Loading Oracle Data into Hadoop Using Sqoop

Introduction

The era of “big data” and the “cloud” are driving companies to change. Just to keep pace, they must learn new skills and implement new practices that leverage those new data sources and technologies.

Transferring big data between different components of big data analytic ecosystem is a common requirement

In this article, we discuss how to transfer data between Oracle Database and HDFS (Hadoop Distributed File System) using sqoop.

Apache Hadoop

The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a Map Reduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.

The base Apache Hadoop framework is composed of the following modules:

Hadoop Common – contains libraries and utilities needed by other Hadoop modules

Hadoop Distributed File System (HDFS)- a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster

Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.

Hadoop Map Reduce- an implementation of the Map Reduce programming model for large-scale data processing.

Apache Hive – gives a SQ-like interface to query data stored in various databases and file systems that integrate with Hadoop.

Apache HBase is a column-oriented key/value data store built to run on top of the HDFS. Hadoop is a framework for handling large data-sets in a distributed computing environment

Hadoop Installation using Ambari

Apache Ambary is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides a dashboard for viewing cluster health such as heat maps and ability to view Map Reduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. It has a very simple and interactive UI to install various tools and perform various management, configuring and monitoring tasks. Below we take you through various steps in installing Hadoop and its various ecosystem components on a multi-node cluster. Ambari has two components

Ambai server

This is the master process which communicates with Ambari agents installed on each node participating in the cluster. This has Postgres database instance which is used to maintain all cluster related meta data.

Ambari Agent

These are acting agents for Ambari on each node. Each agent periodically sends his own health status along with different metrics, installed services status and many more things. According master decides on next action and conveys back to the agent to act.

Installation of Ambari

  1. Download Ambari public repo
  2. Install Ambari RPMS
  3. Configuring Ambari server

After Ambari installation is to configure Ambari and set it up to provision the cluster.

4.Start the server and Login to web UI

5.Start the server with

Ambari-server start

Apache Sqoop

Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.

Command Description
sqoop import Imports a database table to HDFS as a text file or binary representation formats such as Avro or Sequence Files. Command line arguments may be used to import to Hive or HBase.
sqoop export Exports HDFS files to a pre-existing Oracle database table.
sqoop import-all-tables Imports all tables into HDFS.
sqoop job Creates a saved job to be run later.
sqoop-merge Combine two datasets in HDFS.
sqoop metastore Creates a metadata repository for creating and running saved jobs.
sqoop codegen Generates  Java classes to encapsulate and interpret imported records
sqoop create-hive-table Creates a table in Hive metastore based on a database table.
sqoop eval Runs SQL queries on a database table for evaluation for subsequent import
sqoop list-databases Lists all the database schemas. Oracle Database XE does not support the command.
sqoop list-tables Lists all the tables in a database.

Step 1: Sqoop import data from Oracle database to Hive table

To bulk transfer from Oracle Database HR.Departments table to HDFS specify the target HDFS directory with the –target-dir argument (arg), the Oracle database connection parameters with the –connect arg, the table name with the –table arg, and the columns with the –columns arg.

bin/sqoop import –connect “jdbc:oracle:thin:@localhost:1521:XE” –password “calgary10” –username “SYSTEM” –table “HR.DEPARTMENTS” –columns

“DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID”

–target-dir “/tmp/hdfs_sqoopimport” –verbose

Step 2: Load the above Sqoop extracted data to a Hive table

LOAD DATA INPATH ‘<<HDFS path>>’ INTO TABLE <<hive table name>>;

Conclusion

Thus we have successfully loaded a table from Oracle database to Hive using Sqoop, We can query our Oracle and Hive databases to check the data if it is loaded fine

The advantage of Big Data comes into play when you have the ability to correlate Big Data with your existing enterprise data. There’s an implicit product requirement here in consolidating these various architecture principles into a single integrated solution. The advantages of a single solution allow you to address not only the complexities of mapping, accessing, and loading Big Data but also combining it with your enterprise data

Reference

  • https://www.toadworld.com/platforms/oracle/w/wiki/10891.transferring-bulk-dat a-between-Oracle-database-and-Hadoop-ecosystem-with-sqoop
  • https://en.wikipedia.org/wiki/Apache_Hadoop

Author:

Hi, Myself Rajesh from DBA team, working in 4iapps for last 3.8 year as Apps DBA. With a background in engineering and expertise in Oracle E-Business Suite & Oracle Database, I started blogging about the latest upgrades and configurations. My hobbies are blogging, singing and photography.

 

Share this post
Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

Leave a Reply

Your email address will not be published. Required fields are marked *