The era of “big data” and the “cloud” are driving companies to change. Just to keep pace, they must learn new skills and implement new practices that leverage those new data sources and technologies.
Transferring big data between different components of big data analytic ecosystem is a common requirement
In this article, we discuss how to transfer data between Oracle Database and HDFS (Hadoop Distributed File System) using sqoop.
The core of Apache Hadoop consists of a storage part, known as Hadoop Distributed File System (HDFS), and a processing part which is a Map Reduce programming model. Hadoop splits files into large blocks and distributes them across nodes in a cluster. It then transfers packaged code into nodes to process the data in parallel. This approach takes advantage of data locality, where nodes manipulate the data they have access to. This allows the data to be processed faster and more efficiently than it would be in a more conventional supercomputer architecture that relies on a parallel file system where computation and data are distributed via high-speed networking.
The base Apache Hadoop framework is composed of the following modules:
Hadoop Common – contains libraries and utilities needed by other Hadoop modules
Hadoop Distributed File System (HDFS)- a distributed file-system that stores data on commodity machines, providing very high aggregate bandwidth across the cluster
Hadoop YARN – a platform responsible for managing computing resources in clusters and using them for scheduling users’ applications.
Hadoop Map Reduce- an implementation of the Map Reduce programming model for large-scale data processing.
Apache Hive – gives a SQ-like interface to query data stored in various databases and file systems that integrate with Hadoop.
Apache HBase is a column-oriented key/value data store built to run on top of the HDFS. Hadoop is a framework for handling large data-sets in a distributed computing environment
Hadoop Installation using Ambari
Apache Ambary is a web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides a dashboard for viewing cluster health such as heat maps and ability to view Map Reduce, Pig and Hive applications visually along with features to diagnose their performance characteristics in a user-friendly manner. It has a very simple and interactive UI to install various tools and perform various management, configuring and monitoring tasks. Below we take you through various steps in installing Hadoop and its various ecosystem components on a multi-node cluster. Ambari has two components
This is the master process which communicates with Ambari agents installed on each node participating in the cluster. This has Postgres database instance which is used to maintain all cluster related meta data.
These are acting agents for Ambari on each node. Each agent periodically sends his own health status along with different metrics, installed services status and many more things. According master decides on next action and conveys back to the agent to act.
Installation of Ambari
- Download Ambari public repo
- Install Ambari RPMS
- Configuring Ambari server
After Ambari installation is to configure Ambari and set it up to provision the cluster.
4.Start the server and Login to web UI
5.Start the server with
Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured data stores such as relational databases.
|sqoop import||Imports a database table to HDFS as a text file or binary representation formats such as Avro or Sequence Files. Command line arguments may be used to import to Hive or HBase.|
|sqoop export||Exports HDFS files to a pre-existing Oracle database table.|
|sqoop import-all-tables||Imports all tables into HDFS.|
|sqoop job||Creates a saved job to be run later.|
|sqoop-merge||Combine two datasets in HDFS.|
|sqoop metastore||Creates a metadata repository for creating and running saved jobs.|
|sqoop codegen||Generates Java classes to encapsulate and interpret imported records|
|sqoop create-hive-table||Creates a table in Hive metastore based on a database table.|
|sqoop eval||Runs SQL queries on a database table for evaluation for subsequent import|
|sqoop list-databases||Lists all the database schemas. Oracle Database XE does not support the command.|
|sqoop list-tables||Lists all the tables in a database.|
Step 1: Sqoop import data from Oracle database to Hive table
To bulk transfer from Oracle Database HR.Departments table to HDFS specify the target HDFS directory with the –target-dir argument (arg), the Oracle database connection parameters with the –connect arg, the table name with the –table arg, and the columns with the –columns arg.
bin/sqoop import –connect “jdbc:oracle:thin:@localhost:1521:XE” –password “calgary10” –username “SYSTEM” –table “HR.DEPARTMENTS” –columns
–target-dir “/tmp/hdfs_sqoopimport” –verbose
Step 2: Load the above Sqoop extracted data to a Hive table
LOAD DATA INPATH ‘<<HDFS path>>’ INTO TABLE <<hive table name>>;
Thus we have successfully loaded a table from Oracle database to Hive using Sqoop, We can query our Oracle and Hive databases to check the data if it is loaded fine
The advantage of Big Data comes into play when you have the ability to correlate Big Data with your existing enterprise data. There’s an implicit product requirement here in consolidating these various architecture principles into a single integrated solution. The advantages of a single solution allow you to address not only the complexities of mapping, accessing, and loading Big Data but also combining it with your enterprise data
- https://www.toadworld.com/platforms/oracle/w/wiki/10891.transferring-bulk-dat a-between-Oracle-database-and-Hadoop-ecosystem-with-sqoop
Hi, Myself Rajesh from DBA team, working in 4iapps for last 3.8 year as Apps DBA. With a background in engineering and expertise in Oracle E-Business Suite & Oracle Database, I started blogging about the latest upgrades and configurations. My hobbies are blogging, singing and photography.
Share this post