Wiki Page: Transferring Bulk Data between Oracle Database and Hadoop Ecosystem with Sqoop

Written by Deepak Vohra In an era of big data, transferring big data between different components of big data analytics ecosystem is a common requirement. Sqoop is a top-level Apache project for bulk transferring data between Apache Hadoop ecosystem and a structured datastore, such as the relational database Oracle Database. In this article we discuss how to transfer data between Oracle Database and HDFS (Hadoop Distributed File System), Apache Hive, and Apache HBase. The Sqoop tool provides the following commands to bulk transfer data between the Hadoop ecosystem and Oracle database. Command Description sqoop import Imports a database table to HDFS as a text file or binary representation format such as Avro or SequenceFiles. Command line arguments may be used to import to Hive or HBase. sqoop export Exports HDFS files to a pre-existing Oracle database table. sqoop import-all-tables Imports all tables into HDFS. sqoop job Creates a saved job to be run later. sqoop-merge Combine two datasets in HDFS. sqoop metastore Creates a metadata repository for creating and running saved jobs. sqoop codegen Generates Java classes to encapsulate and interpret imported records. sqoop create-hive-table Creates a table in Hive metastore based on a database table. sqoop eval Runs SQL queries on a database table for evaluation for subsequent import. sqoop list-databases Lists all the database schemas. Oracle database XE does not support the command. sqoop list-tables Lists all the tables in a database. We shall discuss the import and export commands in subsequent sections. Preliminary Setup Only certain versions of Hadoop run with certain versions of Sqoop and Hive. Any Oracle Database version may be used. The following versions combination was used. Apache Hadoop (1.0.0) Apache Hive (0.90) Apache Sqoop (Sqoop 1.4.1 incubating) Apache HBase (0.94.1) Java 6 Oracle Database (Oracle Database 10g Express Edition used in the article) We also need to set the environment variables for Sqoop, Hive and HBase, and Oracle database. sudo vi ~/.bashrc export HIVE_HOME=/usr/local/ hive-0.9. 0 export SQOOP_HOME=/usr/local/sqoop-1.4.1-incubating__hadoop-1.0.0 export HBASE_HOME=/usr/local/hbase-0.94.1 export ORACLE_HOME=/usr/lib/oracle/xe/app/oracle/product/10.2.0/server export ORACLE_SID=XE export NLS_LANG='$ORACLE_HOME/bin/nls_lang.sh' export PATH=$HIVE_HOME/bin:$SQOOP_HOME/bin:$ORACLE_HOME/bin:$PATH As the Oracle database driver is used to connect to Oracle database with Sqoop copy the Oracle Database JDBC driver JAR file to Sqoop classpath. sudo mv $ORACLE_HOME/jdbc/lib/ojdbc14_g.jar /usr/local/sqoop-1.4.1-incubating__hadoop-1.0.0/lib Importing to HDFS from Oracle Database The sqoop import command is used to import bulk data into Hadoop filesystem, Hive and HBase. To bulk transfer from Oracle Database HR.Departments table to HDFS specify the target HDFS directory with the –target-dir argument (arg), the Oracle database connection parameters with the –connect arg, the table name with the –table arg, and the columns with the –columns arg. bin/sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --password "calgary10" --username "SYSTEM" --table "HR.DEPARTMENTS" --columns "DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID" --target-dir "/tmp/hdfs_sqoopimport" --verbose The output from the Sqoop import shows the running of the Map/Reduce job. The Sqoop import into HDFS may be run with a saved job. First, create a saved job with the sqoop job command and the –create arg. bin/sqoop job --create import_job -- import --connect "jdbc:oracle:thin:@localhost:1521:XE" --password "calgary10" --username "SYSTEM" --table "HR.DEPARTMENTS" --columns "DEPARTMENT_ID,DEPARTMENT_NAME,MANAGER_ID,LOCATION_ID" --target-dir "/tmp/hdfssqoopimport" --verbose Subsequently, run the import_job job with the following command. bin/sqoop job --exec import_job As the output shows the HR.DEPARTMENTS table gets imported into HDFS as with the sqoop import command. Exporting from HDFS to Oracle Database It may be required to bulk export data from the HDFS to Oracle database. In the previous section we bulk imported data from Oracle database to HDFS into the tmp/hdfs_sqoopimport directory. In this section we export the HDFS data back to Oracle database. As the HR.DEPARTMENTS table already has data, create a copy of the table (HR.DEPARTMENTSCOPY) without the data. Run the sqoop export command to export from HDFS to Oracle database. bin/sqoop export --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --export-dir "/tmp/hdfs_sqoopimport" --table "HR.DEPARTMENTSCOPY" --verbose The export directory is specified with the –export-dir arg, and the Oracle database table with the –table arg. The Map/Reduce export job runs as shown below. As the following output indicates 27 records get exported to Oracle Database table HR.DEPARTMENTSCOPY. If a SELECT query is run on the HR.DEPARTMENTSCOPY table the 27 exported records get listed. Importing to Hbase from Oracle Database To import into HBase specify the --hbase-create-table arg to create the HBase table, specify the HBase table name with the –hbase-table arg and specify the HBase table column family with the –column-family arg. Specify the Oracle database table to import with the –table arg. sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --hbase-create-table --hbase-table "DEPARTMENTS" --column-family "dept" --table "HR.DEPARTMENTS" The output from the command shows that 27 records get transferred to HBase from Oracle Database table HR.DEPARTMENTS. The HR.DEPARTMENTS table gets imported into HBase and may be listed with the scan command in the HBase shell. Importing to Hive from Oracle Database The Sqoop tool may also be used to import into Hive using the –hive-import arg to the sqoop import command. Specify the Hive table with the –hive-table arg and that the Hive table be created with the –create-hive-table arg. Specify the Oracle database to import with the –table arg. bin/sqoop import --connect "jdbc:oracle:thin:@localhost:1521:XE" --hadoop-home "/usr/local/hadoop-1.0.0" --password "calgary10" --username "SYSTEM" --hive-import --create-hive-table --hive-table "depts" --table "HR.DEPARTMENTS" --verbose As the output indicates the HR.DEPARTMENTS table gets imported into Hive. Sqoop imports into a Hive metastore directory. Sqoop imports/exports bulk data from/to Oracle database by first getting the metadata about the Oracle database table using the SQL statement SELECT t.* FROM DEPARTMENTS t WHERE 1=0. Sqoop may not be able to get all the required metadata such as column names using the SQL query. It is recommended to specify the column names in the Sqoop import command as we did. Also, Oracle database table columns may be required to be cast to less precise types by Sqoop. To specify the mapping for columns other than the default, the --map-column-java or --map-column-hive args are provided. In this article we learnt to bulk transfer data between Oracle database and Hadoop ecosystem; HDFS, Hive and HBase.

Latest Images

Trending Articles

Latest Images