Written by Deepak Vohra Apache Tajo is a distributed data warehouse system for Apache Hadoop. If you only know SQL for a query language Apache Tajo is the framework for you as it fully supports the SQL standards (ANSI/ISO). With Tajo data stored in HDFS and other data sources could be queried using SQL as if it were a relational database. Tajo does make use of a relational database, Apache Derby, as a Catalog Store and also supports other relational databases including MySQL and Oracle Databases. In the previous tutorial we used MySQL Database as Catalog Store for Apache Tajo. In this tutorial we shall use Oracle Database as the Catalog Store. Setting the Environment Creating an Oracle Database User for Tajo Configuring Apache Tajo Starting Tajo Creating a Tajo Table Setting the Environment The same setup as used for MySQL database as Catalog Store could be used for Oracle Database except for some difference related to the database being different. Create a directory to install the required software and set its permissions to global. mkdir /tajo chmod -R 777 /tajo cd /tajo The following software is required in this tutorial. -Oracle Database -Apache Tajo -Apache Hadoop 2.3 or higher (up to 2.6) -Java 7 -Oracle Database JDBC Driver Jar Download and extract the tar.gz file for Apache Tajo. wget http://apache.mirror.gtcomm.net/tajo/tajo-0.10.1/tajo-0.10.1.tar.gz tar -xvf tajo-0.10.1.tar.gz Download and extract the Java 7 tar file. tar -xvf jdk-7u55-linux-i586.gz Download and extract the Hadoop 2.3 tar file. wget http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.3.0-cdh5.0.1.tar.gz tar -xvf hadoop-2.3.0-cdh5.0.1.tar.gz Create symlinks for Hadoop bin and conf directories. ln -s /tajo/hadoop-2.3.0-cdh5.0.1/bin-mapreduce1 /tajo/hadoop-2.3.0-cdh5.0.1/share/hadoop/mapreduce1/bin ln -s /tajo/hadoop-2.3.0-cdh5.0.1/etc/hadoop /tajo/hadoop-2.3.0-cdh5.0.1/share/hadoop/mapreduce1/conf Set the core Hadoop properties in the core-site.xml configuration file. vi /tajo/hadoop-2.3.0-cdh5.0.1/etc/hadoop/core-site.xml fs.defaultFS hdfs://10.0.2.15:8020/ hadoop.tmp.dir file:///var/lib/hadoop-0.20/cache Create the directory specified in the hadoop.tmp.dir property. mkdir -p /var/lib/hadoop-0.20/cache chmod -R 777 /var/lib/hadoop-0.20/cache Set the HDFS configuration properties in the hdfs-site.xml configuration file. vi /tajo/hadoop-2.3.0-cdh5.0.1/etc/hadoop/hdfs-site.xml dfs.permissions.superusergroup hadoop dfs.namenode.name.dir file:///data/1/dfs/nn dfs.replication 1 dfs.permissions false Create the NameNode storage directory. mkdir -p /data/1/dfs/nn chmod -R 777 /data/1/dfs/nn Set the environment variables for Oracle Database, Apache Hadoop, Apache Tajo, and Java. vi ~/.bashrc export TAJO_HOME=/tajo/tajo-0.10.1 export TAJO_CONF=$TAJO_HOME/conf export ORACLE_HOME=/home/oracle/app/oracle/product/11.2.0/dbhome_1 export ORACLE_SID=ORCL export HADOOP_PREFIX=/tajo/hadoop-2.3.0-cdh5.0.1 export HADOOP_CONF=$HADOOP_PREFIX/etc/hadoop export JAVA_HOME=/tajo/jdk1.7.0_55 export HADOOP_MAPRED_HOME=/tajo/hadoop-2.3.0-cdh5.0.1/share/hadoop/mapreduce1 export HADOOP_HOME=/tajo/hadoop-2.3.0-cdh5.0.1/share/hadoop/mapreduce1 export HADOOP_CLASSPATH=$HADOOP_HOME/*:$HADOOP_HOME/lib/*:$HADOOP_PREFIX/share/hadoop/common/lib/*:$TAJO_HOME/lib/*:$TAJO_CONF:$HADOOP_PREFIX/share/hadoop/yarn/* export PATH=/usr/lib/qt-3.3/bin:/usr/local/sbin:/usr/sbin:/sbin:/usr/local/bin:/usr/bin:/bin:$HADOOP_HOME/bin:$HADOOP_MAPRED_HOME:$ORACLE_HOME/bin:$TAJO_HOME/bin export CLASSPATH=$HADOOP_CLASSPATH export HADOOP_NAMENODE_USER=hadoop export HADOOP_DATANODE_USER=hadoop Copy the Oracle JDBC Driver Jar file to the Tajo lib directory. cp ojdbc6.jar $TAJO_HOME/lib Configure the NameNode and start the HDFS, which comprises of the NameNode and DataNode. hadoop namenode -format hadoop namenode hadoop datanode Create a directory in HDFS to put the data file wlslog.csv . The same wlslog.csv as used in the MySQL database tutorial is used in this tutorial. hdfs dfs -mkdir hdfs://localhost:8020/wlslog hadoop dfs -chmod -R g+w hdfs://localhost:8020/wlslog hdfs dfs -put wlslog.csv hdfs://localhost:8020/wlslog Also create a HDFS directory for the Tajo Master & set its permissions to global (777). The directory is configured in tajo-site.xml in the tajo.rootdir property. hadoop dfs -mkdir hdfs://localhost:8020/tajo hadoop dfs -chmod -R g+w hdfs://localhost:8020/tajo Creating an Oracle Database User for Tajo We need to create a Oracle user for Tajo and grant the user privileges. Start SQL*Plus and create a user called “tajo” with password “tajo”. Grant the “tajo” user all privileges. SQL>create user tajo identified by tajo; SQL>grant all privileges to tajo; The output from the preceding commands is as follows. Configuring Apache Tajo The Apache Tajo configuration is similar to that used for MySQL Database except that the Catalog Store configured in catalog-site.xml would have to be for Oracle Database instead of MySQL Database. Also the JDBC Driver jar file to be added to the classpath of Tajo would be different. In the tajo-env.sh set the environment variables for HADOOP_HOME , JAVA_HOME and TAJO_CLASSPATH . In the TAJO_CLASSPATH add the Oracle Database JDBC Driver jar file. vi /tajo/tajo-0.10.1/conf/tajo-env.sh export HADOOP_HOME=/tajo/hadoop-2.3.0-cdh5.0.1/share/hadoop/mapreduce1 export JAVA_HOME=/tajo/jdk1.7.0_55 export TAJO_CLASSPATH=$CLASSPATH:/tajo/ojdbc6.jar Copy the catalog-site.xml.template file to catalog-site.xml and configure the connection id and password to connect to Oracle Database. Uncomment the section for Oracle and set the store class using the tajo.catalog.store.class property to org.apache.tajo.catalog.store.OracleStore ; the tajo.catalog.store.class class should be pre-configured if the Oracle section settings have not been modified from the default. Set the tajo.catalog.uri to the connection URL for Oracle Database. The connection URL should include the Oracle Database service name and not the Oracle Database instance as would be acceptable for some other configurations such as for Oracle Loader for Hadoop. cp $TAJO_CONF/catalog-site.xml.template $TAJO_CONF/catalog-site.xml vi $TAJO_CONF/conf/catalog-site.xml property> tajo.catalog.connection.id tajo tajo.catalog.connection.password tajo tajo.catalog.store.class org.apache.tajo.catalog.store.OracleStore tajo.catalog.uri jdbc:oracle:thin:@127.0.0.1:1521/ORCL.168.1.68.1 The Oracle Catalog Store Driver section should be as follows; the service name and hostname could be different. Starting Tajo In the MySQL as a Catalog Store tutorial, to start Tajo we used the following command. $TAJO_HOME/bin/start-tajo.sh Tajo may also be started with the following command. $TAJO_HOME/bin/tajo master As the output from the preceding command indicates the TajoMasterService has been instantiated. A more detailed output from the bin/tajo master command is as follows. [root@localhost tajo-0.10.1]# bin/tajo master 15/09/09 14:22:34 INFO master.TajoMaster: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting TajoMaster STARTUP_MSG: host = localhost.oraclelinux/10.0.2.15 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.10.1 STARTUP_MSG: classpath = /tajo/tajo-0.10.1/bin/../conf:/tajo/tajo-0.10.1/bin/../tajo- STARTUP_MSG: build = git@github.com:hyunsik/tajo.git -r c94edd86a17cf9e26ffb58787a8f7dcf13cfbf43; compiled by 'hyunsik' on 2015-06-24T04:07Z STARTUP_MSG: java = 1.7.0_55 ************************************************************/ 15/09/09 14:22:35 INFO master.TajoMaster: registered UNIX signal handlers for [TERM, HUP, INT] 15/09/09 14:22:47 INFO mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 15/09/09 14:22:50 INFO webapp.HttpServer: Jetty bound to port 26080 15/09/09 14:22:50 INFO mortbay.log: jetty-6.1.14 15/09/09 14:22:57 INFO mortbay.log: Started SelectChannelConnector@0.0.0.0:26080 15/09/09 14:22:57 INFO master.TajoMaster: Tajo Root Directory: hdfs://10.0.2.15:8020/tajo 15/09/09 14:23:23 INFO master.TajoMaster: FileSystem (hdfs://10.0.2.15:8020) is initialized. 15/09/09 14:23:27 INFO master.TajoMaster: Tajo Warehouse dir: hdfs://10.0.2.15:8020/tajo/warehouse 15/09/09 14:23:27 INFO master.TajoMaster: Staging dir: hdfs://10.0.2.15:8020/tajo/warehouse 15/09/09 14:23:52 WARN storage.FileStorageManager: does not support block metadata. ('dfs.datanode.hdfs-blocks-metadata.enabled') 15/09/09 14:24:16 INFO rm.TajoWorkerResourceManager: WorkerResourceAllocationThread start 15/09/09 14:24:16 INFO event.AsyncDispatcher: Registering class org.apache.tajo.master.rm.WorkerEventType for class org.apache.tajo.master.rm.TajoWorkerResourceManager$WorkerEventDispatcher 15/09/09 14:24:20 INFO rpc.RpcChannelFactory: Create TajoResourceTrackerProtocol-1 ServerSocketChannelFactory. Worker:3 15/09/09 14:24:24 INFO rpc.NettyServerBase: Rpc (TajoResourceTrackerProtocol) listens on /127.0.0.1:26003 15/09/09 14:24:24 INFO rm.TajoResourceTracker: TajoResourceTracker starts up (localhost/10.0.2.15:26003) 15/09/09 14:24:24 INFO catalog.CatalogServer: Catalog Store Class: org.apache.tajo.catalog.store.OracleStore 15/09/09 14:24:29 INFO store.OracleStore: Loaded the Catalog driver (oracle.jdbc.OracleDriver) 15/09/09 14:24:29 INFO store.OracleStore: Trying to connect database (jdbc:oracle:thin:@127.0.0.1:1521/ORCL.168.1.68) 15/09/09 14:24:47 INFO store.OracleStore: Connected to database (jdbc:oracle:thin:@127.0.0.1:1521/ORCL.168.1.68) 15/09/09 14:25:22 INFO store.XMLCatalogSchemaManager: meta TABLE is created. 15/09/09 14:25:24 INFO store.XMLCatalogSchemaManager: tablespaces TABLE is created. 15/09/09 14:25:25 INFO store.XMLCatalogSchemaManager: TABLESPACES_SEQ SEQUENCE is created. 15/09/09 14:25:33 INFO store.XMLCatalogSchemaManager: TABLESPACES_AUTOINC TRIGGER is created. 15/09/09 14:25:34 INFO store.XMLCatalogSchemaManager: DATABASES_ TABLE is created. 15/09/09 14:25:35 INFO store.XMLCatalogSchemaManager: DATABASES__SEQ SEQUENCE is created. 15/09/09 14:25:41 INFO store.XMLCatalogSchemaManager: DATABASES__AUTOINC TRIGGER is created. 15/09/09 14:25:42 INFO store.XMLCatalogSchemaManager: TABLES TABLE is created. 15/09/09 14:25:42 INFO store.XMLCatalogSchemaManager: TABLES_SEQ SEQUENCE is created. 15/09/09 14:25:44 INFO store.XMLCatalogSchemaManager: TABLES_AUTOINC TRIGGER is created. 15/09/09 14:25:48 INFO store.XMLCatalogSchemaManager: TABLES_IDX_DB_ID INDEX is created. 15/09/09 14:25:49 INFO store.XMLCatalogSchemaManager: TABLES_IDX_TABLE_ID INDEX is created. 15/09/09 14:25:51 INFO store.XMLCatalogSchemaManager: COLUMNS TABLE is created. 15/09/09 14:25:52 INFO store.XMLCatalogSchemaManager: OPTIONS TABLE is created. 15/09/09 14:25:55 INFO store.XMLCatalogSchemaManager: INDEXES TABLE is created. 15/09/09 14:25:57 INFO store.XMLCatalogSchemaManager: INDEXES_IDX_TID_COLUMN_NAME INDEX is created. 15/09/09 14:25:58 INFO store.XMLCatalogSchemaManager: STATS TABLE is created. 15/09/09 14:25:59 INFO store.XMLCatalogSchemaManager: PARTITION_METHODS TABLE is created. 15/09/09 14:26:01 INFO store.XMLCatalogSchemaManager: PARTITIONS TABLE is created. 15/09/09 14:26:02 INFO store.XMLCatalogSchemaManager: PARTITIONS_IDX_TID INDEX is created. 15/09/09 14:26:03 INFO store.OracleStore: The base tables of CatalogServer are created. 15/09/09 14:26:04 INFO event.AsyncDispatcher: Registering class org.apache.tajo.querymaster.QueryJobEvent$Type for class org.apache.tajo.master.QueryManager$QueryJobManagerEventHandler 15/09/09 14:26:04 INFO master.TajoMaster: Tajo Master is initialized. 15/09/09 14:26:04 INFO master.TajoMaster: TajoMaster is starting up 15/09/09 14:26:23 INFO catalog.CatalogServer: tablespace "default" (hdfs://10.0.2.15:8020/tajo/warehouse) is created 15/09/09 14:26:26 INFO catalog.CatalogServer: database "default" is created 15/09/09 14:26:27 INFO rm.Worker: Worker with slots=m:1024,d:1.0,c:1, used=m:0,d:0.0,c:0 is joined to Tajo cluster 15/09/09 14:26:27 INFO rm.Worker: -1944742237 Node Transitioned from NEW to RUNNING 15/09/09 14:26:28 INFO rpc.RpcChannelFactory: Create CatalogProtocol-2 ServerSocketChannelFactory. Worker:2 15/09/09 14:26:28 INFO rpc.NettyServerBase: Rpc (CatalogProtocol) listens on /127.0.0.1:26005 15/09/09 14:26:28 INFO catalog.CatalogServer: Catalog Server startup (10.0.2.15:26005) 15/09/09 14:26:41 INFO rpc.RpcChannelFactory: Create TajoMasterClientProtocol-3 ServerSocketChannelFactory. Worker:1 15/09/09 14:26:41 INFO rpc.NettyServerBase: Rpc (TajoMasterClientProtocol) listens on /127.0.0.1:26002 15/09/09 14:26:41 INFO master.TajoMasterClientService: Instantiated TajoMasterClientService at localhost/10.0.2.15:26002 15/09/09 14:26:42 INFO rpc.RpcChannelFactory: Create QueryCoordinatorProtocol-4 ServerSocketChannelFactory. Worker:2 15/09/09 14:26:42 INFO rpc.NettyServerBase: Rpc (QueryCoordinatorProtocol) listens on /127.0.0.1:26001 15/09/09 14:26:42 INFO master.QueryCoordinatorService: Instantiated TajoMasterService at localhost/10.0.2.15:26001 Creating a Tajo Table Having started the TajoMaster, which also starts the worker and catalog store, next start the Tajo shell. $TAJO_HOME/bin/tsql The Tajo shell connects to the “default” database. Create an external table called “wlslog” with the data file in the HDFS. tajo> create external table wlslog (id text,time_stamp text,category text,type text,servername text,code text,msg text) using text with ('text.delimiter'=',') location 'hdfs://localhost:8020/wlslog'; An external table gets created. Run a SQL query on the Tajo table. select * from wlslog; The selected data gets listed in Tajo shell. In this tutorial we used Oracle Database as Catalog Store for Apache Tajo.
↧