Wiki Page: Build a real world recommendation system using Apache Spark within an Oracle Big Data Management System

By Wissem El Khlifi (Oracle ACE) Introduction Recommendation systems are the typical data products in e-commerce industry. They are a good starting point when explaining what data science is to non-data scientist people. Any Internet user or someone who bought a product from Amazon website has interacted with recommendation engines; they are been emailed or suggested products on Amazon or gotten recommended clothes on Zalando Website. Recommendations engines are used all the time in e-commerce– You might like this movie/books (based on movies the user has already seen/books he has purchased in the past)? Other people that bought this book also bought xyz. In this article, I will walk you through what it took for me to build a real world recommendation system for FindMyHospital.com – a website where users (= patients) are recommended a hospital based on their illness (Sub Category) and the score (1 to 10) given my other patients to rate hospitals. To setup a recommendation engine, suppose we have a set of users represented by a vector U and a set of hospitals to recommend represented by a vector P (=Product). We link a user U to a product P if that user has given it a rate (=a score). We need a recommendation algorithm that scales; able to build large models and to build recommendations in near real time as soon as the user is still navigating on the FindMyHospital.com website. The Business needs a big data reservoir, a very high performance data processing and scalable machine learning; We need the fastest database system OLTP typically an Oracle Exadata 12c to store all user hospital data and an Oracle Big Data Appliance; a high-performance, secure platform for running diverse workloads on Hadoop and NoSQL systems, handling both unstructured data in very large files using HDFS (Hadoop Distributed File System). Oracle Big Data Appliance runs Hadoop workloads (Map Reduce 2, Apache Spark, Hive etc.) via Cloudera CDH distribution. In this article, we will use Apache Spark to build the recommendation systems. The result of the recommendations will be stored in JSON format into an Oracle 12c database Exadata. From +12.1.0.2 versions, Oracle Database supports the usage of JSON; we use JSON because of its simplicity and it’s smaller in size comparing to XML. The following picture shows the complete big data management system used in our infrastructure. Image: Big Data Appliance & Oracle Exadata Now that we decide to use Apache Spark for data processing, let’s first give a brief introduction of the framework. What is Apache Spark? Apache Spark is designed to support in-memory - very large data processing. Apache Spark supports simple “map” and “reduce” operations, SQL queries, streaming data, and complex analytics such as graph algorithms and machine learning (MLlib). The Apache Spark can be integrated on the top of the Hadoop HDFS taking advantage of distributed large files across data nodes. Unlike Map Reduce, Spark is designed for advanced, real-time analytics and has the framework and tools (using Spark Streaming) to deliver when shorter time-to-insight is critical. The tool is able to execute batch-processing jobs between 10 to 100 times faster than the Map Reduce engine according to Cloudera, primarily by reducing the number of writes and reads to disc. MLlib is Spark’s scalable machine learning library consisting of common learning algorithms and utilities, including classification, regression, clustering, and collaborative filtering. The later is the algorithm used for recommendation systems in Spark; we will talk a bit later about it. The Apache Spark programs can be developed in the most popular programming languages in the market like Java, Scala, SQL and Python. From the 1.4 version, Apache Spark supports R shell scripting. In this article, we will use Java to program the recommender. Collaborative Filtering Algorithm for Recommender Systems Collaborative Filtering is a subset of algorithms that exploit users and products along with their ratings or scores. The algorithm work on the user history ratings to recommend a product to a target user does not have ratings for. The fundamental behind this approach is that users given a score to a product could be used recommending the product to the user who did not see the product or purchase it before. The user does not play any role in the recommendation system but rather rating a particular product serves as an input to the algorithm. You can read more about the Collaborative Filtering algorithm in the following article Collaborative Filtering for Implicit Feedback Datasets . As Spark MLlib uses Alternating Least Squares method, we need a bit of introduction of it. Alternating Least Square (ALS) Algorithm for Recommender Systems We have users (x) for products (y) (=hospitals) matrix. Suppose we have N users and K products, then we want to learn a matrix of factors which represents hospitals. The factor vector for each hospital and that would be how we represent the hospital in feature space. Note that we ignore any sort of metadata for the moment (city, country, subcategory of the illness etc ...). We also want to learn a factor vector for each user in a similar way how we represent the hospital. Factor matrix for hospitals and factor matrix (each hospital is a column vector) for users. We reach the point, we have two unknown variables. Therefore, we will adopt an alternating least squares approach with regularization. We first estimate Y using X and estimate X by using Y. After optimized number of iterations, we try to obtain a convergence point where either the matrices X and Y are no longer changing or the change is quite small. In Spark MLlib; the Collaborative filtering is commonly used for recommender systems. These techniques aim to fill in the missing entries of a user-item association matrix. MLlib currently supports model-based collaborative filtering, in which users and products are described by a small set of latent factors that can be used to predict missing entries. MLlib uses the alternating least squares (ALS) algorithm to learn these latent factors. The implementation of the Recommender System: Data preparing and cleansing: We connect to the Oracle (+12.1.0.2) database. Then, we create a table called HospitalRating that contains all user rating of hospitals. Note in the syntax the addition of the “IS JSON”; a constraint which validates if the document is a correct JSON. CREATE TABLE HospitalRating (rate CLOB CONSTRAINT hosp_valid_json CHECK (rate IS JSON)); We insert some data into the HospitalRating table. To make the example simple, we are going to insert only 4 records: INSERT INTO HospitalRating VALUES ( '{ "SubCategory": "Cancer", "Hospital": { "HospitalID": "2", "HospName": "Hospital Quiron Barcelona", "HospCity":"Barcelona", "HospCountry":"Spain" }, "Rating": { "RatingID":5515355560, "RatingValue":"10" }, "UserInfo": { "UserName":"user1", "UserID":"1", "UserCity":"Barcelona", "UserCountry":"Spain" } }'); INSERT INTO HospitalRating VALUES ( '{ "SubCategory": "Cancer", "Hospital": { "HospitalID": "1", "HospName": "Hospital Del Mar", "HospCity":"Barcelona", "HospCountry":"Spain" }, "Rating": { "RatingID":555449550, "RatingValue":7 }, "UserInfo": { "UserName":"user2", "UserID":2, "UserCity":"Barcelona", "UserCountry":"Spain" } }'); INSERT INTO HospitalRating VALUES ( '{ "SubCategory": "Cancer", "Hospital": { "HospitalID": "4", "HospName": "Hospital Vall d Hebron", "HospCity":"Barcelona", "HospCountry":"Spain" }, "Rating": { "RatingID":9495257490, "RatingValue":8 }, "UserInfo": { "UserName":"user1", "UserID":1, "UserCity":"Barcelona", "UserCountry":"Spain" } }'); INSERT INTO HospitalRating VALUES ( '{ "SubCategory": "Cancer", "Hospital": { "HospitalID": "2", "HospName": "Hospital Quiron Barcelona", "HospCity":"Barcelona", "HospCountry":"Spain" }, "Rating": { "RatingID":5545456500, "RatingValue":9 }, "UserInfo": { "UserName":"user2", "UserID":2, "UserCity":"Barcelona", "UserCountry":"Spain" } }'); The result of the insertion looks like the following: SUBCATEGORY HOSPITALID HOSPNAME HOSPCITY HOSPCOUNTRY RATINGID RATINGVALUE USERNAME USERID USERCITY USERCOUNTRY Cancer 2 Hospital Quiron Barcelona Barcelona Spain 5545456500 9 user2 2 Barcelona Spain Cancer 2 Hospital Quiron Barcelona Barcelona Spain 5515355560 9 user1 1 Barcelona Spain Cancer 1 Hospital Del Mar Barcelona Spain 555449550 8 user2 2 Barcelona Spain Cancer 4 Hospital Vall d Hebron Barcelona Spain 9495257490 9 user1 1 Barcelona Spain We have a user 2 with a user id 2, has given a rate 9 and 8 respectively to hospitals “Hospital Quiron Barcelona” and “Hospital Del Mar”. We have user 1 with a user id 1, has given a rate 9 over 10 to both hospitals “Hospital Quiron Barcelona” and “Hospital Vall d Hebron”. Both users gave a good rating score to the “Hospital Quiron Barcelona”. Here is the similarity. Both users live in Barcelona, so typically the recommendation algorithm must take into account where users are living so we recommend the hospitals closest to user’s home. Also we need to take into account the subcategory; typically we need to recommend a hospital that has been given a good rate based on the illness. Note here, we may consider other metadata like age, gender etc … but to make the example simple we will only take into account the subcategory, the city and country of the users. An Oracle table called recommendation will store the final result of the big data processing. CREATE TABLE Recommendation (rec CLOB CONSTRAINT rec_valid_json CHECK (rec IS JSON)); In Java, we use the Oracle JDBC driver ( ojdbc7.jar) to connect to the running Oracle 12c database instance. We generate a csv file to be stored into the Hadoop HDFS so later we can process it from Spark java program. public static final String DBURL = "jdbc:oracle:thin:@oraexadata.findmyhospital.com:1521:hospital"; public static final String DBUSER = "wissem"; public static final String DBPASS = "wissem123"; String inputFile = "hdfs://spark-01:8020/tmp/hosprates.csv"; PrintWriter writer90 = new PrintWriter(inputFile, "UTF-8"); // Load Oracle JDBC Driver DriverManager.registerDriver(new oracle.jdbc.driver.OracleDriver()); // Connect to Oracle Database Connection con = DriverManager.getConnection(DBURL, DBUSER, DBPASS); Statement statement = con.createStatement(); // Execute a SELECT query on Oracle ResultSet rs = statement.executeQuery( " SELECT hrt.rate.SubCategory, "+ "hrt.rate.Hospital.HospitalID,"+ "hrt.rate.Hospital.HospName,"+ "hrt.rate.Rating.RatingID,"+ "hrt.rate.Rating.RatingValue,"+ "hrt.rate.UserInfo.UserName,"+ "hrt.rate.UserInfo.UserID,"+ "hrt.rate.UserInfo.UserCity,"+ "hrt.rate.Hospital.HospCity,"+ "hrt.rate.UserInfo.UserCountry,"+ "hrt.rate.Hospital.HospCountry "+ "FROM HospitalRating hrt"); while (rs.next()) { writer90.println(rs.getString(1)+","+ rs.getString(2)+","+rs.getString(3)+","+ rs.getString(4)+","+rs.getString(5)+","+rs.getString(6)+","+rs.getString(7)+","+rs.getString(8) +","+rs.getString(9)+","+rs.getString(10)+","+rs.getString(11)); } writer90.close(); We create a distributed in-memory collection of items in Spark called a Resilient Distributed Dataset (RDD) on the top of the Hadoop HDFS file. In this case we create a distributed collection of objects based on lines of the csv file “hosprates.csv”. We can apply operations to these objects that will automatically be parallelized across a cluster. The following is an extract from the Java code: JavaRDD rawUserHospitalData = sc.textFile(inputFile); Once we upload all the objects into the distributed cluster in-memory, we create a pair RDDs called rawHospRDD (a Tuple) having all hospital using the flatMap function in Spark. This method will be applied to the source RDD and eventually each elements of the source RDD and will create a new RDD as resulting pair key value. Note here we are preparing the data by cleaning/ removing all invalid records from the RDD. JavaRDD > rawHosp = rawUserHospitalData .flatMap(new FlatMapFunction >() { private static final long serialVersionUID = 1L; public Iterable > call(String s) { // System.out.println("s::"+s); String[] sarray = s.replaceAll("\\[|\\]", " ").split( ","); List > returnList = new ArrayList >(); if (sarray.length >= 2) { try { if (Utils.isInteger(sarray[1])) { returnList.add(new Tuple2 ( Integer.parseInt(sarray[1]), sarray[ 8 ].concat("-" + sarray[10]) .trim())); return returnList; } else { returnList.add(new Tuple2 ( -1, "NA")); return returnList; } } catch (NumberFormatException e) { e.printStackTrace(); returnList.add(new Tuple2 (-1, "NA")); return returnList; } } else { try { returnList.add(new Tuple2 (-1, "NA")); return returnList; } catch (NumberFormatException e) { e.printStackTrace(); returnList.add(new Tuple2 (-1, "NA")); return returnList; } } } }); JavaPairRDD rawHospRDD = JavaPairRDD.fromJavaRDD(rawHosp); Remember we will take into account the city and the country of the hospitals when we recommend the hospitals. In this case we will create an RDD pair of (tableCityCountry), this will be used as a Data Frame; a collection equivalent to a table in a relational database. A Table called distinctCityCountry is created in Spark memory via the registerTempTable method in the Spark SQL context (sqlctx variable). This table contains list of distinct city and country values. JavaRDD tableCityCountry = rawHospRDD .map(new Function , CityCountry>() { private static final long serialVersionUID = 1L; public CityCountry call(Tuple2 v1) throws Exception { CityCountry st = new CityCountry(v1._1(), v1._2()); return st; } }); DataFrame dfCityCountry = sqlctx.createDataFrame(tableCityCountry, CityCountry.class); dfCityCountry.registerTempTable("CityCountry"); DataFrame distinctCityCountry = sqlctx .sql("select hospCityCountry FROM CityCountry group by hospCityCountry"); List distinctCityCountryList = distinctCityCountry .toJavaRDD().map(new Function () { private static final long serialVersionUID = 1L; public String call(Row v1) throws Exception { return v1.getString(0); } }).collect(); Another RDD pair is needed called rawHospUserRDD, we create a Broadcast variable rawHospUserB in order to keep a read-only variable cached (cache () method is used) on each machine rather than shipping a copy of it with tasks. The variable can be used to give every Spark data node a copy of a large input dataset, in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce inter-communication cost. JavaRDD > rawHospUser = rawUserHospitalData.flatMap(new FlatMapFunction >() { … returnList.add(new Tuple2 ( Integer.parseInt(sarray[ 6 ]), addQuotes(Integer .parseInt(sarray[1])))); return returnList; ….>> content skipped JavaPairRDD rawHospUserRDD = JavaPairRDD .fromJavaRDD(rawHospUser); final Broadcast > rawHospUserB = sc.broadcast(rawHospUserRDD.cache()); The same step is used to create a distinct value of user identifiers (distinctUserIdList) into a Spark SQL Data Frame. This list of users must take into account the city + country of the hospital. DataFrame dfUser = sqlctx.createDataFrame(tableUser, User.class); dfUser.registerTempTable("User"); DataFrame distinctUserId = sqlctx.sql("select userId FROM User where userId <> -1 AND userCityCountry='" + d + "' group by userId"); List distinctUserIdList = distinctUserId.toJavaRDD() .map(new Function () { private static final long serialVersionUID = 1L; public Integer call(Row v1) throws Exception { return v1.getInt(0); } }).collect(); Train Data: The implementation in MLlib has the following parameters: Num Blocks is the number of blocks used to parallelize computation (set to -1 to auto-configure). In our application we set to -1 to let Spark decide about the value. Rank is the number of latent factors in the model. In our application we set it to 2 . Iterations is the number of iterations to run, in our case the base value based on try on error is set to 10 . Lambda specifies the regularization parameter in ALS. We set it to 0.01 value based on try on error. Alpha is a parameter applicable to the implicit feedback variant of ALS that governs the baseline confidence in preference observations. We set it to 0.01 value based on try on error. Train Data: We use the 10% of the data as the train data. The java code looks like the following: We take a sample data of 10% with collection of , we build an RDD called trainData with Rating object from org.apache.spark.mllib.recommendation.Rating. Later we build the recommendation model (model variable) using ALS algorithm by calling the ALS.trainImplicit method of the MLlib. JavaRDD trainData = rawUserHospitalData.sample(true, 0.1).map(new Function () { private static final long serialVersionUID = 1L; public Rating call(String s) throws Exception { String[] sarray = s.replaceAll("\\[|\\]", " ") .split(","); if (d.equals(sarray[ 8 ].concat("-" + sarray[10]) .trim())) { int userID = Integer.parseInt(sarray[ 6 ]); int hospitalID = Integer .parseInt(sarray[1]); double count = Double .parseDouble(sarray[4]); // train 10% of the whole data JavaRDD trainData = rawUserHospitalData.sample(true, 0.1) /// build the model MatrixFactorizationModel model = ALS.trainImplicit(trainData.rdd(), 2, 10, 0.01, 0.01); Run the Recommendation ALS algorithm for the whole data set: We decided to insert the result of ALS processing into an Oracle 12c database. We prepare the statement by calling the prepareStatement method. // Turn auto commit off (turned on by default) con.setAutoCommit(false); // Create The insert statement PreparedStatement stmt = con.prepareStatement("INSERT INTO Recommendation VALUES (?)"); The ALS algorithm for the whole set of data is called via the “recommendProducts” method of the MatrixFactorizationModel class. This Java method returns a Rating of objects, each of which contains the given user ID, a product ID (= hospital id), and a "score" in the rating field. Each represents one recommended product (=hospitals in our case), and they are sorted by score, decreasing. The first returned is the one predicted to be most strongly recommended to the user. The score is an opaque value that indicates how strongly recommended the product is. We took the first most recommended hospitals for a given user. We loop into distinct user id using the lookup method to collect every product (=hospital). Later, we prepare the CLOB Oracle data type into a variable called clob where we insert the concatenated with the recommendation date (= system date) and the hospital identifier (product.toString().trim()) all in a valid JSON format. We execute the insert statement by calling the executeUpdate method. Later we commit the results via the “con.commit()” method. for (Integer u : distinctUserIdList) { try { // recommend TOP 5 hospitals Rating[] recommendProducts = model.recommendProducts(u,5); JavaPairRDD rawHospUserB1 = rawHospUserB.getValue(); List lookupUserHosp = rawHospUserB1.lookup(u); int taken = 0; for (Rating p : recommendProducts) { .... String product = addQuotes(p.product()); .... String ReommendationID = u + "-" + strDate; .... // Create a new Clob instance as I'm inserting into a CLOB data type Clob clob = con.createClob(); // Store my JSON into the CLOB clob.setString(1, "{RecUserID:"+addQuotes(ReommendationID)+", RecHospitalID: "+product.toString().trim()+"}"); // Set clob instance as input parameter for INSERT statement stmt.setClob(1, clob); // Execute the INSERT statement int affectedRows = stmt.executeUpdate(); // Free up resource clob.free(); } // Commit inserted rows into the database con.commit(); …. sc.close(); rs.close(); statement.close(); con.close(); Later and once we compile our Java code; we create a JAR archive of the project (sparkRecommendation.jar), we run from the Spark Cluster by calling the following command line: note here we give an executor memory of 20 gigabytes across the cluster and 40 CPU cores. sudo -u hdfs spark-submit --class main_mLlib --master yarn --executor-memory 20G --total-executor-cores 40 …/sparkRecommendation.jar Data validation: This is the latest step to check if the recommendation results are correct. In our example, the user 2 with a user id 2 has given a rate 9 and 8 respectively to hospitals “Hospital Quiron Barcelona” and “Hospital Del Mar”. Also, we have user 1 with a user id 1, has given a rate 9 over 10 to hospitals “Hospital Quiron Barcelona” and “Hospital Vall d Hebron”. When we query the Recommendation table we can see that we recommend the hospital id 1 (=Hospital Del Mar) to the user 1 and the hospital id 4 (=Hospital Vall d Hebron) to the user 2 which is perfectly correct. SQL> SELECT rec.rec.RecUserID as RecUserID, 2 rec.rec.RecHospitalID as RecHospitalID 3 FROM Recommendation rec; RECUSERID RECHOSPITALID -------------------------------------------------------------------------------- -------------------------- 1-2016-02-08 1 2-2016-02-08 4 SQL> Now that we have the recommendation data available ad validated, we can show the recommended hospitals in the user’s panel in the website. Conclusion In this article, we have seen a real world recommendation system. We have explained a bit the ALS recommendation algorithm and the most important part of the Java code used in Spark MLlib. We used JSON documents to store the recommendation results into an Oracle database 12c. Find the complete Java code under: https://github.com/orawiss/SparkMLlib

Latest Images

Trending Articles

Latest Images