Quantcast
Channel: Oracle
Viewing all articles
Browse latest Browse all 1814

Blog Post: Join performance in MongoDB 3.2 using $lookup

$
0
0
One of the key tenants of MongoDB schema design is to account for the absence of server-side joins. Data is joined all the time inside of application code of course, but traditionally there’s been no way to perform joins within the server itself. This changed in 3.2 with the introduction of the $lookup operator within the aggregation framework. $lookup performs the equivalent of a left outer join – eg: it retrieves matching data from another document and returns null data if no match is found. Here’s an example using the MongoDB version of the Sakila dataset that I converted from MySQL back in this post : 1: db.films.aggregate([ 2: {$match:{ "Actors.First name" : "CHRISTIAN" , 3: "Actors.Last name" : "GABLE" }}, 4: {$lookup: { 5: from: "customers" , 6: as : "customerData" , 7: localField: "_id" , 8: foreignField: "Rentals.filmId" 9: }}, 10: {$unwind: "$customerData" }, 11: {$project:{ "Title" :1, 12: "FirstName" : "$customerData.First Name" , 13: "LastName" : "$customerData.Last Name" }}, 14: ]) What we’re doing here is finding all customers who have ever hired a film staring “Christian Gable”; We start by finding those films in the films collection (lines 2-3), then use $lookup to retrieve customer data (lines 4-9). Films embeds actors in the “Actors” array; the customers collection embeds films that have been hired in the "Rentals" array. The result of the join contains all the customers who have borrowed the movie returned as an array, so we use the $unwind operator to “flatten” them out (line 10). The resulting output looks like this: { "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "SUSAN", "LastName" : "WILSON" } { "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "REBECCA", "LastName" : "SCOTT" } { "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "DEBRA", "LastName" : "NELSON" } { "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "MARIE", "LastName" : "TURNER" } { "_id" : 1, "Title" : "ACADEMY DINOSAUR", "FirstName" : "TINA", "LastName" : "SIMMONS" } One thing that we need to be careful here is with join performance. The $lookup function is going to be executed once for each document returned by our $match condition. There is - AFAIK - no equivalent of a hash or sort merge join operation possible here, so we need to make sure that we've used an index. Unfortunately, the explain() command doesn’t help us. It tells us only if we have used an index to perform the initial $match , but doesn't show us if we used an index within the $lookup . Here's the explain output from the operation above (TL;DR): 1: > db.films.explain().aggregate([ 2: ... {$match:{ "Actors.First name" : "CHRISTIAN" , 3: ... "Actors.Last name" : "GABLE" }}, 4: ... {$lookup: { 5: ... from: "customers" , 6: ... as : "customerData" , 7: ... localField: "_id" , 8: ... foreignField: "Rentals.filmId" 9: ... }}, 10: ... {$unwind: "$customerData" }, 11: ... {$project:{ "Title" :1, 12: ... "FirstName" : "$customerData.First Name" , 13: ... "LastName" : "$customerData.Last Name" }}, 14: ... 15: ... ]) 16: { 17: "waitedMS" : NumberLong(0), 18: "stages" : [ 19: { 20: "$cursor" : { 21: "query" : { 22: "Actors.First name" : "CHRISTIAN" , 23: "Actors.Last name" : "GABLE" 24: }, 25: "fields" : { 26: "Title" : 1, 27: "customerData.First Name" : 1, 28: "customerData.Last Name" : 1, 29: "_id" : 1 30: }, 31: "queryPlanner" : { 32: "plannerVersion" : 1, 33: "namespace" : "sakila.films" , 34: "indexFilterSet" : false , 35: "parsedQuery" : { 36: "$and" : [ 37: { 38: "Actors.First name" : { 39: "$eq" : "CHRISTIAN" 40: } 41: }, 42: { 43: "Actors.Last name" : { 44: "$eq" : "GABLE" 45: } 46: } 47: ] 48: }, 49: "winningPlan" : { 50: "stage" : "COLLSCAN" , 51: "filter" : { 52: "$and" : [ 53: { 54: "Actors.First name" : { 55: "$eq" : "CHRISTIAN" 56: } 57: }, 58: { 59: "Actors.Last name" : { 60: "$eq" : "GABLE" 61: } 62: } 63: ] 64: }, 65: "direction" : "forward" 66: }, 67: "rejectedPlans" : [ ] 68: } 69: } 70: }, 71: { 72: "$lookup" : { 73: "from" : "customers" , 74: "as" : "customerData" , 75: "localField" : "_id" , 76: "foreignField" : "Rentals.filmId" , 77: "unwinding" : { 78: "preserveNullAndEmptyArrays" : false 79: } 80: } 81: }, 82: { 83: "$project" : { 84: "Title" : true , 85: "FirstName" : "$customerData.First Name" , 86: "LastName" : "$customerData.Last Name" 87: } 88: } 89: ], 90: "ok" : 1 91: } However, we can see the queries created by the $lookup function if we enable profiling. For instance if we turn profiling on can see a full collection scan of customers has have been generated for every film document that has been joined: These “nested” collection scans are bad news. Below is the results of a benchmark in which I joined two collections using $lookup with and without an index. As you can see, the unindexed $lookup degrades steeply as the number of rows to be joined increases. The solution is obvious: Always create an index on the foreignField attributes in a $lookup, unless the collections are of trivial size. The MongoDB company is putting a lot of new features into the aggregation framework: they clearly intend to create a very powerful and flexible capability that matches and maybe even exceeds what can be done with SQL. Indeed, the aggregation framework seems poised to become a dataflow language similar to Pig. Anyone wanting to do any serious work in MongoDB should make sure they are very comfortable with aggregate. If you use $lookup to perform joins in aggregate, make sure there is an index on the ForiegnField attribute.

Viewing all articles
Browse latest Browse all 1814

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>