Chapter 9. Recommendation Engines Using MapReduce
Tải bản đầy đủ
• Amazon.com and MyBuys.com, which provide recommendation systems for
similar items that a user might purchase—in other words, when a user views
what other shoppers bought along with the currently selected item
• Tripbase.com, a travel website that recommends travel/vacation packages based
on a user’s input or preferences
• Netflix, which predicts movies that a user might enjoy based on the user’s previ‐
ous ratings and watching habits (as compared to the behavior of other users)
In this chapter we will address the following areas, which have roots in recommenda‐
tion engines and systems:
• Customers Who Bought This Item Also Bought
• Frequently Bought Together
• Recommend Connection
For details on recommendation systems, refer to [1], [26], and [11].
Customers Who Bought This Item Also Bought
Most ecommerce vendors, including Amazon.com, use the feature “Customers Who
Bought This Item Also Bought” (CWBTIAB) on their websites for recommending
books, CDs, and other items. Here we will build a simple recommendation system to
implement the CWBTIAB feature.
Suppose that the Amazon.com store log contains a userid and boughtitem for
each sale. We are going to implement CWBTIAB functionality using the MapReduce
paradigm. Whenever an item is shown, the store will suggest five other items most
often bought by buyers of that item.
Input
We assume that the input is a set of large transactions (a transaction log contains a lot
of data, including transaction ID, date, price, etc.), which have the following fields:
<,>
Expected Output
The recommendation engine should emit keyvalue pairs in which the key is the item
and the value is a list of the five items most commonly purchased by customers who
also bought the original item.
202

Chapter 9: Recommendation Engines Using MapReduce
MapReduce Solution
We implement CWBTIAB with two iterations of MapReduce:
• Phase 1: generate lists of all items bought by the same user. Grouping is handled
by the Hadoop framework, where both the mapper and the reducer perform an
identity function.
• Phase 2: solve the cooccurrences problem on list items. We use the Stripes
design pattern and emit only the five most common cooccurrences.
Before we discuss these two phases, I will explain the concept of Stripes with a simple
example.
Stripes design pattern
Stripes is a design pattern, and the main idea behind it is to group together pairs into
an associative array. Consider the classic case shown in Table 91 of keyvalue pairs
emitted by a mapper (note that in this example, the mapper’s output key is a compo‐
site key, Tuple2).
Table 91. Mapper output: classic approach
Key
Value
(k, k1) 3
(k, k2) 2
(k, k3) 4
(k, k4) 6
(z, z1) 7
(z, z2) 8
(z, z3) 5
The idea behind the Stripes approach is that rather than emitting many keyvalue
pairs, we just emit one per stripe, as shown in Table 92 (note that in this example, k
and z are natural keys).
Table 92. Mapper output: Stripes approach
Key Value
k
{ (k1, 3), (k2, 2), (k3, 4), (k4, 6) }
z
{ (z1, 7), (z2, 8), (z3, 5) }
The Stripes approach creates an associative array (or a hash table) for each natural
key and reduces the number of keyvalue pairs emitted by each mapper. While the
Customers Who Bought This Item Also Bought

203
emitted value of each mapper becomes a complex object (an associative array), with
the Stripes approach there will be less sorting and shuffling of keyvalue pairs.
How does a reducer work in the Stripes approach? Reducers perform an elementwise
sum of associative arrays. Consider the following three keyvalue examples for a
reducer (as input):
K > { (a, 1), (b, 2), (c, 4), (d, 3) }
K > { (a, 2),
(c, 2)
}
K > { (a, 3), (b, 5),
(d, 5) }
The generated output will be:
K > { (a, 1+2+3), (b, 2+5), (c, 4+2), (d, 3+5) }
or:
K > { (a, 6), (b, 7), (c, 6), (d, 8) }
The advantages of the Stripes approach are as follows:
• Since mappers generate fewer keyvalue pairs than with the classic approach,
there will be less sorting and shuffling required.
• Stripes enables us to make effective use of combiners (as a local pernode optimi‐
zation).
• Stripes offers us better performance[14].
Some disadvantages of the Stripes approach are:
• It’s more difficult to implement (since the value emitted by each mapper is an
associative array and we have to write a serializer and deserializer for that asso‐
ciative array).
• The underlying objects (i.e., values generated by mappers as associative arrays)
are more heavyweight.
• Stripes has a fundamental limitation in terms of the size of event space (since we
are creating an associate array for each natural key, we need to make sure that the
mappers have enough RAM to hold these hash tables).
MapReduce phase 1
The first MapReduce phase generates lists of all items bought by the same user.
Grouping is done by the MapReduce framework on the userID (as a key). Both the
mapper and the reducer perform an identity function. The goal of phase 1 is to find
all items purchased by all users.
The mapper is an identity function that emits keyvalue pairs as received (see
Example 91).
204

Chapter 9: Recommendation Engines Using MapReduce
Example 91. Mapper: phase 1
1
2
3
4
5
// key = userID
// value = item bought by userID
map(userID, item) {
emit(userID, item);
}
The reducer is an identity function that groups all items for a single user (see
Example 92).
Example 92. Reducer: phase 1
1
2
3
4
5
// key = userID
// value = list of items bought by userID
reduce(userID, items[I1, I2, ..., In]) {
emit(userID, items);
}
MapReduce phase 2
The second MapReduce phase solves the cooccurrences problem on list items. With
the Stripes approach, the mappers do most of the work, aggregating data and then
passing it to the combiners and reducers. The reducers then emit the expected output
(an item followed by a list of the five most common cooccurrences).
Since we might be creating many hash tables (associative arrays) for each mapper/
reducer, we need to make sure that we have enough memory to hold these data struc‐
tures. If the number of users or items were large enough, it might not fit in memory.
If your memory/RAM is limited, you might consider creating these hash tables on
disk (such a solution is available in MapDB).1
As you can see in Example 93, the mapper in this phase includes combiner function‐
ality. (Using the Stripes approach, the mappers do most of the work, which has to be
done by map() and combine() functions.) As noted earlier, the Stripes approach min‐
imizes the number of keys generated, and therefore the MapReduce execution frame‐
work has less shuffling and sorting to perform. However, since we are using a
nonprimitive type for our values (an associative array), more serialization and deseri‐
alization will be required.
1 MapDB (developed by Jan Kotek) provides concurrent maps, sets, and queues backed by disk storage or off
heap memory. It is a fast and easytouse embedded Java database engine.
Customers Who Bought This Item Also Bought

205
Example 93. Mapper: phase 2
1
2
3
4
5
6
7
8
9
10
11
// key = userID
// value = list of items bought by userID
map(userID, items[I1, I2, ..., In]) {
for (Item item : items) {
Map map = new HashMap
 ();
for (Item j : items) {
map(j) = map(j) + 1;
}
emit(item, map);
}
}
As Example 94 shows, in this phase the reducer generates the “top 5” items most
commonly purchased along with every item in all transactions. The reducer performs
an itemwise sum of all stripes (represented as an associative array) for a given item.
Example 94. Reducer: phase 2
1
2
3
4
5
6
7
8
9
10
11
// key = item
// value = list of stripes[M1, M2, ..., Mm]
reduce(item, stripes[M1, M2, ..., Mm]) {
Map final = new HashMap
 ();
for (Map map : stripes) {
for (all (k, v) in map) {
final(k) = final(k) + v;
}
}
emit(key, top(5, final))
}
top(N, Map ) will return N items, representing the top/maximum
frequencies for a given associative array.
Frequently Bought Together
The purpose of this section is to implement the “Frequently Bought Together” (FBT)
feature using MapReduce/Hadoop. FBT is a behavioral targeting technique that lever‐
ages users’ previous purchasing history in order to select and display other relevant
products that a user may want to buy. Suppose you are searching for Donald Knuth’s
The Art of Computer Programming on Amazon.com. On the product detail page, you
will see a section called “Frequently Bought Together” that lists the original item you
searched for, and books often bought along with it (see Figure 91).
206

Chapter 9: Recommendation Engines Using MapReduce
Figure 91. Frequently Bought Together feature
How does Amazon come up with this list for most of its items? It basically does a
search for relationships between items (such as books and CDs). Typically, ecom‐
merce sites like Amazon.com gather data on their customers’ purchasing habits.
Using association rule learning, these sites can determine which products are fre‐
quently bought together and use that information for marketing purposes. This is
sometimes referred to as a variation on Market Basket Analysis, a wellknown topic
in data mining covered in Chapter 7 of this book.
Input and Expected Output
Let’s assume that we have an input of product sales transactions for all customers.
Let’s also assume that we have n transactions (labeled as T1, ..., Tn) and m products
(labeled as P1, ..., Pm), and that we have compiled the input as shown in Table 93.
Table 93. Product sales transactions for all customers
Transaction Purchased items
T1
{P1,1, P1,2, ..., P1, k1}
T2
{P2,1, P2,2, ..., P2, k2}
...
...
Tn
{Pn,1, Pn,2, ..., Pn,kn}
In this table:
• Pi,j ∊ {P1, ..., Pm}.
• ki is the number of items purchased in transaction Ti.
• Each line of input is a transaction ID, followed by a list of products purchased.
Our goal is to build a hash table for which the key will be Pi for i = 1, 2, ..., m, and the
value will be the list of products purchased together.
For example, say we have the input shown in Table 94.
Frequently Bought Together

207
Table 94. Input for FBT example
Transaction Purchased items
T1
{P1, P2, P3}
T2
{P2, P3}
T3
{P2, P3, P4}
T4
{P5, P6}
T5
{P3, P4}
Then our desired output will look like Table 95.
Table 95. Desired output for FBT example
Item Frequently bought together
P1
{P2, P3}
P2
{P1, P3, P4}
P3
{P1, P2, P4}
P4
{P2, P3}
P5
{P6}
P6
{P5}
Therefore, if a customer is browsing product P3, we can say the frequently bought
together products are P1, P2, and P4.
MapReduce Solution
The map() function will take a single transaction and generate a set of keyvalue pairs
to be consumed by reduce(). The mapper pairs the transaction items (i.e., products)
as a key and the number of occurrences of that key in the transaction as its value (for
all transactions and without the transaction ID; the transaction ID is ignored).
For example, map() for transaction 1 (T1) will emit the following keyvalue pairs:
[, 1]
[, 1]
[, 1]
Note that if we select the two products in a transaction as a key, the counts for the
occurrences of the products in the pairs will be incorrect. For example, if transactions
T1 and T2 have the following products (the same items but in a different order):
T1: (P1, P2, P3)
T2: (P1, P3, P2)
208

Chapter 9: Recommendation Engines Using MapReduce
then for transaction T1, map() will generate:
[, 1]
[, 1]
[, 1]
and for transaction T2, map() will generate:
[, 1]
[, 1]
[, 1]
As you can see from the map() outputs for transactions T1 and T2, we get a total of six
different pairs of products that occur only once each, when there should actually be
only three different pairs. That is, the keys (P2, P3) and (P3, P2) are not being
identified as the same even though they are. We know that this is not correct. We can
avoid this problem if we sort the transaction products in alphabetical order before
generating the keyvalue pairs. After sorting the items in the transactions we will get:
sorted T1: (P1, P2, P3)
sorted T2: (P1, P2, P3)
Now each transaction (T1 and T2) will have the following three keyvalue pairs:
[, 1]
[, 1]
[, 1)
We accumulate the values of the occurrences for these two transactions as follows:
[, 2], [, 2], [, 2]. This gives us the correct counts
for the total number of occurrences.
Mapper
The mapper, shown in Example 95, reads the input data and creates a list of items for
each transaction. For each transaction, its time complexity is O(n), where n is the
number of items for a transaction. Then, the items in the transaction list are sorted to
avoid duplicate keys like (P2, P3) and (P3, P2). The time complexity of Quicksort
is O(n log n). Then, the sorted transaction items are converted to pairs of items as
keys, which is a crossoperation that allows us to generate crosspairs of the items in
the list.
Example 95. Frequently Bought Together: map() function
1 // key is transaction ID and ignored here
2 // value = transaction items (P1, P2, ...,Pm)
3 map(key, value) {
4
(S1, S2, ..., Sm) = sort(P1, P2, ...,Pm);
5
// now, we have: S1 < S2 < ... < Sm
Frequently Bought Together

209
6
7
8
9
10
11
12 }
ListOfPairfor ( (Si, Sj)
// reducer
// reducer
emit([(Si,
}
Sj> = generateCombinations(S1, S2, ..., Sm)
pair : ListOfPair) {
key is: (Si, Sj)
value is: integer 1
Sj), 1]);
generateCombinations(S1, S2, ..., Sm) generates all combinations between any
two items in a given transaction. For example, generateCombinations(S1, S2, S3,
S4) will return the following pairs:
(S1,
(S1,
(S1,
(S2,
(S2,
(S3,
S2)
S3)
S4)
S3)
S4)
S4)
Finally, map() will output the following keyvalue pairs:
P2>
P3>
P3>
P3>
P3>
P4>
P6>
P5>
P4>
1
1
1
1
1
1
1
1
1
Reducer
The FBT algorithm for the reducer is illustrated in Example 96. The reducer sums up
the number of values for each reducer key. Thus, its time complexity is O(v), where v
is the number of values per key.
Example 96. Frequently Bought Together: reduce() function
1
2
3
4
5
6
7
8
9
// key is in form of (Si, Sj)
// value = List, where each element is an integer number
reduce(key, value) {
int sum = 0;
for (int i : List) {
sum += i;
}
emit(key, sum);
}
210

Chapter 9: Recommendation Engines Using MapReduce
Reducer output
The reducer will create the following output format:
, N
where N is the number of transactions, and products Pi and Pj have been purchased
together. The higher N is, the closer the relationship between these two products is.
Now, with this output, we can create the desired hash table, where the keys are {P1,
P2, ..., Pm}.
For our input, the reducer will generate the following output:
P2>
P3>
P3>
P4>
P4>
P6>
1
1
3
1
2
1
Recommend Connection
In this section, I provide a complete Sparkbased MapReduce algorithm to recom‐
mend that people connect with each other. Spark provides an API for graphs and
graphparallel computation called GraphX, but we won’t be using that here; we’ll only
use the Spark API.
These days, there are lots of social network sites (Facebook, Instagram, LinkedIn,
Pinterest, etc.). One feature they all have in common is recommending that people
connect. For example, the “People You May Know” feature from LinkedIn allows
members to see others they might want to link with. The basic idea is this: if Alex is a
friend of Bob and Alex is a friend of Barbara (i.e., Alex is a common friend of Bob
and Barbara, but Bob and Barbara do not know each other), then the social network
system should recommend that Bob connect with Barbara and vice versa. In other
words, if two people have a set of mutual friends, but they are not friends, then the
MapReduce solution should recommend that they be connected to each other.
Friendship among all users can be expressed as a graph. For our simple example, we
can use the graph shown in Figure 92.
Recommend Connection

211
Figure 92. Friendship connection
In mathematics, a graph is an ordered pair G = (V, E) comprising a set V of vertices or
nodes together with a set E of edges or lines, which are twoelement subsets of V (i.e.,
an edge is related with two vertices, and the relation is represented as an unordered
pair of the vertices with respect to the particular edge). This type of graph may be
described precisely as undirected and simple. Our assumption for our MapReduce
solutions is that a friendship between people can be represented in an undirected
graph (if person A is a friend of person B, then B is a friend of A). Most social net‐
works (such as Facebook and LinkedIn) use bidirectional friendships, while friend‐
ship on Twitter is directional.
In our case, the graph is an ordered pair G = (V, E) where:
• V is a finite set of people (users of the social network).
• E is a binary relation on V called the edge set, whose elements are called a
friendship.
From a graph theory perspective, for each person or member of a specific social net‐
work who is within two degrees of person A, we count how many distinct paths (with
two connecting edges) exist between this person and person A. We then rank this list
in terms of the number of paths and show the top 10 people that person A should
connect with. I will show how we can use a MapReduce solution to compute this top
10 connection list for every person. Therefore, our goal is to precompute (as a batch
job) the top 10 recommended people for every member of a social network.
The friendship recommendation problem can be stated as: for every person P, we
determine a list P1, P2, ..., P10 composed of the 10 people with whom person P has the
most friends in common.
212

Chapter 9: Recommendation Engines Using MapReduce