Tải bản đầy đủ
Chapter 9. Recommendation Engines Using MapReduce

Chapter 9. Recommendation Engines Using MapReduce

Tải bản đầy đủ

• Amazon.com and MyBuys.com, which provide recommendation systems for
similar items that a user might purchase—in other words, when a user views
what other shoppers bought along with the currently selected item
• Tripbase.com, a travel website that recommends travel/vacation packages based
on a user’s input or preferences
• Netflix, which predicts movies that a user might enjoy based on the user’s previ‐
ous ratings and watching habits (as compared to the behavior of other users)
In this chapter we will address the following areas, which have roots in recommenda‐
tion engines and systems:
• Customers Who Bought This Item Also Bought
• Frequently Bought Together
• Recommend Connection
For details on recommendation systems, refer to [1], [26], and [11].

Customers Who Bought This Item Also Bought
Most ecommerce vendors, including Amazon.com, use the feature “Customers Who
Bought This Item Also Bought” (CWBTIAB) on their websites for recommending
books, CDs, and other items. Here we will build a simple recommendation system to
implement the CWBTIAB feature.
Suppose that the Amazon.com store log contains a user-id and bought-item for
each sale. We are going to implement CWBTIAB functionality using the MapReduce
paradigm. Whenever an item is shown, the store will suggest five other items most
often bought by buyers of that item.

Input
We assume that the input is a set of large transactions (a transaction log contains a lot
of data, including transaction ID, date, price, etc.), which have the following fields:
<,>

Expected Output
The recommendation engine should emit key-value pairs in which the key is the item
and the value is a list of the five items most commonly purchased by customers who
also bought the original item.

202

|

Chapter 9: Recommendation Engines Using MapReduce

MapReduce Solution
We implement CWBTIAB with two iterations of MapReduce:
• Phase 1: generate lists of all items bought by the same user. Grouping is handled
by the Hadoop framework, where both the mapper and the reducer perform an
identity function.
• Phase 2: solve the co-occurrences problem on list items. We use the Stripes
design pattern and emit only the five most common co-occurrences.
Before we discuss these two phases, I will explain the concept of Stripes with a simple
example.

Stripes design pattern
Stripes is a design pattern, and the main idea behind it is to group together pairs into
an associative array. Consider the classic case shown in Table 9-1 of key-value pairs
emitted by a mapper (note that in this example, the mapper’s output key is a compo‐
site key, Tuple2).
Table 9-1. Mapper output: classic approach
Key

Value

(k, k1) 3
(k, k2) 2
(k, k3) 4
(k, k4) 6
(z, z1) 7
(z, z2) 8
(z, z3) 5

The idea behind the Stripes approach is that rather than emitting many key-value
pairs, we just emit one per stripe, as shown in Table 9-2 (note that in this example, k
and z are natural keys).
Table 9-2. Mapper output: Stripes approach
Key Value
k

{ (k1, 3), (k2, 2), (k3, 4), (k4, 6) }

z

{ (z1, 7), (z2, 8), (z3, 5) }

The Stripes approach creates an associative array (or a hash table) for each natural
key and reduces the number of key-value pairs emitted by each mapper. While the

Customers Who Bought This Item Also Bought

|

203

emitted value of each mapper becomes a complex object (an associative array), with
the Stripes approach there will be less sorting and shuffling of key-value pairs.
How does a reducer work in the Stripes approach? Reducers perform an element-wise
sum of associative arrays. Consider the following three key-value examples for a
reducer (as input):
K -> { (a, 1), (b, 2), (c, 4), (d, 3) }
K -> { (a, 2),
(c, 2)
}
K -> { (a, 3), (b, 5),
(d, 5) }

The generated output will be:
K -> { (a, 1+2+3), (b, 2+5), (c, 4+2), (d, 3+5) }

or:
K -> { (a, 6), (b, 7), (c, 6), (d, 8) }

The advantages of the Stripes approach are as follows:
• Since mappers generate fewer key-value pairs than with the classic approach,
there will be less sorting and shuffling required.
• Stripes enables us to make effective use of combiners (as a local per-node optimi‐
zation).
• Stripes offers us better performance[14].
Some disadvantages of the Stripes approach are:
• It’s more difficult to implement (since the value emitted by each mapper is an
associative array and we have to write a serializer and deserializer for that asso‐
ciative array).
• The underlying objects (i.e., values generated by mappers as associative arrays)
are more heavyweight.
• Stripes has a fundamental limitation in terms of the size of event space (since we
are creating an associate array for each natural key, we need to make sure that the
mappers have enough RAM to hold these hash tables).

MapReduce phase 1
The first MapReduce phase generates lists of all items bought by the same user.
Grouping is done by the MapReduce framework on the userID (as a key). Both the
mapper and the reducer perform an identity function. The goal of phase 1 is to find
all items purchased by all users.
The mapper is an identity function that emits key-value pairs as received (see
Example 9-1).
204

|

Chapter 9: Recommendation Engines Using MapReduce

Example 9-1. Mapper: phase 1
1
2
3
4
5

// key = userID
// value = item bought by userID
map(userID, item) {
emit(userID, item);
}

The reducer is an identity function that groups all items for a single user (see
Example 9-2).
Example 9-2. Reducer: phase 1
1
2
3
4
5

// key = userID
// value = list of items bought by userID
reduce(userID, items[I1, I2, ..., In]) {
emit(userID, items);
}

MapReduce phase 2
The second MapReduce phase solves the co-occurrences problem on list items. With
the Stripes approach, the mappers do most of the work, aggregating data and then
passing it to the combiners and reducers. The reducers then emit the expected output
(an item followed by a list of the five most common co-occurrences).
Since we might be creating many hash tables (associative arrays) for each mapper/
reducer, we need to make sure that we have enough memory to hold these data struc‐
tures. If the number of users or items were large enough, it might not fit in memory.
If your memory/RAM is limited, you might consider creating these hash tables on
disk (such a solution is available in MapDB).1
As you can see in Example 9-3, the mapper in this phase includes combiner function‐
ality. (Using the Stripes approach, the mappers do most of the work, which has to be
done by map() and combine() functions.) As noted earlier, the Stripes approach min‐
imizes the number of keys generated, and therefore the MapReduce execution frame‐
work has less shuffling and sorting to perform. However, since we are using a
nonprimitive type for our values (an associative array), more serialization and deseri‐
alization will be required.

1 MapDB (developed by Jan Kotek) provides concurrent maps, sets, and queues backed by disk storage or off-

heap memory. It is a fast and easy-to-use embedded Java database engine.

Customers Who Bought This Item Also Bought

|

205

Example 9-3. Mapper: phase 2
1
2
3
4
5
6
7
8
9
10
11

// key = userID
// value = list of items bought by userID
map(userID, items[I1, I2, ..., In]) {
for (Item item : items) {
Map map = new HashMap();
for (Item j : items) {
map(j) = map(j) + 1;
}
emit(item, map);
}
}

As Example 9-4 shows, in this phase the reducer generates the “top 5” items most
commonly purchased along with every item in all transactions. The reducer performs
an item-wise sum of all stripes (represented as an associative array) for a given item.
Example 9-4. Reducer: phase 2
1
2
3
4
5
6
7
8
9
10
11

// key = item
// value = list of stripes[M1, M2, ..., Mm]
reduce(item, stripes[M1, M2, ..., Mm]) {
Map final = new HashMap();
for (Map map : stripes) {
for (all (k, v) in map) {
final(k) = final(k) + v;
}
}
emit(key, top(5, final))
}

top(N, Map) will return N items, representing the top/maximum

frequencies for a given associative array.

Frequently Bought Together
The purpose of this section is to implement the “Frequently Bought Together” (FBT)
feature using MapReduce/Hadoop. FBT is a behavioral targeting technique that lever‐
ages users’ previous purchasing history in order to select and display other relevant
products that a user may want to buy. Suppose you are searching for Donald Knuth’s
The Art of Computer Programming on Amazon.com. On the product detail page, you
will see a section called “Frequently Bought Together” that lists the original item you
searched for, and books often bought along with it (see Figure 9-1).

206

|

Chapter 9: Recommendation Engines Using MapReduce

Figure 9-1. Frequently Bought Together feature
How does Amazon come up with this list for most of its items? It basically does a
search for relationships between items (such as books and CDs). Typically, ecom‐
merce sites like Amazon.com gather data on their customers’ purchasing habits.
Using association rule learning, these sites can determine which products are fre‐
quently bought together and use that information for marketing purposes. This is
sometimes referred to as a variation on Market Basket Analysis, a well-known topic
in data mining covered in Chapter 7 of this book.

Input and Expected Output
Let’s assume that we have an input of product sales transactions for all customers.
Let’s also assume that we have n transactions (labeled as T1, ..., Tn) and m products
(labeled as P1, ..., Pm), and that we have compiled the input as shown in Table 9-3.
Table 9-3. Product sales transactions for all customers
Transaction Purchased items
T1

{P1,1, P1,2, ..., P1, k1}

T2

{P2,1, P2,2, ..., P2, k2}

...

...

Tn

{Pn,1, Pn,2, ..., Pn,kn}

In this table:
• Pi,j ∊ {P1, ..., Pm}.
• ki is the number of items purchased in transaction Ti.
• Each line of input is a transaction ID, followed by a list of products purchased.
Our goal is to build a hash table for which the key will be Pi for i = 1, 2, ..., m, and the
value will be the list of products purchased together.
For example, say we have the input shown in Table 9-4.
Frequently Bought Together

|

207

Table 9-4. Input for FBT example
Transaction Purchased items
T1

{P1, P2, P3}

T2

{P2, P3}

T3

{P2, P3, P4}

T4

{P5, P6}

T5

{P3, P4}

Then our desired output will look like Table 9-5.
Table 9-5. Desired output for FBT example
Item Frequently bought together
P1

{P2, P3}

P2

{P1, P3, P4}

P3

{P1, P2, P4}

P4

{P2, P3}

P5

{P6}

P6

{P5}

Therefore, if a customer is browsing product P3, we can say the frequently bought
together products are P1, P2, and P4.

MapReduce Solution
The map() function will take a single transaction and generate a set of key-value pairs
to be consumed by reduce(). The mapper pairs the transaction items (i.e., products)
as a key and the number of occurrences of that key in the transaction as its value (for
all transactions and without the transaction ID; the transaction ID is ignored).
For example, map() for transaction 1 (T1) will emit the following key-value pairs:
[, 1]
[, 1]
[, 1]

Note that if we select the two products in a transaction as a key, the counts for the
occurrences of the products in the pairs will be incorrect. For example, if transactions
T1 and T2 have the following products (the same items but in a different order):
T1: (P1, P2, P3)
T2: (P1, P3, P2)

208

|

Chapter 9: Recommendation Engines Using MapReduce

then for transaction T1, map() will generate:
[, 1]
[, 1]
[, 1]

and for transaction T2, map() will generate:
[, 1]
[, 1]
[, 1]

As you can see from the map() outputs for transactions T1 and T2, we get a total of six
different pairs of products that occur only once each, when there should actually be
only three different pairs. That is, the keys (P2, P3) and (P3, P2) are not being
identified as the same even though they are. We know that this is not correct. We can
avoid this problem if we sort the transaction products in alphabetical order before
generating the key-value pairs. After sorting the items in the transactions we will get:
sorted T1: (P1, P2, P3)
sorted T2: (P1, P2, P3)

Now each transaction (T1 and T2) will have the following three key-value pairs:
[, 1]
[, 1]
[, 1)

We accumulate the values of the occurrences for these two transactions as follows:
[, 2], [, 2], [, 2]. This gives us the correct counts
for the total number of occurrences.

Mapper
The mapper, shown in Example 9-5, reads the input data and creates a list of items for
each transaction. For each transaction, its time complexity is O(n), where n is the
number of items for a transaction. Then, the items in the transaction list are sorted to
avoid duplicate keys like (P2, P3) and (P3, P2). The time complexity of Quicksort
is O(n log n). Then, the sorted transaction items are converted to pairs of items as
keys, which is a cross-operation that allows us to generate cross-pairs of the items in
the list.
Example 9-5. Frequently Bought Together: map() function
1 // key is transaction ID and ignored here
2 // value = transaction items (P1, P2, ...,Pm)
3 map(key, value) {
4
(S1, S2, ..., Sm) = sort(P1, P2, ...,Pm);
5
// now, we have: S1 < S2 < ... < Sm

Frequently Bought Together

|

209

6
7
8
9
10
11
12 }

ListOfPairfor ( (Si, Sj)
// reducer
// reducer
emit([(Si,
}

Sj> = generateCombinations(S1, S2, ..., Sm)
pair : ListOfPair) {
key is: (Si, Sj)
value is: integer 1
Sj), 1]);

generateCombinations(S1, S2, ..., Sm) generates all combinations between any
two items in a given transaction. For example, generateCombinations(S1, S2, S3,
S4) will return the following pairs:
(S1,
(S1,
(S1,
(S2,
(S2,
(S3,

S2)
S3)
S4)
S3)
S4)
S4)

Finally, map() will output the following key-value pairs:

P2>
P3>
P3>
P3>
P3>
P4>
P6>
P5>
P4>

1
1
1
1
1
1
1
1
1

Reducer
The FBT algorithm for the reducer is illustrated in Example 9-6. The reducer sums up
the number of values for each reducer key. Thus, its time complexity is O(v), where v
is the number of values per key.
Example 9-6. Frequently Bought Together: reduce() function
1
2
3
4
5
6
7
8
9

// key is in form of (Si, Sj)
// value = List, where each element is an integer number
reduce(key, value) {
int sum = 0;
for (int i : List) {
sum += i;
}
emit(key, sum);
}

210

|

Chapter 9: Recommendation Engines Using MapReduce

Reducer output
The reducer will create the following output format:
, N

where N is the number of transactions, and products Pi and Pj have been purchased
together. The higher N is, the closer the relationship between these two products is.
Now, with this output, we can create the desired hash table, where the keys are {P1,
P2, ..., Pm}.
For our input, the reducer will generate the following output:

P2>
P3>
P3>
P4>
P4>
P6>

1
1
3
1
2
1

Recommend Connection
In this section, I provide a complete Spark-based MapReduce algorithm to recom‐
mend that people connect with each other. Spark provides an API for graphs and
graph-parallel computation called GraphX, but we won’t be using that here; we’ll only
use the Spark API.
These days, there are lots of social network sites (Facebook, Instagram, LinkedIn,
Pinterest, etc.). One feature they all have in common is recommending that people
connect. For example, the “People You May Know” feature from LinkedIn allows
members to see others they might want to link with. The basic idea is this: if Alex is a
friend of Bob and Alex is a friend of Barbara (i.e., Alex is a common friend of Bob
and Barbara, but Bob and Barbara do not know each other), then the social network
system should recommend that Bob connect with Barbara and vice versa. In other
words, if two people have a set of mutual friends, but they are not friends, then the
MapReduce solution should recommend that they be connected to each other.
Friendship among all users can be expressed as a graph. For our simple example, we
can use the graph shown in Figure 9-2.

Recommend Connection

|

211

Figure 9-2. Friendship connection
In mathematics, a graph is an ordered pair G = (V, E) comprising a set V of vertices or
nodes together with a set E of edges or lines, which are two-element subsets of V (i.e.,
an edge is related with two vertices, and the relation is represented as an unordered
pair of the vertices with respect to the particular edge). This type of graph may be
described precisely as undirected and simple. Our assumption for our MapReduce
solutions is that a friendship between people can be represented in an undirected
graph (if person A is a friend of person B, then B is a friend of A). Most social net‐
works (such as Facebook and LinkedIn) use bidirectional friendships, while friend‐
ship on Twitter is directional.
In our case, the graph is an ordered pair G = (V, E) where:
• V is a finite set of people (users of the social network).
• E is a binary relation on V called the edge set, whose elements are called a
friendship.
From a graph theory perspective, for each person or member of a specific social net‐
work who is within two degrees of person A, we count how many distinct paths (with
two connecting edges) exist between this person and person A. We then rank this list
in terms of the number of paths and show the top 10 people that person A should
connect with. I will show how we can use a MapReduce solution to compute this top
10 connection list for every person. Therefore, our goal is to precompute (as a batch
job) the top 10 recommended people for every member of a social network.
The friendship recommendation problem can be stated as: for every person P, we
determine a list P1, P2, ..., P10 composed of the 10 people with whom person P has the
most friends in common.

212

|

Chapter 9: Recommendation Engines Using MapReduce