Tải bản đầy đủ
Chapter 11. Smarter Email Marketing with the Markov Model

Chapter 11. Smarter Email Marketing with the Markov Model

Tải bản đầy đủ

• Phase 1: build a model by using historical training data.
• Phase 2: make a prediction for the next new data using the model built in
phase 1.

Markov Chains in a Nutshell
Let S = {S1, S2, S3, ...} be a set of finite states. We want to collect the following
probabilities:
P (Sn|Sn–1,Sn–2, ..., S1)
Markov’s first-order assumption is the following:
P (Sn|Sn–1,Sn–2, ..., S1) ≈ P (Sn|Sn–1)
This approximation states the Markov property: the state of the system at time t + 1
depends only on the state of the system at time t.
The Markov second-order assumption is the following:
P (Sn|Sn–1,Sn–2, ..., S1) ≈ P (Sn|Sn–1,Sn–2)
Now, we can express the joint probability using the Markov assumption:
P S1, S2, . . . , Sn =

n

∏ P Si, Si − 1
i=1

Markov random processes can be summarized as follows:1
• A random sequence has the Markov property if its distribution is determined
solely by its current state. Any random process having this property is called a
Markov random process.
• For observable state sequences (i.e., when the state is known from data), this
leads to a Markov chain model, which we will use to predict the next effective
email marketing date.

1 Based on slides by Mehmet Yunus Dnmez: http://bit.ly/hidden_markov_mod.

258

|

Chapter 11: Smarter Email Marketing with the Markov Model

• For nonobservable states, this leads to a hidden Markov model (HMM).
Now, let’s formalize the Markov chain that we will use in this chapter. Our Markov
chain has three components:
State space
A finite set of states S = {S1, S2, S3, ...}
Transition probabilities
A function f: S × S → R such that:
• 0 ≤ f(a, b) ≤ 1 for all a, b ∊ S

b∈S

f a, b = 1 for every a ∊ S

Initial distribution
A function g: S × R such that:
• 0 ≤ g(a) ≤ 1 for every a ∊ S

a∈S

g a =1

Then a Markov chain is a random process in S such that:
• At time 0 the state of the chain is distributed according to function g.
• If at time t the state of the chain is a, then at time t + 1 it will be at state b with a
probability of f(a, b) for every b ∊ S.
Let’s consider an example: let a weather pattern for a city consist of four states—
sunny, cloudy, rainy, foggy—and further assume that the state does not change for a
day. The sum of each row shown in Table 11-1 is 1.00.
Table 11-1. City weather pattern (one state per day)
Today’s weather Tomorrow’s weather sunny rainy cloudy foggy
sunny

0.6

0.1

0.2

0.1

rainy

0.5

0.2

0.2

0.1

cloudy

0.1

0.7

0.1

0.1

foggy

0.0

0.3

0.4

0.3

Now, we can answer the following questions:
• Given that today is sunny, what is the probability that tomorrow will be cloudy
and the day after will be foggy? We compute this as follows:

Markov Chains in a Nutshell

|

259

P (S2 = cloudy, S3 = foggy|S1 = sunny)
= P (S3 = foggy|S2 = cloudy, S1 = sunny) ×
P (S2 = cloudy|S1 = sunny)
= P (S3 = foggy|S2 = cloudy) ×
P (S2 = cloudy|S1 = sunny)
= 0.1 × 0.2
= 0.02
• Given that today is foggy, what is the probability that it will be rainy two days
from now? (This means that the second day can be any of sunny, cloudy, rainy,
or foggy.)
P (S3 = foggy|S1 = foggy) =
P (S3 = foggy, S2 = sunny|S1 = foggy) +
P (S3 = foggy, S2 = cloudy|S1 = foggy) +
P (S3 = foggy, S2 = rainy|S1 = foggy) +
P (S3 = foggy, S2 = foggy|S1 = foggy) +
= P (S3 = foggy|S2 = sunny) × P (S2 = sunny|S1 = foggy) +
P (S3 = foggy|S2 = cloudy) × P (S2 = cloudy|S1 = foggy) +
P (S3 = foggy|S2 = rainy) × P (S2 = rainy|S1 = foggy) +
P (S3 = foggy|S2 = foggy) × P (S2 = foggy|S1 = foggy)
= 0.1 × 0.0 +
0.1 × 0.4 +
0.1 × 0.3 +
0.3 × 0.3
= 0.00 + 0.04 + 0.03 + 0.09
= 0.16
One of the main goals of this chapter is to build the model (that is, the probability
transition table) that will define f(a, b) for all a ∊ S. Once we have created this model,
the rest is easy.

260

|

Chapter 11: Smarter Email Marketing with the Markov Model

Markov Model Using MapReduce
Assume we have historical customer transaction data that includes transaction-id,
customer-id, purchase-date, and amount. Therefore, each input record will have the
following format:

<,><,><,>

The entire solution involves two MapReduce jobs and a set of Ruby scripts (the Ruby
scripts were developed by Pranab Ghosh; I will provide pointers for them as we use
them).
The entire workflow is depicted in Figure 11-1, and our solution (the steps shown in
Figure 11-1) is outlined here:
1. We use a script to generate fake customer data (1).
2. The MapReduce projection (2) accepts customer data (1) as an input and gener‐
ates sorted data (3). The sorted data (3) consists of purchase dates in ascending
order.
3. The state converter script (4) accepts the sorted data (3) and generates the state
sequence (5).
4. The MapReduce Markov state transition model (6) accepts the state sequence (5)
as an input and generates a Markov chain model (7). This model enables us to
predict the next state.
5. Finally, the next state prediction script (9), which accepts new customer data (8)
and the Markov chain model (7), predicts the best date for our next marketing
email.

Figure 11-1. Markov workflow

Markov Model Using MapReduce

|

261

Generating Time-Ordered Transactions with MapReduce
The goal of this MapReduce phase is to accept historical customer transaction data
and generate the following output for every customer-id:
customerID (Date1, Amount1);(Date2, Amount2);...(DateN, AmountN)

such that:
Date1 ≤ Date2 ≤ ... ≤ DateN

The MapReduce output is sorted by purchase date in ascending order. Generating
sorted output can be accomplished in two ways: each reducer can sort its output by
the purchase date in ascending order (here you need enough RAM to hold all your
data for sorting), or you can use MapReduce’s secondary sorting technique to sort
data by date (with this option, you do not need much RAM at all). After the output is
generated, we will convert (Date, Amount) into a two-letter symbol (this step is done
by a script) that stands for a Markov chain state. I will present solutions for both
cases. The final output generated from this phase will have the following format:
customerID, State1, State2, ..., Staten

Example 11-1 defines the map() function for our time-ordered transactions.
Example 11-1. Time-ordered transactions: map() function
1
2
3
4
5
6
7
8

/**
* @param key is ignored
* @param value is transaction-id, customer-id, purchase-date, amount
*/
map(key, value) {
pair(Date, Integer) pair = (value.purchase-date, value.amount);
emit(value.customer-id, pair);
}

Example 11-2 defines the reduce() function for our time-ordered transactions.
Example 11-2. Time-ordered transactions: reduce() function
1
2
3
4
5

/**
* @param key is a customer-id
* @param values is a list of pairs of Pair(Date, Integer)
*/
reduce(String key, List> values) {

262

| Chapter 11: Smarter Email Marketing with the Markov Model

6
7
8 }

sortedValues = sortbyDateInAscendingOrder(values);
emit(key, sortedValues);

In this solution, mappers emit key-value pairs, where the key is a customer-id and
the value is a pair of (purchase-date, amount). Data arrives at the reducers unsor‐
ted. This solution performs transaction sorting in the reducer. If the number of trans‐
actions arriving at the reducers is too big, then this might cause an out-of-memory
exception in the reducers. Our second Hadoop solution will not have this restriction:
rather than sorting data in memory for each reducer, we will use secondary sorting,
which (as you might recall from earlier in this book), is a feature of the MapReduce
paradigm for sorting the values of reducers.
Table 11-2 lists the implementation classes we’ll need for Hadoop solution 1.
Table 11-2. Hadoop solution 1: implementation classes
Class name

Description

SortInMemoryProjectionDriver

Driver class to submit jobs

SortInMemoryProjectionMapper

Mapper class

SortInMemoryProjectionReducer Reducer class
DateUtil

Basic date utility class

Partial input
Pranab Ghosh has provided a Ruby script (buy_xaction.rb) that generates fake cus‐
tomer transaction data:
\$ # create test data by using a ruby script:
\$ ./buy_xaction.rb 80000 210 .05 > training.txt
\$ # copy test data to Hadoop/HDFS
/markov/projection_by_sorting_in_ram/input/
\$ # inspect data in HDFS
/markov/projection_by_sorting_in_ram/input/training.txt
...
EY2I3D12PZ,1382705171,2013-07-29,28
VC38QFM2IF,1382705172,2013-07-29,84
1022R2QPWG,1382705173,2013-07-29,27
4G02MW73CK,1382705174,2013-07-29,31
VKV2K1S0D2,1382705175,2013-07-29,28
LDFK8WZQFH,1382705176,2013-07-29,25

Markov Model Using MapReduce

|

263

8874144Q11,1382705177,2013-07-29,180
...

Log of sample run
The log output is shown here; it has been edited and formatted to fit the page:
# ./run.sh
...
Deleted hdfs://localhost:9000/lib/projection_by_sorting_in_ram.jar
Deleted hdfs://localhost:9000/markov/projection_by_sorting_in_ram/output
...
13/11/27 12:03:16 INFO mapred.JobClient: Running job: job_201311271011_0012
13/11/27 12:03:17 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/27 12:04:16 INFO mapred.JobClient: map 100% reduce 100%
13/11/27 12:04:17 INFO mapred.JobClient: Job complete: job_201311271011_0012
...
13/11/27 12:04:17 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 12:04:17 INFO mapred.JobClient: Map input records=832280
13/11/27 12:04:17 INFO mapred.JobClient: Reduce input records=832280
13/11/27 12:04:17 INFO mapred.JobClient: Reduce input groups=79998
13/11/27 12:04:17 INFO mapred.JobClient: Reduce output records=79998
13/11/27 12:04:17 INFO mapred.JobClient: Map output records=832280
13/11/27 12:04:17 INFO SortInMemoryProjectionDriver: jobStatus: true
13/11/27 12:04:17 INFO SortInMemoryProjectionDriver:
elapsedTime (in milliseconds): 62063

Partial output
...
ZTOBR28AH2
ZV2A56WNI6
ZXN7727FBA
ZY44ATNBK7
...

2013-01-06,190;2013-04-02,109;2013-04-09,26;...
2013-01-22,51;2013-01-24,34;2013-02-09,52;...
2013-02-07,164;2013-02-23,30;2013-03-28,107;...
2013-03-27,191;2013-04-27,51;2013-05-06,31;...

The next step is to build our model’s transition probabilities—that is, to define:
0.0 ≤ P (state1, state2) ≤ 1.0
where state1 and state2 ∊ {SL, SE, SG, ML, ME, MG, LL, LE, LG}.

This implementation provides a solution that sorts reducer values using the secon‐
dary sort technique (this approach is an alternative to Hadoop solution 1; by using
the Secondary Sort design pattern, we no longer must buffer all reducer values in
memory/RAM for sorting). To accomplish this, we need some custom classes, which
will be plugged into the MapReduce framework implementation. Mappers emit key264

|

Chapter 11: Smarter Email Marketing with the Markov Model

value pairs where the key is a pair of (customer-id, purchase-date) and the value
is a pair of (purchase-date, amount). Data arrives to the reducers sorted. As you
can see, to generate sorted values for each reducer key, we include the purchase-date
(i.e., the first part of the emitted mapper value) as part of the mapper key. So,
again, the CompositeKey comprises a pair (customer-id, purchase-date).
The value (purchase-date,
amount) is represented as the class
edu.umd.cloud9.io.pair.PairOfLongInt, where the Long part represents the pur‐
chase date and Int represents the purchase amount.
The MapReduce framework states that once the data values reach a reducer, all data is
grouped by key. As just noted, we have a CompositeKey, so we need to make sure that
records are grouped solely by the natural key (i.e., the customer-id). We accomplish
this by writing a custom partitioner class: NaturalKeyPartitioner. We also need to
provide more plug-in classes:
Configuration conf = new Configuration();
JobConf jobconf = new JobConf(conf, SecondarySortProjectionDriver.class);
...
jobconf.setPartitionerClass(NaturalKeyPartitioner.class);
jobconf.setOutputKeyComparatorClass(CompositeKeyComparator.class);
jobconf.setOutputValueGroupingComparator(NaturalKeyGroupingComparator.class);

Now let’s refresh your memory of the Secondary Sort pattern covered in Chapters 1
and 2. Mappers generate key-value pairs. The order of the values arriving at a reducer
is unspecified and can vary between job runs. For example, let all mappers generate
(K, V1), (K, V2), and (K, V3). So, for the key K, we have three values: V1, V2, V3. A
reducer that is processing key K, then, might get one (out of six) of the following
orders:
V1,
V1,
V2,
V2,
V3,
V3,

V2,
V3,
V1,
V3,
V1,
V2,

V3
V2
V3
V1
V2
V1

In most situations (depending on your requirements and how you will process
reducer values), it really does not matter in what order these values arrive. But if you
want the values you’re receiving and processing to be sorted in some order (such as
ascending or descending), your first option is to get all values V1, V2, V3 and then
apply a sort function on them to get your desired order. As we’ve discussed, this sort
technique might not be feasible if you do not have enough RAM in your servers to
hold all the values. But there is another, preferable method that scales out very well
and eliminates any worry about big RAM requirements: you can use the sort and
shuffle feature of the MapReduce framework. As you’ve learned, this technique is
called a secondary sort, and it allows the MapReduce programmer to control the
order in which the values appear within a reduce() function call. To achieve this, you
Markov Model Using MapReduce

|

265

need to use a composite key that contains the information needed to sort both by key
and by value, and then you decouple the grouping and the sorting of the intermediate
data. Secondary sorting, then, enables us to define the sorting order of the values gen‐
erated by mappers. Again, sorting is done on both the keys and the values. Further,
grouping enables us to decide which key-value pairs are put together into a single
reduce() function call. The composite key for our example is illustrated in
Figure 11-2.

Figure 11-2. Composite key for secondary sorting
In Hadoop, the grouping is controlled in two places: the partitioner, which sends
mapper output to reducer tasks, and the grouping comparator, which groups data
within a reducer task. Both the partitioner and group comparator functionality can
be accomplished through pluggable classes for each MapReduce job. To sort the val‐
ues for reducers, we have to define a pluggable class, which sets the output key com‐
parator. Therefore, using raw MapReduce, implementing secondary sorting can take
a little bit of work: we must define a composite key and specify three functions (by
defining pluggable classes for the MapReduce job—details are given in Table 11-3)
that each use it in a different way. Note that for our example the natural key is a
customer-id (as a string object) and there is no need to wrap it in another class,
which we might call as a natural key. The natural key is what you would normally use
as the key or “group by” operator.
Table 11-3. Hadoop solution 2: implementation classes
Class name

Description

SecondarySortProjectionDriver

Driver class to submit jobs

SecondarySortProjectionMapper

Mapper class

SecondarySortProjectionReducer Reducer class

266

|

Chapter 11: Smarter Email Marketing with the Markov Model

Class name

Description

CompositeKey

Custom key to hold a pair of (customer-id, purchase-date),
which is a combination of the natural key and the natural value we want to
sort by

CompositeKeyComparator

How to sort CompositeKey objects; compares two composite keys for
sorting

NaturalKeyGroupingComparator

Considers the natural key; makes sure that a single reducer sees a custom
view of the groups (how to group customer-id)

NaturalKeyPartitioner

How to partition by the natural key (customer-id) to reducers; blocks all
data into a logical group in which we want the secondary sort to occur on the
natural value

DateUtil

Basic date utility class

Partial input
V31E55G4FI,1381872898,2013-01-01,123
301UNH7I2F,1381872899,2013-01-01,148
PP2KVIR4LD,1381872900,2013-01-01,163
AC57MM3WNV,1381872901,2013-01-01,188
BN020INHUM,1381872902,2013-01-01,116
UP8R2SOR77,1381872903,2013-01-01,183
VD91210MGH,1381872904,2013-01-01,204
COI4OXHET1,1381872905,2013-01-01,78
76S34ZE89C,1381872906,2013-01-01,105
6K3SNF2EG1,1381872907,2013-01-01,214

Log of sample run
The log output is shown here; it has been edited and formatted to fit the page:
# ./run.sh
JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
...
Deleted hdfs://localhost:9000/lib/projection_by_secondary_sort.jar
Deleted hdfs://localhost:9000/markov/projection_by_secondary_sort/output
...
13/11/27 15:14:34 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/27 15:14:34 INFO mapred.JobClient: Running job: job_201311271459_0003
13/11/27 15:14:35 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/27 15:16:02 INFO mapred.JobClient: map 100% reduce 100%
13/11/27 15:16:03 INFO mapred.JobClient: Job complete: job_201311271459_0003
...
13/11/27 15:16:03 INFO mapred.JobClient: Map-Reduce Framework
13/11/27 15:16:03 INFO mapred.JobClient:
Map input records=832280
13/11/27 15:16:03 INFO mapred.JobClient:
Combine input records=0
13/11/27 15:16:03 INFO mapred.JobClient:
Reduce input records=832280

Markov Model Using MapReduce

|

267

13/11/27 15:16:03 INFO mapred.JobClient:
Reduce input groups=79998
13/11/27 15:16:03 INFO mapred.JobClient:
Combine output records=0
13/11/27 15:16:03 INFO mapred.JobClient:
Reduce output records=79998
13/11/27 15:16:03 INFO mapred.JobClient:
Map output records=832280
13/11/27 15:16:03 INFO SecondarySortProjectionDriver: elapsedTime
(in milliseconds): 89809

Partial output
...
ZSY40NVPS6 2013-01-01,110;2013-01-11,32;2013-03-04,111;2013-04-09,65;...
ZTLNF0O4LN 2013-01-16,55;2013-03-21,164;2013-05-14,66;2013-06-29,81;...
ZV20AIXG8L 2013-01-13,210;2013-02-03,32;2013-02-10,48;2013-02-23,27;...
...

Generating State Sequences
The goal of this section is to convert a transaction sequence into a state sequence. Both
of our Hadoop solutions generated the following output (i.e., the transaction
sequence):
customer-id (Date1, Amount1);(Date2, Amount2);...(DateN, AmountN)

such that:
Date1 ≤ Date2 ≤ ... ≤ DateN

We need to convert this output (“transaction sequence”) into a “state sequence” as:
customer-id, State1, State2, ..., Staten

The next task is to convert (using another Ruby script developed by Pranab Ghosh)
the sorted sequence of (purchase-date, amount) pairs into a set of Markov chain
states. The state is indicated by a two-letter symbol; each letter is defined in
Table 11-4.
Table 11-4. Letters indicating a Markov chain state
Time elapsed since last transaction Amount spent compared to previous transaction
S: Small

L: Significantly less than

M: Medium

E: More or less same

L: Large

G: Significantly greater than

Therefore, we will have nine (3 × 3) states, as shown in Table 11-5.
268

|

Chapter 11: Smarter Email Marketing with the Markov Model