Chapter 11. Smarter Email Marketing with the Markov Model
Tải bản đầy đủ
• Phase 1: build a model by using historical training data.
• Phase 2: make a prediction for the next new data using the model built in
phase 1.
Markov Chains in a Nutshell
Let S = {S1, S2, S3, ...} be a set of finite states. We want to collect the following
probabilities:
P (SnSn–1,Sn–2, ..., S1)
Markov’s firstorder assumption is the following:
P (SnSn–1,Sn–2, ..., S1) ≈ P (SnSn–1)
This approximation states the Markov property: the state of the system at time t + 1
depends only on the state of the system at time t.
The Markov secondorder assumption is the following:
P (SnSn–1,Sn–2, ..., S1) ≈ P (SnSn–1,Sn–2)
Now, we can express the joint probability using the Markov assumption:
P S1, S2, . . . , Sn =
n
∏ P Si, Si − 1
i=1
Markov random processes can be summarized as follows:1
• A random sequence has the Markov property if its distribution is determined
solely by its current state. Any random process having this property is called a
Markov random process.
• For observable state sequences (i.e., when the state is known from data), this
leads to a Markov chain model, which we will use to predict the next effective
email marketing date.
1 Based on slides by Mehmet Yunus Dnmez: http://bit.ly/hidden_markov_mod.
258

Chapter 11: Smarter Email Marketing with the Markov Model
• For nonobservable states, this leads to a hidden Markov model (HMM).
Now, let’s formalize the Markov chain that we will use in this chapter. Our Markov
chain has three components:
State space
A finite set of states S = {S1, S2, S3, ...}
Transition probabilities
A function f: S × S → R such that:
• 0 ≤ f(a, b) ≤ 1 for all a, b ∊ S
•
∑
b∈S
f a, b = 1 for every a ∊ S
Initial distribution
A function g: S × R such that:
• 0 ≤ g(a) ≤ 1 for every a ∊ S
•
∑
a∈S
g a =1
Then a Markov chain is a random process in S such that:
• At time 0 the state of the chain is distributed according to function g.
• If at time t the state of the chain is a, then at time t + 1 it will be at state b with a
probability of f(a, b) for every b ∊ S.
Let’s consider an example: let a weather pattern for a city consist of four states—
sunny, cloudy, rainy, foggy—and further assume that the state does not change for a
day. The sum of each row shown in Table 111 is 1.00.
Table 111. City weather pattern (one state per day)
Today’s weather Tomorrow’s weather sunny rainy cloudy foggy
sunny
0.6
0.1
0.2
0.1
rainy
0.5
0.2
0.2
0.1
cloudy
0.1
0.7
0.1
0.1
foggy
0.0
0.3
0.4
0.3
Now, we can answer the following questions:
• Given that today is sunny, what is the probability that tomorrow will be cloudy
and the day after will be foggy? We compute this as follows:
Markov Chains in a Nutshell

259
P (S2 = cloudy, S3 = foggyS1 = sunny)
= P (S3 = foggyS2 = cloudy, S1 = sunny) ×
P (S2 = cloudyS1 = sunny)
= P (S3 = foggyS2 = cloudy) ×
P (S2 = cloudyS1 = sunny)
= 0.1 × 0.2
= 0.02
• Given that today is foggy, what is the probability that it will be rainy two days
from now? (This means that the second day can be any of sunny, cloudy, rainy,
or foggy.)
P (S3 = foggyS1 = foggy) =
P (S3 = foggy, S2 = sunnyS1 = foggy) +
P (S3 = foggy, S2 = cloudyS1 = foggy) +
P (S3 = foggy, S2 = rainyS1 = foggy) +
P (S3 = foggy, S2 = foggyS1 = foggy) +
= P (S3 = foggyS2 = sunny) × P (S2 = sunnyS1 = foggy) +
P (S3 = foggyS2 = cloudy) × P (S2 = cloudyS1 = foggy) +
P (S3 = foggyS2 = rainy) × P (S2 = rainyS1 = foggy) +
P (S3 = foggyS2 = foggy) × P (S2 = foggyS1 = foggy)
= 0.1 × 0.0 +
0.1 × 0.4 +
0.1 × 0.3 +
0.3 × 0.3
= 0.00 + 0.04 + 0.03 + 0.09
= 0.16
One of the main goals of this chapter is to build the model (that is, the probability
transition table) that will define f(a, b) for all a ∊ S. Once we have created this model,
the rest is easy.
260

Chapter 11: Smarter Email Marketing with the Markov Model
Markov Model Using MapReduce
Assume we have historical customer transaction data that includes transactionid,
customerid, purchasedate, and amount. Therefore, each input record will have the
following format:
<,><,><,>
The entire solution involves two MapReduce jobs and a set of Ruby scripts (the Ruby
scripts were developed by Pranab Ghosh; I will provide pointers for them as we use
them).
The entire workflow is depicted in Figure 111, and our solution (the steps shown in
Figure 111) is outlined here:
1. We use a script to generate fake customer data (1).
2. The MapReduce projection (2) accepts customer data (1) as an input and gener‐
ates sorted data (3). The sorted data (3) consists of purchase dates in ascending
order.
3. The state converter script (4) accepts the sorted data (3) and generates the state
sequence (5).
4. The MapReduce Markov state transition model (6) accepts the state sequence (5)
as an input and generates a Markov chain model (7). This model enables us to
predict the next state.
5. Finally, the next state prediction script (9), which accepts new customer data (8)
and the Markov chain model (7), predicts the best date for our next marketing
email.
Figure 111. Markov workflow
Markov Model Using MapReduce

261
Generating TimeOrdered Transactions with MapReduce
The goal of this MapReduce phase is to accept historical customer transaction data
and generate the following output for every customerid:
customerID (Date1, Amount1);(Date2, Amount2);...(DateN, AmountN)
such that:
Date1 ≤ Date2 ≤ ... ≤ DateN
The MapReduce output is sorted by purchase date in ascending order. Generating
sorted output can be accomplished in two ways: each reducer can sort its output by
the purchase date in ascending order (here you need enough RAM to hold all your
data for sorting), or you can use MapReduce’s secondary sorting technique to sort
data by date (with this option, you do not need much RAM at all). After the output is
generated, we will convert (Date, Amount) into a twoletter symbol (this step is done
by a script) that stands for a Markov chain state. I will present solutions for both
cases. The final output generated from this phase will have the following format:
customerID, State1, State2, ..., Staten
Example 111 defines the map() function for our timeordered transactions.
Example 111. Timeordered transactions: map() function
1
2
3
4
5
6
7
8
/**
* @param key is ignored
* @param value is transactionid, customerid, purchasedate, amount
*/
map(key, value) {
pair(Date, Integer) pair = (value.purchasedate, value.amount);
emit(value.customerid, pair);
}
Example 112 defines the reduce() function for our timeordered transactions.
Example 112. Timeordered transactions: reduce() function
1
2
3
4
5
/**
* @param key is a customerid
* @param values is a list of pairs of Pair(Date, Integer)
*/
reduce(String key, List> values) {
262
 Chapter 11: Smarter Email Marketing with the Markov Model
6
7
8 }
sortedValues = sortbyDateInAscendingOrder(values);
emit(key, sortedValues);
Hadoop Solution 1: TimeOrdered Transactions
In this solution, mappers emit keyvalue pairs, where the key is a customerid and
the value is a pair of (purchasedate, amount). Data arrives at the reducers unsor‐
ted. This solution performs transaction sorting in the reducer. If the number of trans‐
actions arriving at the reducers is too big, then this might cause an outofmemory
exception in the reducers. Our second Hadoop solution will not have this restriction:
rather than sorting data in memory for each reducer, we will use secondary sorting,
which (as you might recall from earlier in this book), is a feature of the MapReduce
paradigm for sorting the values of reducers.
Table 112 lists the implementation classes we’ll need for Hadoop solution 1.
Table 112. Hadoop solution 1: implementation classes
Class name
Description
SortInMemoryProjectionDriver
Driver class to submit jobs
SortInMemoryProjectionMapper
Mapper class
SortInMemoryProjectionReducer Reducer class
DateUtil
Basic date utility class
HadoopUtil
Basic Hadoop utility class
Partial input
Pranab Ghosh has provided a Ruby script (buy_xaction.rb) that generates fake cus‐
tomer transaction data:
$ # create test data by using a ruby script:
$ ./buy_xaction.rb 80000 210 .05 > training.txt
$ # copy test data to Hadoop/HDFS
$ hadoop fs copyFromLocal training.txt
/markov/projection_by_sorting_in_ram/input/
$ # inspect data in HDFS
$ hadoop fs cat
/markov/projection_by_sorting_in_ram/input/training.txt
...
EY2I3D12PZ,1382705171,20130729,28
VC38QFM2IF,1382705172,20130729,84
1022R2QPWG,1382705173,20130729,27
4G02MW73CK,1382705174,20130729,31
VKV2K1S0D2,1382705175,20130729,28
LDFK8WZQFH,1382705176,20130729,25
Markov Model Using MapReduce

263
8874144Q11,1382705177,20130729,180
...
Log of sample run
The log output is shown here; it has been edited and formatted to fit the page:
# ./run.sh
...
Deleted hdfs://localhost:9000/lib/projection_by_sorting_in_ram.jar
Deleted hdfs://localhost:9000/markov/projection_by_sorting_in_ram/output
...
13/11/27 12:03:16 INFO mapred.JobClient: Running job: job_201311271011_0012
13/11/27 12:03:17 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/27 12:04:16 INFO mapred.JobClient: map 100% reduce 100%
13/11/27 12:04:17 INFO mapred.JobClient: Job complete: job_201311271011_0012
...
13/11/27 12:04:17 INFO mapred.JobClient: MapReduce Framework
13/11/27 12:04:17 INFO mapred.JobClient: Map input records=832280
13/11/27 12:04:17 INFO mapred.JobClient: Reduce input records=832280
13/11/27 12:04:17 INFO mapred.JobClient: Reduce input groups=79998
13/11/27 12:04:17 INFO mapred.JobClient: Reduce output records=79998
13/11/27 12:04:17 INFO mapred.JobClient: Map output records=832280
13/11/27 12:04:17 INFO SortInMemoryProjectionDriver: jobStatus: true
13/11/27 12:04:17 INFO SortInMemoryProjectionDriver:
elapsedTime (in milliseconds): 62063
Partial output
...
ZTOBR28AH2
ZV2A56WNI6
ZXN7727FBA
ZY44ATNBK7
...
20130106,190;20130402,109;20130409,26;...
20130122,51;20130124,34;20130209,52;...
20130207,164;20130223,30;20130328,107;...
20130327,191;20130427,51;20130506,31;...
The next step is to build our model’s transition probabilities—that is, to define:
0.0 ≤ P (state1, state2) ≤ 1.0
where state1 and state2 ∊ {SL, SE, SG, ML, ME, MG, LL, LE, LG}.
Hadoop Solution 2: TimeOrdered Transactions
This implementation provides a solution that sorts reducer values using the secon‐
dary sort technique (this approach is an alternative to Hadoop solution 1; by using
the Secondary Sort design pattern, we no longer must buffer all reducer values in
memory/RAM for sorting). To accomplish this, we need some custom classes, which
will be plugged into the MapReduce framework implementation. Mappers emit key264

Chapter 11: Smarter Email Marketing with the Markov Model
value pairs where the key is a pair of (customerid, purchasedate) and the value
is a pair of (purchasedate, amount). Data arrives to the reducers sorted. As you
can see, to generate sorted values for each reducer key, we include the purchasedate
(i.e., the first part of the emitted mapper value) as part of the mapper key. So,
again, the CompositeKey comprises a pair (customerid, purchasedate).
The value (purchasedate,
amount) is represented as the class
edu.umd.cloud9.io.pair.PairOfLongInt, where the Long part represents the pur‐
chase date and Int represents the purchase amount.
The MapReduce framework states that once the data values reach a reducer, all data is
grouped by key. As just noted, we have a CompositeKey, so we need to make sure that
records are grouped solely by the natural key (i.e., the customerid). We accomplish
this by writing a custom partitioner class: NaturalKeyPartitioner. We also need to
provide more plugin classes:
Configuration conf = new Configuration();
JobConf jobconf = new JobConf(conf, SecondarySortProjectionDriver.class);
...
jobconf.setPartitionerClass(NaturalKeyPartitioner.class);
jobconf.setOutputKeyComparatorClass(CompositeKeyComparator.class);
jobconf.setOutputValueGroupingComparator(NaturalKeyGroupingComparator.class);
Now let’s refresh your memory of the Secondary Sort pattern covered in Chapters 1
and 2. Mappers generate keyvalue pairs. The order of the values arriving at a reducer
is unspecified and can vary between job runs. For example, let all mappers generate
(K, V1), (K, V2), and (K, V3). So, for the key K, we have three values: V1, V2, V3. A
reducer that is processing key K, then, might get one (out of six) of the following
orders:
V1,
V1,
V2,
V2,
V3,
V3,
V2,
V3,
V1,
V3,
V1,
V2,
V3
V2
V3
V1
V2
V1
In most situations (depending on your requirements and how you will process
reducer values), it really does not matter in what order these values arrive. But if you
want the values you’re receiving and processing to be sorted in some order (such as
ascending or descending), your first option is to get all values V1, V2, V3 and then
apply a sort function on them to get your desired order. As we’ve discussed, this sort
technique might not be feasible if you do not have enough RAM in your servers to
hold all the values. But there is another, preferable method that scales out very well
and eliminates any worry about big RAM requirements: you can use the sort and
shuffle feature of the MapReduce framework. As you’ve learned, this technique is
called a secondary sort, and it allows the MapReduce programmer to control the
order in which the values appear within a reduce() function call. To achieve this, you
Markov Model Using MapReduce

265
need to use a composite key that contains the information needed to sort both by key
and by value, and then you decouple the grouping and the sorting of the intermediate
data. Secondary sorting, then, enables us to define the sorting order of the values gen‐
erated by mappers. Again, sorting is done on both the keys and the values. Further,
grouping enables us to decide which keyvalue pairs are put together into a single
reduce() function call. The composite key for our example is illustrated in
Figure 112.
Figure 112. Composite key for secondary sorting
In Hadoop, the grouping is controlled in two places: the partitioner, which sends
mapper output to reducer tasks, and the grouping comparator, which groups data
within a reducer task. Both the partitioner and group comparator functionality can
be accomplished through pluggable classes for each MapReduce job. To sort the val‐
ues for reducers, we have to define a pluggable class, which sets the output key com‐
parator. Therefore, using raw MapReduce, implementing secondary sorting can take
a little bit of work: we must define a composite key and specify three functions (by
defining pluggable classes for the MapReduce job—details are given in Table 113)
that each use it in a different way. Note that for our example the natural key is a
customerid (as a string object) and there is no need to wrap it in another class,
which we might call as a natural key. The natural key is what you would normally use
as the key or “group by” operator.
Table 113. Hadoop solution 2: implementation classes
Class name
Description
SecondarySortProjectionDriver
Driver class to submit jobs
SecondarySortProjectionMapper
Mapper class
SecondarySortProjectionReducer Reducer class
266

Chapter 11: Smarter Email Marketing with the Markov Model
Class name
Description
CompositeKey
Custom key to hold a pair of (customerid, purchasedate),
which is a combination of the natural key and the natural value we want to
sort by
CompositeKeyComparator
How to sort CompositeKey objects; compares two composite keys for
sorting
NaturalKeyGroupingComparator
Considers the natural key; makes sure that a single reducer sees a custom
view of the groups (how to group customerid)
NaturalKeyPartitioner
How to partition by the natural key (customerid) to reducers; blocks all
data into a logical group in which we want the secondary sort to occur on the
natural value
DateUtil
Basic date utility class
HadoopUtil
Basic Hadoop utility class
Partial input
# hadoop fs cat /markov/projection_by_secondary_sort/input/training.txt  head
V31E55G4FI,1381872898,20130101,123
301UNH7I2F,1381872899,20130101,148
PP2KVIR4LD,1381872900,20130101,163
AC57MM3WNV,1381872901,20130101,188
BN020INHUM,1381872902,20130101,116
UP8R2SOR77,1381872903,20130101,183
VD91210MGH,1381872904,20130101,204
COI4OXHET1,1381872905,20130101,78
76S34ZE89C,1381872906,20130101,105
6K3SNF2EG1,1381872907,20130101,214
Log of sample run
The log output is shown here; it has been edited and formatted to fit the page:
# ./run.sh
JAVA_HOME=/System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
...
Deleted hdfs://localhost:9000/lib/projection_by_secondary_sort.jar
Deleted hdfs://localhost:9000/markov/projection_by_secondary_sort/output
...
13/11/27 15:14:34 INFO mapred.FileInputFormat: Total input paths to process : 1
13/11/27 15:14:34 INFO mapred.JobClient: Running job: job_201311271459_0003
13/11/27 15:14:35 INFO mapred.JobClient: map 0% reduce 0%
...
13/11/27 15:16:02 INFO mapred.JobClient: map 100% reduce 100%
13/11/27 15:16:03 INFO mapred.JobClient: Job complete: job_201311271459_0003
...
13/11/27 15:16:03 INFO mapred.JobClient: MapReduce Framework
13/11/27 15:16:03 INFO mapred.JobClient:
Map input records=832280
13/11/27 15:16:03 INFO mapred.JobClient:
Combine input records=0
13/11/27 15:16:03 INFO mapred.JobClient:
Reduce input records=832280
Markov Model Using MapReduce

267
13/11/27 15:16:03 INFO mapred.JobClient:
Reduce input groups=79998
13/11/27 15:16:03 INFO mapred.JobClient:
Combine output records=0
13/11/27 15:16:03 INFO mapred.JobClient:
Reduce output records=79998
13/11/27 15:16:03 INFO mapred.JobClient:
Map output records=832280
13/11/27 15:16:03 INFO SecondarySortProjectionDriver: elapsedTime
(in milliseconds): 89809
Partial output
...
ZSY40NVPS6 20130101,110;20130111,32;20130304,111;20130409,65;...
ZTLNF0O4LN 20130116,55;20130321,164;20130514,66;20130629,81;...
ZV20AIXG8L 20130113,210;20130203,32;20130210,48;20130223,27;...
...
Generating State Sequences
The goal of this section is to convert a transaction sequence into a state sequence. Both
of our Hadoop solutions generated the following output (i.e., the transaction
sequence):
customerid (Date1, Amount1);(Date2, Amount2);...(DateN, AmountN)
such that:
Date1 ≤ Date2 ≤ ... ≤ DateN
We need to convert this output (“transaction sequence”) into a “state sequence” as:
customerid, State1, State2, ..., Staten
The next task is to convert (using another Ruby script developed by Pranab Ghosh)
the sorted sequence of (purchasedate, amount) pairs into a set of Markov chain
states. The state is indicated by a twoletter symbol; each letter is defined in
Table 114.
Table 114. Letters indicating a Markov chain state
Time elapsed since last transaction Amount spent compared to previous transaction
S: Small
L: Significantly less than
M: Medium
E: More or less same
L: Large
G: Significantly greater than
Therefore, we will have nine (3 × 3) states, as shown in Table 115.
268

Chapter 11: Smarter Email Marketing with the Markov Model