Tải bản đầy đủ
Chapter 2. Secondary Sort: A Detailed Example

Chapter 2. Secondary Sort: A Detailed Example

Tải bản đầy đủ

preferable solution). Implementation of this solution is simple; it is presented in
Chapter 1 and will not be discussed in this chapter.
Solution #2
Use the Secondary Sort design pattern of the MapReduce framework, and
reducer values will arrive sorted (i.e., there’s no need to sort values in memory).
This technique uses the shuffle and sort technique of the MapReduce framework
to sort reducer values. This solution is preferable to solution #1 because you do
not depend on the memory for sorting (again, if you have too many values, solu‐
tion #1 might not be a viable option). The rest of this chapter will focus on pre‐
senting solution #2.
We implement solution #2 in Hadoop by using:
• The old Hadoop API (using org.apache.hadoop.mapred.JobConf and
org.apache.hadoop.mapred.*); I have intentionally included this API in case
you have not migrated to the new Hadoop API.
• The new Hadoop API (using org.apache.hadoop.mapreduce.Job and
org.apache.hadoop.mapreduce.lib.*).

Secondary Sorting Technique
Say we have the following values for key = K:
(K, V1), (K, V2), ..., (K, Vn)

and further assume that each Vi is a tuple of m attributes as follows:
(ai1, ai2, ..., aim)

where we want to sort the reducer’s tuple values by ai1. We will denote (ai2, ...,
aim) (the remaining attributes) with r. Therefore, we can express reducer values as:
(K, (a1, r1)), (K, (a2, r2)), ..., (K, (an, rn))

To sort the reducer values by ai, we create a composite key: (K, ai). Our new map‐
pers will emit the key-value pairs for key = K shown in Table 2-1.

28

| Chapter 2: Secondary Sort: A Detailed Example

Table 2-1. Key-value pairs emitted by mappers
Key

Value

(K, a1) (a1, r1)
(K, a2) (a2, r2)
...

...

(K, an) (an, rn)

So, the composite key is (K, ai), and the natural key is K. Defining the composite key
(by adding the attribute ai to the natural key) enables us to sort the reducer values
using the MapReduce framework, but when we want to partition keys, we will parti‐
tion them by the natural key (K). The composite key and the natural key are illustra‐
ted in Figure 2-1.

Figure 2-1. Secondary sorting keys
We have to tell the MapReduce framework how to sort the keys by using a composite
key composed of two fields, K and ai. For this we need to define a plug-in sort class,
CompositeKeyComparator, which will be sorting the composite keys. Example 2-1
shows how you plug this comparator class into a MapReduce framework.
Example 2-1. Plugging in the comparator class
1
2
3
4
5
6
7

import org.apache.hadoop.mapred.JobConf;
...
JobConf conf = new JobConf(getConf(), .class);
...
// map() creates key-value pairs of
// (CompositeKey, NaturalValue)
conf.setMapOutputKeyClass(CompositeKey.class);

Secondary Sorting Technique

|

29

8
9
10
11
12

conf.setMapOutputValueClass(NaturalValue.class);
...
// Plug-in Comparator class:
// how CompositeKey objects will be sorted
conf.setOutputKeyComparatorClass(CompositeKeyComparator.class);

The CompositeKeyComparator class tells the MapReduce framework how to sort the
composite keys. The implementation in Example 2-2 compares two WritableCompara
ble objects (representing a CompositeKey object).
Example 2-2. Comparator class: CompositeKeyComparator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable k1, WritableComparable k2) {
CompositeKey ck1 = (CompositeKey) k1;
CompositeKey ck2 = (CompositeKey) k2;
//
//
//
//

compare
0, if
1, if
-1, if

ck1
ck1
ck1
ck1

with ck2 and return
and ck2 are identical
> ck2
< ck2

// detail of implementation is provided in subsections
}
}

The next class to plug in is a “natural key partitioner” class (let’s call it NaturalKeyPar
titioner) that will implement the Partitioner1 interface. Example 2-3 shows how

we plug this class into the MapReduce framework.
Example 2-3. Plugging in NaturalKeyPartitioner
1
2
3
4
5

import org.apache.hadoop.mapred.JobConf;
...
JobConf conf = new JobConf(getConf(), .class);
...
conf.setPartitionerClass(NaturalKeyPartitioner.class);

1 org.apache.hadoop.mapred.Partitioner

30

|

Chapter 2: Secondary Sort: A Detailed Example

Next, we define the NaturalKeyPartitioner class, as shown in Example 2-4.
Example 2-4. Defining the NaturalKeyPartitioner class
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25

import org.apache.hadoop.mapred.JobConf;
import org.apache.hadoop.mapred.Partitioner;
/**
* NaturalKeyPartitioner partitions the data output from the
* map phase before it is sent through the shuffle phase.
*
* getPartition() partitions data generated by mappers.
* This function should partition data by the natural key.
*
*/
public class NaturalKeyPartitioner implements
Partitioner {
@Override
public int getPartition(CompositeKey key,
NaturalValue value,
int numberOfPartitions) {
return % numberOfPartitions;
}
@Override
public void configure(JobConf arg) {
}
}

The last piece to plug in is NaturalKeyGroupingComparator, which just compares
two natural keys. Example 2-5 shows how you plug this class into the MapReduce
framework.
Example 2-5. Plugging in NaturalKeyGroupingComparator
1
2
3
4
5

import org.apache.hadoop.mapred.JobConf;
...
JobConf conf = new JobConf(getConf(), .class);
...
conf.setOutputValueGroupingComparator(NaturalKeyGroupingComparator.class);

Next, as shown in Example 2-6, we define the NaturalKeyGroupingComparator class.
Example 2-6. Defining the NaturalKeyGroupingComparator class
1 import org.apache.hadoop.io.WritableComparable;
2 import org.apache.hadoop.io.WritableComparator;
3
4 /**

Secondary Sorting Technique

|

31

5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23

*
* NaturalKeyGroupingComparator
*
* This class is used during Hadoop's shuffle phase to group
* composite keys by the first part (natural) of their key.
*/
public class NaturalKeyGroupingComparator extends WritableComparator {
protected NaturalKeyGroupingComparator() {
super(NaturalKey.class, true);
}
@Override
public int compare(WritableComparable o1, WritableComparable o2) {
NaturalKey nk1 = (NaturalKey) o1;
NaturalKey nk2 = (NaturalKey) o2;
return nk1.getNaturalKey().compareTo(nk2.getNaturalKey());
}
}

Complete Example of Secondary Sorting
Consider the following data:
Stock-Symbol

Date

Closed-Price

and assume that we want to generate the following output data per stock symbol:
Stock-Symbol: (Date1, Price1)(Date2, Price2)...(Daten, Pricen)

where:
Date1 ≤ Date2 ≤ ... ≤ Daten

We want the reducer values to be sorted by the date of the closed price. We can
accomplish this by secondary sorting.

Input Format
We assume that our input data is in CSV format:
Stock-Symbol,Date,Closed-Price

For example:
ILMN,2013-12-05,97.65
GOOG,2013-12-09,1078.14
IBM,2013-12-09,177.46
ILMN,2013-12-09,101.33
ILMN,2013-12-06,99.25
GOOG,2013-12-06,1069.87
IBM,2013-12-06,177.67
GOOG,2013-12-05,1057.34

32

| Chapter 2: Secondary Sort: A Detailed Example

Output Format
We want our output to be sorted by date of closed price, so for our sample input, our
desired output is listed as follows:
ILMN: (2013-12-05,97.65)(2013-12-06,99.25)(2013-12-09,101.33)
GOOG: (2013-12-05,1057.34)(2013-12-06,1069.87)(2013-12-09,1078.14)
IBM: (2013-12-06,177.67)(2013-12-09,177.46)

Composite Key
The natural key is the stock symbol, and the composite key is a pair of (StockSymbol,Date). The Date field has to be part of our composite key because we want
reducer values to be sorted by Date. The natural key and composite key are illustrated
in Figure 2-2.

Figure 2-2. Secondary sorting: composite and natural keys
We can define the composite key class as CompositeKey and its associated comparator
class as CompositeKeyComparator (this class tells MapReduce how to sort objects of
CompositeKey).

Composite key definition
In Example 2-7, the composite key is defined as the CompositeKey class, which imple‐
ments the WritableComparable interface.2

2 WritableComparable(s) can be compared to each other, typically via Comparator(s). Any type that is to be

used as a key in the Hadoop/MapReduce framework should implement this interface.

Complete Example of Secondary Sorting

|

33

Example 2-7. Defining the composite key
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50

34

import
import
import
import
import

java.io.DataInput;
java.io.DataOutput;
java.io.IOException;
org.apache.hadoop.io.WritableComparable;
org.apache.hadoop.io.WritableComparator;

/**
*
* CompositeKey: represents a pair of (String stockSymbol, long timestamp).
* Note that timestamp represents the Date.
*
* We do a primary grouping pass on the stockSymbol field to get all of
* the data of one type together, and then our secondary sort during
* the shuffle phase uses the timestamp long member to sort the data points
* so that they arrive at the reducer partitioned and in sorted order (by date).
*
*/
public class CompositeKey implements WritableComparable {
// natural key is (stockSymbol)
// composite key is a pair (stockSymbol, timestamp)
private String stockSymbol; // stock symbol
private long timestamp;
// date
public CompositeKey(String stockSymbol, long timestamp) {
set(stockSymbol, timestamp);
}
public CompositeKey() {
}
public void set(String stockSymbol, long timestamp) {
this.stockSymbol = stockSymbol;
this.timestamp = timestamp;
}
public String getStockSymbol() {
return this.stockSymbol;
}
public long getTimestamp() {
return this.timestamp;
}
@Override
public void readFields(DataInput in) throws IOException {
this.stockSymbol = in.readUTF();
this.timestamp = in.readLong();
}
@Override

|

Chapter 2: Secondary Sort: A Detailed Example

51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69 }

public void write(DataOutput out) throws IOException {
out.writeUTF(this.stockSymbol);
out.writeLong(this.timestamp);
}
@Override
public int compareTo(CompositeKey other) {
if (this.stockSymbol.compareTo(other.stockSymbol) != 0) {
return this.stockSymbol.compareTo(other.stockSymbol);
}
else if (this.timestamp != other.timestamp) {
return timestamp < other.timestamp ? -1 : 1;
}
else {
return 0;
}
}

Composite key comparator definition
Example 2-8 defines the composite key comparator as the CompositeKeyComparator
class, which compares two CompositeKey objects by implementing the compare()
method. The compare() method returns 0 if they are identical, –1 if the first compo‐
site key is smaller than the second one, and +1 otherwise.
Example 2-8. Defining the composite key comparator
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22

import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.WritableComparator;
/**
* CompositeKeyComparator
*
* The purpose of this class is to enable comparison of two CompositeKeys.
*
*/
public class CompositeKeyComparator extends WritableComparator {
protected CompositeKeyComparator() {
super(CompositeKey.class, true);
}
@Override
public int compare(WritableComparable wc1, WritableComparable wc2) {
CompositeKey ck1 = (CompositeKey) wc1;
CompositeKey ck2 = (CompositeKey) wc2;
int comparison = ck1.getStockSymbol().compareTo(ck2.getStockSymbol());
if (comparison == 0) {

Complete Example of Secondary Sorting

|

35

23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38 }

// stock symbols are equal here
if (ck1.getTimestamp() == ck2.getTimestamp()) {
return 0;
}
else if (ck1.getTimestamp() < ck2.getTimestamp()) {
return -1;
}
else {
return 1;
}
}
else {
return comparison;
}
}

Sample Run—Old Hadoop API
The classes shown in Table 2-2 use the old Hadoop API to implement the Secondary
Sort design pattern.
Table 2-2. Implementation classes using the old Hadoop API
Class name

Description

CompositeKey

Defines a composite key

CompositeKeyComparator

Implements sorting composite keys

DateUtil

Defines some useful date handling methods

HadoopUtil

Defines some utility functions

NaturalKeyGroupingComparator Defines how natural keys will be grouped together
NaturalKeyPartitioner

Implements how natural keys will be partitioned

NaturalValue

Defines a natural value

SecondarySortDriver

Submits a job to Hadoop

SecondarySortMapper

Defines map()

SecondarySortReducer

Defines reduce()

Input
# hadoop fs -ls /secondary_sort_chapter/input/
Found 1 items
-rw-r--r-... /secondary_sort_chapter/input/sample_input.txt
# hadoop fs -cat /secondary_sort_chapter/input/sample_input.txt
ILMN,2013-12-05,97.65
GOOG,2013-12-09,1078.14
IBM,2013-12-09,177.46

36

|

Chapter 2: Secondary Sort: A Detailed Example

ILMN,2013-12-09,101.33
ILMN,2013-12-06,99.25
GOOG,2013-12-06,1069.87
IBM,2013-12-06,177.67
GOOG,2013-12-05,1057.34

Running the MapReduce Job
# ./run.sh
...
13/12/12 21:13:20
13/12/12 21:13:21
13/12/12 21:13:22
...
13/12/12 21:14:25
...
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26
13/12/12 21:14:26

INFO mapred.FileInputFormat: Total input paths to process : 1
INFO mapred.JobClient: Running job: job_201312122109_0002
INFO mapred.JobClient: map 0% reduce 0%
INFO mapred.JobClient: map 100% reduce 100%
INFO
INFO
INFO
INFO
INFO
INFO
INFO
INFO

mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:

Map-Reduce Framework
Map input records=8
Combine input records=0
Reduce input records=8
Reduce input groups=3
Combine output records=0
Reduce output records=3
Map output records=8

Output
# hadoop fs -ls /secondary_sort_chapter/output/
-rw-r--r-- 1 ... 0 2013-12-12 21:14 /secondary_sort_chapter/output/_SUCCESS
drwxr-xr-x - ... 0 2013-12-12 21:13 /secondary_sort_chapter/output/_logs
-rw-r--r-- 1 ... 0 2013-12-12 21:13 /secondary_sort_chapter/output/part-00000
-rw-r--r-- 1 ... 66 2013-12-12 21:13 /secondary_sort_chapter/output/part-00001
...
-rw-r--r-- 1 ... 0 2013-12-12 21:14 /secondary_sort_chapter/output/part-00008
-rw-r--r-- 1 ... 43 2013-12-12 21:14 /secondary_sort_chapter/output/part-00009
# hadoop fs -cat /secondary_sort_chapter/output/part*
GOOG (2013-12-05,1057.34)(2013-12-06,1069.87)(2013-12-09,1078.14)
ILMN (2013-12-05,97.65)(2013-12-06,99.25)(2013-12-09,101.33)
IBM (2013-12-06,177.67)(2013-12-09,177.46)

Sample Run—New Hadoop API
The classes shown in Table 2-3 use the new Hadoop API to implement the Secondary
Sort design pattern.

Sample Run—New Hadoop API

|

37

Table 2-3. Implementation classes using the new Hadoop API
Class name

Description

CompositeKey

Defines a composite key

CompositeKeyComparator

Implements sorting composite keys

DateUtil

Defines some useful date handling methods

HadoopUtil

Defines some utility functions

NaturalKeyGroupingComparator Defines how natural keys will be grouped together
NaturalKeyPartitioner

Implements how natural keys will be partitioned

NaturalValue

Defines a natural value

SecondarySortDriver

Submits a job to Hadoop

SecondarySortMapper

Defines map()

SecondarySortReducer

Defines reduce()

Input
# hadoop fs -ls /secondary_sort_chapter_new_api/input/
Found 1 items
-rw-r--r-... /secondary_sort_chapter_new_api/input/sample_input.txt
# hadoop fs -cat /secondary_sort_chapter_new_api/input/sample_input.txt
ILMN,2013-12-05,97.65
GOOG,2013-12-09,1078.14
IBM,2013-12-09,177.46
ILMN,2013-12-09,101.33
ILMN,2013-12-06,99.25
GOOG,2013-12-06,1069.87
IBM,2013-12-06,177.67
GOOG,2013-12-05,1057.34

Running the MapReduce Job
# ./run.sh
...
13/12/14 21:18:25 INFO ... Total input paths to process : 1
...
13/12/14 21:18:25 INFO mapred.JobClient: Running job: job_201312142112_0002
13/12/14 21:18:26 INFO mapred.JobClient: map 0% reduce 0%
13/12/14
13/12/14
...
13/12/14
13/12/14
13/12/14
13/12/14
13/12/14

38

|

21:19:15 INFO mapred.JobClient: map 100% reduce 100%
21:19:16 INFO mapred.JobClient: Job complete: job_201312142112_0002
21:19:16
21:19:16
21:19:16
21:19:16
21:19:16

INFO
INFO
INFO
INFO
INFO

mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:
mapred.JobClient:

Chapter 2: Secondary Sort: A Detailed Example

Map-Reduce Framework
Map input records=8
Spilled Records=16
Combine input records=0
Reduce input records=8