Tải bản đầy đủ
Chapter 4. Common Hadoop Processing Patterns

Chapter 4. Common Hadoop Processing Patterns

Tải bản đầy đủ

Deltas (updated records)
HDFS is a “write once and read many” filesystem. Making modifications at a
record level is not a simple thing to do. In the use case of deltas we would have an
existing data set with a primary key (or composite key), and we will have updated
records being added to that data set.
We cover methods for dealing with the first case, fully duplicate records, elsewhere in
this book—for example, in the clickstream case study in Chapter 8—so we’ll discuss
the second case, record updates, in this example. This will require implementing pro‐
cessing to rewrite an existing data set so that it only shows the newest versions of each
record.
If you’re familiar with HBase, you might have noted that this is similar to the way
HBase works; at a high level, a region in HBase has an HFile that has values linked to
a key. When new data is added, a second HFile is added with keys and values. During
cleanup activities called compactions, HBase does a merge join to execute this dedu‐
plication pattern, as shown in Figure 4-1.

Figure 4-1. HBase compaction
Note that we’re omitting some additional complexity, such as HBase’s support for ver‐
sioning, in the preceding example.

Data Generation for Deduplication Example
Before we get into examples of implementing this pattern, let’s first look at some code
to create test data. We are going to use the Scala object GenDedupInput, which uses
the HDFS API to create a file on HDFS and write out records in the following format:
{PrimaryKey},{timeStamp},{value}

We’ll write x records and y unique primary keys. This means if we set x to 100 and y
to 10, we will get something close to 10 duplicate records for every primary key as
seen in this example:
object GenDedupInput {
def main(args:Array[String]): Unit = {

136

|

Chapter 4: Common Hadoop Processing Patterns

if (args.length < 3) {
println("{outputPath} {numberOfRecords} {numberOfUniqueRecords}")
return
}
// The output file that will hold the data
val outputPath = new Path(args(0))
// Number of records to be written to the file
val numberOfRecords = args(1).toLong
// Number of unique primary keys
val numberOfUniqueRecords = args(2).toInt
// Open fileSystem to HDFS
val fileSystem = FileSystem.get(new Configuration())
// Create buffered writer
val writer = new BufferedWriter(
new OutputStreamWriter(fileSystem.create(outputPath)))
val r = new Random()
// This loop will write out all the records
// Every primary key will get about
// numberOfRecords/numberOfUniqueRecords records
for (i <- 0 until numberOfRecords) {
val uniqueId = r.nextInt(numberOfUniqueRecords)
// Format: {key}, {timeStamp}, {value}
writer.write(uniqueId + "," + i + "," + r.nextInt(10000))
writer.newLine()
}
writer.close()
}
}

Code Example: Spark Deduplication in Scala
Now that we’ve created our test data in HDFS, let’s look at the code to deduplicate
these records in the SparkDedupExecution object:
object SparkDedupExecution {
def main(args:Array[String]): Unit = {
if (args.length < 2) {
println("{inputPath} {outputPath}")
return
}
// set up given parameters
val inputPath = args(0)
val outputPath = args(1)
// set up spark conf and connection

Pattern: Removing Duplicate Records by Primary Key

|

137

val sparkConf = new SparkConf().setAppName("SparkDedupExecution")
sparkConf.set("spark.cleaner.ttl", "120000");
val sc = new SparkContext(sparkConf)
// Read data in from HDFS
val dedupOriginalDataRDD = sc.hadoopFile(inputPath,
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
1)
// Get the data in a key-value format
val keyValueRDD = dedupOriginalDataRDD.map(t => {
val splits = t._2.toString.split(",")
(splits(0), (splits(1), splits(2)))})
// reduce by key so we will only get one record for every primary key
val reducedRDD =
keyValueRDD.reduceByKey((a,b) => if (a._1.compareTo(b._1) > 0) a else b)
// Format the data to a human-readable format and write it back out to HDFS
reducedRDD
.map(r => r._1 + "," + r._2._1 + "," + r._2._2)
.saveAsTextFile(outputPath)
}
}

Let’s break this code down to discuss what’s going on. We’ll skip the setup code, which
just gets the user arguments and sets up the SparkContext, and skip to the following
code that will get our duplicate record data from HDFS:
val dedupOriginalDataRDD = sc.hadoopFile(inputPath,
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
1)

There are many ways to read data in Spark, but for this example we’ll use the hadoop
File() method so we can show how the existing input formats can be used. If you
have done much MapReduce programing, you will be familiar with the TextInputFor
mat, which is one of the most basic input formats available. The TextInputFormat
provides functionality that will allow Spark or MapReduce jobs to break up a direc‐
tory into files, which are then broken up into blocks to be processed by different
tasks.
The next item of code is the first map() function:
val keyValueRDD = dedupOriginalDataRDD.map(t => {
val splits = t._2.toString.split(",")
(splits(0), (splits(1), splits(2)))})

138

|

Chapter 4: Common Hadoop Processing Patterns

This code will run in parallel across different workers and parse the incoming records
into a Tuple object that has two values representing a key and a value.
This key-value structure is required for the next piece of code, which will use the
reduceByKey() method. As you might guess by the name, in order to use the reduce
ByKey() method we need a key.
Now let’s look at the code that calls reduceByKey():
val reducedRDD =
keyValueRDD.reduceByKey((a,b) => if (a._1.compareTo(b._1) > 0) a else b)

The reduceByKey() method takes a function that takes a left and right value and
returns a value of the same type. The goal of reduceByKey() is to combine all values
of the same key. In the word count example, it is used to add all the counts of a single
word to get the total count. In our example, the a and b are strings, and we will return
a or b depending on which is greater. Since the key we’re reducing by is the primary
key, this function will make sure that we only have one record per primary key—
hence, deduplicating the data based on the greatest primary key-value.
The last bit of code will just write the results back to HDFS:
reducedRDD
.map(r => r._1 + "," + r._2._1 + "," + r._2._2)
.saveAsTextFile(outputPath)

We will get a text output file for every partition in Spark, similar to the way MapRe‐
duce will output a file for each mapper or reducer at the end of a MapReduce job.

Code Example: Deduplication in SQL
Now we’ll turn to the venerable SQL—well, more accurately, HiveQL, although the
examples in this chapter will work with either Hive or Impala. First, we need to put
our test data into a table using this data definition language (DDL) query:
CREATE EXTERNAL TABLE COMPACTION_TABLE (
PRIMARY_KEY STRING,
TIME_STAMP BIGINT,
EVENT_VALUE STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'compaction_data';

Now that we have a table, let’s look at the query to perform the deduplication:
SELECT
A.PRIMARY_KEY,
A.TIME_STAMP,
MAX(A.EVENT_VALUE)
FROM COMPACTION_TABLE A JOIN (

Pattern: Removing Duplicate Records by Primary Key

|

139

SELECT
PRIMARY_KEY AS P_KEY,
MAX(TIME_STAMP) as TIME_SP
FROM COMPACTION_TABLE
GROUP BY PRIMARY_KEY
) B
WHERE A.PRIMARY_KEY = B.P_KEY AND A.TIME_STAMP = B.TIME_SP
GROUP BY A.PRIMARY_KEY, A.TIME_STAMP

Here we have a two-level-deep SQL query. The deepest SELECT is getting the latest
TIME_STAMP for all the PRIMARY_KEY records. The outer SELECT statement is taking the
results from the inner SELECT statement to pull out the latest EVENT_VALUE. Also note
that we apply a MAX() function to the EVENT_VALUE value; this is because we only want
a single value, so if we have two EVENT_VALUEs with the same timestamp we’ll select
the one with the greatest value to keep for our new record.

Pattern: Windowing Analysis
Windowing functions provide the ability to scan over an ordered sequence of events
over some window—for example, a specific slice of time. This pattern is very
powerful and is useful in many industries:
• It can be used in finance to gain a better understanding of changing security
prices.
• In sensor monitoring, it’s useful in predicting failure from abnormalities in
readings.
• It can be used in churn analysis for trying to predict if a customer will leave you
based on behavior patterns.
• In gaming, it can help to identify trends as users progress from novice to expert.
To illustrate, we’ll use an example that relates to the finance use case: finding peaks
and valleys in equity prices in order to provide some insight into price changes. A
peak is a record that has a lower price before it and a lower price after it, while a valley
is just the opposite, with higher prices on both sides, as shown in Figure 4-2.

140

|

Chapter 4: Common Hadoop Processing Patterns

Figure 4-2. Peaks and valleys in stock prices over time
To implement this example we’ll need to maintain a window of stock prices in order
to determine where the peaks and valleys occur.
Note that a simple example like this makes it possible for us to show the solution in
both SQL and Spark. As windowing analysis gets more complex, SQL becomes a less
suitable solution.

Data Generation for Windowing Analysis Example
Let’s create some test data containing records with a value that goes up and down,
similar to stock prices. The following code example takes the same input parameters
as our last data generation tool—numberOfRecords and numberOfUniqueIds—
although the resulting records will be somewhat different:
Primary key
An identifier for each sequence of events we are analyzing—for example, a stock
ticker symbol. This will be based on the numberOfUniqueIds input parameter.
Incrementing counter
This will be unique for every record in the generated data.
Event value
This will have a value that increases and decreases for a random set of records for
a given primary key.
Let’s take a look at the code to generate this test data:
def main(args: Array[String]): Unit = {
if (args.length == 0) {
println("{outputPath} {numberOfRecords} {numberOfUniqueIds}")
return
}

Pattern: Windowing Analysis

|

141

val outputPath = new Path(args(0))
val numberOfRecords = args(1).toInt
val numberOfUniqueIds = args(2).toInt
val fileSystem = FileSystem.get(new Configuration())
val writer =
new BufferedWriter( new OutputStreamWriter(fileSystem.create(outputPath)))
val r = new Random()
var direction = 1
var directionCounter = r.nextInt(numberOfUniqueIds * 10)
var currentPointer = 0
for (i <- 0 until numberOfRecords) {
val uniqueId = r.nextInt(numberOfUniqueIds)
currentPointer = currentPointer + direction
directionCounter = directionCounter - 1
if (directionCounter == 0) {
var directionCounter = r.nextInt(numberOfUniqueIds * 10)
direction = direction * -1
}
writer.write(uniqueId + "," + i + "," + currentPointer)
writer.newLine()
}
writer.close()
}

Code Example: Peaks and Valleys in Spark
Now, let’s look at the code to implement this pattern in Spark. There’s quite a bit
going on in the following code example, so after we present the code we’ll drill down
further to help you to understand what’s going on.
You’ll find this code in the SparkPeaksAndValleysExecution.scala file:
object SparkPeaksAndValleysExecution {
def main(args: Array[String]): Unit = {
if (args.length == 0) {
println("{inputPath} {outputPath} {numberOfPartitions}")
return
}
val inputPath = args(0)
val outputPath = args(1)
val numberOfPartitions = args(2).toInt

142

|

Chapter 4: Common Hadoop Processing Patterns

val sparkConf = new SparkConf().setAppName("SparkTimeSeriesExecution")
sparkConf.set("spark.cleaner.ttl", "120000");
val sc = new SparkContext(sparkConf)
// Read in the data
var originalDataRDD = sc.hadoopFile(inputPath,
classOf[TextInputFormat],
classOf[LongWritable],
classOf[Text],
1).map(r => {
val splits = r._2.toString.split(",")
(new DataKey(splits(0), splits(1).toLong), splits(2).toInt)
})
// Partitioner to partition by primaryKey only
val partitioner = new Partitioner {
override def numPartitions: Int = numberOfPartitions
override def getPartition(key: Any): Int = {
Math.abs(key.asInstanceOf[DataKey].uniqueId.hashCode() % numPartitions)
}
}
// Partition and sort
val partedSortedRDD =
new ShuffledRDD[DataKey, Int, Int](
originalDataRDD,
partitioner).setKeyOrdering(implicitly[Ordering[DataKey]])
// MapPartition to do windowing
val pivotPointRDD = partedSortedRDD.mapPartitions(it => {
val results = new mutable.MutableList[PivotPoint]
// Keeping context
var lastUniqueId = "foobar"
var lastRecord: (DataKey, Int) = null
var lastLastRecord: (DataKey, Int) = null
var position = 0
it.foreach( r => {
position = position + 1
if (!lastUniqueId.equals(r._1.uniqueId)) {
lastRecord = null
lastLastRecord = null
}
// Finding peaks and valleys

Pattern: Windowing Analysis

|

143

if (lastRecord != null && lastLastRecord != null) {
if (lastRecord._2 < r._2 && lastRecord._2 < lastLastRecord._2) {
results.+=(new PivotPoint(r._1.uniqueId,
position,
lastRecord._1.eventTime,
lastRecord._2,
false))
} else if (lastRecord._2 > r._2 && lastRecord._2 > lastLastRecord._2) {
results.+=(new PivotPoint(r._1.uniqueId,
position,
lastRecord._1.eventTime,
lastRecord._2,
true))
}
}
lastUniqueId = r._1.uniqueId
lastLastRecord = lastRecord
lastRecord = r
})
results.iterator
})
// Format output
pivotPointRDD.map(r => {
val pivotType = if (r.isPeak) "peak" else "valley"
r.uniqueId + "," +
r.position + "," +
r.eventTime + "," +
r.eventValue + "," +
pivotType
} ).saveAsTextFile(outputPath)
}

class DataKey(val uniqueId:String, val eventTime:Long)
extends Serializable with Comparable[DataKey] {
override def compareTo(other:DataKey): Int = {
val compare1 = uniqueId.compareTo(other.uniqueId)
if (compare1 == 0) {
eventTime.compareTo(other.eventTime)
} else {
compare1
}
}
}
class PivotPoint(val uniqueId: String,
val position:Int,
val eventTime:Long,

144

|

Chapter 4: Common Hadoop Processing Patterns

val eventValue:Int,
val isPeak:boolean) extends Serializable {}
}

Nothing too interesting here: we’re simply reading the input data and parsing it
into easy-to-consume objects.
This is where things get interesting. We’re defining a partition here, just like
defining a custom partitioner with MapReduce. A partition will help us to decide
which records go to which worker after the shuffle process. We need a custom
partitioner here because we have a two-part key: primary_key and position. We
want to sort by both, but we only want to partition by the primary_key so we get
output like that shown in Figure 4-3.
This is the shuffle action that will partition and sort the data for us. Note that the
1.3 release of Spark provides a transformation called repartitionAndSortWithin
Partitions(), which would provide this functionality for us, but since this is
coded with Spark 1.2 we need to manually implement the shuffle.
This mapPartition() method will allow us to run through the primary_key in
the order of the position. This is where the windowing will happen.
This is context information we need in order to find peaks and valleys and to
know if we have changed primary_keys. Remember, to find a peak or a valley we
will need to know of the record before and the one after. So we will have the
currentRow, lastRow, and lastLastRow, and we can determine if the lastRow is
a peak or valley by comparing it against the others.
Perform comparisons to determine if we’re in a peak or in a valley.
And finally, this is the code that will format the records and write them to HDFS.

Figure 4-3. Partitioning in the peaks and valleys example—here we partition the sequen‐
ces into two groups, so we can distribute the analysis on two workers while still keeping
all events for each sequence together

Pattern: Windowing Analysis

|

145

Code Example: Peaks and Valleys in SQL
As in the previous example, we’ll first create a table over our test data:
CREATE EXTERNAL TABLE PEAK_AND_VALLEY_TABLE (
PRIMARY_KEY STRING,
POSITION BIGINT,
EVENT_VALUE BIGINT
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 'PeakAndValleyData';

Now that we have our table we need to order the records and then use the lead() and
lag() functions, which will provide the context of the surrounding records:
SELECT
PRIMARY_KEY,
POSITION,
EVENT_VALUE,
CASE
WHEN LEAD_EVENT_VALUE is null OR LAG_EVENT_VALUE is null THEN 'EDGE'
WHEN
EVENT_VALUE < LEAD_EVENT_VALUE AND EVENT_VALUE < LAG_EVENT_VALUE
THEN
'VALLEY'
WHEN
EVENT_VALUE > LEAD_EVENT_VALUE AND EVENT_VALUE > LAG_EVENT_VALUE
THEN
'PEAK'
ELSE 'SLOPE'
END AS POINT_TYPE
FROM
(
SELECT
PRIMARY_KEY,
POSITION,
EVENT_VALUE,
LEAD(EVENT_VALUE,1,null)
OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LEAD_EVENT_VALUE,
LAG(EVENT_VALUE,1,null)
OVER (PARTITION BY PRIMARY_KEY ORDER BY POSITION) AS LAG_EVENT_VALUE
FROM PEAK_AND_VALLEY_TABLE
) A

Although this SQL is not overly complex, it is big enough that we should break it
down to explain what’s going on:
After execution of the subquery in step 2, we have the data organized in such a
way that all the information we need is in the same record and we can now use
that record to determine if we have one of the following: an edge, a point on the
leftmost or rightmost part of the window timeline; a peak, a value that has a
146

|

Chapter 4: Common Hadoop Processing Patterns