Tải bản đầy đủ
Chapter 5. Graph Processing on Hadoop

Chapter 5. Graph Processing on Hadoop

Tải bản đầy đủ

Figure 5-1. Edges and vertices
Now let’s change the image a little and give each vertex some information. Let’s say
each vertex represents a person, and we want it to have some information about that
person like his or her name and type (which in this case would be Person). Now for
the edges, let’s give them some information also describing the relationship of the two
vertices they connect. We can use information like movies viewed, or relationships
like brother, father, mother, and wife (see Figure 5-2).

Figure 5-2. Adding information to vertices and edges
But this isn’t enough information. We know that Karen is TJ’s mother, but TJ can’t
also be Karen’s mother, and we know that Andrew watched Iron Man, but Iron Man is
not watching Andrew. So, we can fix this problem by giving our edges directions,
which gives us an image, or graph, that looks like Figure 5-3.

160

|

Chapter 5: Graph Processing on Hadoop

Figure 5-3. Edges showing directional relationships between vertices
This is more like it. We can show this graph to even a nontechnical person and he or
she will still be able to figure out what we’re trying to express.

What Is Graph Processing?
When we talk about graph processing, we are talking about doing processing at a
global level that may touch every vertex in the graph. This is in contrast to the idea of
graph querying. Graph querying is when you ask a question of the graph, like “Who’s
connected to Karen?” This query will execute the following steps:
1. Look up Karen and her edges.
2. Follow each edge and get those vertices.
3. Return the resulting list to the user.
Now, by graph processing, we mean something a little different that would be asking
something like “What are the top five connections with a separation of five degrees?”
This question is much larger in scale than the query and requires much more horse‐
power to process because it involves looking through a lot of people and all their con‐
nections. In contrast, the query focused on a single user could have been executed by
a single client with a couple of hops to a data store such as HBase.
These concepts manifest in many real-world examples:
What Is Graph Processing?

|

161

• The Web can be seen as a very large graph, where web pages are the vertices, and
links are the edges. This leads to a number of opportunities for analysis, includ‐
ing algorithms such as PageRank, famously used by Google in ranking search
results.
• As should be evident by the preceding discussion, social networks are a natural
application of graph processing; for example, determining degrees of connection
between users of a networking site.
It’s these types of applications that this chapter will focus on; how do we ask these
questions with Hadoop data in an effective way that is maintainable and performant?

How Do You Process a Graph in a Distributed System?
In order to perform this processing on a system like Hadoop, we can start with Map‐
Reduce. The problem with MapReduce is that it can only give us a one-layer join,
which means we have to tackle a graph like peeling an onion. For those of you who
don’t peel onions, it is very different from peeling an apple. An onion has many layers
to get through before you reach the core. In addition, that property of onions that
makes your eyes water and makes the peeling experience less than joyful is similar to
how processing a graph with MapReduce might reduce you to tears at a certain point.
The graph in Figure 5-4 is an example of what we mean by “like peeling an onion.”
The center dot is our starting person, and every growing circle is yet another MapRe‐
duce job to figure out who is connected for each level.

Figure 5-4. In MapReduce, tackling a graph is like peeling an onion
This hurts even more when we realize that with every pass we are rereading and
rewriting the whole graph to disk.

162

|

Chapter 5: Graph Processing on Hadoop

Thankfully, once again some very smart people at Google decided to break the rules.
In this case, it was the MapReduce rule that mappers are not allowed to talk to other
mappers. This shared nothing concept is very important to a distributed system like
Hadoop that needs sync points and strategies to recover from failure. So how did
these very smart people solve this problem? Well, in short they found another way to
get the same sync points and recovery strategies without the limitations of siloed
mappers.

The Bulk Synchronous Parallel Model
So how do we maintain synchronous processing and still break the “no talking
between mappers” rule? The answer was provided by a British computer scientist
named Leslie Valiant of Harvard, who developed the Bulk Synchronous Parallel
(BSP) model in the 1990s. This BSP model is at the core of the Google graph process‐
ing solution called Pregel.
The idea of BSP is pretty complex, yet simple at the same time. In short, it is the idea
of distributed processes doing work within a superstep. These distributed processes
can send messages to each other, but they cannot act upon those messages until the
next superstep. These supersteps will act as the boundaries for our needed sync
points. We can only reach the next superstep when all distributed processes finish
processing and sending message sending of the current superstep. There’s then nor‐
mally a single-threaded process that will decide if the overall process needs to con‐
tinue with a new superstep. It’s acceptable for this process to run in a single thread,
since it does very little work in comparison to the worker threads and thus isn’t a bot‐
tleneck.

BSP by Example
That was admittedly a short definition of a distributed processing model that took
years of research. We’ll help clarify with a short example of BSP. Scientists can use
graph processing as a way to model the spread of a disease through a community. In
this example, we illustrate this with zombies, who began as humans but changed after
being bitten by another zombie.
Let’s make a new graph and call it zombie bites. As you can see in Figure 5-5, in the
start state we have one zombie and a bunch of people. The rule when the processing
starts is that the zombie can bite every human it shares an edge with, and then when a
vertex is bitten, it must turn itself into a zombie and continue by biting all of its edges.
Once a zombie bites, it will not bite again because everyone around it will already
have become a zombie, and we know from watching countless zombie movies that
zombies never bite other zombies.
Figure 5-5 shows what the graph looks like in the different supersteps of our BSP exe‐
cution model.
How Do You Process a Graph in a Distributed System?

|

163

Figure 5-5. Supersteps for the zombie bites graph
We’ll be introducing two graph processing tools, Giraph and GraphX, in this chapter
to show implementations of this example. But before we do so, it is important to note
that BSP is not the only solution. As we are learning as we dig deeper into Spark, the
penalty for the onion approach has been hugely reduced from the days of MapRe‐
duce. The penalty of the I/O writes and reads in between onion layers has been
largely mitigated by Spark, at least in the cases where the data can fit in memory. But
with that said, the BSP model is very different in that it only has to send the messages
between the distributed processes, whereas the onion joining will have to resend
everything.
In the next two subsections we will dive into the two most popular graph processing
frameworks for Hadoop today. First will be Giraph, which was born out of LinkedIn
and used by Facebook as part of its graph search. Giraph is the more mature and sta‐
ble system, with the claim of handling up to a trillion edges.
The second tool is the newer GraphX, which is part of the Apache Spark project.
Spark GraphX gets a lot of its roots from GraphLab, an earlier open source graph
processing project, and is an extension built on Spark’s generic DAG execution
engine. Although still young and not as tuned and stable as Giraph, GraphX still

164

| Chapter 5: Graph Processing on Hadoop

holds a lot of promise because of its ease of use and integration with all other compo‐
nents of Spark.

Giraph
Giraph is an open source implementation of Google’s Pregel. From the ground up,
Giraph is built for graph processing. This differs from Spark’s GraphX, which as
noted, contains an implementation of the Pregel API built on the Spark DAG engine.
To get a simple view of Giraph, let’s remove a lot of its details and focus on the three
main stages of a Giraph program (see Figure 5-6):
1. Read and partition the data.
2. Batch-process the graph with BSP.
3. Write the graph back to disk.

Figure 5-6. The three main stages of a Giraph program

Giraph

|

165

There are many other details to Giraph that we can’t cover here. Our intent is to pro‐
vide enough detail for you to decide which tools belong in your architecture.
Let’s dig into these stages in more detail and look at the code we will have to imple‐
ment to customize these stages to our data and the zombie biting problem.

Read and Partition the Data
Just as MapReduce and Spark have input formats, Giraph has VertexInputFormat. In
both cases, an input format takes care of providing the splits and the record or vertex
reader. In our implementation we will stick with the default split logic and only over‐
ride the reader. So our ZombieTextVertexInputFormat is as simple as the following:
public class ZombieTextVertexInputFormat extends
TextVertexInputFormat {
@Override
public TextVertexReader createVertexReader(InputSplit split,
TaskAttemptContext context)
throws IOException {
return new ZombieTextReader();
}
}

The next thing we need is a VertexReader. The main difference from a normal Map‐
Reduce RecordReader is that a RecordReader is returning a key and value Writable,
whereas the VertexReader is returning a Vertex object.
So, what is a Vertex object? It is made up of three parts:
Vertex ID
This is an ID that uniquely identifies a vertex in our graph.
Vertex value
This is an object that contains information about our vertex. In our example, it
will store the state of our human or zombie and at which step he or she turned
into a zombie. For simplicity we will use a string that looks like "Human" or "Zom
bie.2" for a zombie that was bitten on the second superstep.
Edge
This is made up of two parts: the vertex ID of the source vertex and an object that
can represent information about where the edge is pointing and/or what type of
edge it is—for example, is the edge a relationship, a distance, or a weight?
So now that we know what a vertex is, let’s see what a vertex looks like in our source
file:
{vertexId}|{Type}|{comma-separated vertexId of "bitable" people}
2|Human|4,6

166

|

Chapter 5: Graph Processing on Hadoop

This is a vertex with the ID of 2 that is currently a human and is connected to vertices
4 and 6 with a directional edge. So, let’s look at the code that will take this line and
turn it into a Vertex object:
public class ZombieTextReader extends TextVertexReader {
@Override
public boolean nextVertex() throws IOException, InterruptedException {
return getRecordReader().nextKeyValue();
}
@Override
public Vertex getCurrentVertex()
throws IOException, InterruptedException {
Text line = getRecordReader().getCurrentValue();
String[] majorParts = line.toString().split("\\|");
LongWritable id = new LongWritable(Long.parseLong(majorParts[0]));
Text value = new Text(majorParts[1]);
ArrayList> edgeIdList =
new ArrayList>();
if (majorParts.length > 2) {
String[] edgeIds = majorParts[2].split(",");
for (String edgeId: edgeIds) {
DefaultEdge edge =
new DefaultEdge();
LongWritable longEdgeId = new LongWritable(Long.parseLong(edgeId));
edge.setTargetVertexId(longEdgeId);
edge.setValue(longEdgeId); // dummy value
edgeIdList.add(edge);
}
}
Vertex vertex = getConf().createVertex();
vertex.initialize(id, value, edgeIdList);
return vertex;
}
}

There’s a lot going on in this code, so let’s break it down:
• Our VertexReader extends TextVertexReader so we are reading text files lineby-line. Note that we’d have to change our parent reader if we intend to read any
other Hadoop file type.
• nextVertex() is an interesting method. If you drill down into the parent class,
you’ll see that it is using the normal RecordReader to try to read the next line and
return if there is something left.

Giraph

|

167

• The getCurrentVertex() method is where we parse the line and create and pop‐
ulate a Vertex object.
So as this method is firing, the resulting Vertex objects are being partitioned to the
different distributed workers across the cluster. The default partitioning logic is a
basic hash partition, but it can be modified. This is out of scope for this example, but
just note you have control over the partitioning. If you can identify patterns that will
force clumps of the graph to fewer distributed tasks, then the result may be less net‐
work usage and a corresponding reduction in speed.
Once the data is loaded in memory (or disk with the new spill-to-disk functionality in
Giraph), we can move to processing with BSP in the next sub-section.
Before we move on, note that this is just an example of the VertexInputFormat.
There are more advanced options in Giraph like reading in vertices and edges
through different readers and advanced partitioning strategies, but that is out of
scope for this book.

Batch Process the Graph with BSP
Of all the parts of Giraph, the BSP execution pattern is the hardest to understand for
newcomers. To make it easier, let’s focus on three computation stages: vertex, master,
and worker. We will go through the code for these three stages soon, but check out
Figure 5-7 first.

Figure 5-7. Three computation stages of the BSP execution pattern: vertex, master, and
worker

168

|

Chapter 5: Graph Processing on Hadoop

Hopefully, from the image you can see that each BSP pass will start with a master
computation stage. Then it will follow with a worker computation stage on each dis‐
tributed JVM, followed by a vertex computation for every vertex in that JVM’s local
memory or local disk.
These vertex computations may process messages that will be sent to the receiving
vertex, but the receiving vertices will not get those messages until the next BSP pass.
Let’s start with the simplest of the computation stages, the master compute:
public class ZombieMasterCompute extends DefaultMasterCompute {
@Override
public void compute() {
LongWritable zombies = getAggregatedValue("zombie.count");
System.out.println("Superstep "+String.valueOf(getSuperstep())+
" - zombies:" + zombies);
System.out.println("Superstep "+String.valueOf(getSuperstep())+
" - getTotalNumEdges():" + getTotalNumEdges());
System.out.println("Superstep "+String.valueOf(getSuperstep())+
" - getTotalNumVertices():" +
getTotalNumVertices());
}
@Override
public void initialize()
throws InstantiationException, IllegalAccessException {
registerAggregator("zombie.count", LongSumAggregator.class);
}
}

Let’s dig into the two methods in the ZombieMasterCompute class. First, we’ll look at
the initialize() method. This is called before we really get started. The important
thing we are doing here is registering an Aggregator class.
An Aggregator class is like an advanced counter in MapReduce but more like the
accumulators in Spark. There are many aggregators to select from in Giraph, as
shown in the following list, but there is nothing stopping you from creating your own
custom one.
Here are some examples of Giraph aggregators:
• Sum
• Avg
• Max
• Min
• TextAppend
Giraph

|

169

• Boolean And/Or
The second method in the ZombieMasterCompute class is compute(), and this will fire
at the start of every BSP. In this case we are just printing out some information that
will help us debug our process.
On to the next bit of code, which is the ZombieWorkerContext class for the worker
computation stage. This is what will execute before and after the application and each
superstep. It can be used for advanced purposes like putting aggregated values at the
start of a superstep so that it is accessible to a vertex compute step. But, for this simple
example, we are doing nothing more than using System.out.println() so that we
can see when these different methods are being called during processing:
public class ZombieWorkerContext extends WorkerContext {
@Override
public void preApplication() {
System.out.println("PreApplication # of Zombies: " +
getAggregatedValue("zombie.count"));
}
@Override
public void postApplication() {
System.out.println("PostApplication # of Zombies: " +
getAggregatedValue("zombie.count"));
}
@Override
public void preSuperstep() {
System.out.println("PreSuperstep # of Zombies: " +
getAggregatedValue("zombie.count"));
}
@Override
public void postSuperstep() {
System.out.println("PostSuperstep # of Zombies: " +
getAggregatedValue("zombie.count"));
}
}

Last and most complex is the vertex computation stage:
public class ZombieComputation
extends BasicComputation {
private static final Logger LOG = Logger.getLogger(ZombieComputation.class);
Text zombieText = new Text("Zombie");
LongWritable longIncrement = new LongWritable(1);

170

|

Chapter 5: Graph Processing on Hadoop