Tải bản đầy đủ
9 Case study: event log processing with Apache Flume
Case study: event log processing with Apache Flume
records every day. If you multiply that by the number of servers to monitor, well, you
get the picture: it’s big data.
Few organizations store their raw event log data in RDBMSs, because they don’t
need the update and transactional processing features. Because NoSQL systems scale
and integrate with tools like MapReduce, they’re cost effective when you’re looking to
analyze event log data.
Though we’ll use the term event log data to describe this data, a more precise term
is timestamped immutable data streams. Timestamped immutable data is created once
but never updated, so you don’t have to worry about update operations. You only
need to focus on the reliable storage of the records and the efficient analysis of the
data, which is the case with many big data problems.
Distributed log file analysis is critical to allow an organization to quickly find errors
in systems and take corrective action before services are disrupted. It’s also a good
example of the need for both real-time analysis and batch analysis of large datasets.
Challenges of event log data analysis
If you’ve ever been responsible for monitoring web or database servers, you know that
you can see what’s happening on a server by looking at its detailed log file. Log events
add a record to the log file when your system starts up, when a job runs, and when
warnings or errors occur.
Events are classified according to their severity level using a standardized set of
severity codes. An example of these codes (from lowest to highest severity level) might
be TRACE, DEBUG, INFO, WARNING, ERROR, or FATAL. These codes have been standardized in the Java Log4j system.
Most events found in log files are informational (INFO level) events. They tell you
how fast a web page is served or how quickly a query is executed. Informational events
are generally used for looking at system averages and monitoring performance. Other
event types such as WARNING, ERROR, or FATAL events are critical and should notify
an operator to take action or intervene.
Filtering and reporting on log events on a single system is straightforward and can
be done by writing a script that searches for keywords in the log file. In contrast, big
data problems occur when you have hundreds or thousands of systems all generating
events on servers around the world. The challenge is to create a mechanism to get
immediate notification of critical events and allow the noncritical events to be ignored.
A common solution to this problem is to create two channels of communication
between a server and the operations center. Figure 6.14 shows how these channels
work. At the top of the diagram, you see where all events are pulled from the sever,
transformed, and then the aggregates updated in a reliable filesystem such as HDFS.
In the lower part of the diagram, you see the second channel, where critical events
are retrieved from the server and sent directly to the operations dashboard for immediate action.
Using NoSQL to manage big data
(minutes or more delay)
Real-time critical events
Figure 6.14 Critical time-sensitive events must be quickly extracted from log event streams
and routed directly to an operators console. Other events are processed in bulk using
MapReduce transforms after they’ve been stored in a reliable filesystem such as HDFS.
To meet these requirements, your system must meet the following objectives:
It must filter out time-sensitive events based on a set of rules.
It must efficiently and reliably transmit all events in large batch files to a central-
ized event store.
It must reliably route all time-sensitive events using a fast channel.
Let’s see how you can meet these objectives with Apache Flume.
How Apache Flume works to gather distributed event data
Apache Flume is an open source Java framework specifically designed to process event
log data. The word flume refers to a water-filled trough used to transport logs in the
lumber industry. Flume is designed to provide a distributed, reliable, and available system for efficiently collecting, aggregating, and moving large amounts of log data from
different sources to a centralized data store. Because Flume was created by members
of the Hadoop community, HDFS and HBase are the most common storage targets.
Flume is built around the concept of a flow pipeline, as depicted in figure 6.15.
Apache Flume agent
Figure 6.15 The key components of an Apache Flume flow pipeline.
Data arrives at a Flume agent through a source component that’s driven
by a client Java component. The agent contains multiple data channels
that are made available to one or more sink objects.
Case study: event log processing with Apache Flume
Time-critical Fast channel
Figure 6.16 How Log4j agents might be configured to write log data to a Flume
agent with a slow and a fast channel. All data will be written directly to HDFS. All
time-critical data will be written directly to an operator console.
Here’s a narrative of how a flow pipeline works:
A client program such as a Log4jAppender writes log data into a log file. The client program typically is part of the application being monitored.
A source within a Flume agent program receives all events and writes it to one or
more durable channels. Channel events persist even if a server goes down and
must be restarted.
Once an event arrives in a channel, it’ll stay there until a sink service removes it.
The channel is responsible for making sure that all events are reliably delivered
to a sink.
A sink is responsible for pulling events off of the channel and delivering them
to the next stage. This can be another source of another Flume agent, or a terminal destination such as HDFS. Terminal sinks will typically store the event in
three or more separate servers to provide redundancy in case of a node failure.
Now let’s look at how you can configure Apache Flume to meet the specific slow and
fast processing requirements we just described. Figure 6.16 is an example of this
Once log events are stored in HDFS, a regularly scheduled batch tool can be periodically run to summarize totals and averages for various events. For example, a report
might generate average response times of web services or web page rendering times.
This case study showed how Apache Flume supplies an infrastructure for allowing programs to subscribe to key events and route them to different services with different
Apache Flume is a custom-built framework specifically created with the intent of
reliably transferring event log data into a central data store such as HDFS. HDFS, in
turn, is ideally suited to storing large blocks of read-mostly data. HDFS has no extra
overhead for transaction control or update operations; its focus is large and reliable
storage. HDFS is designed as an efficient source for all your analytical reports written
in MapReduce. Since data can be evenly distributed over hundreds of nodes in a
Hadoop cluster, the MapReduce reports can quickly build whatever summary data you
Using NoSQL to manage big data
need. This is ideal for creating materialized views and storing them in your RDBMSs or
Although Apache Flume was originally written for processing log files, it’s a generalpurpose tool and can be used on other types of immutable big data problems such as
data loggers or raw data from web crawling systems. As data loggers get lower in price,
tools like Apache Flume will be needed to preprocess more big data problems.
6.10 Case study: computer-aided discovery
of health care fraud
In this case study, we’ll take a look at a problem that can’t be easily solved using a
shared-nothing architecture. This is the problem of looking for patterns of fraud
using large graphs. Highly connected graphs aren’t partition tolerant—meaning that
you can’t divide the queries on a graph on two or more shared-nothing processors. If
your graph is too large to fit in the RAM of a commodity processor, you may need to
look at an alternative to a shared-nothing system.
This case study is important because it explores the limits of what a cluster of
shared-nothing systems can do. We include this case study because we want to avoid a
tendency for architects to recommend large shared-nothing clusters for all problems.
Although shared-nothing architectures work for many big data problems, they don’t
provide for linear scaling of highly connected data such as graphs or RDBMSs containing joins. Looking for hidden patterns in large graphs is one area that’s best solved
with a custom hardware approach.
6.10.1 What is health care fraud detection?
The US Congressional Office of Management and Budget estimates that improper
payments in Medicare and Medicaid came to $50.7 billion in 2010, nearly 8.5% of the
annual Medicare budget. A portion of this staggering figure is the result of improper
documentation, but it’s certain that Medicare fraud costs taxpayers tens of billions of
Existing efforts to detect fraud have focused on searching for suspicious submissions from individual beneficiaries and health care providers. These efforts yielded
$4.1 billion in fraud recovery in 2011, around 10% of the total estimated fraud.
Unfortunately, fraud is becoming more sophisticated, and detection must move
beyond the search for individuals to the discovery of patterns of collusion among multiple beneficiaries and/or health care providers. Identifying these patterns is challenging, as fraudulent behaviors continuously change, requiring the analyst to hypothesize
that a pattern of relationships could indicate fraud, visualize and evaluate the results,
and iteratively refine their hypothesis.
Case study: computer-aided discovery of health care fraud
6.10.2 Using graphs and custom shared-memory hardware
to detect health care fraud
Graphs are valuable in situations where data discovery is required. Graphs can show
relationships between health care beneficiaries, their claims, associated care providers, tests performed, and other relevant data. Graph analytics search through the data
to find patterns of relationships between all of these entities that might indicate collusion to commit fraud.
The graph representing Medicare data is large: it represents six million providers,
a hundred million patients, and billions of claim records. The graph data is interconnected between health care providers, diagnostic tests, and common treatments associated with each patient and their claim records. This amount of data can’t be held in
the memory of a single server, and partitioning the data across multiple nodes in a
computing cluster isn’t feasible. Attempts to do so may result in incomplete queries
due to all the links crossing partition boundaries, the need to page data in and out of
memory, and the delays added by slower network and storage speeds. Meanwhile,
fraud continues to occur at an alarming rate.
Medicare fraud analytics requires an in-memory graph solution that can merge
heterogeneous data from a variety of sources, use queries to find patterns, and discover similarities as well as exact matches. With every item of data loaded into memory, there’s no need to contend with the issue of graph partitioning. The graph can be
dynamically updated with new data easily, and existing queries can integrate the new
data into the analytics being performed, making the discovery of hidden relationships
in the data feasible.
Figure 6.17 shows the high-level architecture of how shared-memory systems are
used to look for patterns in large graphs.
With these requirements in mind, a US federally funded lab with a mandate to
identify Medicare and Medicaid fraud deployed YarcData’s Urika appliance. The
appliance is capable of scaling from 1–512 terabytes of memory, shared by up to 8,192
Figure 6.17 How large graphs are
loaded into a central sharedmemory structure. This example
shows a graph in a central multiterabyte RAM store with
potentially hundreds or thousands
of simultaneous threads in CPUs
performing queries on the graph.
Note that, like other NoSQL
systems, the data stays in RAM
while the analysis is processing.
Each CPU can perform an
independent query on the graph
without interfering with each other.