Tải bản đầy đủ
Chapter 20. Hive Integration with Oozie

Chapter 20. Hive Integration with Oozie

Tải bản đầy đủ

Java action
A Java class with a main method is launched with optional arguments
A Pig script is run
A Hive HQL query is run
Run a distcp command to copy data to or from another HDFS cluster

Hive Thrift Service Action
The built-in Hive action works well but it has some drawbacks. It uses Hive as a fat
client. Most of the Hive distributions, including JARs and configuration files, need to
be copied into the workflow directory. When Oozie launches an action, it will launch
from a random TaskTracker node. There may be a problem reaching the metastore if
you have your metastore setup to only allow access from specific hosts. Since Hive can
leave artifacts like the hive-history file or some /tmp entries if a job fails, make sure to
clean up across your pool of TaskTrackers.
The fat-client challenges of Hive have been solved (mostly) by using Hive Thrift Service
(see Chapter 16). The HiveServiceBAction (Hive Service “plan B” Action) leverages the
Hive Thrift Service to launch jobs. This has the benefits of funneling all the Hive operations to a predefined set of nodes running Hive service:

cd ~
git clone git://github.com/edwardcapriolo/hive_test.git
cd hive_test
mvn wagon:download-single
mvn exec:exec
mvn install

$ cd ~
$ git clone git://github.com/edwardcapriolo/m6d_oozie.git
$ mvn install

A Two-Query Workflow
A workflow is created by setting up a specific directory hierarchy with required JAR
files, a job.properties file and a workflow.xml file. This hierarchy has to be stored in
HDFS, but it is best to assemble the folder locally and then copy it to HDFS:

mkdir myapp
mkdir myapp/lib
cp $HIVE_HOME/lib/*.jar myapp/lib/
cp m6d_oozie-1.0.0.jar myapp/lib/
cp hive_test-4.0.0.jar myapp/lib/

240 | Chapter 20: Hive Integration with Oozie

The job.properties sets the name of the filesystem and the JobTracker. Also, additional
properties can be set here to be used as Hadoop Job Configuration properties:
The job.properties file:

The workflow.xml is the file where actions are defined:



CREATE TABLE IF NOT EXISTS zz_zz_abc (a int, b int)



INSERT OVERWRITE TABLE zz_zz_abc SELECT dma_code,site_id
FROM BCO WHERE dt=20120426 AND offer=4159 LIMIT 10

A Two-Query Workflow | 241

Java failed, error message

Oozie Web Console
The Oozie web console is helpful for troubleshooting jobs. Oozie launches each action
inside a map task and captures all the input and output. Oozie does a good job presenting this information as well as providing links to job status pages found on the
Hadoop JobTracker web console.
Here is a screenshot of the Oozie web console.

Variables in Workflows
A workflow based on completely static queries is useful but not overly practical. Most
of the use cases for Oozie run a series of processes against files for today or this week.

242 | Chapter 20: Hive Integration with Oozie

In the previous workflow, you may have noticed the KILL tag and the interpolated
variable inside of it:

Java failed, error message

Oozie provides an ETL to access variables. Key-value pairs defined in job.properties can
be referenced this way.

Capturing Output
Oozie also has a tag that can be placed inside an action. Output
captured can be emailed with an error or sent to another process. Oozie sets a Java
property in each action that can be used as a filename to write output to. The code
below shows how this property is accessed:
private static final String
OOZIE_ACTION_OUTPUT_PROPERTIES = "oozie.action.output.properties";
public static void main(String args[]) throws Exception {
String oozieProp = System.getProperty(OOZIE_ACTION_OUTPUT_PROPERTIES);

Your application can output data to that location.

Capturing Output to Variables
We have discussed both capturing output and Oozie variables; using them together
provides what you need for daily workflows.
Looking at our previous example, we see that we are selecting data from a hardcoded
day FROM BCO WHERE dt=20120426. We would like to run this workflow every day so we
need to substitute the hardcoded dt=20120426 with a date:




Capturing Output to Variables | 243

This will produce output like:
$ date +x=%Y%m%d

You can then access this output later in the process:
You said ${wf:actionData('create_table')['x']}

There are many more things you can do with Oozie, including integrating Hive jobs
with jobs implemented with other tools, such as Pig, Java MapReduce, etc. See the
Oozie website for more details.

244 | Chapter 20: Hive Integration with Oozie


Hive and Amazon Web Services (AWS)

—Mark Grover

One of the services that Amazon provides as a part of Amazon Web Services (AWS) is
Elastic MapReduce (EMR). With EMR comes the ability to spin up a cluster of nodes
on demand. These clusters come with Hadoop and Hive installed and configured. (You
can also configure the clusters with Pig and other tools.) You can then run your Hive
queries and terminate the cluster when you are done, only paying for the time you used
the cluster. This section describes how to use Elastic MapReduce, some best practices,
and wraps up with pros and cons of using EMR versus other options.
You may wish to refer to the online AWS documentation available at http://aws.amazon
.com/elasticmapreduce/ while reading this chapter. This chapter won’t cover all the
details of using Amazon EMR with Hive. It is designed to provide an overview and
discuss some practical details.

Why Elastic MapReduce?
Small teams and start-ups often don’t have the resources to set up their own cluster.
An in-house cluster is a fixed cost of initial investment. It requires effort to set up and
servers and switches as well as maintaining a Hadoop and Hive installation.
On the other hand, Elastic MapReduce comes with a variable cost, plus the installation
and maintenance is Amazon’s responsibility. This is a huge benefit for teams that can’t
or don’t want to invest in their own clusters, and even for larger teams that need a test
bed to try out new tools and ideas without affecting their production clusters.

An Amazon cluster is comprised of one or more instances. Instances come in various
sizes, with different RAM, compute power, disk drive, platform, and I/O performance.
It can be hard to determine what size would work the best for your use case. With EMR,


it’s easy to start with small instance sizes, monitor performance with tools like Ganglia,
and then experiment with different instance sizes to find the best balance of cost versus

Before You Start
Before using Amazon EMR, you need to set up an Amazon Web Services (AWS) account. The Amazon EMR Getting Started Guide provides instructions on how to sign
up for an AWS account.
You will also need to create an Amazon S3 bucket for storing your input data and
retrieving the output results of your Hive processing.
When you set up your AWS account, make sure that all your Amazon EC2 instances,
key pairs, security groups, and EMR jobflows are located in the same region to avoid
cross-region transfer costs. Try to locate your Amazon S3 buckets and EMR jobflows
in the same availability zone for better performance.
Although Amazon EMR supports several versions of Hadoop and Hive, only some
combinations of versions of Hadoop and Hive are supported. See the Amazon EMR
documentation to find out the supported version combinations of Hadoop and Hive.

Managing Your EMR Hive Cluster
Amazon provides multiple ways to bring up, terminate, and modify a Hive cluster.
Currently, there are three ways you can manage your EMR Hive cluster:
EMR AWS Management Console (web-based frontend)
This is the easiest way to bring up a cluster and requires no setup. However, as you
start to scale, it is best to move to one of the other methods.
EMR Command-Line Interface
This allows users to manage a cluster using a simple Ruby-based CLI, named
elastic-mapreduce. The Amazon EMR online documentation describes how to
install and use this CLI.
This allows users to manage an EMR cluster by using a language-specific SDK to
call EMR APIs. Details on downloading and using the SDK are available in the
Amazon EMR documentation. SDKs are available for Android, iOS, Java, PHP,
Python, Ruby, Windows, and .NET. A drawback of an SDK is that sometimes
particular SDK wrapper implementations lag behind the latest version of the
It is common to use more than one way to manage Hive clusters.

246 | Chapter 21: Hive and Amazon Web Services (AWS)