Tải bản đầy đủ
Configuring YARN (with the yarn-site.xml File)

Configuring YARN (with the yarn-site.xml File)

Tải bản đầy đủ

84

Chapter 3

n

Creating and Configuring a Simple Hadoop Cluster

yarn.nodemanager.aux-services.mapreduce.shuffle.class: This parameter
instructs MapReduce how to perform shuff le operations. The value I specified for
this parameter, org.apache.hadoop.mapredShulfflehandler, instructs YARN
to use this class to perform the shuffle. The class name is provided to instruct exactly
how it must implement the value you set for the property yarn.nodemanager
.aux-services.

To add the two parameters described here, edit the /opt/yarn/hadoop-2.6.0/etc/
hadoop/yarn-site.xml file as shown in Figure 3.10.
Although at this point we can get by with using the default Hadoop values for memory
usage, it’s probably a good idea to start getting acquainted with some key YARN-related configuration parameters. One of the key Hadoop configuration parameters for
a Hadoop administrator is yarn.nodemanager.resource.memory-mb. This parameter
specifies the total memory that YARN can consume on each node. Let’s say you set this
parameter to 40,960MB, as shown here:

yarn.nodemanager.resource.memory-mb
40960


If you want YARN to launch a maximum of 10 containers per node, you can do so by
specifying the yarn.scheduler.minimum-allocation-mb parameter, as shown here:

yarn.scheduler.minimum-allocation-mb
4096


Since you have a maximum of 40GB on this node for YARN, the fact that you’ve specified the minimum memory per container as 4,096MB (4GB) means that you’ve restricted
this node to running no more than 10 containers at any given time. Each container runs a
single map or reduce task, and as tasks finish, new containers may be started and assigned
to new tasks—however, at any given point in time, no more than 10 containers or tasks
can run on this node. Note that the administrator sets the default number of reducers
for MapReduce v2 jobs by configuring the mapreduce.job.reduces property in the
mapred-site.xml file. The developer can override this default value either by setting the
number of reducers in the driver or on the command line at runtime.

Figure 3.10 Editing the yarn-site.xml file to add the YARN auxiliary service parameters

Performing the Initial Hadoop Configuration

Configure HDFS (with the hdfs-site.xml File)
The hdfs-site.xml file controls the behavior of all HDFS-related Hadoop components,
such as the NameNode, Secondary NameNode and the DataNodes. You need to configure the following basic parameters to enable our single node cluster to function.
n

fs.default.name: This attribute lets you specify the URI for the cluster’s NameNode.

DataNodes will use this URI to register with the NameNode, letting applications
access the data stored on the DataNodes. Clients will also use this URI to retrieve
the locations of the data blocks in HDFS. It’s common to specify 9000 as the port,
but you can use a different port if you wish.
n

n

n

n

dfs.replication: By default, Hadoop replicates each data block three times when
writing a file. Both the default value and the typically recommended value is 3.
However, in our case, since you’ve only a single node, you must change the value
of this parameter to 1.
dfs.datanode.data.dir: This parameter determines exactly where on its local file
system a DataNode stores its blocks. As you can see, you need to provide a normal
Linux directory for storing HDFS data. Later, you’ll format the NameNode, which
converts this directory into something managed by HDFS and not the local Linux
file system. Note the following about this parameter:
n You can list the local file system directories in a comma-separated list. Make sure
there aren’t any spaces between a comma and the next directory path in the list.
n You can specify different values for this parameter for each DataNode if you wish.
dfs.namenode.name.dir: This parameter tells Hadoop where to store the
NameNode’s key metadata files, such as the fsimage and edits files. The value for this
parameter points to a local file system, and only the NameNode service accesses it
for reading and writing its metadata. In some ways, you can think of this parameter as the most important Hadoop configuration parameter of all, since losing
the NameNode’s HDFS metadata really means you’ve effectively lost all of your
HDFS data. You surely will have all the data blocks in the cluster, but without
the metadata that describes those blocks, you can’t reconstruct the original files.
dfs.namenode.checkpoint.dir: Specifies the directory where the Standby
NameNode stores its versions of the metadata-related files. The secondary NameNode
uses the values you specify for this parameter to store the fsimage and the edit log
( journal for the fsimage file).

Edit the /opt/yarn/hadoop-2.6.0/etc/hadoop/hdfs-site.xml file to add the appropriate
values for the parameters listed here, as shown in Figure 3.11.
At this point, you’re all done with configuring Hadoop in a pseudo-distributed cluster.
Just two steps remain for you to access HDFS and start running MapReduce applications
in the shiny new Hadoop cluster! The next step is to format the newly created distributed
file system (HDFS). The final step is to start up the Hadoop cluster.

85

86

Chapter 3

Creating and Configuring a Simple Hadoop Cluster

Figure 3.11 Editing the hdfs-site.xml file to add HDFS-related configuration properties

Operating the New Hadoop Cluster
You’ve configured Hadoop services and also formatted the NameNode service. However,
nothing is actually running in the cluster yet. In order to start the new Hadoop cluster,
you must start the services that support the two primary components of Hadoop, which
are the HDFS storage system and the YARN processing system.
Before you can start the cluster services up, there’s one item of business you need to
take care of—the formatting of HDFS.

Formatting the Distributed File System
Before you can start using HDFS for the very first time, you must format it. As you can
probably guess, you do this only once.
Note
Formatting an existing HDFS file system essentially wipes all data on it and sets up a new
HDFS file system! Formatting HDFS really means you’re initializing where the NameNode
stores its metadata.

As you may recall, the parameter dfs.namenode.name.dir in the hdfs-site.xml file
specifies the location where the NameNode service stores its metadata. When you run
the formatting command for the first time, it creates the necessary metadata files, and
when you reformat it, it wipes out all the files in this directory. A real-life Hadoop administrator can’t really format a production file system to get around a technical problem—
one must persist and fix the problem!

Operating the New Hadoop Cluster

In order to format HDFS, you must login not as the root user, which you’ve been
doing until now, but as the user hdfs.
# cd /opt/yarn/hadoop-2.6.0/bin
$ su hdfs
$ ./hdfs namenode -format
INFO common.Storage: Storage directory /var/data/hadoop/hdfs/nn has been
successfully formatted.
$

Setting the Environment Variables
Before you issue the start and stop commands to control the Hadoop daemons, make
sure you export the following environment variables:
export
export
export
export

JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/
HADOOP_HOME=/opt/yarn/hadoop-2.6.0
HADOOP_PREFIX=/opt/yarn/hadoop-2.6.0
HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

You can also place these variables in the /etc/profile.d directory, within the
hadoop.sh file, as shown here:
[root@hadoop1 hadoop]# cat /etc/profile.d/hadoop.sh
export JAVA_HOME=/usr/lib/jvm/java-1.6.0-openjdk-1.6.0.0.x86_64/
export HADOOP_HOME=/opt/yarn/hadoop-2.6.0/etc/hadoop
export HADOOP_PREFIX=/opt/yarn/hadoop-2.6.0/etc/hadoop
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
[root@hadoop1 hadoop]#

Now you’re ready to fire up the cluster daemons so you can finally start working
with your pseudo-distributed cluster!

Starting the HDFS and YARN Services
In order to work with a Hadoop cluster, you must start up the HDFS- and YARN HDFSrelated services. HDFS services are the NameNode and DataNode services. YARN
services include the ResourceManager (one per cluster), the NodeManager (one on each
worker node) and the JobHistoryServer services.
The following sections explain how to start the HDFS and YARN services in the
pseudo-distributed cluster.
Starting the Hadoop Services
In our simple pseudo-distributed cluster, there are three HDFS services:
n

n

n

NameNode
Secondary NameNode
DataNode

Tip
If you configure a Standby NameNode, you don’t need the Secondary NameNode.

87

88

Chapter 3

Creating and Configuring a Simple Hadoop Cluster

By default, when you create a new Hadoop 2 cluster, you’ll have a Secondary NameNode
but not a Standby NameNode, which you’ll need to explicitly configure. A Secondary
NameNode doesn’t help in failing over, so it can’t offer high availability. In our present
case, since we’re dealing with a very simple cluster, we can use the default Secondary
NameNode to perform the updates of the fsimage file. The recommended practice in a
production Hadoop setup is to configure high availability for the NameNode by configuring a Standby NameNode. When you do this, you don’t need to use the Secondary
NameNode. In addition to the NameNode and the Secondary NameNode (or a Standby
NameNode if you configure it), you’ll also have multiple DataNodes, one for each node
in your cluster (unless you choose not to run a DataNode on the master nodes where you
run key Hadoop services such as the NameNode and ResourceManager).
Here are the steps to follow in order to start up all three of the HDFS services in our
pseudo-distributed cluster:
$ su hdfs
$ cd /opt/yarn/hadoop2.6.0/sbin
$./hadoop-daemon.sh start namenode
starting namenode, logging to /opt/yarn/hadoop-2.6.0/logs/hadoop-hdfs-namenodelimulus.out
[root@hadoop1 sbin]#
/*start this on the server where Secondary NameNode is configured to run
$ ./hadoop-daemon.sh start secondarynamenode
starting secondarynamenode, logging to /opt/yarn/hadoop-2.6.0/logs/hadoop-hdfssecondarynamenode-limulus.out
[root@hadoop1 sbin]#
[root@hadoop1 sbin]# ./hadoop-daemon.sh start datanode
starting datanode, logging to /var/log/hadoop/hdfs/hadoop-root-datanode-hadoop1
.localdomain.out
[root@hadoop1 sbin]#

If the NameNode or the DataNodes fail to start, it’s easy to find out why. Just open
the log file shown in the output (hadoop-root-datanode-hadoop1.localdomain.out
shown for the DataNode) for the start command and check the reason. Usually it’s
because you haven’t set the correct path for HADOOP_HOME or HADOOP_PREFIX. For example,
to find out why the NameNode failed to start, view the file named /opt/yarn/hadoop-2.6.0/
logs/hadoop-hdfs-namenode-limulus.out. As mentioned earlier, startup problems are
usually quite easy to fix!
Now that you have all your HDFS services successfully started, it’s time to start up
the second Hadoop component, YARN.
Starting the YARN Services
Start the YARN services by logging in as the user yarn. There are three yarn services
you must start:
n

n

n

ResourceManager
NodeManager
JobHistoryServer

Operating the New Hadoop Cluster

There’s only one ResourceManager (later you’ll learn how to set up a high-availability
system with an active and a Standby ResourceManager) and a single JobHistoryServer per
cluster, and the NodeManager service runs on every node where you run a DataNode in
your cluster.
# su – yarn
$ cd /opt/yarn/hadoop-2.6.0/sbin
$ ./yarn-daemon.sh start resourcemanager
starting resourcemanager, logging to /opt/yarn/hadoop-2.6.0/logs/yarn-yarnresourcemanager-limulus.out
$ ./yarn-daemon.sh start nodemanager
starting nodemanager, logging to /opt/yarn/hadoop-2.6.0/logs/yarn-yarnnodemanager-limulus.out
$ ./mr-jobhistory-daemon.sh start historyserver

As with the HDFS services, if you’re unable to start up one of the YARN services,
go to the log directory shown in the output of the start command for the service. The
log file will show you the reason why Hadoop was unable to start up the process. In a
simple cluster like ours, mostly it’s because one hasn’t correctly configured something
such as the home directory for Hadoop, for example.

Verifying the Service Startup
A quick and easy way to check whether all the HDFS and YARN services have started
running is to run the jps (the Java Virtual Machine Process Status Tool) command as
the root user. The jps command shows all running Java processes on your single server
that hosts all the HDFS and YARN services.
[root@hadoop1 sbin]# jps
4180 NodeManager
9186 Jps
3833 NameNode
3940 DataNode
3772 SecondaryNameNode
4108 ResourceManager
[root@hadoop1 sbin]#

As you can tell, the jps command reveals that all the HDFS and YARN services,
such as the ResourceManager and the NameNode that you’ve started earlier, are in fact
running. You can alternatively use the jps command as shown here to find the process
status for a specific Hadoop daemon:
$ /usr/jdk/latest/bin/jps | grep NameNode
3658 NameNode
$

You can also use normal Linux process commands to verify whether the Hadoop
services are running, as shown here:
$ ps -ef|grep -i NameNode
$ ps -ef|grep DataNode

89

90

Chapter 3

Creating and Configuring a Simple Hadoop Cluster

Once you’ve verified that all services are running as expected, you can check out the
HDFS file system by issuing the following command:
$ hdfs dfs –ls /

This is a basic HDFS command that’s quite similar to the Linux ls command and
shows the directories under the HDFS root directory. As you’ll see later, the directory
under the /user directory in the HDFS hierarchy commonly serves as the “home directory”
for services and users in your system. For example, Hive by default uses the /user/hive/
warehouse directory for storing data in its tables.
[root@hadoop2 sbin]# hdfs dfs -ls /
Found 4 items
drwxr-xr-x
- hdfs supergroup
drwxr-xr-x
- hdfs supergroup
drwx-wx-wx
- hdfs supergroup
drwxr-xr-x
- hdfs supergroup
[root@hadoop2 sbin]#

0
0
0
0

2014-12-26
2015-01-25
2014-12-26
2014-12-26

17:08
16:05
14:39
14:39

/system
/test
/tmp
/user

Shutting Down the Services
Now that you’ve satisfied yourself that the new Hadoop pseudo-distributed cluster is
working correctly, you may want to shut down your new cluster. Use the following set
of commands to shut the cluster down.
$ su -hdfs
$ ./hadoop-daemon.sh stop datanode
$ ./hadoop-daemon.sh stop secondarynamenode
$ ./hadoop-daemon.sh stop namenode
# su – yarn
$ cd /opt/yarn/hadoop-2.6.0/sbin
$ /yarn-daemon.sh stop resourceManager
$ /yarn-daemon.sh stop nodemanager
$ /yarn-daemon.sh stop historyserver

Summary
Here’s what you learned in this chapter:
n

n

n

Installing a simple pseudo distributed cluster takes only a single server and you can
get going in a couple of hours.
Initial configuration of a simple cluster requires only a handful of parameters, so
you can start running your applications very quickly.
You use separate start and stop commands to control the various Hadoop daemons
that are part of a cluster.

4
Planning for and Creating
a Fully Distributed Cluster
This chapter covers the following:
n

n

n

n

Planning your Hadoop cluster
Sizing your cluster
Installing a multinode, fully distributed Hadoop cluster
Configuring HDFS and YARN for production

Chapter 3, “Creating and Configuring a Simple Hadoop Cluster,” showed how to
get going with Hadoop by creating a simple pseudo-distributed cluster that has essentially the same functionality as a full-f ledged Hadoop cluster. This chapter takes things
further by showing you how to create a multi-node cluster, as well as configure it for
effective performance. Before I jump into the creation and configuration of a full-fledged,
multinode Hadoop cluster, I’ll discuss key factors in planning your cluster and how to
size it.
Creating a simple one-node cluster is easy. However, most people want to know how
to create their own working Hadoop cluster on a set of simple servers or even on a
(powerful) laptop. Or, some of you may want guidance on setting up a real-life, multinode Hadoop cluster in a production environment. I therefore explain how to create
a multinode Hadoop 2 cluster, using a virtual environment to keep things simple.
More precisely, I show you how to create your own three-node Hadoop cluster using
Oracle’s VirtualBox to create multiple nodes on a single server. The book does include
an appendix, Appendix A, “Installing VirtualBox and Linux and Cloning the Virtual
Machines,” that shows the actual VirtualBox installation and cloning steps.
Once you have a working multi-node cluster set up, you can follow the instructions in
this book to create a Hadoop 2 cluster and configure it for operation. It’s my firm belief
that creating a full-fledged Hadoop cluster in this manner strengthens your understanding
of the Hadoop architecture and gets you in the catbird seat for setting up a much larger
real-life production cluster.

92

Chapter 4

Planning for and Creating a Fully Distributed Cluster

As you’ll learn through this book, a large part of a Hadoop administrator’s job is to
master the various dials and knobs that make Hadoop run—the configuration parameters for HDFS, YARN and other components. Incorrect configuration settings, or
configuration properties left at their default values, are at the heart of a vast majority of
Hadoop-related performance issues. Installing a Hadoop cluster from scratch, as shown
in this chapter, will make you aware of how the configuration parameters are used by
Hadoop and help you learn how to optimize your cluster’s operations.

Planning Your Hadoop Cluster
When planning a cluster, you must begin by understanding your company’s data needs.
You must also evaluate the type of data processing that’ll occur in the cluster. Doing
this will let you figure out the HDFS storage you’ll need, as well as the throughput speed
you must have in order to efficiently process the data.
The type of work to which you’ll put the cluster has a huge bearing on how you configure a cluster’s storage, network and CPU. If your expected workload is going to be
CPU intensive, disk speed and network speed are less important. If your cluster is going
to perform a large amount of heavy MapReduce processing, then network bandwidth
becomes a significant factor. Maybe you’ll need to get multiple NICs for each node in
such a case.
Many organizations get their feet wet by starting with a very small cluster of about
half a dozen nodes or so and add more nodes as the data volumes increase.
You can plan for the growth of the cluster in accordance with how your data grows.
If your data is growing by 1TB daily, with Hadoop’s default replication of 3, that’s 90TB
of additional storage required per month. If you allocate about 20-25 percent of the total
storage to the local Linux file system (as you’ll see, Hadoop also needs space on the local file
system, in addition to space it needs for storing HDFS files), you’ll need about 120TB
of new storage per month. If you buy servers with 12 3TB size disks drives each, you’ll
need roughly 4 new servers a month on average.

General Cluster Planning Considerations
In a single node “cluster,” also called a pseudo-distributed Hadoop installation, the single
node will host all Hadoop services, such as the NameNode, the ResourceManager, the
DataNode and the JobHistoryServer daemons. In a real-life production Hadoop cluster,
the architecture will usually consist of one or more racks.
Each of the racks in a cluster will contain multiple server nodes within a cluster with
a 1GbE switch. Each of the nodes is usually interconnected using a 1GbE switch. These
rack-level switches are in turn connected to a set of larger (say, 10GbE) cluster-level
switches, which may be connected to yet another level of switching infrastructure.
Figure 4.1 shows one way of architecting Hadoop, where all the key master services
are located on dedicated master nodes. These are services such as the NameNode and
ResourceManager, which are essential to the functioning of the cluster. However, in

Planning Your Hadoop Cluster

Switch
A pair of 10GbE switches
Switch

Master Node Running
the NameNodes
Master Node Running
the ResourceManager

Master nodes will run the
NameNodes, ResourceManager and
JobHistoryServer.

Master Node Running
the JobHistoryServer

Edge Node

The edge node is for clients who
are running Pig, Hive and other jobs
in your cluster.

Worker Node

Worker Node
Worker nodes store HDFS data
and also process the Hadoop jobs.
Worker Node

Worker Node

Figure 4.1

The basic components of a Hadoop cluster
showing the master nodes and the DataNodes

many environments, it’s common to have the master nodes share a node with other
services such as the DataNodes.
Figure 4.2 shows the typical architecture for a small Hadoop cluster. You have a single
rack of nodes in which you have all the master services running on nodes separate from
the nodes where the DataNodes are running. A pair of 10GbE networks supports the
cluster. On each master node in this single-rack configuration, you can deploy the master
services as shown in Figure 4.2.
The master services are the NameNode (active and Standby), the ResourceManager
(active and Standby), the JournalNodes and the JobHistoryServer. The JournalNodes
are needed only in high-availability architectures. The lightweight ZooKeeper service
needs to run on at least three nodes for quorum purposes and to support NameNode
high availability, so we have it running on the three master nodes.

93

94

Chapter 4

Planning for and Creating a Fully Distributed Cluster

Master Node 1

Master Node 2

Master Node 3

Active
NameNode

Standby
NameNode

ResourceManager

ZooKeeper

ZooKeeper

ZooKeeper

JournalNode

JournalNode

JournalNode

Figure 4.2 Multiple master nodes with various master services running on each node

Server Form Factors
You can choose between two different form factors for your cluster nodes:
n

n

Blade servers are fitted inside blade enclosures, which provide storage, networking
and power for the servers in the enclosures. A typical rack (72-inch f loor rack) of
servers can fit somewhere around 42 blade servers because of their small footprint
(each server takes up 1 RU, which is short for “a rack unit”). However, blade servers
aren’t the best strategy for Hadoop installations, because they share resources with
other servers and also have limited storage capacities.
Rack servers are full-f ledged, stand-alone servers, which don’t share resources with
other servers in a rack and also provide room for storage expansion. Rack servers
have a larger footprint than blade servers, and usually about 18-20 of them will fit
inside a standard server rack.

Criteria for Choosing the Nodes
While cost is definitely a key factor in choosing a specific type of server for your cluster,
the initial cost of purchasing the servers is but one of the factors that determine the true
long-term cost to an organization. In addition to the initial cost, you ought to consider
other factors that add to the long-term total cost of managing your cluster, such as the
power consumption and cooling factor of the servers, as well as the reliability and the maintenance costs for servicing the storage, CPUs and network.
It’s not recommended that you use the most inexpensive desktop-class computers to
build a Hadoop cluster. You should select a midrange Intel server, with a fairly large storage
capacity—typically about 36-48TBs per server.
The term commodity server is commonly used to describe the class of servers a Hadoop
cluster requires. A commodity server is usually an Intel server with some hard disk storage.
A commodity server is considered an “average” level server, which means it is affordable
but doesn’t imply low quality by any means.