Tải bản đầy đủ
Step 5: Install an HDP2.X Cluster

Step 5: Install an HDP2.X Cluster

Tải bản đầy đủ

76

Chapter 5 Installing Apache Hadoop YARN

3. Next, Ambari will ask for your list of nodes, one per line, and ask you to select
manual or private key registration. In this installation, we are using the manual
method (see Figure 5.8).

Figure 5.8

Hadoop install options

The installed Ambari-Agents should register with the Ambari-Server at this point.
Any install warnings will also be displayed here, such as ntpd not running on the
nodes. If there are issues or warnings, the registration window will indicate these
as shown in Figure 5.9. Note that the example is installing a “cluster” of one node.

Installing Hadoop with Apache Ambari

Figure 5.9

Ambari host registration screen

If everything is set correctly, your window should look like Figure 5.10.

Figure 5.10 Successful Ambari host registration screen

77

78

Chapter 5 Installing Apache Hadoop YARN

4. The next step is to select which components of the HDP2.X stack you want
to install. At the very least, you will want to install HDFS, YARN, and
MapReduceV2. In Figure 5.11, we will install everything.

Figure 5.11

Ambari services selection

Installing Hadoop with Apache Ambari

5. The next step is to assign functions to the various nodes in your cluster (see
Figure 5.12). The Assign Masters window allows for the selection of master
nodes—that is, NameNode, ResourceManager, HBaseMaster, Hive Server,
Oozie Server, etc. All nodes that have registered with the Ambari-Server will be
available in the drop-down selection box. Remember that the ResourceManager
has replaced the JobTracker from Hadoop version 1 and in a multi-node installation should always be given its own dedicated node.

Figure 5.12

Ambari host assignment

79

80

Chapter 5 Installing Apache Hadoop YARN

6. In this step, you assign NodeManagers (which run YARN containers), RegionServers, and DataNodes (HDFS). Remember that the NodeManager has replaced
the TaskTracker from Hadoop version 1, so you should always co-locate one of
these node managers with a DataNode to ensure that local data is available for
YARN containers. The selection window is shown in Figure 5.13. Again, this
example has only a single node.

Figure 5.13

Ambari slave and client component assignment

Installing Hadoop with Apache Ambari

7. The next set of screens allows you to define any initial parameter changes and
usernames for the services selected for installation (i.e., Hive, Oozie, and Nagios). Users are required to set up the database passwords and alert reporting
email before continuing. The Hive database setup is pictured in Figure 5.14.

Figure 5.14 Ambari customized services window

81

82

Chapter 5 Installing Apache Hadoop YARN

8. The final step before installing Hadoop is a review of your configuration. Figure 5.15 summarizes the actions that are about to take place. Be sure to doublecheck all settings before you commit to an install.

Figure 5.15

Ambari final review window

9. During the installation step shown in Figure 5.16, the cluster is actually provisioned with the various software packages. By clicking on the node name, you
can drill down into the installation status of every component. Should the installation encounter any errors on specific nodes, these errors will be highlighted on
this screen.

Installing Hadoop with Apache Ambari

Figure 5.16

Ambari deployment process

10. Once the installation of the nodes is complete, a summary window similar to
Figure 5.17 will be displayed. The screen indicates which tasks were completed and
identifies any preliminary testing of cluster services.

Figure 5.17

Ambari summary window

83

84

Chapter 5 Installing Apache Hadoop YARN

Congratulations! You have just completed installing HDP2.X with Ambari. Consult the online Ambari documentation ( http://docs.hortonworks.com/#2) for further
details on the installation process.
If you are using Hive and have a FQDN longer than 60 characters (as is common in
some cloud deployments), please note that this can cause authentication issues with the
MySQL database that Hive installs by default with Ambari. To work around this issue,
start the MySQL database with the “--skip-name-resolve” option to disable FQDN
resolution and authenticate based only on IP number.

Wrap-up
It is possible to perform an automated script-based install of moderate to very large
clusters. The use of parallel distributed shell and copy commands (pdsh and pdcp,
respectively) makes a fully remote installation on any number of nodes possible. The
script-based install process is designed to be f lexible, and users can easily modify it for
their specific needs on Red Hat (and derivative)–based distributions.
In addition to the install script, some useful functions for creating and changing the
Hadoop XML property files are made available to users. To aid with start-up and shutdown of Hadoop services, the scripted install also provides SysV init scripts for Red Hat–
based systems.
Finally, a graphical install process using Apache Ambari was described in this
chapter. With Ambari, the entire Hadoop installation process can be automated with
a powerful point-and-click interface. As we will see in Chapter 6, “Apache Hadoop
YARN Administration,” Ambari can also be used for administration purposes.
Installing Hadoop 2 YARN from scratch is also easy. The single-machine installation outlined in Chapter 2, “Apache Hadoop YARN Install Quick Start,” can be used
as a guide. Again, in custom scenarios, pdsh and pdcp can be very valuable.

6
Apache Hadoop YARN
Administration

A
dministering a YARN cluster involves many things. Those familiar with Hadoop 1
may know that there are many configuration properties and that their values are listed
in the Hadoop documentation. Instead of repeating that information here and coming
up with different explanations of each property, what we’ll do here is to give practical
examples of how you can use open-source tools to manage and understand a complex
environment like a YARN cluster.
To effectively administer YARN, we will use the bash scripts and init system
scripts developed in Chapter 5, “Installing Apache Hadoop YARN.” Also, YARN and
Hadoop in general comprise a distributed data platform written in Java. Naturally, this
means that there will be many Java processes running on your servers, so it’s a good
idea to know some of the basics concerning those processes and the process for analyzing them should the need arise.
We will not cover Hadoop File System (HDFS) administration in this chapter. It is
assumed that most readers are familiar with HDFS operations and can navigate the file
system. For those unfamiliar with HDFS, see Appendix F for a short introduction. In
addition, further information on HDFS can be found on the Apache Hadoop website:
http://hadoop.apache.org/docs/stable/hdfs_user_guide.html. In this chapter, we cover
some basic YARN administration scenarios, introduce both Nagios and Ganglia for
cluster monitoring, discuss JVM monitoring, and introduce the Ambari management
interface.

Script-based Configuration
In Chapter 5, “Installing Apache Hadoop YARN,” we presented some bash scripts to
help us install and configure Hadoop. If you haven’t read that chapter, we suggest you
examine it to get an idea of how we’ll reuse the scripts to manage our cluster once
it’s up and running. If you’ve already read Chapter 5, you’ll recall that we use a script

86

Chapter 6 Apache Hadoop YARN Administration

called hadoop-xml-conf.sh to do XML file processing. We can reuse these commands
to create an administration script that assists us in creating and pushing out Hadoop
configuration changes to our cluster. This script, called configure-hadoop2.sh, is part
of the hadoop2-install-scripts.tgz tar file from the book’s repository (see Appendix A). A listing of the administration script is also available in Appendix C.
The configure-hadoop2.sh script is designed to push (and possibly delete) configuration properties to the cluster and optionally restart various services within the
cluster. Since the bulk of our work for these scripts was presented in Chapter 5, we
will use these scripts as a starting point. You will need to set your version of Hadoop
in the beginning of the script.
HADOOP_VERSION=2.2.0
HADOOP_HOME="/opt/hadoop-${HADOOP_VERSION}"

The script also sources hadoop-xml-conf.sh, which contains the basic file manipulation commands. We also need to decide whether we want to restart and refresh the
Hadoop cluster. The default is refresh=false.
We can reuse our scripts to create a function that adds or overwrites a configuration property.
put()
{
put_config --file $file --property $property --value $value
}

The put_config function from hadoop-xml-conf.sh can be used in the same way
as was shown in Chapter 5. In a similar fashion, we can add a function to delete a
property.
delete()
{
del_config --file $file --property $property
}

Next, we enlist the help of pdcp to push the file out to the cluster in a single command. We’ve kept the all_hosts file on our machine from the installation process,
but in the event you deleted this file, just create a new one with the fully qualified
domain names of every host on which you want the configuration file to reside.
deploy()
{
echo "Deploying $file to the cluster..."
pdcp -w ^all_hosts "$file" $HADOOP_HOME/etc/hadoop/
}

We’ve gotten good use out of our existing scripts to modify configuration files,
so all we need is a way to restart Hadoop. We need to be careful as to how we bring
down the services on each node, because the order in which the services are brought