Tải bản đầy đủ
Appendix A. Creating a Hadoop Pseudo-Distributed Development Environment

Appendix A. Creating a Hadoop Pseudo-Distributed Development Environment

Tải bản đầy đủ

If you’re brave enough to set up the environment yourself, go ahead and move to the
next section!

Setting Up Linux
Before you can get started installing Hadoop, you’ll need to have a Linux environ‐
ment configured and ready to use. These instructions assume that you can get an
Ubuntu 14.04 distribution installed on the machine of your choice, either in a dual
booted configuration or using a virtual machine. Using Ubuntu Server or Ubuntu
Desktop is left to your preference, as you’ll also need to be familiar working with the
command line.
Our base environment is Ubuntu x64 Desktop 14.04 LTS.
Make sure your system is fully up to date by running the following commands:


apt-get update && sudo apt-get upgrade
apt-get install build-essential ssh lzop git rsync curl
apt-get install python-dev python-setuptools
apt-get install libcurl4-openssl-dev
easy_install pip
pip install virtualenv virtualenvwrapper python-dateutil

Creating a Hadoop User
In order to secure our Hadoop services, we will make sure that Hadoop is run as a
Hadoop-specific user and group. This user would be able to initiate SSH connections
to other nodes in a cluster, but not have administrative access to do damage to the
operating system upon which the service was running. Implementing Linux permis‐
sions also helps secure HDFS and is the start of preparing a secure computing cluster.
This tutorial is not meant for operational implementation; however, as a data scientist
these permissions may save you some headache in the long run, so it is helpful to
have the permissions in place in your development environment. Needless to say, this
will also ensure that the Hadoop installation is separate from other software applica‐
tions and will help organize the maintenance of the machine.
Create the hadoop user and group, then add the student user to the Hadoop group:
$ sudo addgroup hadoop
$ sudo adduser --ingroup hadoop hadoop
$ sudo usermod -a -G hadoop student

Once you have logged out and logged back in (or restarted the machine) you should
be able to see that you’ve been added to the Hadoop group by issuing the groups


| Appendix A: Creating a Hadoop Pseudo-Distributed Development Environment

Configuring SSH
SSH is required and must be installed on your system to use Hadoop (and to better
manage the virtual environment, especially if you’re using a headless Ubuntu). Gen‐
erate some ssh keys for the hadoop user by issuing the following commands:
$ sudo su hadoop
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/student/.ssh/id_rsa):
Created directory '/home/student/.ssh'.
Enter passphrase (empty for no passphrase):
Enter same passphrase again:
Your identification has been saved in /home/student/.ssh/id_rsa.
Your public key has been saved in /home/student/.ssh/id_rsa.pub.
[... snip ...]

Simply press Enter at all the prompts to accept the default and to create a key that
does not require a password to authenticate (this is required for Hadoop). It is good
practice to keep an administrative user separate from the Hadoop user because of the
password-less SSH requirement; however, because this is a developer cluster, we’ll
take the shortcut of making the student user the Hadoop user.
In order to allow the key to be used to SSH into the box, copy the public key to the
authorized_keys file with the following command:
$ cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys
$ chmod 600 ~/.ssh/authorized_keys

You should be able to download this key and use it to SSH into the Ubuntu environ‐
ment. To test the SSH key, issue the following command:
$ ssh -l hadoop localhost

If this completes without asking you for a password, then you have successfully con‐
figured SSH for Hadoop.

Installing Java
Hadoop and most of the Hadoop ecosystem require Java to run. Hadoop requires a
minimum of Oracle Java 1.6.x or greater, and used to recommend particular versions
of Java to use with Hadoop. However, now Hadoop maintains a reporting of the vari‐
ous JDKs that work well with Hadoop. Ubuntu does not maintain an Oracle JDK in
Ubuntu repositories because it is proprietary code, so instead we will install
OpenJDK. For more information on supported Java versions, see Hadoop Java Ver‐
sions and for information about installing different versions on Ubuntu, see Instal‐
ling Java on Ubuntu.
$ sudo apt-get install openjdk-7-*

Creating a Hadoop Pseudo-Distributed Development Environment



Do a quick check to ensure the right version of Java is installed:
$ java -version
java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Hadoop is currently built and tested on both OpenJDK and Oracle’s JDK/JRE.

Disabling IPv6
It has been reported for a while now that Hadoop running on Ubuntu has a conflict
with IPv6, and ever since Hadoop 0.20, Ubuntu users have been disabling IPv6 on
their clustered boxes. It is unclear whether this is still a bug in the latest versions of
Hadoop, but in a single node or pseudo-distributed environment we will have no
need for IPv6, so it is best to simply disable it and not worry about any potential
Edit the /etc/sysctl.conf file by executing the following lines of code:
$ gksu gedit /etc/sysctl.conf

Then add the following lines to the end of the file:
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1

For this change to take effect, reboot your computer. Once it has rebooted, check the
status with the following command:
$ cat /proc/sys/net/ipv6/conf/all/disable_ipv6

If the output is 0 then IPv6 is enabled, if it is 1 then we have successfully disabled

Installing Hadoop
To get Hadoop, you’ll need to download the release of your choice from one of the
Apache Download Mirrors. These instructions will download the current stable ves‐
ion of Hadoop with YARN at the time of this writing, Hadoop 2.5.0.
After you’ve selected a mirror, type the following commands into a Terminal window,
replacing http://apache.mirror.com/hadoop-2.5.0/ with the mirror URL that you
selected and that is best for your region:
$ curl -O http://apache.mirror.com/hadoop-2.5.0/hadoop-2.5.0.tar.gz

You can verify the download by ensuring that the md5sum matches the md5sum, which
should also be available at the mirror:


Appendix A: Creating a Hadoop Pseudo-Distributed Development Environment

$ md5sum hadoop-2.5.0.tar.gz


Of course, you can use any mechanism you wish to download Hadoop—wget or a
browser will work just fine.

After obtaining the compressed tarball, the next step is to unpack it. You can use an
Archive Manager or simply follow the instructions that follow. The most significant
decision that you have to make is where to unpack Hadoop.
The Linux operating system depends upon a hierarchical directory structure to func‐
tion. At the root, many directories that you’ve heard of have specific purposes: /etc is
used to store configuration files and /home is used to store user specific files. Most
applications find themselves in a variety of locations; for example, /bin and /sbin
include programs that are vital for the OS and /usr/bin and /usr/sbin are for programs
that are not vital but are system-wide. The directory /usr/local is for locally installed
programs and /var is used for program data including caches and logs. You can read
more about these directories in this Stack Exchange post.
A good choice to move Hadoop to is the /opt and /srv directories; /opt contains nonpackaged programs, usually source; a lot of developers stick their code there for
deployments. The /srv directory stands for services; Hadoop, HBase, Hive and others
run as services on your machine, so this seems like a great place to put things—and
it’s a standard location that’s easy to get to—so let’s stick everything there. Enter the
following commands:

tar -xzf hadoop-2.5.0.tar.gz
sudo mv hadoop-2.5.0 /srv/
sudo chown -R hadoop:hadoop /srv/hadoop-2.5.0
sudo chmod g+w -R /srv/hadoop-2.5.0
sudo ln -s /srv/hadoop-2.5.0 /srv/hadoop

These commands unpack Hadoop, move it to the service directory where we will
keep all of our Hadoop and cluster services, then set permissions. Finally, we create a
symlink to the version of Hadoop that we would like to use, which makes it easy to
upgrade our Hadoop distribution in the future.

In order to ensure everything executes correctly, we are going to set some environ‐
ment variables so that Hadoop executes in its correct context. Enter the following
command on the command line to open up a text editor with the profile of the
hadoop user to change the environment variables:
$ gksu gedit /home/hadoop/.bashrc

Creating a Hadoop Pseudo-Distributed Development Environment



Add the following lines to this file:
# Set the Hadoop-related environment variables
export HADOOP_HOME=/srv/hadoop
# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

We’ll also add some convenience functionality to the student user environment. Open
the student user bash aliases file with the following command:
$ gedit ~/.bash_aliases

Add the following contents to that file:
# Set the Hadoop-related environment variables
export HADOOP_HOME=/srv/hadoop
export HADOOP_STREAMING=$HADOOP_HOME/share/hadoop/tools/lib/
# Set the JAVA_HOME
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
# Helpful aliases
alias ..="cd .."
alias ...="cd ../.."
alias hfs="hadoop fs"
alias hls="hfs -ls"

These simple aliases may save you a lot of typing in the long run! Feel free to add any
other helpers that you think might be useful in your development work.
Check that your environment configuration has worked by running a Hadoop com‐
$ hadoop version
Hadoop 2.5.0
Subversion http://svn.apache.org/repos/asf/hadoop/common -r 1616291
Compiled by jenkins on 2014-08-06T17:31Z
Compiled with protoc 2.5.0
From source with checksum 423dcd5a752eddd8e45ead6fd5ff9a24
This command was run using /srv/hadoop-2.5.0/share/hadoop/common/

If that ran with no errors and displayed output similar to what is shown here, then
everything has been configured correctly up to this point.


| Appendix A: Creating a Hadoop Pseudo-Distributed Development Environment

Hadoop Configuration
The penultimate step to setting up Hadoop as a pseudo-distributed node is to edit
configuration files for the Hadoop environment, the MapReduce site, the HDFS site,
and the YARN site. This will mostly entail configuration file editing.
Edit the hadoop-env.sh file by entering the following on the command line:
$ gedit $HADOOP_HOME/etc/hadoop/hadoop-env.sh

The most important part of this configuration is to change the following line:
# The Java implementation to use
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

Next, edit the core site configuration file:
$ gedit $HADOOP_HOME/etc/hadoop/core-site.xml

Replace the with the following:



Edit the mapreduce site configuration following by copying the template then open‐
ing the file for editing:
$ cp $HADOOP_HOME/etc/hadoop/mapred-site.xml.template \
$ gedit $HADOOP_HOME/etc/hadoop/mapred-site.xml

Replace the with the following:


Now edit the hdfs site configuration by editing the following file:
$ gedit $HADOOP_HOME/etc/hadoop/hdfs-site.xml

Replace the with the following:

Creating a Hadoop Pseudo-Distributed Development Environment




Finally, edit the yarn site configuration file:
$ gedit $HADOOP_HOME/etc/hadoop/yarn-site.xml

And update the configuration as follows:






With these files edited, Hadoop should be fully configured as a pseudo-distributed

Formatting the Namenode
The final step before we can turn Hadoop on is to format the Namenode. The Name‐
node is in charge of HDFS—the distributed file system. The Namenode on this
machine is going to keep its files in the /var/app/hadoop/data directory. We need to
initialize this directory, then format the Namenode to properly use it:

sudo mkdir -p /var/app/hadoop/data
sudo chown hadoop:hadoop -R /var/app/hadoop
sudo su hadoop
hadoop namenode -format

You should see a bunch of Java messages scrolling down the page. If the namenode
command has executed successfully (there should be directories inside of



Appendix A: Creating a Hadoop Pseudo-Distributed Development Environment

the /var/app/hadoop/data directory including a dfs directory) then Hadoop is set up
and ready to use!

Starting Hadoop
At this point, we can start and run our Hadoop daemons. When you formatted the
Namenode, you switched to being the hadoop user with the sudo su hadoop com‐
mand. If you’re still that user, go ahead and execute the following commands:
$ $HADOOP_HOME/sbin/start-dfs.sh
$ $HADOOP_HOME/sbin/start-yarn.sh

The daemons should start up and issue messages about where they are logging to and
other important information. If you get asked about your SSH key, just type y at the
prompt. You can see the processes that are running via the jps command:
$ jps
5298 Jps
4690 ResourceManager
4541 SecondaryNameNode
4813 NodeManager
4227 NameNode

If the processes are not running, then something has gone wrong. You can also access
the Hadoop cluster administration site by opening a browser and pointing it to http://
localhost:8088; this should bring up a page with the Hadoop logo and a table of appli‐
To wrap up the configuration, prepare a space on HDFS for our student account to
store data and to run analytical jobs on:
$ hadoop fs -mkdir -p /user/student
$ hadoop fs -chown student:student /user/student

You can now exit from the hadoop user’s shell with the exit command.

Restarting Hadoop
If you reboot your machine, the Hadoop daemons will stop running and will not
automatically be restarted. If you are attempting to run a Hadoop command and you
get a “Connection refused” message, it is likely because the daemons are not running.
You can check this by issuing the jps command as sudo:
$ sudo jps

To restart Hadoop in the case that it shuts down, issue the following commands:
$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-dfs.sh
$ sudo -H -u hadoop $HADOOP_HOME/sbin/start-yarn.sh

Creating a Hadoop Pseudo-Distributed Development Environment



The processes should start up again as the dedicated hadoop user and you’ll be back
on your way!



Appendix A: Creating a Hadoop Pseudo-Distributed Development Environment


Installing Hadoop Ecosystem Products

In addition to the core functionality provided in Hadoop, this book covers several
other Hadoop ecosystem projects that are built on top of Hadoop. In a typical setting,
these products are often installed either on the same cluster that hosts Hadoop and
YARN, or are configured to connect to the Hadoop cluster. In this book, we will
assume that you have setup and configured Apache Hadoop in a single node, pseudodistributed mode. However, there are several other options to get up and running
with a single node Hadoop cluster along with the Hadoop ecosystem products that
we will discuss in this book.

Packaged Hadoop Distributions
The easiest way to get up and running with a single-machine configuration of
Hadoop is to install one of the virtualized Hadoop distributions provided by the
major Hadoop vendors. These include Cloudera’s Quickstart VM, Hortonworks
Sandbox, or MapR’s sandbox for Hadoop. These virtual machines contain a singlenode Hadoop cluster in addition to the popular Apache Hadoop ecosystem projects
as well as proprietary applications and tools that are included in a simple turn-key
bundle. You can use your preferred virtualization software such as VMWare Player or
Virtualbox to run these VMs.

Self-Installation of Apache Hadoop Ecosystem Products
If you are not using a packaged distribution of Hadoop, but instead installing Apache
Hadoop manually, then you will also need to manually install and configure the vari‐
ous Hadoop ecosystem projects that we discuss in this book to work with your
Hadoop installation.


For the most part, installing services (e.g., Hive, HBase, or others) in the Hadoop
environment we have set up will consist of the following:
1. Download the release tarball of the service
2. Unpack the release to the /srv/ directory (where we have been installing our
Hadoop services) and create a symlink from the release to a simple name
3. Configure environment variables with the paths to the service
4. Configure the service to run in pseudo-distributed mode
In this appendix, we’ll walk through the steps to install Sqoop to work with our
pseudo-distributed Hadoop cluster. These steps can be reproduced for nearly all the
other Hadoop ecosystem projects that we discuss in this book.

Basic Installation and Configuration Steps
Let’s start by downloading the latest stable release of Sqoop from the Apache Sqoop
Download Mirrors, which as of this writing is currently at version 1.4.6. Make sure
you are a user with admin (sudo) privileges and grab the version of Sqoop that is
compatible with your version of Hadoop (in this example, Hadoop 2.5.1):
~$ wget http://apache.arvixe.com/sqoop/1.4.6/sqoop-1.4.6.bin__
~$ sudo mv sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz /srv/
~$ cd /srv
/srv$ sudo tar -xvf sqoop-1.4.6.bin__hadoop-2.0.4-alpha.tar.gz
/srv$ sudo chown -R hadoop:hadoop sqoop-1.4.6.bin__hadoop-2.0.4-alpha
/srv$ sudo ln -s $(pwd)/sqoop-1.4.6.bin__hadoop-2.0.4-alpha $(pwd)/sqoop

Now switch to the hadoop user using the sudo su command and edit your Bash con‐
figuration to add some environment variables for convenience:
/srv$ sudo su hadoop
$ vim ~/.bashrc

Add the following environment variables to your bashrc profile:
# Sqoop aliases
export SQOOP_HOME=/srv/sqoop

Then source the profile to add the new variables to the current shell environment:
~$ $ source ~/.bashrc

We can verify that Sqoop is successfully installed by running sqoop help from
/srv$ cd $SQOOP_HOME
/srv/sqoop$ sqoop help



Appendix B: Installing Hadoop Ecosystem Products