Hadoop

From Icbwiki

Jump to: navigation, search

This page describes how to configure the open source implementation of the MapReduce framework called Hadoop for use in the ICB.

Contents

Prerequisites

Java

Hadoop requires Java 1.6 in order to execute. Virtually every ICB Linux server provides the Sun JDK 1.6 by default. See the Softlib documentation for information on using software packages available on the servers.

Hadoop Distribution

The hadoop software is available at the Hadoop Core Release. As of this writing, 0.20.0 is the latest version and is installed into the hadoop account on the Linux machines available to the ICB.

OpenSSH

OpenSSH is required for Hadoop to execute in a distributed fashion. See the PBTech wiki entry about setting up Password Free Logins.

Windows Specific Notes

OpenSSH is already installed and configured on the Linux machines available to the ICB. While the Hadoop developers do not recommend Windows for use in a production environment, it may be useful to use Windows desktops while in the development and test phase. In order to have windows participate in a Hadoop workgroup, an ssh daemon must be running on each desktop in order to accept remote tasks. Fortunately, Cygwin provides an implementation of OpenSSH for Windows operating system. This section assumes that the reader knows how to install cygwin and optional packages. Once cygwin is installed and configured properly, execute the following commands to configure the sshd process.

 $ chmod +r /etc/passwd
 $ chmod +r /etc/group
 $ chmod +rx /var
 $ ssh-host-config
  • If the script asks you about "privilege separation", answer yes
  • If the script asks about "create a local user sshd on this machine", answer yes
  • If the script asks you about "install sshd as a service", answer yes

To start the sshd service, execute

 cygrunsrv --start sshd

At this point, the sshd service should be configured and running and it will appear in the Windows Service Manager as something like "CYGWIN sshd". The daemon process should start again upon reboot. It can be managed like any other windows service at this point.

To verify that the server is running properly, you should see something similar to the following

 $ ssh localhost
 Last login: Tue Oct 21 16:21:13 2008 from localhost
 Fanfare!!!
 You are successfully logged in to this server!!!
 $

It is very likely that the Windows, Norton or McAfee firewall will be blocking remote access to port 22 by default. This port must be open to allow access from other machines. Once the firewall is configured, the last step would be to enable the password free login by generating the appropriate ssh private/public keys as listed in the OpenSSH section above.

Windows user name contains a space in it

You will run into problems if your windows user name contains spaces in it. Specifically, hadoop will try and create a /tmp/${user-name} directory. This will fail because the space is not escaped.

To solve this issue, the file conf/hadoop-defaults.xml can be modified to replace ${user-name} by a value that has no space.

<property>

 <name>hadoop.tmp.dir</name>
 <value>/tmp/[FIX THIS]</value>
 <description>A base for other temporary directories.</description>

</property>

Configuration

This section describes some of the primary configuration settings used for te hadoop setup at the ICB. Detailed instructions are available at http://hadoop.apache.org/.

Directory Structure

There are two important base directories used by hadoop. The first, referred to as HADOOP_HOME is the base directory for the hadoop installation. Source code, libraries, scripts, configuration files and examples all reside relative to this directory. On the ICB systems, HADOOP_HOME is set to ~hadoop/hadoop. (the second hadoop is not a typo but is actually a link to the current version in use, i.e., hadoop-0.19.1, etc.) The second base directory, somewhat misleadingly referred to as hadoop.tmp.dir, defines the base directory for the Hadoop FileSystem. This directory is generally set to /scratchLocal/hadoop on the ICB systems.

Hadoop node configurations

The current configuration assigns both the NameNode and the JobTracker to chico.med.cornell.edu. This will likely change in the future.

NameNode

Settings for the NameNode and DFS are found in in the file $HADOOP_HOME/conf/core-site.xml. The configuration for the ICB is as follows:

 <configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://chico.med.cornell.edu:9000/</value>
   </property>
   <property>
     <name>hadoop.tmp.dir</name>
     <value>/scratchLocal/hadoop/hadoop-${user.name}</value>
     <description>A base for other temporary directories.</description>
   </property>
 </configuration>

JobTracker

Settings for the JobTracker and map reduce tasks are found in in the file $HADOOP_HOME/conf/mapred-site.xml. The configuration for the ICB is as follows:

 <configuration>
   <property>
     <name>mapred.job.tracker</name>
     <value>chico.med.cornell.edu:9001</value>
   </property>
 </configuration>

Scheduler

The ICB Hadoop cluster is configured to use a pluggable Map/Reduce scheduler called the Fair Scheduler.

Master and Slave Node Configuration

On The Master Node

Edit the file $HADOOP_HOME/conf/masters to specify the master node. In our case this will be

 chico.med.cornell.edu

Edit the file $HADOOP_HOME/conf/slaves to specify the slave nodes. In our case this will be

 claudel.med.cornell.edu
 rodin.med.cornell.edu

More nodes may be added to the slaves file in the future.

On Each Slave Node

Edit the file $HADOOP_HOME/conf/masters to specify the master node. In our case this will be

 chico.med.cornell.edu

The slaves do not need to know about one another. However, since the main user directories are shared, each slave node should be configured with it's own fully qualified domain name and not "localhost". Using the FQDN in the configurations will avoid problems with ssh keys in the hadoop user account.

Hadoop Specific Environment Variables

The file $HADOOP_HOME/conf/hadoop-env.xml contains environment settings for hadoop. The only required environment variable is JAVA_HOME. On the ICB systems, JAVA_HOME is set to /softlib/exe/x86_64/pkg/sun_jdk/6.0.2/dist

Since the hadoop user account is in a directory served by nfs, we need to override the location for the log files created by hadoop. We therefore set HADOOP_LOG_DIR to /scratchLocal/hadoop/logs

Firewall

This section describes the various ports used by hadoop to communicate between nodes or to provide status and administration. Ports listed here are the default values and can be specified in the hadoop-site.xml file.

Name Port Description
fs.default.name 9000 The port that the name node will listen to.
mapred.job.tracker 9001 The port that the MapReduce job tracker will listen to.
dfs.datanode.address 50010 The port that the datanode server will listen to.
dfs.datanode.ipc.address 50020 The datanode ipc server port.
mapred.job.tracker.http.address 50030 The job tracker http server port the server will listen on.
mapred.task.tracker.http.address 50060 The task tracker http server port.
dfs.http.address 50070 The dfs namenode http server port the server will listen on.
dfs.https.address 50470 The dfs namenode https server port the server will listen on.
dfs.datanode.http.address 50075 The datanode http server port.
dfs.datanode.https.address 50475 The datanode https server port.
dfs.secondary.http.address 50090 The secondary namenode http server port.

Administration

Tasks in this section are only intended to be run from the hadoop administrator account (unless you are running a single node configuration locally). All instructions in this section assume the user is in the $HADOOP_HOME directory.

Initializing the Hadoop FileSystem (HDFS)

To format a new distributed filesystem execute the following command on the designated NameNode.

 $ bin/hadoop namenode -format
 $ bin/hadoop fs -mkdir /user
 $ bin/hadoop fs -chmod +w /user

Typically this will only need to be performed once as it will delete any previous installation of the HDFS.

Hadoop Startup

Start the HDFS with the following command, run on the designated NameNode:

 $ bin/start-dfs.sh

This script starts the DataNode daemons on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:

 $ bin/start-mapred.sh

This script starts the TaskTracker daemons on all the listed slaves.

Note that if the NameNode and the JobTracker are running on the same machine, the bin/start-all.sh script can be used to start both the HDFS and Map-Reduce daemons.

Hadoop Shutdown

Stop Map/Reduce with the following command, run on the designated the designated JobTracker:

 $ bin/stop-mapred.sh

This script stops the TaskTracker daemons on all the listed slaves.

Stop HDFS with the following command, run on the designated NameNode:

 $ bin/stop-dfs.sh

This script stops the DataNode daemons on all the listed slaves.

Note that if the NameNode and the JobTracker are running on the same machine, the bin/stop-all.sh script can be used to stop both the HDFS and Map-Reduce daemons.

Using Hadoop

All the hadoop commands are invoked by the hadoop script located in $HADOOP_HOME/bin. Running hadoop script without any arguments prints the description for all commands. Users wishing to use the hadoop cluster should add the following lines to ~/.bash_profile (or the initialization file corresponding to your default shell):

 export HADOOP_HOME=/home/hadoop/hadoop
 export PATH=${HADOOP_HOME}/bin:$PATH

The hadoop script itself will load the appropriate configuration.

Interacting with the Hadoop FileSystem (HDFS)

The commands used to interact with the files in hadoop are very much like the commands found in the Unix operating system. The FileSystem shell is invoked by "bin/hadoop fs <args>". All the FileSystem shell commands take path URIs as arguments. A few basic commands are listed below. For a complete list of hadoop commands refer to the Hadoop Shell Commands manual.

cat
Copies source paths to stdout. Usage: hadoop fs -cat URI [URI …]
Example:
   $ hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
   $ hadoop fs -cat file:///file3 /user/hadoop/file4
cp
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Usage: hadoop fs -cp URI [URI …] <dest>
Example:
   $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
   $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
get
Copy files to the local file system. Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
Example:
   $ hadoop fs -get /user/hadoop/file localfile
   $ hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
mkdir
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. Usage: hadoop fs -mkdir <paths>
Example:
   $ hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
   $ hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
mv
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted. Usage: hadoop fs -mv URI [URI …] <dest>
Example:
   $ hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
   $ hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
put
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. Usage: hadoop fs -put <localsrc> ... <dst>
Example:
   $ hadoop fs -put localfile /user/hadoop/hadoopfile
   $ hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
   $ hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
   $ hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
rm
Delete files specified as args. Only deletes non empty directory and files. Usage: hadoop fs -rm URI [URI …]
Example:
   $ hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

Submitting Tasks to Hadoop

Most often users will want to submit their code to the Hadoop cluster for processing. Users can bundle their Map Reduce code in a jar file and execute it using this command.

 $ hadoop jar <jar> [mainClass] args... 

or by simply providing the name of a class to execute

 $ hadoop CLASSNAME args...

For a complete list of hadoop commands refer to the Hadoop Commands Manual.

Monitoring Hadoop

Hadoop provides a web interface for job tracking and also for examining the state of the namenode. The hadoop filesystem and log files for each node in the hadoop cluster can also be retrieved using the web interface for the NameNode.

Using Hadoop remotely with SSH

Generally speaking, interacting with hadoop is done directly from a node in the cluster (i.e., rodin, claudel). It is possible to access the hadoop filesystem and submit jobs from remote machines as long as you have the appropriate permissions to do so.

A local copy of the hadoop distribution should be installed onto your machine. Once again, the hadoop software is available at the Hadoop Core Release. The ICB hadoop client configuration files hadoop-client-conf.zip should be placed into the conf directory of the local hadoop installation. The value of hadoop.job.ugi needs to be changed to reflect your ICB username (and needs to be there twice with a comma separating the two copies of your username - if your username is jsmith the value would be jsmith,jsmith). You should NOT attempt to start or stop the dfs or job tracker on your local machine.

In order to access the hadoop cluster from your local desktop, an SSH session to a hadoop node must be established by opening a dynamic port-forwarding tunnel. This isn’t as hard as it sounds; it’s accomplished by:

 $ ssh -ND 2600 you@rodin.med.cornell.edu

The -D 2600 instructs SSH to open a SOCKS proxy on local port 2600. The SOCKS-based SocketFactory in Hadoop will then create connections forwarded over this SOCKS proxy. After this connection is established, you can minimize the ssh session and forget about it. (Alternatively you can add the option "-f" which will run the ssh tunnel in the background). Then just run Hadoop jobs in another terminal the normal way:

 $ ${HADOOP_HOME}/bin/hadoop fs -ls /
 $ ${HADOOP_HOME}/bin/hadoop jar myJarFile.jar myMainClass
   ...

See Securing a Hadoop Cluster Through a Gateway for more details.

Documentation

Notes

Recommended Practices

  • The HDFS is NOT backed up at all. Be sure to copy any data that is critical or you need to save into your home directory where it will be archived on a regular basis.
  • Avoid using "localhost" in the configuration files. This gets messy when using shared configuration directories. Always use the fully qualified names.

Error Messages

Host key verification failed 
This generally means that the remote host has never been authenticated before. To set up the authentication, log into the host at least once from the name node using ssh as the user running the hadoop process.
WARN fs.FileSystem: "chico.med.cornell.edu:9000" is a deprecated filesystem name. Use "hdfs://chico.med.cornell.edu:9000/" instead
The hadoop examples use an old format for the setup. We use the new format.
INFO ipc.Client: Retrying connect to server: chico.med.cornell.edu/157.139.217.3:9000
You are likely trying to execute a Hadoop command from a machine which is firewalled from the Hadoop server.

Annoyances

  • It seems that the hadoop examples and the hadoop streaming framework cannot handle a job where the output directory already exists. This needs to be removed first by running:
 $ hadoop fs -rmr <path>

Links

Personal tools