Hadoop
From Icbwiki
This page describes how to configure the open source implementation of the MapReduce framework called Hadoop for use in the ICB.
Contents |
Prerequisites
Java
Hadoop requires Java 1.6 in order to execute. Virtually every ICB Linux server provides the Sun JDK 1.6 by default. See the Softlib documentation for information on using software packages available on the servers.
Hadoop Distribution
The hadoop software is available at the Hadoop Core Release. As of this writing, 0.20.0 is the latest version and is installed into the hadoop account on the Linux machines available to the ICB.
OpenSSH
OpenSSH is required for Hadoop to execute in a distributed fashion. See the PBTech wiki entry about setting up Password Free Logins.
Windows Specific Notes
OpenSSH is already installed and configured on the Linux machines available to the ICB. While the Hadoop developers do not recommend Windows for use in a production environment, it may be useful to use Windows desktops while in the development and test phase. In order to have windows participate in a Hadoop workgroup, an ssh daemon must be running on each desktop in order to accept remote tasks. Fortunately, Cygwin provides an implementation of OpenSSH for Windows operating system. This section assumes that the reader knows how to install cygwin and optional packages. Once cygwin is installed and configured properly, execute the following commands to configure the sshd process.
$ chmod +r /etc/passwd $ chmod +r /etc/group $ chmod +rx /var $ ssh-host-config
- If the script asks you about "privilege separation", answer yes
- If the script asks about "create a local user sshd on this machine", answer yes
- If the script asks you about "install sshd as a service", answer yes
To start the sshd service, execute
cygrunsrv --start sshd
At this point, the sshd service should be configured and running and it will appear in the Windows Service Manager as something like "CYGWIN sshd". The daemon process should start again upon reboot. It can be managed like any other windows service at this point.
To verify that the server is running properly, you should see something similar to the following
$ ssh localhost Last login: Tue Oct 21 16:21:13 2008 from localhost Fanfare!!! You are successfully logged in to this server!!! $
It is very likely that the Windows, Norton or McAfee firewall will be blocking remote access to port 22 by default. This port must be open to allow access from other machines. Once the firewall is configured, the last step would be to enable the password free login by generating the appropriate ssh private/public keys as listed in the OpenSSH section above.
Windows user name contains a space in it
You will run into problems if your windows user name contains spaces in it. Specifically, hadoop will try and create a /tmp/${user-name} directory. This will fail because the space is not escaped.
To solve this issue, the file conf/hadoop-defaults.xml can be modified to replace ${user-name} by a value that has no space.
<property>
<name>hadoop.tmp.dir</name> <value>/tmp/[FIX THIS]</value> <description>A base for other temporary directories.</description>
</property>
Configuration
This section describes some of the primary configuration settings used for te hadoop setup at the ICB. Detailed instructions are available at http://hadoop.apache.org/.
Directory Structure
There are two important base directories used by hadoop. The first, referred to as HADOOP_HOME is the base directory for the hadoop installation. Source code, libraries, scripts, configuration files and examples all reside relative to this directory. On the ICB systems, HADOOP_HOME is set to ~hadoop/hadoop. (the second hadoop is not a typo but is actually a link to the current version in use, i.e., hadoop-0.19.1, etc.) The second base directory, somewhat misleadingly referred to as hadoop.tmp.dir, defines the base directory for the Hadoop FileSystem. This directory is generally set to /scratchLocal/hadoop on the ICB systems.
Hadoop node configurations
The current configuration assigns both the NameNode and the JobTracker to chico.med.cornell.edu. This will likely change in the future.
NameNode
Settings for the NameNode and DFS are found in in the file $HADOOP_HOME/conf/core-site.xml. The configuration for the ICB is as follows:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://chico.med.cornell.edu:9000/</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/scratchLocal/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
JobTracker
Settings for the JobTracker and map reduce tasks are found in in the file $HADOOP_HOME/conf/mapred-site.xml. The configuration for the ICB is as follows:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>chico.med.cornell.edu:9001</value>
</property>
</configuration>
Scheduler
The ICB Hadoop cluster is configured to use a pluggable Map/Reduce scheduler called the Fair Scheduler.
Master and Slave Node Configuration
On The Master Node
Edit the file $HADOOP_HOME/conf/masters to specify the master node. In our case this will be
chico.med.cornell.edu
Edit the file $HADOOP_HOME/conf/slaves to specify the slave nodes. In our case this will be
claudel.med.cornell.edu rodin.med.cornell.edu
More nodes may be added to the slaves file in the future.
On Each Slave Node
Edit the file $HADOOP_HOME/conf/masters to specify the master node. In our case this will be
chico.med.cornell.edu
The slaves do not need to know about one another. However, since the main user directories are shared, each slave node should be configured with it's own fully qualified domain name and not "localhost". Using the FQDN in the configurations will avoid problems with ssh keys in the hadoop user account.
Hadoop Specific Environment Variables
The file $HADOOP_HOME/conf/hadoop-env.xml contains environment settings for hadoop. The only required environment variable is JAVA_HOME. On the ICB systems, JAVA_HOME is set to /softlib/exe/x86_64/pkg/sun_jdk/6.0.2/dist
Since the hadoop user account is in a directory served by nfs, we need to override the location for the log files created by hadoop. We therefore set HADOOP_LOG_DIR to /scratchLocal/hadoop/logs
Firewall
This section describes the various ports used by hadoop to communicate between nodes or to provide status and administration. Ports listed here are the default values and can be specified in the hadoop-site.xml file.
| Name | Port | Description |
|---|---|---|
| fs.default.name | 9000 | The port that the name node will listen to. |
| mapred.job.tracker | 9001 | The port that the MapReduce job tracker will listen to. |
| dfs.datanode.address | 50010 | The port that the datanode server will listen to. |
| dfs.datanode.ipc.address | 50020 | The datanode ipc server port. |
| mapred.job.tracker.http.address | 50030 | The job tracker http server port the server will listen on. |
| mapred.task.tracker.http.address | 50060 | The task tracker http server port. |
| dfs.http.address | 50070 | The dfs namenode http server port the server will listen on. |
| dfs.https.address | 50470 | The dfs namenode https server port the server will listen on. |
| dfs.datanode.http.address | 50075 | The datanode http server port. |
| dfs.datanode.https.address | 50475 | The datanode https server port. |
| dfs.secondary.http.address | 50090 | The secondary namenode http server port. |
Administration
Tasks in this section are only intended to be run from the hadoop administrator account (unless you are running a single node configuration locally). All instructions in this section assume the user is in the $HADOOP_HOME directory.
Initializing the Hadoop FileSystem (HDFS)
To format a new distributed filesystem execute the following command on the designated NameNode.
$ bin/hadoop namenode -format $ bin/hadoop fs -mkdir /user $ bin/hadoop fs -chmod +w /user
Typically this will only need to be performed once as it will delete any previous installation of the HDFS.
Hadoop Startup
Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
This script starts the DataNode daemons on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh
This script starts the TaskTracker daemons on all the listed slaves.
Note that if the NameNode and the JobTracker are running on the same machine, the bin/start-all.sh script can be used to start both the HDFS and Map-Reduce daemons.
Hadoop Shutdown
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh
This script stops the TaskTracker daemons on all the listed slaves.
Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
This script stops the DataNode daemons on all the listed slaves.
Note that if the NameNode and the JobTracker are running on the same machine, the bin/stop-all.sh script can be used to stop both the HDFS and Map-Reduce daemons.
Using Hadoop
All the hadoop commands are invoked by the hadoop script located in $HADOOP_HOME/bin. Running hadoop script without any arguments prints the description for all commands. Users wishing to use the hadoop cluster should add the following lines to ~/.bash_profile (or the initialization file corresponding to your default shell):
export HADOOP_HOME=/home/hadoop/hadoop
export PATH=${HADOOP_HOME}/bin:$PATH
The hadoop script itself will load the appropriate configuration.
Interacting with the Hadoop FileSystem (HDFS)
The commands used to interact with the files in hadoop are very much like the commands found in the Unix operating system. The FileSystem shell is invoked by "bin/hadoop fs <args>". All the FileSystem shell commands take path URIs as arguments. A few basic commands are listed below. For a complete list of hadoop commands refer to the Hadoop Shell Commands manual.
- cat
- Copies source paths to stdout. Usage: hadoop fs -cat URI [URI …]
- Example:
- $ hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
- $ hadoop fs -cat file:///file3 /user/hadoop/file4
- cp
- Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Usage: hadoop fs -cp URI [URI …] <dest>
- Example:
- $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
- $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
- get
- Copy files to the local file system. Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
- Example:
- $ hadoop fs -get /user/hadoop/file localfile
- $ hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
- mkdir
- Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. Usage: hadoop fs -mkdir <paths>
- Example:
- $ hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
- $ hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
- mv
- Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted. Usage: hadoop fs -mv URI [URI …] <dest>
- Example:
- $ hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
- $ hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
- put
- Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. Usage: hadoop fs -put <localsrc> ... <dst>
- Example:
- $ hadoop fs -put localfile /user/hadoop/hadoopfile
- $ hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
- $ hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
- $ hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
- rm
- Delete files specified as args. Only deletes non empty directory and files. Usage: hadoop fs -rm URI [URI …]
- Example:
- $ hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
Submitting Tasks to Hadoop
Most often users will want to submit their code to the Hadoop cluster for processing. Users can bundle their Map Reduce code in a jar file and execute it using this command.
$ hadoop jar <jar> [mainClass] args...
or by simply providing the name of a class to execute
$ hadoop CLASSNAME args...
For a complete list of hadoop commands refer to the Hadoop Commands Manual.
Monitoring Hadoop
Hadoop provides a web interface for job tracking and also for examining the state of the namenode. The hadoop filesystem and log files for each node in the hadoop cluster can also be retrieved using the web interface for the NameNode.
- JobTracker: http://chico.med.cornell.edu:50030/
- NameNode: http://chico.med.cornell.edu:50070/
- Scheduler: http://chico.med.cornell.edu:50030/scheduler
Using Hadoop remotely with SSH
Generally speaking, interacting with hadoop is done directly from a node in the cluster (i.e., rodin, claudel). It is possible to access the hadoop filesystem and submit jobs from remote machines as long as you have the appropriate permissions to do so.
A local copy of the hadoop distribution should be installed onto your machine. Once again, the hadoop software is available at the Hadoop Core Release. The ICB hadoop client configuration files hadoop-client-conf.zip should be placed into the conf directory of the local hadoop installation. The value of hadoop.job.ugi needs to be changed to reflect your ICB username (and needs to be there twice with a comma separating the two copies of your username - if your username is jsmith the value would be jsmith,jsmith). You should NOT attempt to start or stop the dfs or job tracker on your local machine.
In order to access the hadoop cluster from your local desktop, an SSH session to a hadoop node must be established by opening a dynamic port-forwarding tunnel. This isn’t as hard as it sounds; it’s accomplished by:
$ ssh -ND 2600 you@rodin.med.cornell.edu
The -D 2600 instructs SSH to open a SOCKS proxy on local port 2600. The SOCKS-based SocketFactory in Hadoop will then create connections forwarded over this SOCKS proxy. After this connection is established, you can minimize the ssh session and forget about it. (Alternatively you can add the option "-f" which will run the ssh tunnel in the background). Then just run Hadoop jobs in another terminal the normal way:
$ ${HADOOP_HOME}/bin/hadoop fs -ls /
$ ${HADOOP_HOME}/bin/hadoop jar myJarFile.jar myMainClass
...
See Securing a Hadoop Cluster Through a Gateway for more details.
Documentation
- The main Hadoop web site: http://hadoop.apache.org/
- The official MapReduce tutorial: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
- Hadoop classes API (0.20.0): http://hadoop.apache.org/core/docs/r0.20.0/api/
- Hadoop Wiki: http://wiki.apache.org/hadoop/
- Hadoop Fair Scheduler: http://hadoop.apache.org/core/docs/current/fair_scheduler.html
Notes
Recommended Practices
- The HDFS is NOT backed up at all. Be sure to copy any data that is critical or you need to save into your home directory where it will be archived on a regular basis.
- Avoid using "localhost" in the configuration files. This gets messy when using shared configuration directories. Always use the fully qualified names.
Error Messages
- Host key verification failed
- This generally means that the remote host has never been authenticated before. To set up the authentication, log into the host at least once from the name node using ssh as the user running the hadoop process.
- WARN fs.FileSystem: "chico.med.cornell.edu:9000" is a deprecated filesystem name. Use "hdfs://chico.med.cornell.edu:9000/" instead
- The hadoop examples use an old format for the setup. We use the new format.
- INFO ipc.Client: Retrying connect to server: chico.med.cornell.edu/157.139.217.3:9000
- You are likely trying to execute a Hadoop command from a machine which is firewalled from the Hadoop server.
Annoyances
- It seems that the hadoop examples and the hadoop streaming framework cannot handle a job where the output directory already exists. This needs to be removed first by running:
$ hadoop fs -rmr <path>
