Hadoop

From Icbwiki

(Difference between revisions)
Jump to: navigation, search
Revision as of 20:43, 31 October 2008
Kdorff (Talk | contribs)
(NameNode and JobTracker)
← Previous diff
Revision as of 20:44, 31 October 2008
Kdorff (Talk | contribs)
(NameNode and JobTracker)
Next diff →
Line 52: Line 52:
<name>mapred.job.tracker</name> <name>mapred.job.tracker</name>
<value>chico.med.cornell.edu:9001</value> <value>chico.med.cornell.edu:9001</value>
- </property> 
- <property> 
- <name>dfs.replication</name> 
- <value>1</value> 
</property> </property>
<property> <property>

Revision as of 20:44, 31 October 2008

This page describes how to configure the open source implementation of the MapReduce framework called Hadoop for use in the ICB.

Contents

Prerequisites

Java

Hadoop requires Java 1.5 in order to execute. Virtually every ICB Linux server provides the Sun JDK 1.6 by default. See the Softlib documentation for information on using software packages available on the servers.

Hadoop Distribution

The hadoop software is available at the Hadoop Core Release. As of this writing, 0.18.1 is the latest version and is installed into the hadoop account on the Linux machines available to the ICB.

OpenSSH

OpenSSH is required for Hadoop to execute in a distributed fashion. See the PBTech wiki entry about setting up Password Free Logins.

Windows Specific Notes

OpenSSH is already installed and configured on the Linux machines available to the ICB. While the Hadoop developers do not recommend Windows for use in a production environment, it may be useful to use Windows desktops while in the development and test phase. In order to have windows participate in a Hadoop workgroup, an ssh daemon must be running on each desktop in order to accept remote tasks. Fortunately, Cygwin provides an implementation of OpenSSH for Windows operating system. This section assumes that the reader knows how to install cygwin and optional packages. Once cygwin is installed and configured properly, execute the following commands to configure the sshd process.

 $ chmod +r /etc/passwd
 $ chmod +r /etc/group
 $ chmod +rx /var
 $ ssh-host-config
  • If the script asks you about "privilege separation", answer yes
  • If the script asks about "create a local user sshd on this machine", answer yes
  • If the script asks you about "install sshd as a service", answer yes

To start the sshd service, execute

 cygrunsrv --start sshd

At this point, the sshd service should be configured and running and it will appear in the Windows Service Manager as something like "CYGWIN sshd". The daemon process should start again upon reboot. It can be managed like any other windows service at this point.

To verify that the server is running properly, you should see something similar to the following

 $ ssh localhost
 Last login: Tue Oct 21 16:21:13 2008 from localhost
 Fanfare!!!
 You are successfully logged in to this server!!!
 $

It is very likely that the Windows, Norton or McAfee firewall will be blocking remote access to port 22 by default. This port must be open to allow access from other machines. Once the firewall is configured, the last step would be to enable the password free login by generating the appropriate ssh private/public keys as listed in the OpenSSH section above.

Configuration

This section describes some of the primary configuration settings used for te hadoop setup at the ICB. Detailed instructions are available at http://hadoop.apache.org/.

Directory Structure

There are two important base directories used by hadoop. The first, referred to as HADOOP_HOME is the base directory for the hadoop installation. Source code, libraries, scripts, configuration files and examples all reside relative to this directory. On the ICB systems, HADOOP_HOME is set to ~hadoop/hadoop-0.18.1. The second base directory, somewhat misleadingly referred to as hadoop.tmp.dir, defines the base directory for the Hadoop FileSystem. This directory is generally set to /scratchLocal/hadoop on the ICB systems.

NameNode and JobTracker

The current configuration assigns both the NameNode and the JobTracker to chico.med.cornell.edu. This will likely change in the future. These are set in the file $HADOOP_HOME/conf/hadoop-site.xml. The main configuration for the ICB is as follows:

 <configuration>
   <property>
     <name>fs.default.name</name>
     <value>hdfs://chico.med.cornell.edu:9000/</value>
   </property>
   <property>
     <name>mapred.job.tracker</name>
     <value>chico.med.cornell.edu:9001</value>
   </property>
   <property>
     <name>hadoop.tmp.dir</name>
     <value>/scratchLocal/hadoop/hadoop-${user.name}</value>
     <description>A base for other temporary directories.</description>
   </property>
 </configuration>

Hadoop Specific Environment Variables

The file $HADOOP_HOME/conf/hadoop-site.xml contains environment settings for hadoop. The only required environment variable is JAVA_HOME. On the ICB systems, JAVA_HOME is set to /softlib/exe/x86_64/pkg/sun_jdk/6.0.2/dist.

Since the hadoop user account is in a directory served by nfs, we need to override the location for the log files created by hadoop. We therefore set HADOOP_LOG_DIR to /scratchLocal/hadoop/logs.

Firewall

This section describes the various ports used by hadoop to communicate between nodes or to provide status and administration. Ports listed here are the default values and can be specified in the hadoop-site.xml file.

Name Port Description
fs.default.name 9000 The port that the name node will listen to.
mapred.job.tracker 9001 The port that the MapReduce job tracker will listen to.
dfs.datanode.address 50010 The port that the datanode server will listen to.
dfs.datanode.ipc.address 50020 The datanode ipc server port.
mapred.job.tracker.http.address 50030 The job tracker http server port the server will listen on.
mapred.task.tracker.http.address 50060 The task tracker http server port.
dfs.http.address 50070 The dfs namenode http server port the server will listen on.
dfs.https.address 50470 The dfs namenode https server port the server will listen on.
dfs.datanode.http.address 50075 The datanode http server port.
dfs.datanode.https.address 50475 The datanode https server port.
dfs.secondary.http.address 50090 The secondary namenode http server port.


Using Hadoop

All the hadoop commands are invoked by the bin/hadoop script. Running hadoop script without any arguments prints the description for all commands. All instructions in this section assume the user is in the $HADOOP_HOME directory.

Hadoop Startup

To format a new distributed filesystem execute the following command on the designated NameNode.

 $ bin/hadoop namenode -format

Typically this will only need to be performed once as it will delete any previous installation of the HDFS.

Start the HDFS with the following command, run on the designated NameNode:

 $ bin/start-dfs.sh

This script starts the DataNode daemons on all the listed slaves.

Start Map-Reduce with the following command, run on the designated JobTracker:

 $ bin/start-mapred.sh

This script starts the TaskTracker daemons on all the listed slaves.

Note that if the NameNode and the JobTracker are running on the same machine, the bin/start-all.sh script can be used to start both the HDFS and Map-Reduce daemons.

Hadoop Shutdown

Stop HDFS with the following command, run on the designated NameNode:

 $ bin/stop-dfs.sh

This script stops the DataNode daemons on all the listed slaves.

Stop Map/Reduce with the following command, run on the designated the designated JobTracker:

 $ bin/stop-mapred.sh

This script stops the TaskTracker daemons on all the listed slaves.

Note that if the NameNode and the JobTracker are running on the same machine, the bin/stop-all.sh script can be used to stop both the HDFS and Map-Reduce daemons.

Interacting with the Hadoop FileSystem (HDFS)

The commands used to interact with the files in hadoop are very much like the commands found in the Unix operating system. The FileSystem shell is invoked by "bin/hadoop fs <args>". All the FileSystem shell commands take path URIs as arguments. A few basic commands are listed below. For a complete list of hadoop commands refer to the Hadoop Shell Commands manual.

cat
Copies source paths to stdout. Usage: hadoop fs -cat URI [URI …]
Example:
   $ hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
   $ hadoop fs -cat file:///file3 /user/hadoop/file4
cp
Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Usage: hadoop fs -cp URI [URI …] <dest>
Example:
   $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
   $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
get
Copy files to the local file system. Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
Example:
   $ hadoop fs -get /user/hadoop/file localfile
   $ hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
mkdir
Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. Usage: hadoop fs -mkdir <paths>
Example:
   $ hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
   $ hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
mv
Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted. Usage: hadoop fs -mv URI [URI …] <dest>
Example:
   $ hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
   $ hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
put
Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. Usage: hadoop fs -put <localsrc> ... <dst>
Example:
   $ hadoop fs -put localfile /user/hadoop/hadoopfile
   $ hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
   $ hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
   $ hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
rm
Delete files specified as args. Only deletes non empty directory and files. Usage: hadoop fs -rm URI [URI …]
Example:
   $ hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir

Submitting Tasks to Hadoop

Most often users will want to submit their code to the Hadoop cluster for processing. Users can bundle their Map Reduce code in a jar file and execute it using this command.

 $ hadoop jar <jar> [mainClass] args... 

or by simply providing the name of a class to execute

 $ hadoop CLASSNAME args...

For a complete list of hadoop commands refer to the Hadoop Commands Manual.

Monitoring Hadoop

Hadoop provides a web interface for job tracking and also for examining the state of the namenode. The hadoop filesystem and log files for each node in the hadoop cluster can also be retrieved using the web interface for the NameNode.

Documentation

Notes

Configuration

  • In order to allow users to submit jobs, as themselves, the user directory in the HDFS needs to allow writes from users other than the hadoop user. To do this execute the following as the hadoop user:
 $ bin/hadoop fs -chmod  +w /user

Recommended Practices

  • The HDFS is NOT backed up at all. Be sure to copy any data that is critical or you need to save into your home directory where it will be archived on a regular basis.
  • Avoid using "localhost" in the configuration files. This gets messy when using shared configuration directories. Always use the fully qualified names.

Error Messages

Host key verification failed 
This generally means that the remote host has never been authenticated before. To set up the authentication, log into the host at least once from the name node using ssh as the user running the hadoop process.
WARN fs.FileSystem: "chico.med.cornell.edu:9000" is a deprecated filesystem name. Use "hdfs://chico.med.cornell.edu:9000/" instead
When changing the settings in the hadoop-site.xml configuration file to the recommended name, some of the provided examples failed. We should be aware that this may become an issue at some point.

Links

Personal tools