Hadoop
From Icbwiki
| Revision as of 20:43, 31 October 2008 Kdorff (Talk | contribs) (→NameNode and JobTracker) ← Previous diff |
Revision as of 20:44, 31 October 2008 Kdorff (Talk | contribs) (→NameNode and JobTracker) Next diff → |
||
| Line 52: | Line 52: | ||
| <name>mapred.job.tracker</name> | <name>mapred.job.tracker</name> | ||
| <value>chico.med.cornell.edu:9001</value> | <value>chico.med.cornell.edu:9001</value> | ||
| - | </property> | ||
| - | <property> | ||
| - | <name>dfs.replication</name> | ||
| - | <value>1</value> | ||
| </property> | </property> | ||
| <property> | <property> | ||
Revision as of 20:44, 31 October 2008
This page describes how to configure the open source implementation of the MapReduce framework called Hadoop for use in the ICB.
Contents |
Prerequisites
Java
Hadoop requires Java 1.5 in order to execute. Virtually every ICB Linux server provides the Sun JDK 1.6 by default. See the Softlib documentation for information on using software packages available on the servers.
Hadoop Distribution
The hadoop software is available at the Hadoop Core Release. As of this writing, 0.18.1 is the latest version and is installed into the hadoop account on the Linux machines available to the ICB.
OpenSSH
OpenSSH is required for Hadoop to execute in a distributed fashion. See the PBTech wiki entry about setting up Password Free Logins.
Windows Specific Notes
OpenSSH is already installed and configured on the Linux machines available to the ICB. While the Hadoop developers do not recommend Windows for use in a production environment, it may be useful to use Windows desktops while in the development and test phase. In order to have windows participate in a Hadoop workgroup, an ssh daemon must be running on each desktop in order to accept remote tasks. Fortunately, Cygwin provides an implementation of OpenSSH for Windows operating system. This section assumes that the reader knows how to install cygwin and optional packages. Once cygwin is installed and configured properly, execute the following commands to configure the sshd process.
$ chmod +r /etc/passwd $ chmod +r /etc/group $ chmod +rx /var $ ssh-host-config
- If the script asks you about "privilege separation", answer yes
- If the script asks about "create a local user sshd on this machine", answer yes
- If the script asks you about "install sshd as a service", answer yes
To start the sshd service, execute
cygrunsrv --start sshd
At this point, the sshd service should be configured and running and it will appear in the Windows Service Manager as something like "CYGWIN sshd". The daemon process should start again upon reboot. It can be managed like any other windows service at this point.
To verify that the server is running properly, you should see something similar to the following
$ ssh localhost Last login: Tue Oct 21 16:21:13 2008 from localhost Fanfare!!! You are successfully logged in to this server!!! $
It is very likely that the Windows, Norton or McAfee firewall will be blocking remote access to port 22 by default. This port must be open to allow access from other machines. Once the firewall is configured, the last step would be to enable the password free login by generating the appropriate ssh private/public keys as listed in the OpenSSH section above.
Configuration
This section describes some of the primary configuration settings used for te hadoop setup at the ICB. Detailed instructions are available at http://hadoop.apache.org/.
Directory Structure
There are two important base directories used by hadoop. The first, referred to as HADOOP_HOME is the base directory for the hadoop installation. Source code, libraries, scripts, configuration files and examples all reside relative to this directory. On the ICB systems, HADOOP_HOME is set to ~hadoop/hadoop-0.18.1. The second base directory, somewhat misleadingly referred to as hadoop.tmp.dir, defines the base directory for the Hadoop FileSystem. This directory is generally set to /scratchLocal/hadoop on the ICB systems.
NameNode and JobTracker
The current configuration assigns both the NameNode and the JobTracker to chico.med.cornell.edu. This will likely change in the future. These are set in the file $HADOOP_HOME/conf/hadoop-site.xml. The main configuration for the ICB is as follows:
<configuration>
<property>
<name>fs.default.name</name>
<value>hdfs://chico.med.cornell.edu:9000/</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>chico.med.cornell.edu:9001</value>
</property>
<property>
<name>hadoop.tmp.dir</name>
<value>/scratchLocal/hadoop/hadoop-${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
</configuration>
Hadoop Specific Environment Variables
The file $HADOOP_HOME/conf/hadoop-site.xml contains environment settings for hadoop. The only required environment variable is JAVA_HOME. On the ICB systems, JAVA_HOME is set to /softlib/exe/x86_64/pkg/sun_jdk/6.0.2/dist.
Since the hadoop user account is in a directory served by nfs, we need to override the location for the log files created by hadoop. We therefore set HADOOP_LOG_DIR to /scratchLocal/hadoop/logs.
Firewall
This section describes the various ports used by hadoop to communicate between nodes or to provide status and administration. Ports listed here are the default values and can be specified in the hadoop-site.xml file.
| Name | Port | Description |
|---|---|---|
| fs.default.name | 9000 | The port that the name node will listen to. |
| mapred.job.tracker | 9001 | The port that the MapReduce job tracker will listen to. |
| dfs.datanode.address | 50010 | The port that the datanode server will listen to. |
| dfs.datanode.ipc.address | 50020 | The datanode ipc server port. |
| mapred.job.tracker.http.address | 50030 | The job tracker http server port the server will listen on. |
| mapred.task.tracker.http.address | 50060 | The task tracker http server port. |
| dfs.http.address | 50070 | The dfs namenode http server port the server will listen on. |
| dfs.https.address | 50470 | The dfs namenode https server port the server will listen on. |
| dfs.datanode.http.address | 50075 | The datanode http server port. |
| dfs.datanode.https.address | 50475 | The datanode https server port. |
| dfs.secondary.http.address | 50090 | The secondary namenode http server port. |
Using Hadoop
All the hadoop commands are invoked by the bin/hadoop script. Running hadoop script without any arguments prints the description for all commands. All instructions in this section assume the user is in the $HADOOP_HOME directory.
Hadoop Startup
To format a new distributed filesystem execute the following command on the designated NameNode.
$ bin/hadoop namenode -format
Typically this will only need to be performed once as it will delete any previous installation of the HDFS.
Start the HDFS with the following command, run on the designated NameNode:
$ bin/start-dfs.sh
This script starts the DataNode daemons on all the listed slaves.
Start Map-Reduce with the following command, run on the designated JobTracker:
$ bin/start-mapred.sh
This script starts the TaskTracker daemons on all the listed slaves.
Note that if the NameNode and the JobTracker are running on the same machine, the bin/start-all.sh script can be used to start both the HDFS and Map-Reduce daemons.
Hadoop Shutdown
Stop HDFS with the following command, run on the designated NameNode:
$ bin/stop-dfs.sh
This script stops the DataNode daemons on all the listed slaves.
Stop Map/Reduce with the following command, run on the designated the designated JobTracker:
$ bin/stop-mapred.sh
This script stops the TaskTracker daemons on all the listed slaves.
Note that if the NameNode and the JobTracker are running on the same machine, the bin/stop-all.sh script can be used to stop both the HDFS and Map-Reduce daemons.
Interacting with the Hadoop FileSystem (HDFS)
The commands used to interact with the files in hadoop are very much like the commands found in the Unix operating system. The FileSystem shell is invoked by "bin/hadoop fs <args>". All the FileSystem shell commands take path URIs as arguments. A few basic commands are listed below. For a complete list of hadoop commands refer to the Hadoop Shell Commands manual.
- cat
- Copies source paths to stdout. Usage: hadoop fs -cat URI [URI …]
- Example:
- $ hadoop fs -cat hdfs://nn1.example.com/file1 hdfs://nn2.example.com/file2
- $ hadoop fs -cat file:///file3 /user/hadoop/file4
- cp
- Copy files from source to destination. This command allows multiple sources as well in which case the destination must be a directory. Usage: hadoop fs -cp URI [URI …] <dest>
- Example:
- $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2
- $ hadoop fs -cp /user/hadoop/file1 /user/hadoop/file2 /user/hadoop/dir
- get
- Copy files to the local file system. Usage: hadoop fs -get [-ignorecrc] [-crc] <src> <localdst>
- Example:
- $ hadoop fs -get /user/hadoop/file localfile
- $ hadoop fs -get hdfs://nn.example.com/user/hadoop/file localfile
- mkdir
- Takes path uri's as argument and creates directories. The behavior is much like unix mkdir -p creating parent directories along the path. Usage: hadoop fs -mkdir <paths>
- Example:
- $ hadoop fs -mkdir /user/hadoop/dir1 /user/hadoop/dir2
- $ hadoop fs -mkdir hdfs://nn1.example.com/user/hadoop/dir hdfs://nn2.example.com/user/hadoop/dir
- mv
- Moves files from source to destination. This command allows multiple sources as well in which case the destination needs to be a directory. Moving files across filesystems is not permitted. Usage: hadoop fs -mv URI [URI …] <dest>
- Example:
- $ hadoop fs -mv /user/hadoop/file1 /user/hadoop/file2
- $ hadoop fs -mv hdfs://nn.example.com/file1 hdfs://nn.example.com/file2 hdfs://nn.example.com/file3 hdfs://nn.example.com/dir1
- put
- Copy single src, or multiple srcs from local file system to the destination filesystem. Also reads input from stdin and writes to destination filesystem. Usage: hadoop fs -put <localsrc> ... <dst>
- Example:
- $ hadoop fs -put localfile /user/hadoop/hadoopfile
- $ hadoop fs -put localfile1 localfile2 /user/hadoop/hadoopdir
- $ hadoop fs -put localfile hdfs://nn.example.com/hadoop/hadoopfile
- $ hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile
- rm
- Delete files specified as args. Only deletes non empty directory and files. Usage: hadoop fs -rm URI [URI …]
- Example:
- $ hadoop fs -rm hdfs://nn.example.com/file /user/hadoop/emptydir
Submitting Tasks to Hadoop
Most often users will want to submit their code to the Hadoop cluster for processing. Users can bundle their Map Reduce code in a jar file and execute it using this command.
$ hadoop jar <jar> [mainClass] args...
or by simply providing the name of a class to execute
$ hadoop CLASSNAME args...
For a complete list of hadoop commands refer to the Hadoop Commands Manual.
Monitoring Hadoop
Hadoop provides a web interface for job tracking and also for examining the state of the namenode. The hadoop filesystem and log files for each node in the hadoop cluster can also be retrieved using the web interface for the NameNode.
- JobTracker: http://chico.med.cornell.edu:50030/
- NameNode: http://chico.med.cornell.edu:50070/
Documentation
- The main Hadoop web site: http://hadoop.apache.org/
- The official MapReduce tutorial: http://hadoop.apache.org/core/docs/current/mapred_tutorial.html
- Hadoop classes API (0.18.1): http://hadoop.apache.org/core/docs/r0.18.1/api/
- Hadoop Wiki: http://wiki.apache.org/hadoop/
Notes
Configuration
- In order to allow users to submit jobs, as themselves, the user directory in the HDFS needs to allow writes from users other than the hadoop user. To do this execute the following as the hadoop user:
$ bin/hadoop fs -chmod +w /user
Recommended Practices
- The HDFS is NOT backed up at all. Be sure to copy any data that is critical or you need to save into your home directory where it will be archived on a regular basis.
- Avoid using "localhost" in the configuration files. This gets messy when using shared configuration directories. Always use the fully qualified names.
Error Messages
- Host key verification failed
- This generally means that the remote host has never been authenticated before. To set up the authentication, log into the host at least once from the name node using ssh as the user running the hadoop process.
- WARN fs.FileSystem: "chico.med.cornell.edu:9000" is a deprecated filesystem name. Use "hdfs://chico.med.cornell.edu:9000/" instead
- When changing the settings in the hadoop-site.xml configuration file to the recommended name, some of the provided examples failed. We should be aware that this may become an issue at some point.
