Hadoop on Debian Wheezy
December 16, 2012
We from the ButtonFactory care deeply about Data. And the latest hype seems to be that the Data must be Big. So let's get our hands dirty already and take a plunge into Hadoop.
This is how you can install Hadoop on a Debian Wheezy virtual machine in VirtualBox.
My laptop is running Windows 8 Enterprise.
Java Development Kit
$chmod u=rwx jdk-6u38-linux-x64.bin $tar xvf jdk-6u38-linux-x64.bin
We will move the JDK to /opt like this:
$mkdir /opt/jvm $mv jdk1.6.0_38/ /opt/jvm/jdk1.6.0_38/ $update-alternatives --install /usr/bin/java java /opt/jvm/jdk1.6.0_38/jre/bin/java 3 $update-alternatives --config java
Now we can check the version:
$ java -version java version "1.6.0_38" Java(TM) SE Runtime Environment (build 1.6.0_38-b05) Java HotSpot(TM) 64-Bit Server VM (build 20.13-b02, mixed mode)
The Hadoop user
I followed Michael Noll's howto for Ubuntu.
We need to create a hadoop group and a hduser, and we'll put the haduser in the hadoop group:
addgroup hadoop adduser --ingroup hadoop hduser
We need to disable IPv6, so let's add the following lines to the end of /etc/sysctl.conf:
#disable ipv6 net.ipv6.conf.all.disable_ipv6 = 1 net.ipv6.conf.default.disable_ipv6 = 1 net.ipv6.conf.lo.disable_ipv6 = 1
Next we need to add Java and Hadoop to the path.
Add Java to the path
We need to set the $JAVA_HOME variable and add JAVA to our path.
# Set Hadoop-related environment variables export HADOOP_HOME=/opt/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/opt/jvm/jdk1.6.0_38 # Add Hadoop bin/ and JAVA bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin export PATH=$PATH:$JAVA_HOME/bin
Next we can install Hadoop, also in the opt directory. I prefer to use the opt directory, but feel free to use /usr or something else.
I'm using Hadoop 1.0.4, because it is stable. We might want to test with 2.0 later on.
cd /opt tar xzf hadoop-1.0.4.tar.gz mv hadoop-1.0.4 hadoop chown -R hduser:hadoop hadoop
We need to edit some of Hadoop's config files now.
Open opt/hadoop/conf/hadoop-env.sh and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.
# The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.5-sun to export JAVA_HOME=/opt/jvm/jdk1.6.0_38
This is where Hadoop stores its Data.
hadoop.tmp.dir /app/hadoop/tmp A base for other temporary directories. fs.default.name hdfs://localhost:54310 The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.
We need to create this directory and set ownership correctly:
mkdir -p /app/hadoop/tmp chown hduser:hadoop /app/hadoop/tmp
mapred.job.tracker localhost:54311 The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.
dfs.replication 1 Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time.
su to hduser, because Hadoop runs under the hduser:
First format the namenode:
/opt/hadoop/bin/hadoop namenode -format
Now start Hadoop:
You can check if it all runs by running the Java JPS command:
hduser@wheezy:$ jps 2764 JobTracker 3374 Jps 2554 DataNode 2667 SecondaryNameNode 2879 TaskTracker 2449 NameNode
Is something went wrong and you don't see the DataNode running, then stop Hadoop, remove all the files in /app/hadoop/tmp, format the datanode and start again.
bin/stop-all.sh cd /app/hadoop/tmp/ rm * -rf cd /opt/hadoop/ bin/hadoop namenode -format bin/start-all.sh
Now you should be able to browse to:
http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon
If you visit the browser from the host machine replace 'localhost' with the IP from your virtual Debian machine.
Run a MapReduce Task
You can run one of the examples like this:
hduser@wheezy:/opt/hadoop$ bin/hadoop jar hadoop-examples-1.0.4.jar pi 10 1000000
When all goes well the output should be something like this:
SNIP 12/12/16 16:48:57 INFO mapred.JobClient: Combine output records=0 12/12/16 16:48:57 INFO mapred.JobClient: Physical memory (bytes) snapshot=1655828480 12/12/16 16:48:57 INFO mapred.JobClient: Reduce output records=0 12/12/16 16:48:57 INFO mapred.JobClient: Virtual memory (bytes) snapshot=5379981312 12/12/16 16:48:57 INFO mapred.JobClient: Map output records=20 Job Finished in 71.721 seconds Estimated value of Pi is 3.14158440000000000000 SNIP
But chances are it doesn't work directly. I had to doublecheck my file permissions, because I once ran Hadoop as root and that makes root owner of the log directory. And then the hduser is not allowed to write in them.
So check for errors like "WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stdout 10/01/18 10:52:48 WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stderr"
And take ownership of the /opt/hadoop/logs:
chown -R hduser:hadoop logs