TBF wishes you all a happy coding and troubleshooting New Year!

Celebrate!In 2012 we celebrated our 3th anniversary.

The year 2012 was all about Oracle migrations and troubleshooting, playing with Ravendb and picking up the old PHP skills.

We welcomed 20.000 viewers in 2012.

The busiest day of the year was November 27th with 141 views. The most popular post that day was Recover your corrupt datafiles in oracle – ora-00376.

These are the posts that got the most views in 2012

More numbers and figures: annual report 2012

It was a busy year, work hard and no play. But we will make this up next year. We are currently obsessed with bigdata,
so I expect lots of big data coming your way in 2013.

For now, stay save, enjoy and have a great ‘Oud en Nieuw’ as we dutch girls say 😉

Hadoop on Debian Wheezy

Hadoop

We from the ButtonFactory care deeply about Data. And the latest hype seems to be that the Data must be Big. So let’s get our hands dirty already and take a plunge into Hadoop.

This is how you can install Hadoop on a Debian Wheezy virtual machine in VirtualBox.
My laptop is running Windows 8 Enterprise.

Java Development Kit

Install Debian Wheezy from here.
When choosing packages, select only base system and SSH server.
We need to install Java 6 (see wiki).

We will move the JDK to /opt like this:

Now we can check the version:

The Hadoop user

I followed Michael Noll‘s howto for Ubuntu.

We need to create a hadoop group and a hduser, and we’ll put the haduser in the hadoop group:

We need to disable IPv6, so let’s add the following lines to the end of /etc/sysctl.conf:

Next we need to add Java and Hadoop to the path.

Add Java to the path

We need to set the $JAVA_HOME variable and add JAVA to our path.
Edit ~/.bashrc:

Next we can install Hadoop, also in the opt directory. I prefer to use the opt directory, but feel free to use /usr or something else.

Installing Hadoop

I’m using Hadoop 1.0.4, because it is stable. We might want to test with 2.0 later on.

We need to edit some of Hadoop’s config files now.

hadoop-env.sh

Open opt/hadoop/conf/hadoop-env.sh and set the JAVA_HOME environment variable to the Sun JDK/JRE 6 directory.

core-site.xml

This is where Hadoop stores its Data.
/opt/hadoop/conf/core-site.xml

We need to create this directory and set ownership correctly:

mapred-site.xml

vim mapred-site.xml

hdfs-site.xml

vim hdfs-site.xml

Starting Hadoop

su to hduser, because Hadoop runs under the hduser:

First format the namenode:

Now start Hadoop:

You can check if it all runs by running the Java JPS command:

Is something went wrong and you don’t see the DataNode running, then stop Hadoop, remove all the files in /app/hadoop/tmp, format the datanode and start again.

Now you should be able to browse to:
http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

If you visit the browser from the host machine replace ‘localhost’ with the IP from your virtual Debian machine.

Run a MapReduce Task

You can run one of the examples like this:

When all goes well the output should be something like this:

But chances are it doesn’t work directly. I had to doublecheck my file permissions, because I once ran Hadoop as root and that makes root owner of the log directory. And then the hduser is not allowed to write in them.

So check for errors like “WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stdout 10/01/18 10:52:48 WARN mapred.JobClient: Error reading task outputhttp://wheezy:50060/tasklog?plaintext=true&taskid=attempt_201001181020_0002_m_000014_0&filter=stderr”

And take ownership of the /opt/hadoop/logs: