softwaremom

Just another WordPress.com site

Tag Archives: Hadoop

Following the Cloudera Hadoop Course

Since I’ve last hit a road block last time I’ve tried to follow the Hadoop – a Definitive Guide book, I have not done much with it.  My time were spent working on learning developing iOS apps for iPad.  Today I went back to the Cloudera site and found a very nice course on the subject.  I like the format and it explains everything very clearly and with illustration, kind of like the Kahn Academy for Hadoop.  🙂  I’ve went through 2 classes tonight.

https://www.udacity.com/course/viewer#!/c-ud617/l-308873796/m-416359060

Trying to Complie Hadoop Sample Code with Eclipse

I am having trouble building and running the sample code from Hadoop – The Definitive Guide.  The problem is that I am not familiar with how to set the CLASSPATH for the Java compiler.  Found this webcast from Cloudera:

http://blog.cloudera.com/blog/2013/08/how-to-use-eclipse-with-mapreduce-in-clouderas-quickstart-vm/

So, I’ve created a project and add the downloaded sample code.  Then I click on the Properties of the project:

classpath1

Click on “Add External Jars…”

Browse to /user/lib/hadoop/lib and select all the jar files

.classpath2

Add more jar files…

classpath3

After that, all the jar files show up in my project folder, making it really cluttered.

classpath4

At this point I gave up and call it a night.

The next night, after being able to compile the WordCount example from Cloudera in command line, it gave me an idea – what if I add the hadoop_mapreduce library in this project as well?  Turned out it works.  The ~ signs under the import statement initially stayed, then disappeared.  Oh, I also need to switch to J2SE-1.5 JRE library.  Eclipse asked me if I want to switch so I am guessing the sample code is using the old JRE calls. Now, all I’ve left is the warning about Job() is deprecated.

Job job = new Job();

At this point, I am going to table it and learn from Cloudera or some other places such as Horton Works.

In doing my research, I’ve found out that you can use Eclipse to run Hadoop code in its IDE.  This posting: https://github.com/itsamiths/hadoop-eclipse-plugins let you download the Eclipse Hadoop plug-in, or build your own plug in.  I think I am going to try it and test it out with WordCount sample code.

Compiling Cloudera WordCount Hadoop Sample Code

Since I’ve been having trouble building the sample code from the Hadoop – The Definitive Guide, using Eclipse.  I thought I’d go back to the Cloudera site and built it’s sample code, following the instruction from here:  http://www.cloudera.com/content/cloudera-content/cloudera-docs/HadoopTutorial/CDH4/Hadoop-Tutorial/ht_usage.html

However, since the version of Hadoop has changed, I am using Hadoop 2.0.0-cdh4.5.0, the ClassPath is different, too.  When I try compiling WordCount.java:

$ mkdir wordcount_classes
$ javac -cp <classpath> -d wordcount_classes WordCount.java
where <classpath> is:/usr/lib/hadoop/*:/usr/lib/hadoop/client-0.20/* 

I get errors: 

WordCount.java:9: package org.apache.hadoop.mapred does not exist
import org.apache.hadoop.mapred.*;
^
WordCount.java:14: cannot find symbol
symbol : class MapReduceBase
location: class org.myorg.WordCount
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

Obviously, we need more jar file to be included in the ClassPath.  After checking with the installed directories, this works:

javac -cp /usr/lib/hadoop/*:/usr/lib/hadoop-mapreduce/* -d wordcount_classes WordCount.java

After that, I can create the jar file, run the application, and exam the results, just like what’s in the instruction.

Getting Ready to Learn Hadoop

I had previously installed and configured Hadoop and got the Hadoop – the Definitive Guide by Tom White.  The chapter 2 of the book talked about writing a map/reduce program to process the NCDC Weather Data.  So, first, I need to download the data files.

It turns out not that straight-forward.  The book’s companion website https://github.com/tomwhite/hadoop-book, only have 2 gzip files for 1900 and 1911, each contains only 1 file.  The NCDC website has tons of data for download, some are free and some are not.  I went online to see anyone else has shared how they downloaded the files.  This site provides the answer that’s closest to what I need.  https://groups.google.com/forum/#!topic/nosql-databases/LXgRfqo7H-k

I’ve created a shell script (hey, a chance to learn that, too!) and execute it to grab data from 1990-2012.  It works, but little did I know how long it will take.  So far the script has been running for 1 full day, and we are still downloading the weather data from 1998.  The shell script is as below:

#!/bin/bash
for i in {1990..2012}
do
cd ~/hadoop-book/data
mkdir $i
wget -r -np -R index.html* http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cp ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*.gz ~/hadoop-book/data/$i/
rm ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*
rmdir ftp3.ncdc.noaa.gov/pub/data/noaa/$i
done

The script is supposed to skip the index.html* files but it still downloads it.  Never mind that.  At the end of the execution, I will have a hadoop-book/data folder and one folder per year from 1990-2012.  The folders will contain many, many small gzip files, one from each weather station.

I also downloaded the source code from https://github.com/tomwhite/hadoop-book.  It took me a long time to find the “Download” button on the right-hand side.  The Readme tells me that I need to download Maven.   Took me a little bit of research to find this site (http://stackoverflow.com/questions/15630055/how-to-install-maven-3-on-ubuntu-12-04-12-10-13-04-by-using-apt-get) tells me I need to edit a file.

The whole process I came up with is as follows:

  1. sudo -H gedit /etc/apt/sources.list

  2. Add the following line the sources.list file:

    deb http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

    deb-src http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

  3. sudo apt-get update && sudo apt-get install maven3

  4. sudo ln -s /usr/share/maven3/bin/mvn /usr/bin/mvn

Then I ran

mvn package -DskipTests -Dhadoop.version=1.1.1

Initially, it gave me an error saving my JAVA_HOME is not set correctly.  I’ve fixed it by

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

Also modified my .bashrc file so I will have it for my next session.  After that, I ran the same mvn command and it started building everything.  I am not sure how it knows but I guess it’s because maven is supposed to do that.  In any case, it started working away and print out a lot of information on the screen.  This is a fragment of it:

[WARNING] Replacing pre-existing project main-artifact file: /home/mei/hadoop-book-master/ch13/target/ch13-3.0.jar
with assembly file: /home/mei/hadoop-book-master/ch13/target/../../hbase-examples.jar
[INFO]
[INFO] ————————————————————————
[INFO] Building Chapter 14: ZooKeeper 3.0
[INFO] ————————————————————————
[INFO]
[INFO] — maven-enforcer-plugin:1.0.1:enforce (enforce-versions) @ ch14 —
[INFO]
[INFO] — maven-resources-plugin:2.6:resources (default-resources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/main/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:compile (default-compile) @ ch14 —
[INFO] Compiling 10 source files to /home/mei/hadoop-book-master/ch14/target/classes
[INFO]
[INFO] — maven-resources-plugin:2.6:testResources (default-testResources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/test/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ch14 —
[INFO] No sources to compile
[INFO]
[INFO] — maven-surefire-plugin:2.5:test (default-test) @ ch14 —
[INFO] Tests are skipped.
[INFO]
[INFO] — maven-jar-plugin:2.4:jar (default-jar) @ ch14 —
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/ch14-3.0.jar
[INFO]
[INFO] — maven-assembly-plugin:2.2.1:single (make-assembly) @ ch14 —
[INFO] Reading assembly descriptor: ../book/src/main/assembly/jar.xml
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[WARNING] Configuration options: ‘appendAssemblyId’ is set to false, and ‘classifier’ is missing.
Instead of attaching the assembly file: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar, it will become the file for main project artifact.

It build a lot of jar file

[INFO] ————————————————————————
[INFO] Building Hadoop: The Definitive Guide, Example Code 3.0
[INFO] ————————————————————————
[INFO] ————————————————————————
[INFO] Reactor Summary:
[INFO]
[INFO] Hadoop: The Definitive Guide, Project …………. SUCCESS [7.850s]
[INFO] Common Code ………………………………… SUCCESS [1:02.842s]
[INFO] Chapter 2: MapReduce ………………………… SUCCESS [1.007s]
[INFO] Chapter 3: The Hadoop Distributed Filesystem …… SUCCESS [2.791s]
[INFO] Chapter 4: Hadoop I/O ……………………….. SUCCESS [1.962s]
[INFO] Chapter 4: Hadoop I/O (Avro) …………………. SUCCESS [27.794s]
[INFO] Chapter 5: Developing a MapReduce Application ….. SUCCESS [1.647s]
[INFO] Chapter 7: MapReduce Types and Formats ………… SUCCESS [1.690s]
[INFO] Chapter 8: MapReduce Features ………………… SUCCESS [1.943s]
[INFO] Chapter 11: Pig …………………………….. SUCCESS [12.575s]
[INFO] Chapter 12: Hive ……………………………. SUCCESS [28.504s]
[INFO] Chapter 13: HBase …………………………… SUCCESS [29.176s]
[INFO] Chapter 14: ZooKeeper ……………………….. SUCCESS [0.618s]
[INFO] Chapter 15: Sqoop …………………………… SUCCESS [16.455s]
[INFO] Chapter 16: Case Studies …………………….. SUCCESS [0.949s]
[INFO] Hadoop Examples JAR …………………………. SUCCESS [0.894s]
[INFO] Snippet testing …………………………….. SUCCESS [4.474s]
[INFO] Hadoop: The Definitive Guide, Example Code …….. SUCCESS [0.000s]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 3:23.641s
[INFO] Finished at: Sun Dec 29 00:54:02 PST 2013
[INFO] Final Memory: 158M/562M
[INFO] ————————————————————————

Now I  am supposed to be able to run it, but I don’t know how yet.  I am also waiting for the data files to be downloaded.

Restarting My Blog

I’ve been thinking of switching my focus on technology for a while: I’ve been on the Microsoft bandwagon since 1995 and devoted all my professional energy keeping up with the latest of what Redmond has to offer.  This is great at my current position.  However, it keeps me somewhat shut off from the other camp, like, the rest of the world.  With Hadoop and its derived technologies so prominent these days, all the major companies, when hiring a BI talent, request Hadoop skills.  I want to be on that boat!

3 weeks ago, I installed a VirtualBox VM with Linux/Ubuntu.  Installed and configured Hadoop 2.2.0 and have the system ready as a single node machine.  It was much harder than I thought and I felt like a baby taking her first steps.  What I’ve learned so far:

1. how to keep my system updated.

2. how to download a package or a piece software and install it on the machine.

3. how to create a new user as the service account, and how to change user in terminal.

4. how to use Nano editor and what sudo means.

5. the “official” Hadoop installation doc is very hard to read because it assumes functional knowledge of Linux.  Online search results, though helpful, often outdated.  In trying to update the config xml files, one blogger refer to a file exists in a different location and that threw me off.  What I did is to keep looking and not getting stuck in one place.  Eventually, it will work.

I am at the point where I tried to compile the sample WordCount java script in Hadoop

$ javac -classpath ${HADOOP_HOME}/hadoop-${HADOOP_VERSION}-core.jar -d wordcount_classes WordCount.java

I am getting 18 errors, starting from

import org.apache.hadoop.mapred.*

It says package org.apache.hadoop.mapred does not exist.  I am gussing the classpath is incorrect.

On a separate endeavor, I am learning Java from the beginning.  I wrote a HelloWorld program today.  Just have to keep at it, even if it’s just a little bit of progress each day.