softwaremom

Just another WordPress.com site

Getting Ready to Learn Hadoop

I had previously installed and configured Hadoop and got the Hadoop – the Definitive Guide by Tom White.  The chapter 2 of the book talked about writing a map/reduce program to process the NCDC Weather Data.  So, first, I need to download the data files.

It turns out not that straight-forward.  The book’s companion website https://github.com/tomwhite/hadoop-book, only have 2 gzip files for 1900 and 1911, each contains only 1 file.  The NCDC website has tons of data for download, some are free and some are not.  I went online to see anyone else has shared how they downloaded the files.  This site provides the answer that’s closest to what I need.  https://groups.google.com/forum/#!topic/nosql-databases/LXgRfqo7H-k

I’ve created a shell script (hey, a chance to learn that, too!) and execute it to grab data from 1990-2012.  It works, but little did I know how long it will take.  So far the script has been running for 1 full day, and we are still downloading the weather data from 1998.  The shell script is as below:

#!/bin/bash
for i in {1990..2012}
do
cd ~/hadoop-book/data
mkdir $i
wget -r -np -R index.html* http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cp ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*.gz ~/hadoop-book/data/$i/
rm ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*
rmdir ftp3.ncdc.noaa.gov/pub/data/noaa/$i
done

The script is supposed to skip the index.html* files but it still downloads it.  Never mind that.  At the end of the execution, I will have a hadoop-book/data folder and one folder per year from 1990-2012.  The folders will contain many, many small gzip files, one from each weather station.

I also downloaded the source code from https://github.com/tomwhite/hadoop-book.  It took me a long time to find the “Download” button on the right-hand side.  The Readme tells me that I need to download Maven.   Took me a little bit of research to find this site (http://stackoverflow.com/questions/15630055/how-to-install-maven-3-on-ubuntu-12-04-12-10-13-04-by-using-apt-get) tells me I need to edit a file.

The whole process I came up with is as follows:

  1. sudo -H gedit /etc/apt/sources.list

  2. Add the following line the sources.list file:

    deb http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

    deb-src http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

  3. sudo apt-get update && sudo apt-get install maven3

  4. sudo ln -s /usr/share/maven3/bin/mvn /usr/bin/mvn

Then I ran

mvn package -DskipTests -Dhadoop.version=1.1.1

Initially, it gave me an error saving my JAVA_HOME is not set correctly.  I’ve fixed it by

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

Also modified my .bashrc file so I will have it for my next session.  After that, I ran the same mvn command and it started building everything.  I am not sure how it knows but I guess it’s because maven is supposed to do that.  In any case, it started working away and print out a lot of information on the screen.  This is a fragment of it:

[WARNING] Replacing pre-existing project main-artifact file: /home/mei/hadoop-book-master/ch13/target/ch13-3.0.jar
with assembly file: /home/mei/hadoop-book-master/ch13/target/../../hbase-examples.jar
[INFO]
[INFO] ————————————————————————
[INFO] Building Chapter 14: ZooKeeper 3.0
[INFO] ————————————————————————
[INFO]
[INFO] — maven-enforcer-plugin:1.0.1:enforce (enforce-versions) @ ch14 —
[INFO]
[INFO] — maven-resources-plugin:2.6:resources (default-resources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/main/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:compile (default-compile) @ ch14 —
[INFO] Compiling 10 source files to /home/mei/hadoop-book-master/ch14/target/classes
[INFO]
[INFO] — maven-resources-plugin:2.6:testResources (default-testResources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/test/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ch14 —
[INFO] No sources to compile
[INFO]
[INFO] — maven-surefire-plugin:2.5:test (default-test) @ ch14 —
[INFO] Tests are skipped.
[INFO]
[INFO] — maven-jar-plugin:2.4:jar (default-jar) @ ch14 —
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/ch14-3.0.jar
[INFO]
[INFO] — maven-assembly-plugin:2.2.1:single (make-assembly) @ ch14 —
[INFO] Reading assembly descriptor: ../book/src/main/assembly/jar.xml
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[WARNING] Configuration options: ‘appendAssemblyId’ is set to false, and ‘classifier’ is missing.
Instead of attaching the assembly file: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar, it will become the file for main project artifact.

It build a lot of jar file

[INFO] ————————————————————————
[INFO] Building Hadoop: The Definitive Guide, Example Code 3.0
[INFO] ————————————————————————
[INFO] ————————————————————————
[INFO] Reactor Summary:
[INFO]
[INFO] Hadoop: The Definitive Guide, Project …………. SUCCESS [7.850s]
[INFO] Common Code ………………………………… SUCCESS [1:02.842s]
[INFO] Chapter 2: MapReduce ………………………… SUCCESS [1.007s]
[INFO] Chapter 3: The Hadoop Distributed Filesystem …… SUCCESS [2.791s]
[INFO] Chapter 4: Hadoop I/O ……………………….. SUCCESS [1.962s]
[INFO] Chapter 4: Hadoop I/O (Avro) …………………. SUCCESS [27.794s]
[INFO] Chapter 5: Developing a MapReduce Application ….. SUCCESS [1.647s]
[INFO] Chapter 7: MapReduce Types and Formats ………… SUCCESS [1.690s]
[INFO] Chapter 8: MapReduce Features ………………… SUCCESS [1.943s]
[INFO] Chapter 11: Pig …………………………….. SUCCESS [12.575s]
[INFO] Chapter 12: Hive ……………………………. SUCCESS [28.504s]
[INFO] Chapter 13: HBase …………………………… SUCCESS [29.176s]
[INFO] Chapter 14: ZooKeeper ……………………….. SUCCESS [0.618s]
[INFO] Chapter 15: Sqoop …………………………… SUCCESS [16.455s]
[INFO] Chapter 16: Case Studies …………………….. SUCCESS [0.949s]
[INFO] Hadoop Examples JAR …………………………. SUCCESS [0.894s]
[INFO] Snippet testing …………………………….. SUCCESS [4.474s]
[INFO] Hadoop: The Definitive Guide, Example Code …….. SUCCESS [0.000s]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 3:23.641s
[INFO] Finished at: Sun Dec 29 00:54:02 PST 2013
[INFO] Final Memory: 158M/562M
[INFO] ————————————————————————

Now I  am supposed to be able to run it, but I don’t know how yet.  I am also waiting for the data files to be downloaded.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: