softwaremom

Just another WordPress.com site

Monthly Archives: December 2013

Getting Ready to Learn Hadoop

I had previously installed and configured Hadoop and got the Hadoop – the Definitive Guide by Tom White.  The chapter 2 of the book talked about writing a map/reduce program to process the NCDC Weather Data.  So, first, I need to download the data files.

It turns out not that straight-forward.  The book’s companion website https://github.com/tomwhite/hadoop-book, only have 2 gzip files for 1900 and 1911, each contains only 1 file.  The NCDC website has tons of data for download, some are free and some are not.  I went online to see anyone else has shared how they downloaded the files.  This site provides the answer that’s closest to what I need.  https://groups.google.com/forum/#!topic/nosql-databases/LXgRfqo7H-k

I’ve created a shell script (hey, a chance to learn that, too!) and execute it to grab data from 1990-2012.  It works, but little did I know how long it will take.  So far the script has been running for 1 full day, and we are still downloading the weather data from 1998.  The shell script is as below:

#!/bin/bash
for i in {1990..2012}
do
cd ~/hadoop-book/data
mkdir $i
wget -r -np -R index.html* http://ftp3.ncdc.noaa.gov/pub/data/noaa/$i/
cp ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*.gz ~/hadoop-book/data/$i/
rm ftp3.ncdc.noaa.gov/pub/data/noaa/$i/*
rmdir ftp3.ncdc.noaa.gov/pub/data/noaa/$i
done

The script is supposed to skip the index.html* files but it still downloads it.  Never mind that.  At the end of the execution, I will have a hadoop-book/data folder and one folder per year from 1990-2012.  The folders will contain many, many small gzip files, one from each weather station.

I also downloaded the source code from https://github.com/tomwhite/hadoop-book.  It took me a long time to find the “Download” button on the right-hand side.  The Readme tells me that I need to download Maven.   Took me a little bit of research to find this site (http://stackoverflow.com/questions/15630055/how-to-install-maven-3-on-ubuntu-12-04-12-10-13-04-by-using-apt-get) tells me I need to edit a file.

The whole process I came up with is as follows:

  1. sudo -H gedit /etc/apt/sources.list

  2. Add the following line the sources.list file:

    deb http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

    deb-src http://ppa.launchpad.net/natecarlson/maven3/ubuntu precise main

  3. sudo apt-get update && sudo apt-get install maven3

  4. sudo ln -s /usr/share/maven3/bin/mvn /usr/bin/mvn

Then I ran

mvn package -DskipTests -Dhadoop.version=1.1.1

Initially, it gave me an error saving my JAVA_HOME is not set correctly.  I’ve fixed it by

export JAVA_HOME=/usr/lib/jvm/java-1.7.0-openjdk-amd64

Also modified my .bashrc file so I will have it for my next session.  After that, I ran the same mvn command and it started building everything.  I am not sure how it knows but I guess it’s because maven is supposed to do that.  In any case, it started working away and print out a lot of information on the screen.  This is a fragment of it:

[WARNING] Replacing pre-existing project main-artifact file: /home/mei/hadoop-book-master/ch13/target/ch13-3.0.jar
with assembly file: /home/mei/hadoop-book-master/ch13/target/../../hbase-examples.jar
[INFO]
[INFO] ————————————————————————
[INFO] Building Chapter 14: ZooKeeper 3.0
[INFO] ————————————————————————
[INFO]
[INFO] — maven-enforcer-plugin:1.0.1:enforce (enforce-versions) @ ch14 —
[INFO]
[INFO] — maven-resources-plugin:2.6:resources (default-resources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/main/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:compile (default-compile) @ ch14 —
[INFO] Compiling 10 source files to /home/mei/hadoop-book-master/ch14/target/classes
[INFO]
[INFO] — maven-resources-plugin:2.6:testResources (default-testResources) @ ch14 —
[INFO] Using ‘UTF-8’ encoding to copy filtered resources.
[INFO] skip non existing resourceDirectory /home/mei/hadoop-book-master/ch14/src/test/resources
[INFO]
[INFO] — maven-compiler-plugin:2.3.2:testCompile (default-testCompile) @ ch14 —
[INFO] No sources to compile
[INFO]
[INFO] — maven-surefire-plugin:2.5:test (default-test) @ ch14 —
[INFO] Tests are skipped.
[INFO]
[INFO] — maven-jar-plugin:2.4:jar (default-jar) @ ch14 —
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/ch14-3.0.jar
[INFO]
[INFO] — maven-assembly-plugin:2.2.1:single (make-assembly) @ ch14 —
[INFO] Reading assembly descriptor: ../book/src/main/assembly/jar.xml
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[INFO] Building jar: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar
[INFO] DeleteGroup.class already added, skipping
[INFO] ListGroup.class already added, skipping
[INFO] ResilientActiveKeyValueStore.class already added, skipping
[INFO] CreateGroup.class already added, skipping
[INFO] ResilientConfigUpdater.class already added, skipping
[INFO] ConnectionWatcher.class already added, skipping
[INFO] JoinGroup.class already added, skipping
[INFO] ActiveKeyValueStore.class already added, skipping
[INFO] ConfigUpdater.class already added, skipping
[INFO] ConfigWatcher.class already added, skipping
[WARNING] Configuration options: ‘appendAssemblyId’ is set to false, and ‘classifier’ is missing.
Instead of attaching the assembly file: /home/mei/hadoop-book-master/ch14/target/../../zookeeper-examples.jar, it will become the file for main project artifact.

It build a lot of jar file

[INFO] ————————————————————————
[INFO] Building Hadoop: The Definitive Guide, Example Code 3.0
[INFO] ————————————————————————
[INFO] ————————————————————————
[INFO] Reactor Summary:
[INFO]
[INFO] Hadoop: The Definitive Guide, Project …………. SUCCESS [7.850s]
[INFO] Common Code ………………………………… SUCCESS [1:02.842s]
[INFO] Chapter 2: MapReduce ………………………… SUCCESS [1.007s]
[INFO] Chapter 3: The Hadoop Distributed Filesystem …… SUCCESS [2.791s]
[INFO] Chapter 4: Hadoop I/O ……………………….. SUCCESS [1.962s]
[INFO] Chapter 4: Hadoop I/O (Avro) …………………. SUCCESS [27.794s]
[INFO] Chapter 5: Developing a MapReduce Application ….. SUCCESS [1.647s]
[INFO] Chapter 7: MapReduce Types and Formats ………… SUCCESS [1.690s]
[INFO] Chapter 8: MapReduce Features ………………… SUCCESS [1.943s]
[INFO] Chapter 11: Pig …………………………….. SUCCESS [12.575s]
[INFO] Chapter 12: Hive ……………………………. SUCCESS [28.504s]
[INFO] Chapter 13: HBase …………………………… SUCCESS [29.176s]
[INFO] Chapter 14: ZooKeeper ……………………….. SUCCESS [0.618s]
[INFO] Chapter 15: Sqoop …………………………… SUCCESS [16.455s]
[INFO] Chapter 16: Case Studies …………………….. SUCCESS [0.949s]
[INFO] Hadoop Examples JAR …………………………. SUCCESS [0.894s]
[INFO] Snippet testing …………………………….. SUCCESS [4.474s]
[INFO] Hadoop: The Definitive Guide, Example Code …….. SUCCESS [0.000s]
[INFO] ————————————————————————
[INFO] BUILD SUCCESS
[INFO] ————————————————————————
[INFO] Total time: 3:23.641s
[INFO] Finished at: Sun Dec 29 00:54:02 PST 2013
[INFO] Final Memory: 158M/562M
[INFO] ————————————————————————

Now I  am supposed to be able to run it, but I don’t know how yet.  I am also waiting for the data files to be downloaded.

Advertisements

Back to Learning Microsoft Technologies

The last week was hectic.  At work, the team needs to create a POC project that uses Microsoft Dynamics CRM 2013 for case management.  Since Christmas is coming, almost everyone took off for the holidays.  Right before that, I volunteered to configure or create the CRM entities, based on the logical data model I’ve been worked on for the last 5 months.

So, I’ve downloaded the Dynamics CRM SDKs, learned how to query the entities and perform CRUD operations using the early-bound (with code generated by CRMSvcUtil.exe) object model.  I’ve also tried using the late-bound API to do the same.  In the configuration side, I’ve learned how to modify the existing entities by adding fields, relationships, and forms.  In addition, I’ve learned how to create a new custom entities.

When it comes to the data, I’ve use the settings feature to first download a Data Import Template (an XML file).  The fields in the template is based on the form I’ve created for the entity.  I then open the XML file in Excel and populate the template with real data rows.   I saved the file and use the Import Data wizard from the Settings feature to import the data rows into the CRM entity.  It then walk my through the steps where I can define the mappings between the Excel file columns and the target entity’s fields.  What’s interesting about it is that, for the “lookup” column, you can put in any field that can uniquely identify a record in the lookup entity.

My next step, after complete the configuration of entities in CRM, is to write a test harness or client app that perform simple case management activities.  The PM may not want me to do it, but, I think I should just do it myself and learn something.  I am thinking of creating a Lightswitch app to connect the OData services exposed by Dynamics CRM.  This article has good info: http://wp.sjkp.dk/lightswitch-working-with-microsoft-dynamics-crm-2011-or-2013-data-2/  There are lots of samples here.

I’ve noticed that Lightswitch let you create application based on the data endpoint.  The choices (for VS 2013) can be either database, SharePoint, OData Service, or WCF RIA Service.  The last one piqued my interest so I am going to try it.  I don’t remember whether I’ve heard about the WCF RIA Service before or not.  This post has a good walkthrough that I am going to try soon: http://www.silverlightshow.net/items/WCF-RIA-Services-Part-1-Getting-Started.aspx.

I have not done any extensive application development since 2007 because my focus shifted to BI.  There were so much to learn and I can say I am 80% on top of the Microsoft BI stack now.  With my renewed interest in coding, there are so much to catch up.

MongoDB Driver Import Issue – Resolved

Tonight my younger daughter had a orchestra concert at her school.  I went and sat on the gym bench reading “Hadoop: The Definitive Guide”.  Even though I cannot do some hands-on exercises, it gets me started to see the actual Map/Reduce Java code.

After the concert I continue to work on my first Java MongoDB program.  Since the issue was that the interpreter cannot find the MongoDB Java driver, I re-downloaded mongo-java-driver-2.11.3.jar file and place it in the /usr/lib/mongodb folder (created by me).  I am not sure if it’s the right folder but I gave it a try.  Some blogger suggested to put it in the JAVA_HOME folder.  That led me to discover that I didn’t have the variable set.  After some searching and I’ve found that I have multiple versions (6 and 7) of openjre jar files in /user/lib/jvm  folder.  So, I set that variable in my .bashrc file for future use.

I’ve found in the forum (http://forums.devshed.com/java-help-9/importing-a-jar-file-with-eclipse-595444.html), someone has a similar question and this is the answer that worked:

In project explorer right click on the folder/package where you want to import the JAR. Then choose “Import”. In the import wizard, select General > File system. Browse for the JAR and select it.Now you imported the JAR, but before you can use it, you need to add it to the classpath. Once again in project explorer right click on the project root folder, select Properties. In the properties menu select Java Build Path. In Java Build Path open the libraries tab and click on “Add JARs…” and browse for the JAR you just imported.

If you do not wish to actually import the package to your project but just use it, then you can skip the first part of this instruction and directly go to the “Libraries” tab in the Java Build Path. This time you should click on “Add external JARs…” and browse for the JAR in your file system.

I imported the jar file by selecting the Import option after right click on the project icom in Package Explorer, then select General/File system and browse to the folder.  After the selection, the list boxes on the right showed the jar files but I check the folder on the left list box.  I guess it means importing all the jar files in that folder.  

I didn’t set classpath but it worked.  I’ve added some code to add documents to the collection and iterate through the documents in the collection to print out the objects.

I like how the driver support  DBObject that serializing into JSON format.

 

BasicDBObject doc = new BasicDBObject("name", "MongoDB").
                              append("type", "database").
                              append("count", 1).
                              append("info", new BasicDBObject("x", 203).append("y", 102));

coll.insert(doc);

    for (DBObject obj : col.find()) {
    System.out.println(obj.toString());

The result is:

{ “_id” : { “$oid” : “52b146b8e4b04530a89967c2”} , “name” : “John” , “type” : “Singer” , “lastname” : “Lennon” , “info” : { “x” : 203 , “y” : 102}}

I guess they might have the same thing for C# and I just haven’t came across it.  So far I’ve only tried the Linq driver, which does require you to define a POCO class to represent it’s schema.

Installing mongoDB on Ubuntu

Tonight I’ve done some learning on Java programming.  Once I got used to the syntax and keywords, it’s quite easy.  I can do interface, inheritance (extends).  I also found a way to add additional option for JRE version by changing the property of the project.  The use of binary value (0b10101, for example) only works for JRE 7.  Eclipse automatically asked me if I want to switch to 7.  Nice.

Then, I proceed to install mongoDB on my Ubuntu system.  The official mongoDB site does it one way, without storing key for mongodb user.  I’ve found this site provides a different way to do the server installation.  http://www.mkyong.com/mongodb/how-to-install-mongodb-on-ubuntu/

Now I can start and stop the server.  However, I still need to find out how to import the mongoDB driver JAR files into my project.  

Last Week’s Learning on C# programming on MongoDB

Over the week, I had done the following:

1. On Windows, played with MongoDB C# and the Linq adaptor.  The sample tutorial (http://docs.mongodb.org/ecosystem/tutorial/use-linq-queries-with-csharp-driver/)  shows that you have to declare a POCO class that expose one particular property:

public class Entity
{
    public ObjectId Id { get; set; }

    public string Name { get; set; }
}

Then, after created a 

var connectionString = "mongodb://localhost";
var client = new MongoClient(connectionString);
var server = client.GetServer();
var database = server.GetDatabase("test"); // "test" is the name of the database
// "entities" is the name of the collection var collection = database.GetCollection<Entity>("entities");

After that, you can insert a new document by using collection.Insert, query the collection by 

var query = Query<Entity>.EQ(e => e.<field name>, <field value to Search>);
var entity = collection.FindOne(query);

Or, use Set or Save methods to modify the document.

However, I feel that there must be another sets of APIs that allows type-less access to the database. The beauty of NoSQL is that it allows schema-less design.  I saw that some of the methods has a different type of interface so perhaps that’s what it’s for.  Will explore that later.

2.  I’ve installed Ubuntu on my retired Windows PC and put Java and Hadoop on it.  This time, I used the Cloudera CDH package so things went a lot smoother.  Everything uses the default install and there’s no need to modify the site.xml files.  These are all very new to me so I am not 100% understanding what went on.  But, the WordCount sample code compile and run.

I’ve created 2 shell script files in my ~/bin folder.  hadoop-hdfs-start and hadoop-hdfs-stop for starting and stopping all the processes for hadoop, mapred, and yarn.

Read more of this post

Started to learn MongoDB

I temporarily put aside my quest on Hadoop and dabbled in MongoDB a little this week, just because I wanted something less frustrating.

MongoDB is relatively easy to get into.   You download the file, unzip it, created a directory dedicated as the data folder and another for the log folder.  You can setup a mongod.cfg file for basic configuration for the MongoD.exe, the services engine listening on the port.

MongoD runs as a Windows service.  To install it, open a cmd prompt window, run as Administrator and type

  • Bin\mongod.exe –config c:\learn\mongoDb\mongod.cfg –install

The config file needs to be in absolute path.  If not, it needs to be set that way in the registry.

Open a command line prompt.  Run Mongo.  That’s the client.  It present you with a console to write your command.  The code is in Java.  Will find out whether it supports C#.

Default to test as database.

  • db for show the current opened database.
  • Dbs shows all the databases
  • Use <database> to create and open a new database

Instead of table, MongiDb has collections.

  •  db.<collection>.insert(<something>)

Generate data in a loop

  • for (var i = 1; i <= 25; i++) db.testData.insert( { x : i } )

Show data with

  • db.testData.find()

It will show all the objects in testData.  It does by pages.  Need to type “it” to go to another page

That’s it so far.