Hadoop 1: Get Started

Hadoop is an open-source software framework that supports data intensive distributed applications.

There are two architectural layers: HDFS and MapReduce.

  • HDFS is a filesystem that can store very large data sets by scaling out across a cluster of hosts
  • MapReduce is a data processing framework that takes a specification of how the data will be input and output from its two stages (map and reduce) and applies it accross large data sets on HDFS.

Apache Hadoop consists of four components: NameNode, DataNode, JobTracker and TaskTracker. These components can be deployed in three modes: local standalone mode, pseudo-distributed mode and fully distributed mode.

The following is how to get Hadoop started in the default (local standalone) mode on Linux which is simple and less configurations. I am using Oracle Enterprise Linux on VirtualBox. But the process can be duplicated on other Linux / Unix.

1. Download

Download the tarball hadoop-1.2.1-bin.tar.gz from http://mirror.its.dal.ca/apache/hadoop/common/hadoop-1.2.1/.

2. Uncompress

tar -xf hadoop-1.2.1-bin.tar.gz

In my case, it was uncompressed to /home/hadoop/hadoop-1.2.1

3. Add Symbolic Link

su

ln -s /home/hadoop/hadoop-1.2.1 /opt/hadoop

4. Setup Local Profile

I am using bash shell, and I add the following two lines onto my .bashrc.

export HADOOP_HOME=/home/hadoop/hadoop-1.2.1
export PATH=$HADOOP_HOME/bin:$PATH

5. Setup JAVA_HOME

Find out where your java sdk located. In my case, it is /usr/java/latest.

Modify  $HADOOP_HOME/conf/hadoop-env.sh script on the following line:

export JAVA_HOME=/usr/java/latest

6.Test Drive

cd $HADOOP_HOME

hadoop jar ./hadoop-examples-1.2.1.jar pi 2 1000

This will start $HADOOP_HOME/bin/hadoop with $HADOOP_HOME/hadoop-examples-1.2.1.jar, and create 2 MapReduce tasks to calculate pi (class) by using 1000 samples. The java source code for PiEstimator is here.

The hadoop is running on default (local standalone) mode. All the components (NameNode, DataNode, JobTracker and TaskTracker) run in a singal Java process.

The following is the output from this test.

[hadoop@localhost hadoop-1.2.1]$ hadoop jar ./hadoop-examples-1.2.1.jar pi 2 1000
Warning: $HADOOP_HOME is deprecated.

Number of Maps  = 2
Samples per Map = 1000
13/09/13 10:05:00 INFO util.NativeCodeLoader: Loaded the native-hadoop library
Wrote input for Map #0
Wrote input for Map #1
Starting Job
13/09/13 10:05:00 INFO mapred.FileInputFormat: Total input paths to process : 2
13/09/13 10:05:00 INFO mapred.JobClient: Running job: job_local691945750_0001
13/09/13 10:05:00 INFO mapred.LocalJobRunner: Waiting for map tasks
13/09/13 10:05:00 INFO mapred.LocalJobRunner: Starting task: attempt_local691945750_0001_m_000000_0
13/09/13 10:05:00 INFO util.ProcessTree: setsid exited with exit code 0
13/09/13 10:05:00 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@ec4a87
13/09/13 10:05:00 INFO mapred.MapTask: Processing split: file:/home/hadoop/hadoop-1.2.1/PiEstimator_TMP_3_141592654/in/part0:0+118
13/09/13 10:05:00 INFO mapred.MapTask: numReduceTasks: 1
13/09/13 10:05:00 INFO mapred.MapTask: io.sort.mb = 100
13/09/13 10:05:00 INFO mapred.MapTask: data buffer = 79691776/99614720
13/09/13 10:05:00 INFO mapred.MapTask: record buffer = 262144/327680
13/09/13 10:05:00 INFO mapred.MapTask: Starting flush of map output
13/09/13 10:05:00 INFO mapred.MapTask: Finished spill 0
13/09/13 10:05:00 INFO mapred.Task: Task:attempt_local691945750_0001_m_000000_0 is done. And is in the process of commiting
13/09/13 10:05:00 INFO mapred.LocalJobRunner: Generated 1000 samples.
13/09/13 10:05:00 INFO mapred.Task: Task 'attempt_local691945750_0001_m_000000_0' done.
13/09/13 10:05:00 INFO mapred.LocalJobRunner: Finishing task: attempt_local691945750_0001_m_000000_0
13/09/13 10:05:00 INFO mapred.LocalJobRunner: Starting task: attempt_local691945750_0001_m_000001_0
13/09/13 10:05:00 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@2cb49d
13/09/13 10:05:00 INFO mapred.MapTask: Processing split: file:/home/hadoop/hadoop-1.2.1/PiEstimator_TMP_3_141592654/in/part1:0+118
13/09/13 10:05:00 INFO mapred.MapTask: numReduceTasks: 1
13/09/13 10:05:00 INFO mapred.MapTask: io.sort.mb = 100
13/09/13 10:05:01 INFO mapred.MapTask: data buffer = 79691776/99614720
13/09/13 10:05:01 INFO mapred.MapTask: record buffer = 262144/327680
13/09/13 10:05:01 INFO mapred.MapTask: Starting flush of map output
13/09/13 10:05:01 INFO mapred.MapTask: Finished spill 0
13/09/13 10:05:01 INFO mapred.Task: Task:attempt_local691945750_0001_m_000001_0 is done. And is in the process of commiting
13/09/13 10:05:01 INFO mapred.LocalJobRunner: Generated 1000 samples.
13/09/13 10:05:01 INFO mapred.Task: Task 'attempt_local691945750_0001_m_000001_0' done.
13/09/13 10:05:01 INFO mapred.LocalJobRunner: Finishing task: attempt_local691945750_0001_m_000001_0
13/09/13 10:05:01 INFO mapred.LocalJobRunner: Map task executor complete.
13/09/13 10:05:01 INFO mapred.Task:  Using ResourceCalculatorPlugin : org.apache.hadoop.util.LinuxResourceCalculatorPlugin@1d225a7
13/09/13 10:05:01 INFO mapred.LocalJobRunner:
13/09/13 10:05:01 INFO mapred.Merger: Merging 2 sorted segments
13/09/13 10:05:01 INFO mapred.Merger: Down to the last merge-pass, with 2 segments left of total size: 48 bytes
13/09/13 10:05:01 INFO mapred.LocalJobRunner:
13/09/13 10:05:01 INFO mapred.Task: Task:attempt_local691945750_0001_r_000000_0 is done. And is in the process of commiting
13/09/13 10:05:01 INFO mapred.LocalJobRunner:
13/09/13 10:05:01 INFO mapred.Task: Task attempt_local691945750_0001_r_000000_0 is allowed to commit now
13/09/13 10:05:01 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local691945750_0001_r_000000_0' to file:/home/hadoop/hadoop-1.2.1/PiEstimator_TMP_3_141592654/out
13/09/13 10:05:01 INFO mapred.LocalJobRunner: reduce > reduce
13/09/13 10:05:01 INFO mapred.Task: Task 'attempt_local691945750_0001_r_000000_0' done.
13/09/13 10:05:01 INFO mapred.JobClient:  map 100% reduce 100%
13/09/13 10:05:01 INFO mapred.JobClient: Job complete: job_local691945750_0001
13/09/13 10:05:01 INFO mapred.JobClient: Counters: 21
13/09/13 10:05:01 INFO mapred.JobClient:   File Input Format Counters
13/09/13 10:05:01 INFO mapred.JobClient:     Bytes Read=260
13/09/13 10:05:01 INFO mapred.JobClient:   File Output Format Counters
13/09/13 10:05:01 INFO mapred.JobClient:     Bytes Written=109
13/09/13 10:05:01 INFO mapred.JobClient:   FileSystemCounters
13/09/13 10:05:01 INFO mapred.JobClient:     FILE_BYTES_READ=430323
13/09/13 10:05:01 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=587849
13/09/13 10:05:01 INFO mapred.JobClient:   Map-Reduce Framework
13/09/13 10:05:01 INFO mapred.JobClient:     Map output materialized bytes=56
13/09/13 10:05:01 INFO mapred.JobClient:     Map input records=2
13/09/13 10:05:01 INFO mapred.JobClient:     Reduce shuffle bytes=0
13/09/13 10:05:01 INFO mapred.JobClient:     Spilled Records=8
13/09/13 10:05:01 INFO mapred.JobClient:     Map output bytes=36
13/09/13 10:05:01 INFO mapred.JobClient:     Total committed heap usage (bytes)=475803648
13/09/13 10:05:01 INFO mapred.JobClient:     CPU time spent (ms)=0
13/09/13 10:05:01 INFO mapred.JobClient:     Map input bytes=48
13/09/13 10:05:01 INFO mapred.JobClient:     SPLIT_RAW_BYTES=240
13/09/13 10:05:01 INFO mapred.JobClient:     Combine input records=0
13/09/13 10:05:01 INFO mapred.JobClient:     Reduce input records=4
13/09/13 10:05:01 INFO mapred.JobClient:     Reduce input groups=4
13/09/13 10:05:01 INFO mapred.JobClient:     Combine output records=0
13/09/13 10:05:01 INFO mapred.JobClient:     Physical memory (bytes) snapshot=0
13/09/13 10:05:01 INFO mapred.JobClient:     Reduce output records=0
13/09/13 10:05:01 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=0
13/09/13 10:05:01 INFO mapred.JobClient:     Map output records=4
Job Finished in 1.236 seconds
Estimated value of Pi is 3.14400000000000000000

About henry416
I am a computer technology explorer and an university student based on Toronto. If you have any question, please feel free to discuss and comment here

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s