Setting up Apache Pig

Apache Pig is part of the Hadoop ecosystem and is a procedural language which makes the job of processing data on Hadoop a lot easier than writing MapReduce jobs by hand. You can write a script in Pig Latin which, under the hood, will translate into MapReduce processes. This is often the quickest way of doing things.

First, download and install Pig

$ wget http://mirror.ox.ac.uk/sites/rsync.apache.org/pig/pig-0.12.0/pig-0.12.0.tar.gz
$ sudo tar -vxzf pig-0.12.0.tar.gz -C /usr/local
$ cd /usr/local
$ sudo mv pig-0.12.0 pig
$ sudo chown -R hduser:hadoop pig

Now add the following lines to .bashrc

$ cd ~
$ gedit .bashrc
export PIG_HOME=/usr/local/pig
export PATH=$PATH:PIG_HOME/bin

Now restart your PC. To test the installation, type:

$ pig -h

You should see some blurb indicating you now have a working installation of Pig. If you are using Pig 0.12.0 with Hadoop 2.2.0 they are not compatible without rebuilding Pig. So let's rebuild Pig. To do this you need ant, so install ant first:

sudo apt-get -u install ant

and now rebuild Pig

$ cd /usr/local/pig
$ ant clean jar-withouthadoop -Dhadoopversion=23

One final thing we'll do is to make the logging level a bit less spammy – makes it hard to see the wood from the trees. The file /usr/local/pig/conf/pig.properties points to a logging configuration file.

First, edit pig.properties:

$ cd /usr/local/pig/conf
$ sudo gedit pig.properties

and add in the following line:

log4jconf=/usr/local/pig/conf/log4j_WARN

Now create this file:

$ cp log4j.properties.template log4j_WARN
$ sudo gedit log4j_WARN

and make sure the following lines are present:

log4j.logger.org.apache.pig=WARN, A
log4j.logger.org.apache.hadoop=WARN, A

Next, let's do something with it.

I installed Hadoop last week which means I can run Pig in MapReduce mode (i.e., it uses HDFS and Hadooop – the alternative is to run it in local mode where it uses the local, native filesystem).

First, I'll make sure that the Hadoop services are running:

$ start-dfs.sh
....
$ start-yarn.sh
....
$ jps
If everything is successful, you should see following services running
2986 ResourceManager
3209 NodeManager
2549 DataNode
2329 NameNode
2824 SecondaryNameNode
7970 Jps

And now finally we can make it do something!

Copy a simple file into HDFS:

$ cp /etc/passwd passwd
$ hadoop fs -put passwd .
$ hadoop fs -ls
-rw-r--r-- 1 hduser supergroup 1796 2014-03-31 07:57 passwd

Now go into the Pig shell and do some stuff – take the passed file and strip out the userids, and write a file back to HDFS:

$ pig
grunt> data = LOAD 'passwd' using PigStorage(':');
grunt> userids = foreach data generate $0 as id;
grunt> store userids into 'passwd.out'
grunt> quit

$ hadoop fs -cat passwd.out/*
...
sshd
hduser

If that works then you should be OK!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: