Apache Pig is part of the Hadoop ecosystem and is a procedural language which makes the job of processing data on Hadoop a lot easier than writing MapReduce jobs by hand. You can write a script in Pig Latin which, under the hood, will translate into MapReduce processes. This is often the quickest way of doing things.
First, download and install Pig
$ wget http://mirror.ox.ac.uk/sites/rsync.apache.org/pig/pig-0.12.0/pig-0.12.0.tar.gz $ sudo tar -vxzf pig-0.12.0.tar.gz -C /usr/local $ cd /usr/local $ sudo mv pig-0.12.0 pig $ sudo chown -R hduser:hadoop pig
Now add the following lines to .bashrc
$ cd ~ $ gedit .bashrc export PIG_HOME=/usr/local/pig export PATH=$PATH:PIG_HOME/bin
Now restart your PC. To test the installation, type:
$ pig -h
You should see some blurb indicating you now have a working installation of Pig. If you are using Pig 0.12.0 with Hadoop 2.2.0 they are not compatible without rebuilding Pig. So let's rebuild Pig. To do this you need ant, so install ant first:
sudo apt-get -u install ant
and now rebuild Pig
$ cd /usr/local/pig $ ant clean jar-withouthadoop -Dhadoopversion=23
One final thing we'll do is to make the logging level a bit less spammy – makes it hard to see the wood from the trees. The file /usr/local/pig/conf/pig.properties points to a logging configuration file.
First, edit pig.properties:
$ cd /usr/local/pig/conf $ sudo gedit pig.properties
and add in the following line:
log4jconf=/usr/local/pig/conf/log4j_WARN
Now create this file:
$ cp log4j.properties.template log4j_WARN $ sudo gedit log4j_WARN
and make sure the following lines are present:
log4j.logger.org.apache.pig=WARN, A log4j.logger.org.apache.hadoop=WARN, A
Next, let's do something with it.
I installed Hadoop last week which means I can run Pig in MapReduce mode (i.e., it uses HDFS and Hadooop – the alternative is to run it in local mode where it uses the local, native filesystem).
First, I'll make sure that the Hadoop services are running:
$ start-dfs.sh .... $ start-yarn.sh .... $ jps If everything is successful, you should see following services running 2986 ResourceManager 3209 NodeManager 2549 DataNode 2329 NameNode 2824 SecondaryNameNode 7970 Jps
And now finally we can make it do something!
Copy a simple file into HDFS:
$ cp /etc/passwd passwd $ hadoop fs -put passwd . $ hadoop fs -ls -rw-r--r-- 1 hduser supergroup 1796 2014-03-31 07:57 passwd
Now go into the Pig shell and do some stuff – take the passed file and strip out the userids, and write a file back to HDFS:
$ pig grunt> data = LOAD 'passwd' using PigStorage(':'); grunt> userids = foreach data generate $0 as id; grunt> store userids into 'passwd.out' grunt> quit $ hadoop fs -cat passwd.out/* ... sshd hduser
If that works then you should be OK!