Pig
For online Hadoop training, send mail to [email protected]
Agenda
Download Pig tar.gz file
Extract the content of Pig tar.gz
Configure pig-env.sh file
Configure pig.properties file
Start your Hadoop
Start Pig shell
Input file for Pig query
Access HDFS from Pig shell
Execute Pig commands
Store Pig query's output into HDFS
Check the output
Comparison of HBase/Hive/Pig
Download Pig from Apache website
www.apache.org/dyn/closer.cgi/pig
Select a stable version of Pig
Click on pig-0.11.0-tar.gz
Save pig-0.11.0-tar.gz file
Untar pig-0.11.0-tar.gz file
5
Configure pig-env.sh file
Create pig-env.sh file in PIG_HOME/conf
Add the following entries in PIG_HOME/conf/pig-env.sh file
export JAVA_HOME=/usr
export PIG_HOME=/home/neeraj/local_cluster_home/pig-0.11.0
export HADOOP_HOME=/home/neeraj/local_cluster_home/hadoop-1.0.3
export PIG_CLASSPATH=$HADOOP_HOME/conf/
Configure pig.properties file
Add the following entries in PIG_HOME/conf/pig.properties file
fs.default.name=hdfs://localhost:9000
mapred.job.tracker=localhost:9001
Copy core-site.xml, hdfs-site.xml & mapred-site.xml file from
HADOOP_HOME/conf to PIG_HOME/conf
Start your Hadoop
Check Hadoop processes
&
Safemode
Make sure that safe mode is off before you start Pig
Start Pig shell
Input file for Pig
Access HDFS from Pig shell
Execute Pig query
records = LOAD '/pig_input_files/temprature.txt' AS (year:chararray, temperature:int);
filtered_records = FILTER records BY temperature != 9999;
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);
DUMP max_temp;
Execute Pig query
records = LOAD '/pig_input_files/temprature.txt' AS (year:chararray, temperature:int);
filtered_records = FILTER records BY temperature != 9999;
grouped_records = GROUP filtered_records BY year;
max_temp = FOREACH grouped_records GENERATE group,MAX(filtered_records.temperature);
STORE max_temp INTO '/pig_output_files';
Pig job details
Output of Pig query
Exit from Pig shell
HBase/Hive/Pig
HBase/Hive/Pig suitability
HBase is suitable when...
When you need to handle unstructured data
When you need to edit the data
When you need versioned data
Hive is suitable when...
When you need to handle structured data
When you don't need to edit the data
When you comfortable in SQL syntax
Pig is suitable when...
When you need to handle structured data
When you don't need to edit the data
When you are comfortable in scripting
…Thanks…
For online Hadoop training, send mail to [email protected]