Apache Pig | |
Developer: | Apache Software Foundation, Yahoo Research |
Latest Release Version: | 0.17.0 |
Operating System: | Microsoft Windows, OS X, Linux |
Genre: | Data analytics |
License: | Apache License 2.0 |
Apache Pig[1] is a high-level platform for creating programs that run on Apache Hadoop. The language for this platform is called Pig Latin. Pig can execute its Hadoop jobs in MapReduce, Apache Tez, or Apache Spark.[2] Pig Latin abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for relational database management systems. Pig Latin can be extended using user-defined functions (UDFs) which the user can write in Java, Python, JavaScript, Ruby or Groovy[3] and then call directly from the language.
Apache Pig was originally[4] developed at Yahoo Research around 2006 for researchers to have an ad hoc way of creating and executing MapReduce jobs on very large data sets. In 2007,[5] it was moved into the Apache Software Foundation.
Version | Original release date | Latest version | Release date[6] | ||
---|---|---|---|---|---|
2008-09-11 | 0.1.1 | 2008-12-05 | |||
2009-04-08 | 0.2.0 | 2009-04-08 | |||
2009-06-25 | 0.3.0 | 2009-06-25 | |||
2009-08-29 | 0.4.0 | 2009-08-29 | |||
2009-09-29 | 0.5.0 | 2009-09-29 | |||
2010-03-01 | 0.6.0 | 2010-03-01 | |||
2010-05-13 | 0.7.0 | 2010-05-13 | |||
2010-12-17 | 0.8.1 | 2011-04-24 | |||
2011-07-29 | 0.9.2 | 2012-01-22 | |||
2012-01-22 | 0.10.1 | 2012-04-25 | |||
2013-02-21 | 0.11.1 | 2013-04-01 | |||
2013-10-14 | 0.12.1 | 2014-04-14 | |||
2014-07-04 | 0.13.0 | 2014-07-04 | |||
2014-11-20 | 0.14.0 | 2014-11-20 | |||
2015-06-06 | 0.15.0 | 2015-06-06 | |||
2016-06-08 | 0.16.0 | 2016-06-08 | |||
2017-06-19 | 0.17.0 | 2017-06-19 | |||
Regarding the naming of the Pig programming language, the name was chosen arbitrarily and stuck because it was memorable, easy to spell, and for novelty.[7] [8] [9]
Below is an example of a "Word Count" program in Pig Latin:
The above program will generate parallel executable tasks which can be distributed across multiple machines in a Hadoop cluster to count the number of words in a dataset such as all the webpages on the internet.
In comparison to SQL, Pig
On the other hand, it has been argued DBMSs are substantially faster than the MapReduce system once the data is loaded, but that loading the data takes considerably longer in the database systems. It has also been argued RDBMSs offer out of the box support for column-storage, working with compressed data, indexes for efficient random data access, and transaction-level fault tolerance.[10]
Pig Latin is procedural and fits very naturally in the pipeline paradigm while SQL is instead declarative. In SQL users can specify that data from two tables must be joined, but not what join implementation to use (You can specify the implementation of JOIN in SQL, thus "... for many SQL applications the query writer may not have enough knowledge of the data or enough expertise to specify an appropriate join algorithm."). Pig Latin allows users to specify an implementation or aspects of an implementation to be used in executing a script in several ways. In effect, Pig Latin programming is similar to specifying a query execution plan, making it easier for programmers to explicitly control the flow of their data processing task.[11]
SQL is oriented around queries that produce a single result. SQL handles trees naturally, but has no built in mechanism for splitting a data processing stream and applying different operators to each sub-stream. Pig Latin script describes a directed acyclic graph (DAG) rather than a pipeline.
Pig Latin's ability to include user code at any point in the pipeline is useful for pipeline development. If SQL is used, data must first be imported into the database, and then the cleansing and transformation process can begin.[12]