Apache Arrow Explained

Developer:Apache Software Foundation
Programming Language:C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, Rust
Genre:Data format, algorithms
License:Apache License 2.0
Repo:https://github.com/apache/arrow

Apache Arrow is a language-agnostic software framework for developing data analytics applications that process columnar data. It contains a standardized column-oriented memory format that is able to represent flat and hierarchical data for efficient analytic operations on modern CPU and GPU hardware.[1] [2] [3] [4] [5] This reduces or eliminates factors that limit the feasibility of working with large sets of data, such as the cost, volatility, or physical constraints of dynamic random-access memory.

Interoperability

Arrow can be used with Apache Parquet, Apache Spark, NumPy, PySpark, pandas and other data processing libraries.The project includes native software libraries written in C, C++, C#, Go, Java, JavaScript, Julia, MATLAB, Python, R, Ruby, and Rust. Arrow allows for zero-copy reads and fast data access and interchange without serialization overhead between these languages and systems.[1]

Applications

Arrow has been used in diverse domains, including analytics,[6] genomics,[7] [8] and cloud computing.[9]

Comparison to Apache Parquet and ORC

Apache Parquet and Apache ORC are popular examples of on-disk columnar data formats. Arrow is designed as a complement to these formats for processing data in-memory.[10] The hardware resource engineering trade-offs for in-memory processing vary from those associated with on-disk storage.[11] The Arrow and Parquet projects include libraries that allow for reading and writing data between the two formats.[12]

Governance

Apache Arrow was announced by The Apache Software Foundation on February 17, 2016,[13] with development led by a coalition of developers from other open source data analytics projects.[14] [15] [16] [17] The initial codebase and Java library was seeded by code from Apache Drill.

External links

Notes and References

  1. Web site: Apache Arrow and Distributed Compute with Kubernetes. 13 Dec 2018.
  2. Web site: Apache Arrow: Lining Up The Ducks In A Row... Or Column. Tony. Baer. Seeking Alpha. 17 February 2016.
  3. Web site: Apache Arrow: The little data accelerator that could. 25 February 2019. Tony. Baer. ZDNet.
  4. Web site: Apache Arrow's Columnar Layouts of Data Could Accelerate Hadoop, Spark. 23 February 2016. The New Stack. Susan. Hall.
  5. Web site: Apache Arrow aims to speed access to big data. Yegulalp. Serdar. 27 February 2016. InfoWorld.
  6. Book: Dinsmore T.W.. Disruptive Analytics . In-Memory Analytics: Satisfying the Need for Speed . 2016. Apress, Berkeley, CA. 978-1-4842-1312-4. 97–116. 10.1007/978-1-4842-1311-7_5.
  7. Versaci F, Pireddu L, Zanetti G. 2016. Scalable genomics: from raw data to aligned reads on Apache YARN. IEEE International Conference on Big Data. 1232–1241.
  8. Tanveer Ahmad. 2019. ArrowSAM: In-Memory Genomics Data Processing through Apache Arrow Framework. bioRxiv. 741843. 10.1101/741843. free.
  9. Maas M, Asanović K, Kubiatowicz J. 2017. Return of the runtimes: rethinking the language runtime system for the cloud 3.0 era. Proceedings of the 16th Workshop on Hot Topics in Operating Systems (ACM). 138–143. 10.1145/3102980.3103003. free.
  10. Web site: Le Dem . Julien . Apache Arrow and Apache Parquet: Why We Needed Different Projects for Columnar Data, On Disk and In-Memory . KDnuggets.
  11. Web site: Apache Arrow vs. Parquet and ORC: Do we really need a third Apache project for columnar data representation?. 2017-10-31.
  12. Web site: PyArrow:Reading and Writing the Apache Parquet Format.
  13. Web site: The Apache® Software Foundation Announces Apache Arrow™ as a Top-Level Project. 17 February 2016. The Apache Software Foundation Blog. live. https://web.archive.org/web/20160313195334/https://blogs.apache.org/foundation/entry/the_apache_software_foundation_announces87 . 2016-03-13 .
  14. Web site: Apache Foundation rushes out Apache Arrow as top-level project. Martin. Alexander J.. 17 February 2016. The Register.
  15. Web site: Big data gets a new open-source project, Apache Arrow: It offers performance improvements of more than 100x on analytical workloads, the foundation says.. 2016-02-17. 2018-01-31. 2016-07-27. https://web.archive.org/web/20160727221445/http://www.cio.com/article/3034279/big-data-gets-a-new-open-source-project-apache-arrow.html. dead.
  16. Web site: The first release of Apache Arrow. Le Dem. Julien. 28 November 2016. SD Times.
  17. Web site: Julien Le Dem on the Future of Column-Oriented Data Processing with Apache Arrow..