LakeFS explained

lakeFS
Author:Einat Orr
Oz Katz
Developer:Treeverse
Released:August 3, 2020
Latest Release Version:0.104.0
Repo:https://github.com/treeverse/lakeFS
Programming Language:Go
Genre:Data version control
License:Apache 2.0

lakeFS is a free and open-source software developed by Treeverse.[1] [2] It provides scalable and format-agnostic version control for data lakes,[3] using Git-like semantics to create and access different data versions.

First released in August 2020, its features include data version tracking, isolated development and testing, repository rollback, continuous data integration and deployment.

History

lakeFS was developed by Oz Katz and Einat Orr in 2020.[4] [5]

Its first public release, v0.8.1, was provided by Treeverse in August 2020. This version provided Git-like operations for any file format and AWS S3 storage compatibility, featuring a versioning engine based on MVCC.[6]

In 2021, the versioning engine transitioned to Graveler, increasing its handling capacity to billions of objects with a limited performance impact.[7]

In July 2021, Treeverse, the parent company of lakeFS, received an investment of $23 million in a Series A funding round, led by Dell Technologies Capital, Norwest Venture Partners, and Zeev Ventures.[8] [9]

In June 2022, lakeFS Cloud was introduced as a managed service to facilitate versioning in cloud data lakes. This service helps mitigate challenges related to tracking data changes and reverting to previous versions.

Software

Overview

lakeFS is a data versioning engine that manages data in a way similar to code. By using operations such as branching, committing, merging, and reverting, which resemble those found in Git, it facilitates the handling of data and its corresponding schema throughout the entire data life cycle.[10]

Features

lakeFS is an interface made for interaction with object stores such as S3 as well as data management systems, such as AWS Glue and Databricks. The system assigns the task of actual data storage to backend services such as AWS, while it handles branch tracking and supports multiple storage providers.

lakeFS simplifies branch creation, tracking, and merging. It removes the need for complete dataset duplication during testing phases, thereby isolating experimental modifications. It also streamlines branch operations, supporting the creation, merging, or deletion of branches as required. Furthermore, it integrates with continuous integration and deployment pipelines via webhooks.

When dealing with arbitrary object storage, lakeFS processes data blocks via API calls. It stores branching information as metadata, enabling efficient subsequent object management as needed.

lakeFS hooks

lakeFS hooks enable specific checks and validations before key lifecycle events. Unlike Git Hooks, these hooks activate remote servers to run tests. They can be configured to assess table schemas when merging data from development or test branches into production; if validation fails, the merge is blocked. This function serves as a tool for schema enforcement and standardized rule application across various data sources and producers.

Events that can trigger these hooks may include change commits, branch merges, new branch creations, or alterations in tags.[11] In the context of a merge, a pre-merge hook operates on the source branch before the finalization of the merge.

Notes and References

  1. Web site: LakeFS brings branching to data lakes. June 27, 2022. Peter. Wayner. VentureBeat. June 27, 2023. June 27, 2023. https://web.archive.org/web/20230627121630/https://venturebeat.com/data-infrastructure/lakefs-brings-branching-to-data-lakes/. live.
  2. Web site: The best open source software of 2021. James R.. Borck. October 18, 2021. InfoWorld. July 18, 2023. March 8, 2023. https://web.archive.org/web/20230308023356/https://www.infoworld.com/article/3637038/the-best-open-source-software-of-2021.html#slide24. live.
  3. Web site: Treeverse set to launch lakeFS cloud data lake service . 2023-06-27 . 22 June 2022 . Sean Michael . Kerner . . en . 2023-06-27 . https://web.archive.org/web/20230627121630/https://www.techtarget.com/searchdatamanagement/news/252521898/Treeverse-set-to-launch-LakeFS-cloud-data-lake-service . live .
  4. Web site: Israeli Startup Treeverse Secures $23 Million for Open Source Technology. Niva. Goldberg. July 29, 2021. Jewish Business News. July 18, 2023. July 8, 2023. https://web.archive.org/web/20230708111042/https://jewishbusinessnews.com/2021/07/29/israeli-startup-treeverse-secures-23-million-for-open-source-technology/. live.
  5. Web site: Treeverse raises $23M to bring Git-like version control to data lakes . 2023-06-27 . Paul . Sawers . . 28 July 2021 . en . 2023-09-24 . https://web.archive.org/web/20230924070544/https://venturebeat.com/business/treeverse-raises-23m-to-bring-git-like-version-control-to-data-lakes/ . live .
  6. Web site: v0.8.1. 2023-06-27. Github. en. 2024-06-28. https://web.archive.org/web/20240628162346/https://github.com/treeverse/lakeFS/releases/tag/v0.8.1. live.
  7. Web site: lakeFS Architecture. 2023-08-10. 2023-08-10. https://web.archive.org/web/20230810235357/https://docs.lakefs.io/understand/architecture.html. live.
  8. Web site: Treeverse raises $15 million Series A to leverage lakeFS . Meir . Orbach . 28 July 2021 . . 18 July 2023 . 7 July 2023 . https://web.archive.org/web/20230707222613/https://www.calcalistech.com/ctech/articles/0,7340,L-3913525,00.html . live .
  9. Web site: Open source technology lakeFS secures $23M in funding . Noga . Martin . 28 July 2021 . . 10 August 2023 . 10 July 2023 . https://web.archive.org/web/20230710134110/https://www.israelhayom.com/2021/07/28/open-source-technology-lakefs-secures-23m-in-funding/ . live .
  10. Web site: How To Avoid "Schema Drift" . Yaniv Ben . Hemo . 3 February 2023 . 10 August 2023 . 10 August 2023 . https://web.archive.org/web/20230810235359/https://dzone.com/articles/how-to-avoid-schema-drift . live .
  11. Web site: Managing Schema Validation in a Data Lake Using Data Version Control . Iddo . Avneri . 27 June 2023 . 11 August 2023 . 11 August 2023 . https://web.archive.org/web/20230811162629/https://dzone.com/articles/managing-schema-validation-in-a-data-lake-using-da . live .