what is a data lakehouse

1 year ago 75
Nature

A data lakehouse is a new data management architecture that combines the flexibility, cost-efficiency, and scale of data lakes with the data management and ACID transactions of data warehouses. It is designed to provide broader analytics support than data warehouses while avoiding the data management shortcomings that can limit the effectiveness of data lakes. Data lakehouses merge the data structures and management features of data warehouses with the low-cost storage used for data lakes, enabling business intelligence (BI) and machine learning (ML) on all data.

The key features of a data lakehouse include:

  • Flexibility: A data lakehouse can store a wide range of data from both internal and external sources and make it available to various end-users, including data scientists, data analysts, BI analysts and developers, business analysts, corporate and business executives, marketing and sales.

  • Cost-efficiency: Data lakehouses are built on low-cost storage used for data lakes, which makes them more cost-effective than traditional data warehouses.

  • Scale: Data lakehouses can handle large amounts of data, making them suitable for big data applications.

  • ACID transactions: Data lakehouses support ACID transactions, which ensure data consistency and integrity.

  • Data management: Data lakehouses provide data management features, such as data quality and governance, that are lacking in data lakes.

  • Machine learning and business intelligence: Data lakehouses support machine learning and business intelligence while also supporting SQL analytics, real-time data applications, and data science.

The term "data lakehouse" was first documented in 2017, but the concept was initially outlined by data platform vendor Databricks in 2020 and then embraced by various other vendors.