what is parquet file format

11 months ago 21
Nature

Parquet is an open-source, column-oriented data file format designed for efficient data storage and retrieval. It is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. Parquet is designed to be a common interchange format for both batch and interactive workloads. Some key characteristics of Parquet include:

  • Columnar: Unlike row-based formats such as CSV or Avro, Parquet is column-oriented, meaning the values of each table column are stored next to each other, rather than those of each record. This saves storage space and speeds up analytics queries.

  • Language Agnostic: Parquet is language agnostic, meaning developers may use different programming languages to manipulate the data in the Parquet file.

  • Self-Describing: In addition to data, a Parquet file contains metadata including schema and structure. Each file stores both the data and the standards used for accessing each record, making it easier to decouple services that write, store, and read Parquet files.

  • Flexible Compression Options: Parquet is built to support flexible compression options and efficient encoding types.

  • Support for Complex Data Types: Parquet is built from the ground up and is able to support advanced nested data structures.

Parquet files are composed of row groups, header, and footer. Each row group contains data from the same columns, and the same columns are stored together in each row group. This approach is best for queries that need to read certain columns from a large table, as Parquet can only read the needed columns, greatly minimizing the IO. Parquet has helped its users reduce storage requirements by at least one-third on large datasets, in addition to greatly improving scan and deserialization time, hence the overall costs.