Parquet vs Avro Format

Avro is a row-based storage format for Hadoop.

Parquet is a column-based storage format for Hadoop.

If your use case typically scans or retrieves all of the fields in a row in each query, Avro is usually the best choice.

If your dataset has many columns, and your use case typically involves working with a subset of those columns rather than entire records, Parquet is optimized for that kind of work

-----------------------------------------------------------------------------------------

A Parquet "file" is actually a collection of files stored in a single directory. The Parquet format offers features that make it the ideal choice for storing "big data" on distributed file systems.

For more information, see Apache Parquet.

Motivation

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem.

Parquet is built from the ground up with complex nested data structures in mind and uses the record shredding and assembly algorithm described in the Dremel paper. We believe this approach is superior to simple flattening of nested namespaces.

Parquet is built to support very efficient compression and encoding schemes. Multiple projects have demonstrated the performance impact of applying the right compression and encoding scheme to the data. Parquet allows compression schemes to be specified on a per-column level and is future-proofed to allow adding more encodings as they are invented and implemented.

Parquet is built to be used by anyone. The Hadoop ecosystem is rich with data processing frameworks, and we are not interested in playing favorites. We believe that an efficient, well-implemented columnar storage substrate should be useful to all frameworks without the cost of extensive and difficult to set up dependencies.

Azure

Wednesday, 31 July 2019

Apache Parquet vs Avro

Parquet vs Avro Format

Motivation

No comments:

Post a Comment