Member-only story

Why is Parquet format so popular?

And how it compares to pandas DataFrame?

4 min readJan 27, 2023

Introduction
History
Why so popular?
Parquet vs pandas DataFrame
Conclusion
References

Introduction

Parquet is a popular columnar storage format for big data processing. It’s widely used in the Hadoop ecosystem and provides several benefits over traditional row-based storage formats like CSV and JSON. In this article, we’ll take a closer look at why Parquet is so popular and how it can help improve the performance and efficiency of big data processing tasks. Also, we’ll compare it to the popular pandas DataFrame.

History

The Parquet format was created in 2013 by the Apache Software Foundation’s Parquet project, a collaboration between Twitter, Cloudera, and other organizations. The goal of the project was to create a columnar storage format that was optimized for big data processing and could be used with a variety of data processing frameworks such as Hadoop, Impala, and Hive. The project was a response to the growing need for a more efficient way of storing and processing large datasets as data collection and storage was rapidly increasing. Since its release, the Parquet format has become one of the most popular storage formats for big data, widely used in the industry and adopted by many companies.

Why so popular?

The first reason why Parquet is so popular is its high compression and encoding capabilities. Parquet uses a technique called columnar storage, which organizes data in a way that allows for more efficient compression and encoding. This means that data stored in Parquet format takes up less space on disk and can be read and processed more quickly.

Another benefit of Parquet is its support for advanced data types and encoding schemes. Parquet supports a wide range of data types, including integers, floating-point numbers, strings, and timestamps. It also supports advanced encoding schemes like dictionary encoding, which can further reduce the size of the data on disk.

Parquet also supports advanced data querying capabilities. It provides a feature called predicate pushdown, which allows query engines to filter data on disk before reading it into…

Why is Parquet format so popular?

And how it compares to pandas DataFrame?

Table of contents

Introduction

History

Why so popular?

Written by Mori

Responses (1)