Why is Parquet format so popular?

And how it compares to pandas DataFrame?

Mori
4 min readJan 27, 2023

Table of contents

Introduction

Parquet is a popular columnar storage format for big data processing. It’s widely used in the Hadoop ecosystem and provides several benefits over traditional row-based storage formats like CSV and JSON. In this article, we’ll take a closer look at why Parquet is so popular and how it can help improve the performance and efficiency of big data processing tasks. Also, we’ll compare it to the popular pandas DataFrame.

History

The Parquet format was created in 2013 by the Apache Software Foundation’s Parquet project, a collaboration between Twitter, Cloudera, and other organizations. The goal of the project was to create a columnar storage format that was optimized for big data processing and could be used with a variety of data processing frameworks such as Hadoop, Impala, and Hive. The project was a response to the growing need for a more efficient way of storing and processing large datasets as data collection and storage was rapidly increasing. Since its release, the Parquet format has become one of the most popular storage formats for big data, widely used in the industry and adopted by many…

--

--

Mori

Date Scientist/Machine Learning Engineer | Passionate about solving real-world problems | PhD in Computer Science