pandas - Python Data Analysis Library

How fast can we process a CSV file

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Feb 22, 2024

Introduction Comma-separated values (CSV) are an extremely popular format to store tabular data because of their simplicity and how easy is to write them. The file can be directly read by a human, as opposed to more efficient binary formats like parquet, for example: name,age Maryam,23 Mèng yáo …

What's new in pandas 2.2

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Jan 25, 2024

The most interesting things about the new release pandas 2.2 was released on January 22nd 2024. Let’s take a look at the things this release introduces and how it will help us to improve our pandas workloads. It includes a bunch of improvements that will improve the user …

Deep dive into pandas Copy-on-Write mode - part III

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Sep 28, 2023

Explaining the migration path for Copy-on-Write Introduction The introduction of Copy-on-Write (CoW) is a breaking change that will have some impact on existing pandas-code. We will investigate how we can adapt our code to avoid errors when CoW will be enabled by default. This is currently planned for the pandas …

What's new in pandas 2.1

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Sep 06, 2023

The most interesting things about the new release pandas 2.1 was released on August 30th 2023. Let’s take a look at the things this release introduces and how it will help us improving our pandas workloads. It includes a bunch of improvements and also a set of new …

Deep Dive into pandas Copy-on-Write Mode - Part II

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Aug 16, 2023

Explaining how Copy-on-Write optimizes performance Introduction The first post explained how the Copy-on-Write mechanism works. It highlights some ares where copies are introduced into the workflow. This post will focus on optimizations that ensure that this won't slow the average workflow down. We utilize a technique that pandas internals use …

Deep Dive into pandas Copy-on-Write Mode - Part I

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Aug 08, 2023

Explaining how Copy-on-Write works internally Introduction pandas 2.0 was released in early April and brought many improvements to the new Copy-on-Write (CoW) mode. The feature is expected to become the default in pandas 3.0, which is scheduled for April 2024 at the moment. There are no plans for …

pandas Internals Explained

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Jul 20, 2023

Explaining the pandas data model and its advantages Introduction pandas enables you to choose between different types of arrays to represent the data of your DataFrame. Historically, most DataFrames are backed by NumPy arrays. pandas 2.0 introduced the option to use PyArrow arrays as a storage format. There exists …

Dask performance benchmarking put to the test: Fixing a pandas bottleneck

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Jun 27, 2023

Getting notified of a significant performance regression the day before release sucks, but quickly identifying and resolving it feels great! We were getting set up at our booth at JupyterCon 2023 when we received a notification: An engineer on our team had spotted a significant performance regression in Dask. With …

Benchmarking pandas against Polars from a pandas PoV

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Jun 14, 2023

Or: How writing efficient pandas code matters Introduction I've regularly seen benchmarks that show how much faster Polars is compared to pandas. The fact that Polars is faster than pandas is not too surprising since it is multithreaded while pandas is mostly single-core. The big difference surprises me though. That's …

Utilizing PyArrow to improve pandas and Dask workflows

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Jun 04, 2023

Get the most out of PyArrow support in pandas and Dask right now Introduction This post investigates where we can use PyArrow to improve our pandas and Dask workflows right now. General support for PyArrow dtypes was added with pandas 2.0 to pandas and Dask. This solves a bunch …

Welcoming pandas 2.0

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Mar 22, 2023

How the API is changing and how to leverage new functionalities Introduction After 3 years of development, the second pandas 2.0 release candidate was released on the 16th of March. There are many new features in pandas 2.0, including improved extension array support, pyarrow support for DataFrames and …

pandas 2.0 and the Arrow revolution (part I)

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Feb 17, 2023

Introduction At the time of writing this post, we are in the process of releasing pandas 2.0. The project has a large number of users, and it's used in production quite widely by personal and corporate users. This large use based forces us to be conservative and make us …

A guide to efficient data selection in pandas

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Feb 09, 2023

Improve performance when selecting data from a pandas object Introduction There exist different ways of selecting a subset of data from a pandas object. Depending on the specific operation, the result will either be a view pointing to the original data or a copy of the original data. This ties …

A solution for inconsistencies in indexing operations in pandas

Source: Patrick Hoefler - pandas | Author: Patrick Hoefler | Published: Dec 22, 2022

Get rid of annoying SettingWithCopyWarning messages Introduction Indexing operations in pandas are quite flexible and thus, have many cases that can behave quite different and therefore produce unexpected results. Additionally, it is hard to predict when a SettingWithCopyWarningis raised and what this means exactly. I’ll show a couple of …

pandas with hundreds of millions of rows

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Sep 21, 2022

The problem We want to find out which are the top #5 American airports with the largest average (mean) delay on domestic flights. Data We will be using the Data Expo 2009: Airline on time data dataset from the Harvard Dataverse. The data consists of flight arrival and departure details …

On copies and views: getting rid of the SettingWithCopyWarning

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Apr 07, 2022

Pandas' current behavior on whether indexing returns a view or copy is confusing, even for experienced users. But it doesn’t have to be this way. We can make this aspect of pandas easier to grasp by simplifying the copy/view rules, and at the same time make pandas more memory-efficient. And get rid of the SettingWithCopyWarning.

Write up of the NumFOCUS grant to improve pandas benchmarks and diversity

Source: pandas blog | Author: pandas team | Published: Apr 01, 2022

By Lucy Jiménez and Dorothy Kabarozi B. We want to share our experience working on Improvements to the ASV benchmarking framework and diversity efforts sponsored by NumFOCUS to the pandas project. This grant focused on

pandas 1.0

Source: pandas blog | Author: pandas team | Published: Jan 29, 2020

Today pandas celebrates its 1.0.0 release. In many ways this is just a normal release with a host of new features, performance improvements, and bug fixes, which are documented in

Towards consistent missing value handling in Pandas

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Nov 30, 2019

This blogpost gives some background and motivation for my proposal on better missing value support in pandas, and the changes that have been merged in the development version (to be released in pandas 1.0): a new pd.NA scalar is introduced that can be used consistently across all data types..

An update on the pandas documentation

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Nov 28, 2019

Some context This post is mainly a technical post on what's the status of the pandas documentation. But let me provide a bit of context on where this comes from. It's a personal opinion, but I think pandas is one of the clearest examples of how open source is transforming …

New pandas workflow

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Nov 17, 2019

Some exciting news. After some years of organizing sprints, and maintaining open source, I've been thinking on a more efficient workflow for projects with high volume of activity, like pandas. An exaggerated example would be that I want to create 1,600 issues in pandas. One for each docstring of …

2019 NumFOCUS Awards and New Contributor Recognition

Source: pandas Archives - NumFOCUS | Author: Admin | Published: Nov 15, 2019

The post 2019 NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.

Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects

Source: pandas Archives - NumFOCUS | Author: Admin | Published: Nov 14, 2019

The post Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects appeared first on NumFOCUS.

Highlights From The 2019 Pandas Hack

Source: pandas Archives - NumFOCUS | Author: nf-admin | Published: Sep 13, 2019

The post Highlights From The 2019 Pandas Hack appeared first on NumFOCUS.

Dataframe summit @ EuroSciPy write up

Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Sep 10, 2019

Last week took place in Bilbao, Spain, EuroSciPy 2019. This year we introduced the maintainers track a room dedicated to discussions among maintainers. The idea is similar to the birds of a feather or unconference sessions of other conferences. But focussed on open source maintainers and contributors. And we scheduled …

2019 pandas user survey

Source: pandas blog | Author: pandas team | Published: Aug 22, 2019

Pandas recently conducted a user survey to help guide future development. Thanks to everyone who participated! This post presents the high-level results. This analysis and the raw data can be found on

GeoPandas now uses the pandas ExtensionArray interface

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Aug 13, 2019

Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!

pandas: The two cultures

Source: datapythonista blog - pandas | Author: Marc | Published: Jul 22, 2019

Leo Breiman was a distinguished statistician at UC Berkeley, known among other things for his major contributions to CART (decision trees), and ensemble techniques, mainly bootstrap aggregation. Combining both, he was able to define one of the most popular machine learning models even today (18 years after the publication of …

pandas extension arrays

Source: pandas blog | Author: pandas team | Published: Jan 04, 2019

Extensibility was a major theme in pandas development over the last couple of releases. This post introduces the pandas extension array interface: the motivation behind it and how it might affect you

Inaugural NumFOCUS Awards and New Contributor Recognition

Source: pandas Archives - NumFOCUS | Author: Admin | Published: Sep 27, 2018

The post Inaugural NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.

The Worldwide Pandas Documentation Sprint: A Closer Look

Source: pandas Archives - NumFOCUS | Author: Admin | Published: Mar 27, 2018

The post The Worldwide Pandas Documentation Sprint: A Closer Look appeared first on NumFOCUS.

#pandasSprint write-up

Source: datapythonista blog - pandas | Author: Marc | Published: Mar 22, 2018

The past 10th of March took place #pandasSprint. To the best of my knowledge, an unprecedented kind of event, where around 500 people worked together in improving the documentation of the popular pandas library. As one of the people involved in the organization of the event, I wanted to write …

Activity on the pandas github repo during the March 10 documentation sprint

Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Mar 13, 2018

Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!

NumFOCUS Announces New Fiscally Sponsored Project: pandas

Source: pandas Archives - NumFOCUS | Author: nf-admin | Published: Oct 09, 2015

by Gina Helfrich NumFOCUS is pleased to announce pandas as our newest fiscally sponsored project. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas enables users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like […] The post NumFOCUS Announces New Fiscally Sponsored Project: pandas appeared first on NumFOCUS.