Python operators and how they affect pandas
Source: datapythonista blog - pandas | Author: Marc Garcia | Published: May 22, 2020
Python is today among the most popular programming languages. And in my opinion, there is one main reason for it, readability. A clear example of how Python was designed to be readable is next example: if 'melon' not in ('apple', 'coconut'): print('it is missing!') And compare this to for example Javascript: var fruits =
Read more
Maintaing Performance
Source: datas-frame | Author: Tom Augspurger | Published: Apr 01, 2020
As pandas' documentation claims: pandas provides high-performance data structures. But how do we verify that the claim is correct? And how do we ensure that it stays correct over many releases. This post describes pandas' current setup for monitoring performance My personal debugging strategy for understanding and fixing performance regressions …
Read more
pandas 1.0
Source: pandas blog | Author: pandas team | Published: Jan 29, 2020
Today pandas celebrates its 1.0.0 release. In many ways this is just a normal release with a host of new features, performance improvements, and bug fixes, which are documented in
Read more
Towards consistent missing value handling in Pandas
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Nov 30, 2019
This blogpost gives some background and motivation for my proposal on better missing value support in pandas, and the changes that have been merged in the development version (to be released in pandas 1.0): a new pd.NA scalar is introduced that can be used consistently across all data types..
Read more
An update on the pandas documentation
Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Nov 28, 2019
Some context This post is mainly a technical post on what's the status of the pandas documentation. But let me provide a bit of context on where this comes from. It's a personal opinion, but I think pandas is one of the clearest examples of how open source is transforming …
Read more
New pandas workflow
Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Nov 17, 2019
Some exciting news. After some years of organizing sprints, and maintaining open source, I've been thinking on a more efficient workflow for projects with high volume of activity, like pandas. An exaggerated example would be that I want to create 1,600 issues in pandas. One for each docstring of …
Read more
2019 NumFOCUS Awards and New Contributor Recognition
Source: pandas | NumFOCUS | Author: Admin | Published: Nov 15, 2019
The post 2019 NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.
Read more
Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects
Source: pandas | NumFOCUS | Author: Admin | Published: Nov 14, 2019
The post Chan Zuckerberg Initiative Funds Maintenance of NumFOCUS Projects appeared first on NumFOCUS.
Read more
Highlights From The 2019 Pandas Hack
Source: pandas | NumFOCUS | Author: nf-admin | Published: Sep 13, 2019
The post Highlights From The 2019 Pandas Hack appeared first on NumFOCUS.
Read more
Dataframe summit @ EuroSciPy write up
Source: datapythonista blog - pandas | Author: Marc Garcia | Published: Sep 10, 2019
Last week took place in Bilbao, Spain, EuroSciPy 2019. This year we introduced the maintainers track a room dedicated to discussions among maintainers. The idea is similar to the birds of a feather or unconference sessions of other conferences. But focussed on open source maintainers and contributors. And we scheduled …
Read more
2019 pandas user survey
Source: pandas blog | Author: pandas team | Published: Aug 22, 2019
Pandas recently conducted a user survey to help guide future development. Thanks to everyone who participated! This post presents the high-level results. This analysis and the raw data can be found on
Read more
GeoPandas now uses the pandas ExtensionArray interface
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Aug 13, 2019
Short summary: the upcoming 0.6.0 release of GeoPandas will feature a refactor based on the pandas ExtensionArray interface. Although this change should keep the user interface mostly stable, it enables more robust integration with pandas and allows for more upcoming changes in the future. And given the invasive code changes under the hood, testing is very welcome!
Read more
pandas: The two cultures
Source: datapythonista blog - pandas | Author: Marc | Published: Jul 22, 2019
Leo Breiman was a distinguished statistician at UC Berkeley, known among other things for his major contributions to CART (decision trees), and ensemble techniques, mainly bootstrap aggregation. Combining both, he was able to define one of the most popular machine learning models even today (18 years after the publication of …
Read more
pandas + binder
Source: datas-frame | Author: Tom Augspurger | Published: Jul 21, 2019
This post describes the start of a journey to get pandas' documentation running on Binder. The end result is this nice button: For a while now I've been jealous of Dask's examples repository. That's a repository containing a collection of Jupyter notebooks demonstrating Dask in action. It stitches together some …
Read more
pandas extension arrays
Source: pandas blog | Author: pandas team | Published: Jan 04, 2019
Extensibility was a major theme in pandas development over the last couple of releases. This post introduces the pandas extension array interface: the motivation behind it and how it might affect you
Read more
Inaugural NumFOCUS Awards and New Contributor Recognition
Source: pandas | NumFOCUS | Author: Admin | Published: Sep 27, 2018
The post Inaugural NumFOCUS Awards and New Contributor Recognition appeared first on NumFOCUS.
Read more
Tabular Data in Scikit-Learn and Dask-ML
Source: datas-frame | Author: Tom Augspurger | Published: Sep 17, 2018
Scikit-Learn 0.20.0 will contain some nice new features for working with tabular data. This blogpost will introduce those improvements with a small demo. We'll then see how Dask-ML was able to piggyback on the work done by scikit-learn to offer a version that works well with Dask Arrays …
Read more
Distributed Auto-ML with TPOT with Dask
Source: datas-frame | Author: Tom Augspurger | Published: Aug 30, 2018
This work is supported by Anaconda Inc. This post describes a recent improvement made to TPOT. TPOT is an automated machine learning library for Python. It does some feature engineering and hyper-parameter optimization for you. TPOT uses genetic algorithms to evaluate which models are performing well and how to choose …
Read more
Moral Philosophy for pandas or: What is .values?
Source: datas-frame | Author: Tom Augspurger | Published: Aug 14, 2018
The other day, I put up a Twitter poll asking a simple question: What's the type of series.values? Pop Quiz! What are the possible results for the following:>>> type(pandas.Series.values)— Tom Augspurger (@TomAugspurger) August 6, 2018 I was a bit limited for space, so I'll expand on …
Read more
Modern Pandas (Part 8): Scaling
Source: datas-frame | Author: Tom Augspurger | Published: Apr 23, 2018
This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling As I sit down to write this, the third-most popular pandas question on StackOverflow covers how to use pandas for large datasets. This is in …
Read more
The Worldwide Pandas Documentation Sprint: A Closer Look
Source: pandas | NumFOCUS | Author: Admin | Published: Mar 27, 2018
The post The Worldwide Pandas Documentation Sprint: A Closer Look appeared first on NumFOCUS.
Read more
#pandasSprint write-up
Source: datapythonista blog - pandas | Author: Marc | Published: Mar 22, 2018
The past 10th of March took place #pandasSprint. To the best of my knowledge, an unprecedented kind of event, where around 500 people worked together in improving the documentation of the popular pandas library. As one of the people involved in the organization of the event, I wanted to write …
Read more
Activity on the pandas github repo during the March 10 documentation sprint
Source: Joris Van den Bossche - pandas | Author: Joris Van den Bossche | Published: Mar 13, 2018
Last weekend, Marc Garcia and many others organised a world-wide pandas documentation sprint (https://python-sprints.github.io/pandas/). The goal was to improve the pandas API documentation, and I have to say, it was a great success!
Read more
dask-ml 0.4.1 Released
Source: datas-frame | Author: Tom Augspurger | Published: Feb 13, 2018
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. dask-ml 0.4.1 was released today with a few enhancements. See the changelog for all the changes from 0.4.0. Conda packages are available on conda-forge $ conda install -c conda-forge dask-ml …
Read more
Extension Arrays for Pandas
Source: datas-frame | Author: Tom Augspurger | Published: Feb 12, 2018
This is a status update on some enhancements for pandas. The goal of the work is to store things that are sufficiently array-like in a pandas DataFrame, even if they aren't a regular NumPy array. Pandas already does this in a few places for some blessed types (like Categorical); we'd …
Read more
Easy distributed training with Joblib and dask
Source: datas-frame | Author: Tom Augspurger | Published: Feb 05, 2018
This work is supported by Anaconda Inc and the Data Driven Discovery Initiative from the Moore Foundation. This past week, I had a chance to visit some of the scikit-learn developers at Inria in Paris. It was a fun and productive week, and I'm thankful to them for hosting me …
Read more
dask-ml
Source: datas-frame | Author: Tom Augspurger | Published: Oct 26, 2017
Today we released the first version of dask-ml, a library for parallel and distributed machine learning. Read the documentation or install it with pip install dask-ml Packages are currently building for conda-forge, and will be up later today. conda install -c conda-forge dask-ml The Goals dask is, to quote the …
Read more
Scalable Machine Learning (Part 2): Partial Fit
Source: datas-frame | Author: Tom Augspurger | Published: Sep 15, 2017
This work is supported by Anaconda, Inc. and the Data Driven Discovery Initiative from the Moore Foundation. This is part two of my series on scalable machine learning. Small Fit, Big Predict Scikit-Learn Partial Fit You can download a notebook of this post here. Scikit-learn supports out-of-core learning (fitting a …
Read more
Scalable Machine Learning (Part 1)
Source: datas-frame | Author: Tom Augspurger | Published: Sep 11, 2017
This work is supported by Anaconda Inc. and the Data Driven Discovery Initiative from the Moore Foundation. Anaconda is interested in scaling the scientific python ecosystem. My current focus is on out-of-core, parallel, and distributed machine learning. This series of posts will introduce those concepts, explore what we have available …
Read more
Introducing Stitch
Source: datas-frame | Author: Tom Augspurger | Published: Aug 30, 2016
Today I released stitch into the wild. If you haven't yet, check out the examples page to see an example of what stitch does, and the Github repo for how to install. I'm using this post to explain why I wrote stitch, and some issues it tries to solve. Why …
Read more
Modern Pandas (Part 7): Timeseries
Source: datas-frame | Author: Tom Augspurger | Published: May 13, 2016
This is part 7 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Timeseries Pandas started out in the financial world, so naturally it has strong timeseries support. The first half of this post will look at pandas' …
Read more
Modern Pandas (Part 6): Visualization
Source: datas-frame | Author: Tom Augspurger | Published: Apr 28, 2016
This is part 6 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Visualization and Exploratory Analysis A few weeks ago, the R community went through some hand-wringing about plotting packages. For outsiders (like me) the details aren't …
Read more
Modern Pandas (Part 5): Tidy Data
Source: datas-frame | Author: Tom Augspurger | Published: Apr 22, 2016
This is part 5 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Reshaping & Tidy Data Structuring datasets to facilitate analysis (Wickham 2014) So, you've sat down to analyze a new dataset. What do you do first? In …
Read more
Modern Panadas (Part 3): Indexes
Source: datas-frame | Author: Tom Augspurger | Published: Apr 11, 2016
This is part 3 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Indexes can be a difficult concept to grasp at first. I suspect this is partly becuase they're somewhat peculiar to pandas. These aren't like the …
Read more
Modern Pandas (Part 4): Performance
Source: datas-frame | Author: Tom Augspurger | Published: Apr 08, 2016
This is part 4 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Wes McKinney, the creator of pandas, is kind of obsessed with performance. From micro-optimizations for element access, to embedding a fast hash table inside pandas …
Read more
Modern Pandas (Part 2): Method Chaining
Source: datas-frame | Author: Tom Augspurger | Published: Apr 04, 2016
This is part 2 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Method Chaining Method chaining, where you call methods on an object one after another, is in vogue at the moment. It's always been a style …
Read more
Modern Pandas (Part 1)
Source: datas-frame | Author: Tom Augspurger | Published: Mar 21, 2016
This is part 1 in my series on writing modern idiomatic pandas. Modern Pandas Method Chaining Indexes Fast Pandas Tidy Data Visualization Time Series Scaling Effective Pandas Introduction This series is about how to make effective use of pandas, a data analysis library for the Python programming language. It's targeted …
Read more
Why pandas users should be excited about Apache Arrow
Source: Wes McKinney - pandas | Author: Wes McKinney | Published: Feb 22, 2016
I'm super excited to be involved in the new open source Apache Arrow community initiative. For Python (and R, too!), it will help enable Substantially improved data access speeds Closer to native performance Python extensions for big data systems like Apache Spark New in-memory analytics functionality for nested / JSON-like data There's plenty of places you can learn more about Arrow, but this post is about how it's specifically relevant to pandas users. See, for example: "Python and Hadoop: A State of the Union" "Introducing Apache Arrow: A Fast, Interoperable In-Memory Columnar Data Structure Standard" "Introducing Apache Arrow: Columnar In-Memory Analytics"
Read more
NumFOCUS Announces New Fiscally Sponsored Project: pandas
Source: pandas | NumFOCUS | Author: nf-admin | Published: Oct 09, 2015
by Gina Helfrich NumFOCUS is pleased to announce pandas as our newest fiscally sponsored project. pandas is an open source, BSD-licensed library providing high-performance, easy-to-use data structures and data analysis tools for the Python programming language. pandas enables users to carry out their entire data analysis workflow in Python without having to switch to a more domain-specific language like […] The post NumFOCUS Announces New Fiscally Sponsored Project: pandas appeared first on NumFOCUS.
Read more
dplyr and pandas
Source: datas-frame | Author: Tom Augspurger | Published: Oct 16, 2014
This notebook compares pandas and dplyr. The comparison is just on syntax (verbage), not performance. Whether you're an R user looking to switch to pandas (or the other way around), I hope this guide will help ease the transition. We'll work through the introductory dplyr vignette to analyze some flight …
Read more
Practical Pandas Part 3 - Exploratory Data Analysis
Source: datas-frame | Author: Tom Augspurger | Published: Sep 16, 2014
Welcome back. As a reminder: In part 1 we got dataset with my cycling data from last year merged and stored in an HDF5 store In part 2 we did some cleaning and augmented the cycling data with data from http://forecast.io. You can find the full source code …
Read more
Practical Pandas Part 2 - More Tidying, More Data, and Merging
Source: datas-frame | Author: Tom Augspurger | Published: Sep 04, 2014
This is Part 2 in the Practical Pandas Series, where I work through a data analysis problem from start to finish. It's a misconception that we can cleanly separate the data analysis pipeline into a linear sequence of steps from data acqusition data tidying exploratory analysis model building production As …
Read more
Practical Pandas Part 1 - Reading the Data
Source: datas-frame | Author: Tom Augspurger | Published: Aug 26, 2014
This is the first post in a series where I'll show how I use pandas on real-world datasets. For this post, we'll look at data I collected with Cyclemeter on my daily bike ride to and from school last year. I had to manually start and stop the tracking at …
Read more
Using Python to tackle the CPS (Part 4)
Source: datas-frame | Author: Tom Augspurger | Published: May 19, 2014
Last time, we got to where we'd like to have started: One file per month, with each month laid out the same. As a reminder, the CPS interviews households 8 times over the course of 16 months. They're interviewed for 4 months, take 8 months off, and are interviewed four …
Read more
Using Python to tackle the CPS (Part 3)
Source: datas-frame | Author: Tom Augspurger | Published: May 19, 2014
In part 2 of this series, we set the stage to parse the data files themselves. As a reminder, we have a dictionary that looks like id length start end 0 HRHHID 15 1 15 1 HRMONTH 2 16 17 2 HRYEAR4 4 18 21 3 HURESPLI 2 22 23 …
Read more
Tidy Data in Action
Source: datas-frame | Author: Tom Augspurger | Published: Mar 27, 2014
Hadley Whickham wrote a famous paper (for a certain definition of famous) about the importance of tidy data when doing data analysis. I want to talk a bit about that, using an example from a StackOverflow post, with a solution using pandas. The principles of tidy data aren't language specific …
Read more
Organizing Papers
Source: datas-frame | Author: Tom Augspurger | Published: Feb 13, 2014
As a graduate student, you read a lot of journal articles... a lot. With the material in the articles being as difficult as it is, I didn't want to worry about organizing everything as well. That's why I wrote this script to help (I may have also been procrastinating from …
Read more
Using Python to tackle the CPS (Part 2)
Source: datas-frame | Author: Tom Augspurger | Published: Feb 04, 2014
Last time, we used Python to fetch some data from the Current Population Survey. Today, we'll work on parsing the files we just downloaded. We downloaded two types of files last time: CPS monthly tables: a fixed-width format text file with the actual data Data Dictionaries: a text file describing …
Read more
Using Python to tackle the CPS
Source: datas-frame | Author: Tom Augspurger | Published: Jan 27, 2014
The Current Population Survey is an important source of data for economists. It's modern form took shape in the 70's and unfortunately the data format and distribution shows its age. Some centers like IPUMS have attempted to put a nicer face on accessing the data, but they haven't done everything …
Read more