PDEP-9: Allow third-party projects to register pandas connectors with a standard API
This document proposes that third-party projects implementing I/O or memory connectors to pandas can register them using Python's entrypoint system, and make them available to pandas users with the usual pandas I/O interface. For example, packages independent from pandas could implement readers from DuckDB and writers to Delta Lake, and when installed in the user environment the user would be able to use them as if they were implemented in pandas. For example:
import pandas pandas.load_io_plugins() df = pandas.DataFrame.read_duckdb("SELECT * FROM 'my_dataset.parquet';") df.to_deltalake('/delta/my_dataset')
This would allow to easily extend the existing number of connectors, adding support to new formats and database engines, data lake technologies, out-of-core connectors, the new ADBC interface, and others, and at the same time reduce the maintenance cost of the pandas code base.
pandas supports importing and exporting data from different formats using
I/O connectors, currently implemented in
pandas/io, as well as connectors
to in-memory structures like Python structures or other library formats.
In many cases, those connectors wrap an existing Python library, while in
some others, pandas implements the logic to read and write to a particular
In some cases, different engines exist for the same format. The API to use
those connectors is
pandas.read_<format>(engine='<engine-name>', ...) to
import data, and
DataFrame.to_<format>(engine='<engine-name>', ...) to
For objects exported to memory (like a Python dict) the API is the same as
DataFrame.to_<format>(...). For formats imported from objects in
memory, the API is different using the
from_ prefix instead of
In some cases, the pandas API provides
DataFrame.to_* methods that are not
used to export the data to a disk or memory object, but instead to transform
the index of a
Dependencies of the connectors are not loaded by default, and are
imported when the connector is used. If the dependencies are not installed
ImportError is raised.
>>> pandas.read_gbq(query) Traceback (most recent call last): ... ImportError: Missing optional dependency 'pandas-gbq'. pandas-gbq is required to load data from Google BigQuery. See the docs: https://pandas-gbq.readthedocs.io. Use pip or conda to install pandas-gbq.
The list of formats can be found in the IO guide. A more detailed table, including in memory objects, and I/O connectors in the DataFrame styler is presented next:
At the time of writing this document, the
io/ module contains
close to 100,000 lines of Python, C and Cython code.
There is no objective criteria for when a format is included in pandas, and the list above is mostly the result of a developer being interested in implementing the connectors for a certain format in pandas.
The number of existing formats available for data that can be processed with pandas is constantly increasing, and its difficult for pandas to keep up to date even with popular formats. It possibly makes sense to have connectors to PyArrow, PySpark, Iceberg, DuckDB, Hive, Polars, and many others.
At the same time, some of the formats are not frequently used as shown in the 2019 user survey. Those less popular formats include SPSS, SAS, Google BigQuery and Stata. Note that only I/O formats (and not memory formats like records or xarray) were included in the survey.
The maintenance cost of supporting all formats is not only in maintaining the code and reviewing pull requests, but also it has a significant cost in time spent on CI systems installing dependencies, compiling code, running tests, etc.
In some cases, the main maintainers of some of the connectors are not part of the pandas core development team, but people specialized in one of the formats.
While the current pandas approach has worked reasonably well, it is difficult to find a stable solution where the maintenance incurred in pandas is not too big, while at the same time users can interact with all different formats and representations they are interested in, in an easy and intuitive way.
Third-party packages are already able to implement connectors to pandas, but there are some limitations to it:
- Given the large number of formats supported by pandas itself, third-party connectors are likely seen as second class citizens, not important enough to be used, or not well supported.
- There is no standard API for external I/O connectors, and users need
to learn each of them individually. Since the pandas I/O API is inconsistent
by using read/to instead of read/write or from/to, developers in many cases
ignore the convention. Also, even if developers follow the pandas convention
the namespaces would be different, since developers of connectors will rarely
monkeypatch their functions into the
- Method chaining is not possible with third-party I/O connectors to export
data, unless authors monkey patch the
DataFrameclass, which should not be encouraged.
This document proposes to open the development of pandas I/O connectors to third-party libraries in a standard way that overcomes those limitations.
Implementing this proposal would not require major changes to pandas, and the API defined next would be used.
Users will be able to install third-party packages implementing pandas
connectors using the standard packaging tools (pip, conda, etc.). These
connectors should implement entrypoints that pandas will use to
automatically create the corresponding methods
pandas.Series.to_*. Arbitrary function or
method names will not be created by this interface, only the
to_* pattern will be allowed.
By simply installing the appropriate packages and calling the function
pandas.load_io_plugins() users will be able to use code like this:
import pandas pandas.load_io_plugins() df = pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") df.to_hive(hive_conn, "hive_table")
This API allows for method chaining:
(pandas.read_duckdb("SELECT * FROM 'dataset.parquet';") .to_hive(hive_conn, "hive_table"))
The total number of I/O functions and methods is expected to be small, as users in general use only a small subset of formats. The number could actually be reduced from the current state if the less popular formats (such as SAS, SPSS, BigQuery, etc.) are removed from the pandas core into third-party packages. Moving these connectors is not part of this proposal, and could be discussed later in a separate proposal.
Third-party packages would implement
to define the connectors that they implement, under a group
For example, a hypothetical project
pandas_duckdb implementing a
function, could use
pyproject.toml to define the next entry point:
[project.entry-points."dataframe.io"] reader_duckdb = "pandas_duckdb:read_duckdb"
When the user calls
pandas.load_io_plugins(), it would read the entrypoint registry for the
dataframe.io group, and would dynamically create methods in the
pandas.Series namespaces for them. Only entrypoints with
name starting by
writer_ would be processed by pandas, and the functions
registered in the entrypoint would be made available to pandas users in the corresponding
pandas namespaces. The text after the keywords
writer_ would be used
for the name of the function. In the example above, the entrypoint name
pandas.read_duckdb. An entrypoint with name
writer_hive would create
Entrypoints not starting with
writer_ would be ignored by this interface,
but will not raise an exception since they can be used for future extensions of this
API, or other related dataframe I/O interfaces.
Connectors will use the dataframe interchange API to provide data to pandas. When
data is read from a connector, and before returning it to the user as a response
pandas.read_<format>, data will be parsed from the data interchange interface
and converted to a pandas DataFrame. In practice, connectors are likely to return
a pandas DataFrame or a PyArrow Table, but the interface will support any object
implementing the dataframe interchange API.
In order to provide a better and more consistent experience to users, guidelines will be created to unify terminology and behavior. Some of the topics to unify are defined next.
Guidelines to avoid name conflicts. Since it is expected that more than one
implementation exists for certain formats, as it already happens, guidelines on
how to name connectors would be created. The easiest approach is probably to use
as the format a string of the type
to_<format>_<implementation-id> if it is
expected that more than one connector can exist. For example, for LanceDB it is likely
that only one connector exist, and the name
lance can be used (which would create
DataFrame.to_lance. But if a new
csv reader based in the
Arrow2 Rust implementation, the guidelines can recommend to use
Existence and naming of parameters, since many connectors are likely to provide similar features, like loading only a subset of columns in the data, or dealing with paths. Examples of recommendations to connector developers could be:
columns: Use this argument to let the user load a subset of columns. Allow a list or tuple.
path: Use this argument if the dataset is a file in the file disk. Allow a string, a
pathlib.Pathobject, or a file descriptor. For a string object, allow URLs that will be automatically download, compressed files that will be automatically uncompressed, etc. Specific libraries can be recommended to deal with those in an easier and more consistent way.
schema: For datasets that don't have a schema (e.g.
csv), allow providing an Apache Arrow schema instance, and automatically infer types if not provided.
Note that the above are only examples of guidelines for illustration, and not a proposal of the guidelines, which would be developed independently after this PDEP is approved.
Connector registry and documentation. To simplify the discovery of connectors and its documentation, connector developers can be encourage to register their projects in a central location, and to use a standard structure for documentation. This would allow the creation of a unified website to find the available connectors, and their documentation. It would also allow to customize the documentation for specific implementations, and include their final API.
This section lists specific examples of connectors that could immediately benefit from this proposal.
PyArrow currently provides
With the new interface, it could also register
DataFrame.to_pyarrow, so pandas users can use the converters with
the interface they are used to, when PyArrow is installed in the environment.
Better integration with PyArrow tables was discussed in
pyarrow.Table.from_pandas(table.to_pandas() .query('my_col > 0'))
(pandas.read_pyarrow(table) .query('my_col > 0') .to_pyarrow())
Polars, Vaex and other dataframe frameworks could benefit from third-party projects that make the interoperability with pandas use a more explicitly API. Integration with Polars was requested in #47368.
polars.DataFrame(df.to_pandas() .query('my_col > 0'))
(pandas.read_polars(df) .query('my_col > 0') .to_polars())
DuckDB provides an out-of-core engine able to push predicates before the data is loaded, making much better use of memory and significantly decreasing loading time. pandas, because of its eager nature is not able to easily implement this itself, but could benefit from a DuckDB loader. The loader can already be implemented inside pandas (it has already been proposed in #45678, or as a third-party extension with an arbitrary API. But this proposal would let the creation of a third-party extension with a standard and intuitive API:
pandas.read_duckdb("SELECT * FROM 'dataset.parquet' WHERE my_col > 0")
Out-of-core algorithms push some operations like filtering or grouping to the loading of the data. While this is not currently possible, connectors implementing out-of-core algorithms could be developed using this interface.
Big data systems such as Hive, Iceberg, Presto, etc. could benefit from a standard way to load data to pandas. Also regular SQL databases that can return their query results as Arrow, would benefit from better and faster connectors than the existing ones based on SQL Alchemy and Python structures.
Any other format, including domain-specific formats could easily implement pandas connectors with a clear and intuitive API.
The implementation of this proposal has some limitations discussed here:
- Lack of support for multiple engines. The current pandas I/O API
supports multiple engines for the same format (for the same function or
method name). For example
read_csv(engine='pyarrow', ...). Supporting engines requires that all engines for a particular format use the same signature (the same parameters), which is not ideal. Different connectors are likely to have different parameters and using
**kwargsprovides users with a more complex and difficult experience. For this reason this proposal prefers that function and method names are unique instead of supporting an option for engines.
- Lack of support for type checking of connectors. This PDEP proposes creating functions and methods dynamically, and those are not supported for type checking using stubs. This is already the case for other dynamically created components of pandas, such as custom accessors.
- No improvements to the current I/O API. In the discussions of this
proposal it has been considered to improve the current pandas I/O API to
fix the inconsistency of using
to(instead of for example
write), avoid using
to_prefixed methods for non-I/O operations, or using a dedicated namespace (e.g.
DataFrame.io) for the connectors. All of these changes are out of scope for this PDEP.
This PDEP is exclusively to support a better API for existing of future connectors. It is out of scope for this PDEP to implement changes to any connectors existing in the pandas code base.
Some ideas for future discussion related to this PDEP include:
Automatically loading of I/O plugins when pandas is imported.
Removing from the pandas code base some of the least frequently used connectors, such as SAS, SPSS or Google BigQuery, and move them to third-party connectors registered with this interface.
Discussing a better API for pandas connectors. For example, using
read_*methods instead of
to_*methods not used as I/O connectors, using a consistent terminology like from/to, read/write, load/dump, etc. or using a dedicated namespace for connectors (e.g.
pandas.ioinstead of the general
Implement as I/O connectors some of the formats supported by the
- 5 March 2023: Initial version
- 30 May 2023: Major refactoring to use the pandas existing API, the dataframe interchange API and to make the user be explicit to load the plugins
- 13 June 2023: The PDEP did not get any support after several iterations, and its been closed as rejected by the author