PyArrow Functionality#
pandas can utilize PyArrow to extend functionality and improve the performance of various APIs. This includes:
- More extensive data types compared to NumPy 
- Missing data support (NA) for all data types 
- Performant IO reader integration 
- Facilitate interoperability with other dataframe libraries based on the Apache Arrow specification (e.g. polars, cuDF) 
To use this functionality, please ensure you have installed the minimum supported PyArrow version.
Data Structure Integration#
A Series, Index, or the columns of a DataFrame can be directly backed by a pyarrow.ChunkedArray
which is similar to a NumPy array. To construct these from the main pandas data structures, you can pass in a string of the type followed by
[pyarrow], e.g. "int64[pyarrow]"" into the dtype parameter
In [1]: ser = pd.Series([-1.5, 0.2, None], dtype="float32[pyarrow]")
In [2]: ser
Out[2]: 
0    -1.5
1     0.2
2    <NA>
dtype: float[pyarrow]
In [3]: idx = pd.Index([True, None], dtype="bool[pyarrow]")
In [4]: idx
Out[4]: Index([True, <NA>], dtype='bool[pyarrow]')
In [5]: df = pd.DataFrame([[1, 2], [3, 4]], dtype="uint64[pyarrow]")
In [6]: df
Out[6]: 
   0  1
0  1  2
1  3  4
Note
The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to
specifying dtype=pd.ArrowDtype(pa.string()). Generally, operations on the data will behave similarly
except pd.StringDtype("pyarrow") can return NumPy-backed nullable types while pd.ArrowDtype(pa.string())
will return ArrowDtype.
In [7]: import pyarrow as pa
In [8]: data = list("abc")
In [9]: ser_sd = pd.Series(data, dtype="string[pyarrow]")
In [10]: ser_ad = pd.Series(data, dtype=pd.ArrowDtype(pa.string()))
In [11]: ser_ad.dtype == ser_sd.dtype
Out[11]: False
In [12]: ser_sd.str.contains("a")
Out[12]: 
0     True
1    False
2    False
dtype: boolean
In [13]: ser_ad.str.contains("a")
Out[13]: 
0     True
1    False
2    False
dtype: bool[pyarrow]
For PyArrow types that accept parameters, you can pass in a PyArrow type with those parameters
into ArrowDtype to use in the dtype parameter.
In [14]: import pyarrow as pa
In [15]: list_str_type = pa.list_(pa.string())
In [16]: ser = pd.Series([["hello"], ["there"]], dtype=pd.ArrowDtype(list_str_type))
In [17]: ser
Out[17]: 
0    ['hello']
1    ['there']
dtype: list<item: string>[pyarrow]
In [18]: from datetime import time
In [19]: idx = pd.Index([time(12, 30), None], dtype=pd.ArrowDtype(pa.time64("us")))
In [20]: idx
Out[20]: Index([12:30:00, <NA>], dtype='time64[us][pyarrow]')
In [21]: from decimal import Decimal
In [22]: decimal_type = pd.ArrowDtype(pa.decimal128(3, scale=2))
In [23]: data = [[Decimal("3.19"), None], [None, Decimal("-1.23")]]
In [24]: df = pd.DataFrame(data, dtype=decimal_type)
In [25]: df
Out[25]: 
      0      1
0  3.19   <NA>
1  <NA>  -1.23
If you already have an pyarrow.Array or pyarrow.ChunkedArray,
you can pass it into arrays.ArrowExtensionArray to construct the associated Series, Index
or DataFrame object.
In [26]: pa_array = pa.array(
   ....:     [{"1": "2"}, {"10": "20"}, None],
   ....:     type=pa.map_(pa.string(), pa.string()),
   ....: )
   ....: 
In [27]: ser = pd.Series(pd.arrays.ArrowExtensionArray(pa_array))
In [28]: ser
Out[28]: 
0      [('1', '2')]
1    [('10', '20')]
2              <NA>
dtype: map<string, string>[pyarrow]
To retrieve a pyarrow pyarrow.ChunkedArray from a Series or Index, you can call
the pyarrow array constructor on the Series or Index.
In [29]: ser = pd.Series([1, 2, None], dtype="uint8[pyarrow]")
In [30]: pa.array(ser)
Out[30]: 
<pyarrow.lib.UInt8Array object at 0x7f842e072860>
[
  1,
  2,
  null
]
In [31]: idx = pd.Index(ser)
In [32]: pa.array(idx)
Out[32]: 
<pyarrow.lib.UInt8Array object at 0x7f842e0726e0>
[
  1,
  2,
  null
]
To convert a pyarrow.Table to a DataFrame, you can call the
pyarrow.Table.to_pandas() method with types_mapper=pd.ArrowDtype.
In [33]: table = pa.table([pa.array([1, 2, 3], type=pa.int64())], names=["a"])
In [34]: df = table.to_pandas(types_mapper=pd.ArrowDtype)
In [35]: df
Out[35]: 
   a
0  1
1  2
2  3
In [36]: df.dtypes
Out[36]: 
a    int64[pyarrow]
dtype: object
Operations#
PyArrow data structure integration is implemented through pandas’ ExtensionArray interface;
therefore, supported functionality exists where this interface is integrated within the pandas API. Additionally, this functionality
is accelerated with PyArrow compute functions where available. This includes:
- Numeric aggregations 
- Numeric arithmetic 
- Numeric rounding 
- Logical and comparison functions 
- String functionality 
- Datetime functionality 
The following are just some examples of operations that are accelerated by native PyArrow compute functions.
In [37]: import pyarrow as pa
In [38]: ser = pd.Series([-1.545, 0.211, None], dtype="float32[pyarrow]")
In [39]: ser.mean()
Out[39]: -0.6669999808073044
In [40]: ser + ser
Out[40]: 
0    -3.09
1    0.422
2     <NA>
dtype: float[pyarrow]
In [41]: ser > (ser + 1)
Out[41]: 
0    False
1    False
2     <NA>
dtype: bool[pyarrow]
In [42]: ser.dropna()
Out[42]: 
0   -1.545
1    0.211
dtype: float[pyarrow]
In [43]: ser.isna()
Out[43]: 
0    False
1    False
2     True
dtype: bool
In [44]: ser.fillna(0)
Out[44]: 
0   -1.545
1    0.211
2      0.0
dtype: float[pyarrow]
In [45]: ser_str = pd.Series(["a", "b", None], dtype=pd.ArrowDtype(pa.string()))
In [46]: ser_str.str.startswith("a")
Out[46]: 
0     True
1    False
2     <NA>
dtype: bool[pyarrow]
In [47]: from datetime import datetime
In [48]: pa_type = pd.ArrowDtype(pa.timestamp("ns"))
In [49]: ser_dt = pd.Series([datetime(2022, 1, 1), None], dtype=pa_type)
In [50]: ser_dt.dt.strftime("%Y-%m")
Out[50]: 
0    2022-01
1       <NA>
dtype: string[pyarrow]
I/O Reading#
PyArrow also provides IO reading functionality that has been integrated into several pandas IO readers. The following
functions provide an engine keyword that can dispatch to PyArrow to accelerate reading from an IO source.
In [51]: import io
In [52]: data = io.StringIO("""a,b,c
   ....:    1,2.5,True
   ....:    3,4.5,False
   ....: """)
   ....: 
In [53]: df = pd.read_csv(data, engine="pyarrow")
In [54]: df
Out[54]: 
   a    b      c
0  1  2.5   True
1  3  4.5  False
By default, these functions and all other IO reader functions return NumPy-backed data. These readers can return
PyArrow-backed data by specifying the parameter dtype_backend="pyarrow". A reader does not need to set
engine="pyarrow" to necessarily return PyArrow-backed data.
In [55]: import io
In [56]: data = io.StringIO("""a,b,c,d,e,f,g,h,i
   ....:     1,2.5,True,a,,,,,
   ....:     3,4.5,False,b,6,7.5,True,a,
   ....: """)
   ....: 
In [57]: df_pyarrow = pd.read_csv(data, dtype_backend="pyarrow")
In [58]: df_pyarrow.dtypes
Out[58]: 
a     int64[pyarrow]
b    double[pyarrow]
c      bool[pyarrow]
d    string[pyarrow]
e     int64[pyarrow]
f    double[pyarrow]
g      bool[pyarrow]
h    string[pyarrow]
i      null[pyarrow]
dtype: object
Several non-IO reader functions can also use the dtype_backend argument to return PyArrow-backed data including: