Nullable integer data type#
Note
IntegerArray is currently experimental. Its API or implementation may
change without warning. Uses pandas.NA
as the missing value.
In Working with missing data, we saw that pandas primarily uses NaN
to represent
missing data. Because NaN
is a float, this forces an array of integers with
any missing values to become floating point. In some cases, this may not matter
much. But if your integer column is, say, an identifier, casting to float can
be problematic. Some integers cannot even be represented as floating point
numbers.
Construction#
pandas can represent integer data with possibly missing values using
arrays.IntegerArray
. This is an extension type
implemented within pandas.
In [1]: arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
In [2]: arr
Out[2]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
Or the string alias "Int64"
(note the capital "I"
) to differentiate from
NumPy’s 'int64'
dtype:
In [3]: pd.array([1, 2, np.nan], dtype="Int64")
Out[3]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
All NA-like values are replaced with pandas.NA
.
In [4]: pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")
Out[4]:
<IntegerArray>
[1, 2, <NA>, <NA>, <NA>]
Length: 5, dtype: Int64
This array can be stored in a DataFrame
or Series
like any
NumPy array.
In [5]: pd.Series(arr)
Out[5]:
0 1
1 2
2 <NA>
dtype: Int64
You can also pass the list-like object to the Series
constructor
with the dtype.
Warning
Currently pandas.array()
and pandas.Series()
use different
rules for dtype inference. pandas.array()
will infer a
nullable-integer dtype
In [6]: pd.array([1, None])
Out[6]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
In [7]: pd.array([1, 2])
Out[7]:
<IntegerArray>
[1, 2]
Length: 2, dtype: Int64
For backwards-compatibility, Series
infers these as either
integer or float dtype.
In [8]: pd.Series([1, None])
Out[8]:
0 1.0
1 NaN
dtype: float64
In [9]: pd.Series([1, 2])
Out[9]:
0 1
1 2
dtype: int64
We recommend explicitly providing the dtype to avoid confusion.
In [10]: pd.array([1, None], dtype="Int64")
Out[10]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
In [11]: pd.Series([1, None], dtype="Int64")
Out[11]:
0 1
1 <NA>
dtype: Int64
In the future, we may provide an option for Series
to infer a
nullable-integer dtype.
If you create a column of NA
values (for example to fill them later)
with df['new_col'] = pd.NA
, the dtype
would be set to object
in the
new column. The performance on this column will be worse than with
the appropriate type. It’s better to use
df['new_col'] = pd.Series(pd.NA, dtype="Int64")
(or another dtype
that supports NA
).
In [12]: df = pd.DataFrame()
In [13]: df['objects'] = pd.NA
In [14]: df.dtypes
Out[14]:
objects object
dtype: object
Operations#
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and the data will be coerced to another dtype if needed.
In [15]: s = pd.Series([1, 2, None], dtype="Int64")
# arithmetic
In [16]: s + 1
Out[16]:
0 2
1 3
2 <NA>
dtype: Int64
# comparison
In [17]: s == 1
Out[17]:
0 True
1 False
2 <NA>
dtype: boolean
# slicing operation
In [18]: s.iloc[1:3]
Out[18]:
1 2
2 <NA>
dtype: Int64
# operate with other dtypes
In [19]: s + s.iloc[1:3].astype("Int8")
Out[19]:
0 <NA>
1 4
2 <NA>
dtype: Int64
# coerce when needed
In [20]: s + 0.01
Out[20]:
0 1.01
1 2.01
2 <NA>
dtype: Float64
These dtypes can operate as part of a DataFrame
.
In [21]: df = pd.DataFrame({"A": s, "B": [1, 1, 3], "C": list("aab")})
In [22]: df
Out[22]:
A B C
0 1 1 a
1 2 1 a
2 <NA> 3 b
In [23]: df.dtypes
Out[23]:
A Int64
B int64
C object
dtype: object
These dtypes can be merged, reshaped & casted.
In [24]: pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes
Out[24]:
A Int64
B int64
C object
dtype: object
In [25]: df["A"].astype(float)
Out[25]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
Reduction and groupby operations such as sum()
work as well.
In [26]: df.sum(numeric_only=True)
Out[26]:
A 3
B 5
dtype: Int64
In [27]: df.sum()
Out[27]:
A 3
B 5
C aab
dtype: object
In [28]: df.groupby("B").A.sum()
Out[28]:
B
1 3
3 0
Name: A, dtype: Int64
Scalar NA value#
arrays.IntegerArray
uses pandas.NA
as its scalar
missing value. Slicing a single element that’s missing will return
pandas.NA
In [29]: a = pd.array([1, None], dtype="Int64")
In [30]: a[1]
Out[30]: <NA>