Nullable integer data type#

arrays.IntegerArray uses pandas.NA as its missing value.

In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a float, this forces an array of integers with any missing values to become floating point. In some cases, this may not matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot even be represented as floating point numbers.

Construction#

pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension type implemented within pandas.

In [1]: arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())

In [2]: arr
Out[2]: 
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

Or the string alias "Int64" (note the capital "I") to differentiate from NumPy’s 'int64' dtype:

In [3]: pd.array([1, 2, np.nan], dtype="Int64")
Out[3]: 
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64

All NA-like values are replaced with pandas.NA.

In [4]: pd.array([1, 2, np.nan, None, pd.NA], dtype="Int64")
Out[4]: 
<IntegerArray>
[1, 2, <NA>, <NA>, <NA>]
Length: 5, dtype: Int64

This array can be stored in a DataFrame or Series like any NumPy array.

In [5]: pd.Series(arr)
Out[5]: 
0       1
1       2
2    <NA>
dtype: Int64

You can also pass the list-like object to the Series constructor with the dtype.

Warning

Currently pandas.array() and pandas.Series() use different rules for dtype inference. pandas.array() will infer a nullable-integer dtype

In [6]: pd.array([1, None])
Out[6]: 
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64

In [7]: pd.array([1, 2])
Out[7]: 
<IntegerArray>
[1, 2]
Length: 2, dtype: Int64

For backwards-compatibility, Series infers these as either integer or float dtype.

In [8]: pd.Series([1, None])
Out[8]: 
0    1.0
1    NaN
dtype: float64

In [9]: pd.Series([1, 2])
Out[9]: 
0    1
1    2
dtype: int64

We recommend explicitly providing the dtype to avoid confusion.

In [10]: pd.array([1, None], dtype="Int64")
Out[10]: 
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64

In [11]: pd.Series([1, None], dtype="Int64")
Out[11]: 
0       1
1    <NA>
dtype: Int64

In the future, we may provide an option for Series to infer a nullable-integer dtype.

If you create a column of NA values (for example to fill them later) with df['new_col'] = pd.NA, the dtype would be set to object in the new column. The performance on this column will be worse than with the appropriate type. It’s better to use df['new_col'] = pd.Series(pd.NA, dtype="Int64") (or another dtype that supports NA).

In [12]: df = pd.DataFrame()

In [13]: df['objects'] = pd.NA

In [14]: df.dtypes
Out[14]: 
objects    object
dtype: object

Operations#

Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and the data will be coerced to another dtype if needed.

In [15]: s = pd.Series([1, 2, None], dtype="Int64")

# arithmetic
In [16]: s + 1
Out[16]: 
0       2
1       3
2    <NA>
dtype: Int64

# comparison
In [17]: s == 1
Out[17]: 
0     True
1    False
2     <NA>
dtype: boolean

# slicing operation
In [18]: s.iloc[1:3]
Out[18]: 
1       2
2    <NA>
dtype: Int64

# operate with other dtypes
In [19]: s + s.iloc[1:3].astype("Int8")
Out[19]: 
0    <NA>
1       4
2    <NA>
dtype: Int64

# coerce when needed
In [20]: s + 0.01
Out[20]: 
0    1.01
1    2.01
2    <NA>
dtype: Float64

These dtypes can operate as part of a DataFrame.

In [21]: df = pd.DataFrame({"A": s, "B": [1, 1, 3], "C": list("aab")})

In [22]: df
Out[22]: 
      A  B  C
0     1  1  a
1     2  1  a
2  <NA>  3  b

In [23]: df.dtypes
Out[23]: 
A    Int64
B    int64
C      str
dtype: object

These dtypes can be merged, reshaped & casted.

In [24]: pd.concat([df[["A"]], df[["B", "C"]]], axis=1).dtypes
Out[24]: 
A    Int64
B    int64
C      str
dtype: object

In [25]: df["A"].astype(float)
Out[25]: 
0    1.0
1    2.0
2    NaN
Name: A, dtype: float64

Reduction and groupby operations such as sum() work as well.

In [26]: df.sum(numeric_only=True)
Out[26]: 
A    3
B    5
dtype: Int64

In [27]: df.sum()
Out[27]: 
A      3
B      5
C    aab
dtype: object

In [28]: df.groupby("B").A.sum()
Out[28]: 
B
1    3
3    0
Name: A, dtype: Int64

Scalar NA value#

arrays.IntegerArray uses pandas.NA as its scalar missing value. Slicing a single element that’s missing will return pandas.NA

In [29]: a = pd.array([1, None], dtype="Int64")

In [30]: a[1]
Out[30]: <NA>