Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing at first. If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.
Index
This section describes how duplicate labels change the behavior of certain operations, and how prevent duplicates from arising during operations, or to detect them if they do.
In [1]: import pandas as pd In [2]: import numpy as np
Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t be determined, and so pandas raises.
Series.reindex()
In [3]: s1 = pd.Series([0, 1, 2], index=["a", "b", "b"]) In [4]: s1.reindex(["a", "b", "c"]) --------------------------------------------------------------------------- ValueError Traceback (most recent call last) <ipython-input-4-18a38f6978fe> in <module> ----> 1 s1.reindex(["a", "b", "c"]) /pandas/pandas/core/series.py in reindex(self, index, **kwargs) 4313 ) 4314 def reindex(self, index=None, **kwargs): -> 4315 return super().reindex(index=index, **kwargs) 4316 4317 def drop( /pandas/pandas/core/generic.py in reindex(self, *args, **kwargs) 4806 4807 # perform the reindex on the axes -> 4808 return self._reindex_axes( 4809 axes, level, limit, tolerance, method, fill_value, copy 4810 ).__finalize__(self, method="reindex") /pandas/pandas/core/generic.py in _reindex_axes(self, axes, level, limit, tolerance, method, fill_value, copy) 4827 4828 axis = self._get_axis_number(a) -> 4829 obj = obj._reindex_with_indexers( 4830 {axis: [new_index, indexer]}, 4831 fill_value=fill_value, /pandas/pandas/core/generic.py in _reindex_with_indexers(self, reindexers, fill_value, copy, allow_dups) 4872 4873 # TODO: speed up on homogeneous DataFrame objects -> 4874 new_data = new_data.reindex_indexer( 4875 index, 4876 indexer, /pandas/pandas/core/internals/managers.py in reindex_indexer(self, new_axis, indexer, axis, fill_value, allow_dups, copy, consolidate, only_slice) 1299 # some axes don't allow reindexing with dups 1300 if not allow_dups: -> 1301 self.axes[axis]._can_reindex(indexer) 1302 1303 if axis >= self.ndim: /pandas/pandas/core/indexes/base.py in _can_reindex(self, indexer) 3474 # trying to reindex on an axis with duplicates 3475 if not self._index_as_unique and len(indexer): -> 3476 raise ValueError("cannot reindex from a duplicate axis") 3477 3478 def reindex(self, target, method=None, level=None, limit=None, tolerance=None): ValueError: cannot reindex from a duplicate axis
Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimensionality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar. But with duplicates, this isn’t the case.
DataFrame
Series
In [5]: df1 = pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "A", "B"]) In [6]: df1 Out[6]: A A B 0 0 1 2 1 3 4 5
We have duplicates in the columns. If we slice 'B', we get back a Series
'B'
In [7]: df1["B"] # a series Out[7]: 0 2 1 5 Name: B, dtype: int64
But slicing 'A' returns a DataFrame
'A'
In [8]: df1["A"] # a DataFrame Out[8]: A A 0 0 1 1 3 4
This applies to row labels as well
In [9]: df2 = pd.DataFrame({"A": [0, 1, 2]}, index=["a", "a", "b"]) In [10]: df2 Out[10]: A a 0 a 1 b 2 In [11]: df2.loc["b", "A"] # a scalar Out[11]: 2 In [12]: df2.loc["a", "A"] # a Series Out[12]: a 0 a 1 Name: A, dtype: int64
You can check whether an Index (storing the row or column labels) is unique with Index.is_unique:
Index.is_unique
In [13]: df2 Out[13]: A a 0 a 1 b 2 In [14]: df2.index.is_unique Out[14]: False In [15]: df2.columns.is_unique Out[15]: True
Note
Checking whether an index is unique is somewhat expensive for large datasets. pandas does cache this result, so re-checking on the same index is very fast.
Index.duplicated() will return a boolean ndarray indicating whether a label is repeated.
Index.duplicated()
In [16]: df2.index.duplicated() Out[16]: array([False, True, False])
Which can be used as a boolean filter to drop duplicate rows.
In [17]: df2.loc[~df2.index.duplicated(), :] Out[17]: A a 0 b 2
If you need additional logic to handle duplicate labels, rather than just dropping the repeats, using groupby() on the index is a common trick. For example, we’ll resolve duplicates by taking the average of all rows with the same label.
groupby()
In [18]: df2.groupby(level=0).mean() Out[18]: A a 0.5 b 2.0
New in version 1.2.0.
As noted above, handling duplicates is an important feature when reading in raw data. That said, you may want to avoid introducing duplicates as part of a data processing pipeline (from methods like pandas.concat(), rename(), etc.). Both Series and DataFrame disallow duplicate labels by calling .set_flags(allows_duplicate_labels=False). (the default is to allow them). If there are duplicate labels, an exception will be raised.
pandas.concat()
rename()
.set_flags(allows_duplicate_labels=False)
In [19]: pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False) --------------------------------------------------------------------------- DuplicateLabelError Traceback (most recent call last) <ipython-input-19-11af4ee9738e> in <module> ----> 1 pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_labels=False) /pandas/pandas/core/generic.py in set_flags(self, copy, allows_duplicate_labels) 334 df = self.copy(deep=copy) 335 if allows_duplicate_labels is not None: --> 336 df.flags["allows_duplicate_labels"] = allows_duplicate_labels 337 return df 338 /pandas/pandas/core/flags.py in __setitem__(self, key, value) 103 if key not in self._keys: 104 raise ValueError(f"Unknown flag {key}. Must be one of {self._keys}") --> 105 setattr(self, key, value) 106 107 def __repr__(self): /pandas/pandas/core/flags.py in allows_duplicate_labels(self, value) 90 if not value: 91 for ax in obj.axes: ---> 92 ax._maybe_check_unique() 93 94 self._allows_duplicate_labels = value /pandas/pandas/core/indexes/base.py in _maybe_check_unique(self) 469 msg += f"\n{duplicates}" 470 --> 471 raise DuplicateLabelError(msg) 472 473 @final DuplicateLabelError: Index has duplicates. positions label b [1, 2]
This applies to both row and column labels for a DataFrame
In [20]: pd.DataFrame([[0, 1, 2], [3, 4, 5]], columns=["A", "B", "C"],).set_flags( ....: allows_duplicate_labels=False ....: ) ....: Out[20]: A B C 0 0 1 2 1 3 4 5
This attribute can be checked or set with allows_duplicate_labels, which indicates whether that object can have duplicate labels.
allows_duplicate_labels
In [21]: df = pd.DataFrame({"A": [0, 1, 2, 3]}, index=["x", "y", "X", "Y"]).set_flags( ....: allows_duplicate_labels=False ....: ) ....: In [22]: df Out[22]: A x 0 y 1 X 2 Y 3 In [23]: df.flags.allows_duplicate_labels Out[23]: False
DataFrame.set_flags() can be used to return a new DataFrame with attributes like allows_duplicate_labels set to some value
DataFrame.set_flags()
In [24]: df2 = df.set_flags(allows_duplicate_labels=True) In [25]: df2.flags.allows_duplicate_labels Out[25]: True
The new DataFrame returned is a view on the same data as the old DataFrame. Or the property can just be set directly on the same object
In [26]: df2.flags.allows_duplicate_labels = False In [27]: df2.flags.allows_duplicate_labels Out[27]: False
When processing raw, messy data you might initially read in the messy data (which potentially has duplicate labels), deduplicate, and then disallow duplicates going forward, to ensure that your data pipeline doesn’t introduce duplicates.
>>> raw = pd.read_csv("...") >>> deduplicated = raw.groupby(level=0).first() # remove duplicates >>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
Setting allows_duplicate_labels=True on a Series or DataFrame with duplicate labels or performing an operation that introduces duplicate labels on a Series or DataFrame that disallows duplicates will raise an errors.DuplicateLabelError.
allows_duplicate_labels=True
errors.DuplicateLabelError
In [28]: df.rename(str.upper) --------------------------------------------------------------------------- DuplicateLabelError Traceback (most recent call last) <ipython-input-28-17c8fb0b7c7f> in <module> ----> 1 df.rename(str.upper) /pandas/pandas/util/_decorators.py in wrapper(*args, **kwargs) 310 @wraps(func) 311 def wrapper(*args, **kwargs) -> Callable[..., Any]: --> 312 return func(*args, **kwargs) 313 314 kind = inspect.Parameter.POSITIONAL_OR_KEYWORD /pandas/pandas/core/frame.py in rename(self, mapper, index, columns, axis, copy, inplace, level, errors) 4436 4 3 6 4437 """ -> 4438 return super().rename( 4439 mapper=mapper, 4440 index=index, /pandas/pandas/core/generic.py in rename(self, mapper, index, columns, axis, copy, inplace, level, errors) 1062 return None 1063 else: -> 1064 return result.__finalize__(self, method="rename") 1065 1066 @rewrite_axis_style_signature("mapper", [("copy", True), ("inplace", False)]) /pandas/pandas/core/generic.py in __finalize__(self, other, method, **kwargs) 5430 self.attrs[name] = other.attrs[name] 5431 -> 5432 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels 5433 # For subclasses using _metadata. 5434 for name in set(self._metadata) & set(other._metadata): /pandas/pandas/core/flags.py in allows_duplicate_labels(self, value) 90 if not value: 91 for ax in obj.axes: ---> 92 ax._maybe_check_unique() 93 94 self._allows_duplicate_labels = value /pandas/pandas/core/indexes/base.py in _maybe_check_unique(self) 469 msg += f"\n{duplicates}" 470 --> 471 raise DuplicateLabelError(msg) 472 473 @final DuplicateLabelError: Index has duplicates. positions label X [0, 2] Y [1, 3]
This error message contains the labels that are duplicated, and the numeric positions of all the duplicates (including the “original”) in the Series or DataFrame
In general, disallowing duplicates is “sticky”. It’s preserved through operations.
In [29]: s1 = pd.Series(0, index=["a", "b"]).set_flags(allows_duplicate_labels=False) In [30]: s1 Out[30]: a 0 b 0 dtype: int64 In [31]: s1.head().rename({"a": "b"}) --------------------------------------------------------------------------- DuplicateLabelError Traceback (most recent call last) <ipython-input-31-8f09bda3af1a> in <module> ----> 1 s1.head().rename({"a": "b"}) /pandas/pandas/core/series.py in rename(self, index, axis, copy, inplace, level, errors) 4271 """ 4272 if callable(index) or is_dict_like(index): -> 4273 return super().rename( 4274 index, copy=copy, inplace=inplace, level=level, errors=errors 4275 ) /pandas/pandas/core/generic.py in rename(self, mapper, index, columns, axis, copy, inplace, level, errors) 1062 return None 1063 else: -> 1064 return result.__finalize__(self, method="rename") 1065 1066 @rewrite_axis_style_signature("mapper", [("copy", True), ("inplace", False)]) /pandas/pandas/core/generic.py in __finalize__(self, other, method, **kwargs) 5430 self.attrs[name] = other.attrs[name] 5431 -> 5432 self.flags.allows_duplicate_labels = other.flags.allows_duplicate_labels 5433 # For subclasses using _metadata. 5434 for name in set(self._metadata) & set(other._metadata): /pandas/pandas/core/flags.py in allows_duplicate_labels(self, value) 90 if not value: 91 for ax in obj.axes: ---> 92 ax._maybe_check_unique() 93 94 self._allows_duplicate_labels = value /pandas/pandas/core/indexes/base.py in _maybe_check_unique(self) 469 msg += f"\n{duplicates}" 470 --> 471 raise DuplicateLabelError(msg) 472 473 @final DuplicateLabelError: Index has duplicates. positions label b [0, 1]
Warning
This is an experimental feature. Currently, many methods fail to propagate the allows_duplicate_labels value. In future versions it is expected that every method taking or returning one or more DataFrame or Series objects will propagate allows_duplicate_labels.