pandas.factorize#
- pandas.factorize(values, sort=False, use_na_sentinel=True, size_hint=None)[source]#
- Encode the object as an enumerated type or categorical variable. - This method is useful for obtaining a numeric representation of an array when all that matters is identifying distinct values. factorize is available as both a top-level function - pandas.factorize(), and as a method- Series.factorize()and- Index.factorize().- Parameters
- valuessequence
- A 1-D sequence. Sequences that aren’t pandas objects are coerced to ndarrays before factorization. 
- sortbool, default False
- Sort uniques and shuffle codes to maintain the relationship. 
- use_na_sentinelbool, default True
- If True, the sentinel -1 will be used for NaN values. If False, NaN values will be encoded as non-negative integers and will not drop the NaN from the uniques of the values. - New in version 1.5.0. 
- size_hintint, optional
- Hint to the hashtable sizer. 
 
- Returns
- codesndarray
- An integer ndarray that’s an indexer into uniques. - uniques.take(codes)will have the same values as values.
- uniquesndarray, Index, or Categorical
- The unique valid values. When values is Categorical, uniques is a Categorical. When values is some other pandas object, an Index is returned. Otherwise, a 1-D ndarray is returned. - Note - Even if there’s a missing value in values, uniques will not contain an entry for it. 
 
 - Notes - Reference the user guide for more examples. - Examples - These examples all show factorize as a top-level method like - pd.factorize(values). The results are identical for methods like- Series.factorize().- >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b']) >>> codes array([0, 0, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object) - With - sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the maintained.- >>> codes, uniques = pd.factorize(['b', 'b', 'a', 'c', 'b'], sort=True) >>> codes array([1, 1, 0, 2, 1]) >>> uniques array(['a', 'b', 'c'], dtype=object) - When - use_na_sentinel=True(the default), missing values are indicated in the codes with the sentinel value- -1and missing values are not included in uniques.- >>> codes, uniques = pd.factorize(['b', None, 'a', 'c', 'b']) >>> codes array([ 0, -1, 1, 2, 0]) >>> uniques array(['b', 'a', 'c'], dtype=object) - Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned. - >>> cat = pd.Categorical(['a', 'a', 'c'], categories=['a', 'b', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques ['a', 'c'] Categories (3, object): ['a', 'b', 'c'] - Notice that - 'b'is in- uniques.categories, despite not being present in- cat.values.- For all other pandas objects, an Index of the appropriate type is returned. - >>> cat = pd.Series(['a', 'a', 'c']) >>> codes, uniques = pd.factorize(cat) >>> codes array([0, 0, 1]) >>> uniques Index(['a', 'c'], dtype='object') - If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting - use_na_sentinel=False.- >>> values = np.array([1, 2, 1, np.nan]) >>> codes, uniques = pd.factorize(values) # default: use_na_sentinel=True >>> codes array([ 0, 1, 0, -1]) >>> uniques array([1., 2.]) - >>> codes, uniques = pd.factorize(values, use_na_sentinel=False) >>> codes array([0, 1, 0, 2]) >>> uniques array([ 1., 2., nan])