pandas.wide_to_long#

pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\\d+')[source]#

Unpivot a DataFrame from wide to long format.

Less flexible but more user-friendly than melt.

With stubnames [‘A’, ‘B’], this function expects to find one or more group of columns with format A-suffix1, A-suffix2,…, B-suffix1, B-suffix2,… You specify what you want to call this suffix in the resulting long format with j (for example j=’year’)

Each row of these wide variables are assumed to be uniquely identified by i (can be a single column name or a list of column names)

All remaining variables in the data frame are left intact.

Parameters:
dfDataFrame

The wide-format DataFrame.

stubnamesstr or list-like

The stub name(s). The wide format variables are assumed to start with the stub names.

istr or list-like

Column(s) to use as id variable(s).

jstr

The name of the sub-observation variable. What you wish to name your suffix in the long format.

sepstr, default “”

A character indicating the separation of the variable names in the wide format, to be stripped from the names in the long format. For example, if your column names are A-suffix1, A-suffix2, you can strip the hyphen by specifying sep=’-’.

suffixstr, default ‘\d+’

A regular expression capturing the wanted suffixes. ‘\d+’ captures numeric suffixes. Suffixes with no numbers could be specified with the negated character class ‘\D+’. You can also further disambiguate suffixes, for example, if your wide variables are of the form A-one, B-two,.., and you have an unrelated column A-rating, you can ignore the last one by specifying suffix=’(!?one|two)’. When all suffixes are numeric, they are cast to int64/float64.

Returns:
DataFrame

A DataFrame that contains each stub name as a variable, with new index (i, j).

See also

melt

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

pivot

Create a spreadsheet-style pivot table as a DataFrame.

DataFrame.pivot

Pivot without aggregation that can handle non-numeric data.

DataFrame.pivot_table

Generalization of pivot that can handle duplicate values for one index/column pair.

DataFrame.unstack

Pivot based on the index values instead of a column.

Notes

All extra variables are left untouched. This simply uses pandas.melt under the hood, but is hard-coded to “do the right thing” in a typical case.

Examples

>>> np.random.seed(123)
>>> df = pd.DataFrame(
...     {
...         "A1970": {0: "a", 1: "b", 2: "c"},
...         "A1980": {0: "d", 1: "e", 2: "f"},
...         "B1970": {0: 2.5, 1: 1.2, 2: 0.7},
...         "B1980": {0: 3.2, 1: 1.3, 2: 0.1},
...         "X": dict(zip(range(3), np.random.randn(3))),
...     }
... )
>>> df["id"] = df.index
>>> df
  A1970 A1980  B1970  B1980         X  id
0     a     d    2.5    3.2 -1.085631   0
1     b     e    1.2    1.3  0.997345   1
2     c     f    0.7    0.1  0.282978   2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
... 
                X  A    B
id year
0  1970 -1.085631  a  2.5
1  1970  0.997345  b  1.2
2  1970  0.282978  c  0.7
0  1980 -1.085631  d  3.2
1  1980  0.997345  e  1.3
2  1980  0.282978  f  0.1

With multiple id columns

>>> df = pd.DataFrame(
...     {
...         "famid": [1, 1, 1, 2, 2, 2, 3, 3, 3],
...         "birth": [1, 2, 3, 1, 2, 3, 1, 2, 3],
...         "ht1": [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...         "ht2": [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9],
...     }
... )
>>> df
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9
>>> long_format = pd.wide_to_long(df, stubnames="ht", i=["famid", "birth"], j="age")
>>> long_format
... 
                  ht
famid birth age
1     1     1    2.8
            2    3.4
      2     1    2.9
            2    3.8
      3     1    2.2
            2    2.9
2     1     1    2.0
            2    3.2
      2     1    1.8
            2    2.8
      3     1    1.9
            2    2.4
3     1     1    2.2
            2    3.3
      2     1    2.3
            2    3.4
      3     1    2.1
            2    2.9

Going from long back to wide just takes some creative use of unstack

>>> wide_format = long_format.unstack()
>>> wide_format.columns = wide_format.columns.map("{0[0]}{0[1]}".format)
>>> wide_format.reset_index()
   famid  birth  ht1  ht2
0      1      1  2.8  3.4
1      1      2  2.9  3.8
2      1      3  2.2  2.9
3      2      1  2.0  3.2
4      2      2  1.8  2.8
5      2      3  1.9  2.4
6      3      1  2.2  3.3
7      3      2  2.3  3.4
8      3      3  2.1  2.9

Less wieldy column names are also handled

>>> np.random.seed(0)
>>> df = pd.DataFrame(
...     {
...         "A(weekly)-2010": np.random.rand(3),
...         "A(weekly)-2011": np.random.rand(3),
...         "B(weekly)-2010": np.random.rand(3),
...         "B(weekly)-2011": np.random.rand(3),
...         "X": np.random.randint(3, size=3),
...     }
... )
>>> df["id"] = df.index
>>> df  
   A(weekly)-2010  A(weekly)-2011  B(weekly)-2010  B(weekly)-2011  X  id
0        0.548814        0.544883        0.437587        0.383442  0   0
1        0.715189        0.423655        0.891773        0.791725  1   1
2        0.602763        0.645894        0.963663        0.528895  1   2
>>> pd.wide_to_long(df, ["A(weekly)", "B(weekly)"], i="id", j="year", sep="-")
... 
         X  A(weekly)  B(weekly)
id year
0  2010  0   0.548814   0.437587
1  2010  1   0.715189   0.891773
2  2010  1   0.602763   0.963663
0  2011  0   0.544883   0.383442
1  2011  1   0.423655   0.791725
2  2011  1   0.645894   0.528895

If we have many columns, we could also use a regex to find our stubnames and pass that list on to wide_to_long

>>> stubnames = sorted(
...     set(
...         [
...             match[0]
...             for match in df.columns.str.findall(r"[A-B]\(.*\)").values
...             if match != []
...         ]
...     )
... )
>>> list(stubnames)
['A(weekly)', 'B(weekly)']

All of the above examples have integers as suffixes. It is possible to have non-integers as suffixes.

>>> df = pd.DataFrame(
...     {
...         "famid": [1, 1, 1, 2, 2, 2, 3, 3, 3],
...         "birth": [1, 2, 3, 1, 2, 3, 1, 2, 3],
...         "ht_one": [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
...         "ht_two": [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9],
...     }
... )
>>> df
   famid  birth  ht_one  ht_two
0      1      1     2.8     3.4
1      1      2     2.9     3.8
2      1      3     2.2     2.9
3      2      1     2.0     3.2
4      2      2     1.8     2.8
5      2      3     1.9     2.4
6      3      1     2.2     3.3
7      3      2     2.3     3.4
8      3      3     2.1     2.9
>>> long_format = pd.wide_to_long(
...     df, stubnames="ht", i=["famid", "birth"], j="age", sep="_", suffix=r"\w+"
... )
>>> long_format
... 
                  ht
famid birth age
1     1     one  2.8
            two  3.4
      2     one  2.9
            two  3.8
      3     one  2.2
            two  2.9
2     1     one  2.0
            two  3.2
      2     one  1.8
            two  2.8
      3     one  1.9
            two  2.4
3     1     one  2.2
            two  3.3
      2     one  2.3
            two  3.4
      3     one  2.1
            two  2.9