【3.6.1】Pandas--宽转长melt

June 18, 2022 pandas 阅读量：次

在pandas中，宽型转长型数据有melt和wide_to_long两种方法。

一、melt

pandas.melt(frame: pandas.core.frame.DataFrame, id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None) → pandas.core.frame.DataFrame

Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.

melt方法叫做数据融合，是dataFrame拥有的方法，使用较为频繁。参数解释如下：

DataFrame.melt(id_vars=None, value_vars=None, var_name=None, value_name=‘value’, col_level=None, ignore_index=True)

id_vars:[tuple, list, ndarray]，列中识别符变量，不参与融合。
value_vars:[tuple, list, ndarray]，列中融合变量，默认全部融合。
var_name:[scalar]，融合后变量名字，默认variable。
value_name:[scalar]，融合后值名字，默认value。
col_level:[int, str]，多重列索引时选择列。
ignore_index:[bool]，融合后索引是否重新排序，默认True。
import pandas as pd
pd.set_option('display.notebook_repr_html',False)

#宽型数据

w_df = pd.DataFrame({'A': [1,2,3],
                   'B': [4,5,6],
                   'C': [7,8,9]})
w_df

A  B  C
0  1  4  7
1  2  5  8
2  3  6  9

当不传入任何参数时，默认会融合全部的列。

#全部融合

w_df.melt()

variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6
6        C      7
7        C      8
8        C      9

设置id_vars参数，选择部分列作为识别符不参与融合，剩余的列将全部融合。

#A标识，B，C融合

w_df.melt(id_vars=['A'])

A variable  value
0  1        B      4
1  2        B      5
2  3        B      6
3  1        C      7
4  2        C      8
5  3        C      9

#A，B标识，C融合

w_df.melt(id_vars=['A','B'])

A  B variable  value
0  1  4        C      7
1  2  5        C      8
2  3  6        C      9

设置value_vars参数，选择部分列作为融合列。注意剩余的列不会自动作为标识符列。

只融合A

w_df.melt(value_vars=['A'])

variable  value
0        A      1
1        A      2
2        A      3
# 只融合A,B
w_df.melt(value_vars=['A','B'])

variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6

设置var_name(默认variable),value_name(默认value)参数，为融合的变量与值设置名字。

#设置融合后变量名与值名

w_df.melt(var_name='code',value_name='count')

code  count
0    A      1
1    A      2
2    A      3
3    B      4
4    B      5
5    B      6
6    C      7
7    C      8
8    C      9

设置ignore_index=False可以保留原数据的索引。

w_df.melt(ignore_index=False)

variable  value
0        A      1
1        A      2
2        A      3
0        B      4
1        B      5
2        B      6
0        C      7
1        C      8
2        C      9

设置col_level参数，可以选择多重列索引数据来融合数据。

列多重索引数据

mi_w_df=w_df.copy()
mi_w_df.columns=[list('ABC'),list('DEF')]
mi_w_df

A  B  C
   D  E  F
0  1  4  7
1  2  5  8
2  3  6  9

融合第一索引列

mi_w_df.melt(col_level=0)

variable  value
0        A      1
1        A      2
2        A      3
3        B      4
4        B      5
5        B      6
6        C      7
7        C      8
8        C      9

融合第二索引列

mi_w_df.melt(col_level=1)

variable  value
0        D      1
1        D      2
2        D      3
3        E      4
4        E      5
5        E      6
6        F      7
7        F      8
8        F      9

二、wide_to_long

wide_to_long函数是pandas自带的，是对melt的一种补充，在特殊的宽转长情况下更适用。

pandas.wide_to_long(df, stubnames, i, j, sep='', suffix='\d+')

df:[pd.dataframe]，宽型数据框
stubnames:[str,list-like]，列名中的存根名字
i:[str,list-like]，列中的索引变量
j:[str]，后缀的重命名
sep:[str,default ""]，存根名与后缀之间的分隔符
suffix:[str,default "\d+"]，后缀

#宽型数据

s_df = pd.DataFrame({"A1970" : [1,33,3],
                   "B1980" : [3,5,7],
                   "A1980" : [13,15,17],
                   "B1970" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
s_df

A1970  B1980  A1980  B1970  x  y
0      1      3     13      6  1  4
1     33      5     15      8  2  5
2      3      7     17     14  3  6

在数据中，A1970,B1980,A1980,B1970这几列名字具有相同的结构，如果需要将它们分开，就可以用long_to_wide函数。

#特定列的宽转长

pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i='x')

y   A   B
x year           
1 1970  4   1   6
  1980  4  13   3
2 1970  5  33   8
  1980  5  15   5
3 1970  6   3  14
  1980  6  17   7

设置stubnames，函数会根据设置的字符去数据列中匹配目标列，然后转换为长数据

#只转换包含A的列

pd.wide_to_long(s_df,stubnames=['A',],j='year',i='x')

B1970  y  B1980   A
x year                     
1 1970      6  4      3   1
2 1970      8  5      5  33
3 1970     14  6      7   3
1 1980      6  4      3  13
2 1980      8  5      5  15
3 1980     14  6      7  17

如果stubnames参数设置的字符在原数据框的列中无法找到，则返回空数据框。

#列名中不存在C字符，返回空数据框

pd.wide_to_long(s_df,stubnames=['C',],j='year',i='x')

Empty DataFrame
Columns: [B1970, y, A1980, B1980, A1970, C]
Index: []

参数i可以设置为多列，返回多个索引。

#设置多索引

pd.wide_to_long(s_df,stubnames=['A','B'],j='year',i=['x','y'])

A   B
x y year        
1 4 1970   1   6
    1980  13   3
2 5 1970  33   8
    1980  15   5
3 6 1970   3  14
    1980  17   7

参数sep表示分隔符，默认""，可以根据实际情况设置。

#宽型数据（-分隔符）

sep_df = pd.DataFrame({"A-1970" : [1,33,3],
                   "B-1980" : [3,5,7],
                   "A-1980" : [13,15,17],
                   "B-1970" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
sep_df

A-1970  B-1980  A-1980  B-1970  x  y
0       1       3      13       6  1  4
1      33       5      15       8  2  5
2       3       7      17      14  3  6

数据中列名的分隔符为-，则转换的时候需要设置sep='-'。

#设置sep参数

pd.wide_to_long(sep_df,stubnames=['A','B'],j='year',i='x',sep='-')

y   A   B
x year           
1 1970  4   1   6
  1980  4  13   3
2 1970  5  33   8
  1980  5  15   5
3 1970  6   3  14
  1980  6  17   7

参数suffix表示后缀，默认是"\d+"，是正则表达式，表示匹配数字，可以根据实际情况替换。

#宽型数据

suf_df = pd.DataFrame({"Aone" : [1,33,3],
                   "Btwo" : [3,5,7],
                   "Atwo" : [13,15,17],
                   "Bone" : [6,8,14],
                   "x"     : [1,2,3],
                   "y"     : [4,5,6]})
suf_df

Aone  Btwo  Atwo  Bone  x  y
0     1     3    13     6  1  4
1    33     5    15     8  2  5
2     3     7    17    14  3  6
# 指定后缀
pd.wide_to_long(suf_df,stubnames=['A','B'],j='year',i='x',suffix='(one|two)')

y   A   B
x year           
1 one   4   1   6
  two   4  13   3
2 one   5  33   8
  two   5  15   5
3 one   6   3  14
  two   6  17   7

参考资料

https://zhuanlan.zhihu.com/p/366403545

药企，独角兽，苏州。团队长期招人，感兴趣的都可以发邮件聊聊：tiehan@sina.cn

个人公众号，比较懒，很少更新，可以在上面提问题，如果回复不及时，可发邮件给我： tiehan@sina.cn