【4.1.2】Pandas--DataFrame的group数据分析

一、DataFrame每个group中最大的值

原始数据:

In [1]: df
Out[1]:
    Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7

获取每个group中最大的值

In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count

获取每个group中最大值对应的行

In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
    Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7

每个group中最大的值形成一个新的列

In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
    Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7

二、每个group中提取前10个数据

df.groupby('Sp').head(10)

三、每个group的均值与size

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b', 
                      'a', 'b', 'a', 'a'],
            'key2' : ['one', 'one', 'two', 'three',
                      'two', 'two', 'one', 'three'],
            'data1': np.random.randn(8),
            'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)

运行结果:

      data1     data2 key1   key2
0  0.974685 -0.672494    a    one
1 -0.214324  0.758372    b    one
2  1.508838  0.392787    a    two
3  0.522911  0.630814    b  three
4  1.347359 -0.177858    a    two
5 -0.264616  1.017155    b    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three

进行分组

# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))

# dataframe的 data1 列根据 key1 进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))

示例结果:

<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>

示例代码:

# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())

运行结果

         data1     data2
key1                    
a     0.437389 -0.230101
b     0.014657  0.802114
key1
a    0.437389
b    0.014657
Name: data1, dtype: float64

size() 返回每个分组的元素个数

示例代码:

# size
print(grouped1.size())
print(grouped2.size())

运行结果:

key1
a    5
b    3
dtype: int64

key1
a    5
b    3
dtype: int64

四、按自定义的key分组

obj.groupby(self_def_key)

自定义的key可为列表或多层列表

obj.groupby([‘label1’, ‘label2’])->多层dataframe

# 按自定义key分组,列表
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())

# 按自定义key分组,多层列表
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())

# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
# unstack可以将多层索引的结果转换成单层的dataframe
print(grouped3.mean().unstack())

运行结果:

0    1
1    1
2    1
3    2
4    1
5    1
7    1
dtype: int64

key1  key2 
a     one      2
      three    1
      two      2
b     one      1
      three    1
      two      1
dtype: int64


key1  key2 
a     one      2
      three    1
      two      2
b     one      1
      three    1
      two      1
dtype: int64


               data1     data2
key2  key1                    
one   a     0.174988 -0.110804
      b    -0.214324  0.758372
three a    -1.019229 -1.143825
      b     0.522911  0.630814
two   a     1.428099  0.107465
      b    -0.264616  1.017155

          data1               data2          
key1          a         b         a         b
key2                                         
one    0.174988 -0.214324 -0.110804  0.758372
three -1.019229  0.522911 -1.143825  0.630814
two    1.428099 -0.264616  0.107465  1.017155

参考资料

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学