# 【4.1.3】Pandas--DataFrame的group数据分析

1. DataFrame每个group中最大的值
2. 每个group中提取前10个数据
3. 每个group的均值与size
4. 按自定义的key分组
5. groupby根据某一列 对于那一列进行计算

## 一、DataFrame每个group中最大的值

In [1]: df
Out[1]:
Sp  Mt Value  count
0  MM1  S1     a      3
1  MM1  S1     n      2
2  MM1  S3    cb      5
3  MM2  S3    mk      8
4  MM2  S4    bg     10
5  MM2  S4   dgd      1
6  MM4  S2    rd      2
7  MM4  S2    cb      2
8  MM4  S2   uyi      7


In [2]: df.groupby(['Mt'], sort=False)['count'].max()
Out[2]:
Mt
S1     3
S3     8
S4    10
S2     7
Name: count


In [3]: idx = df.groupby(['Mt'])['count'].transform(max) == df['count']

In [4]: df[idx]
Out[4]:
Sp  Mt Value  count
0  MM1  S1     a      3
3  MM2  S3    mk      8
4  MM2  S4    bg     10
8  MM4  S2   uyi      7


In [5]: df['count_max'] = df.groupby(['Mt'])['count'].transform(max)

In [6]: df
Out[6]:
Sp  Mt Value  count  count_max
0  MM1  S1     a      3          3
1  MM1  S1     n      2          3
2  MM1  S3    cb      5          8
3  MM2  S3    mk      8          8
4  MM2  S4    bg     10         10
5  MM2  S4   dgd      1         10
6  MM4  S2    rd      2          7
7  MM4  S2    cb      2          7
8  MM4  S2   uyi      7          7


## 二、每个group中提取前10个数据

df.groupby('Sp').head(10)


## 三、每个group的均值与size

import pandas as pd
import numpy as np

dict_obj = {'key1' : ['a', 'b', 'a', 'b',
'a', 'b', 'a', 'a'],
'key2' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three'],
'data1': np.random.randn(8),
'data2': np.random.randn(8)}
df_obj = pd.DataFrame(dict_obj)
print(df_obj)


      data1     data2 key1   key2
0  0.974685 -0.672494    a    one
1 -0.214324  0.758372    b    one
2  1.508838  0.392787    a    two
3  0.522911  0.630814    b  three
4  1.347359 -0.177858    a    two
5 -0.264616  1.017155    b    two
6 -0.624708  0.450885    a    one
7 -1.019229 -1.143825    a  three


# dataframe根据key1进行分组
print(type(df_obj.groupby('key1')))

# dataframe的 data1 列根据 key1 进行分组
print(type(df_obj['data1'].groupby(df_obj['key1'])))


<class 'pandas.core.groupby.DataFrameGroupBy'>
<class 'pandas.core.groupby.SeriesGroupBy'>


# 分组运算
grouped1 = df_obj.groupby('key1')
print(grouped1.mean())

grouped2 = df_obj['data1'].groupby(df_obj['key1'])
print(grouped2.mean())


         data1     data2
key1
a     0.437389 -0.230101
b     0.014657  0.802114
key1
a    0.437389
b    0.014657
Name: data1, dtype: float64


size() 返回每个分组的元素个数

# size
print(grouped1.size())
print(grouped2.size())


key1
a    5
b    3
dtype: int64

key1
a    5
b    3
dtype: int64


## 四、按自定义的key分组

obj.groupby(self_def_key)


obj.groupby([‘label1’, ‘label2’])->多层dataframe

# 按自定义key分组，列表
self_def_key = [0, 1, 2, 3, 3, 4, 5, 7]
print(df_obj.groupby(self_def_key).size())

# 按自定义key分组，多层列表
print(df_obj.groupby([df_obj['key1'], df_obj['key2']]).size())

# 按多个列多层分组
grouped2 = df_obj.groupby(['key1', 'key2'])
print(grouped2.size())

# 多层分组按key的顺序进行
grouped3 = df_obj.groupby(['key2', 'key1'])
print(grouped3.mean())
# unstack可以将多层索引的结果转换成单层的dataframe
print(grouped3.mean().unstack())


0    1
1    1
2    1
3    2
4    1
5    1
7    1
dtype: int64

key1  key2
a     one      2
three    1
two      2
b     one      1
three    1
two      1
dtype: int64

key1  key2
a     one      2
three    1
two      2
b     one      1
three    1
two      1
dtype: int64

data1     data2
key2  key1
one   a     0.174988 -0.110804
b    -0.214324  0.758372
three a    -1.019229 -1.143825
b     0.522911  0.630814
two   a     1.428099  0.107465
b    -0.264616  1.017155

data1               data2
key1          a         b         a         b
key2
one    0.174988 -0.214324 -0.110804  0.758372
three -1.019229  0.522911 -1.143825  0.630814
two    1.428099 -0.264616  0.107465  1.017155


## 五、groupby根据某一列 对于那一列进行计算

import pandas as pd
direction = ['A','A','A','A','B','B','B','B','C']
view_num = [1,1,2,1,2,3,1,1,1]
floor = [2,3,4,4,5,2,3,4,4]

df = pd.DataFrame.from_dict({'direction':direction,'view_num':view_num,'floor':floor})

print(df)

direction  view_num  floor
0         A         1      2
1         A         1      3
2         A         2      4
3         A         1      4
4         B         2      5
5         B         3      2
6         B         1      3
7         B         1      4
8         C         1      4


direction：房子朝向
view_num：看房人数
floor：楼层


A 看房人数最多的朝向

df.groupby(['direction'])['view_num'].sum()

direction
A    5
B    7
C    1
Name: view_num, dtype: int64


B 每个朝向的房子的数量

df.groupby(['direction'])['view_num'].count()

direction
A    4
B    4
C    1
Name: view_num, dtype: int64


C 求不同朝向的房子 平均、最大、最小楼层

df.groupby('direction').agg({'floor':{'max','min','mean'}})

floor
mean    min     max
direction
A   3.25    2   4
B   3.50    2   5
C   4.00    4   4


df = df.reset_index()


1.view_num 在两句代码中的作用

• A 中：将数据按照 direction 进行分类，将同一类的 direction 对应的 view_num 进行求和
• B 中：将数据按照 direction 进行分类，统计 direction 中每个 方向 出现的次数，此处的 view_num 只是代表：选择了这一列的数据，进行展示

2.如果没有 view_num

df.groupby(['direction']).sum()

view_num    floor
direction
A   5   13
B   7   14
C   1   4

df.groupby(['direction']).count()

view_num    floor
direction
A   4   4
B   4   4
C   1   1


3.agg 可以同时进行多项计算

df.groupby('direction').agg({'floor':{'max','min','mean'},'view_num':'max'})

floor   view_num
mean    min     max     view_num
direction
A   3.25    2   4   2
B   3.50    2   5   3
C   4.00    4   4   1

1. 除以group中的最大值

df[‘normal’] = df.groupby(‘direction’)[‘view_num’].transform(lambda x: x / x.max()) print(df)

direction view_num floor normal 0 A 1 2 0.500000 1 A 1 3 0.500000 2 A 2 4 1.000000 3 A 1 4 0.500000 4 B 2 5 0.666667 5 B 3 2 1.000000 6 B 1 3 0.333333 7 B 1 4 0.333333 8 C 1 4 1.000000