【3.3.1】Pandas--Dataframe索引修改

  • 重新索引:reindex
  • 将Index变成coloumn reset_index
  • 删除列或行:drop
  • 修改dataframe列名
  • 获取行数
  • dataframe宽表变窄表
  • dataframe窄表变宽表
  • 删掉重复值: drop_dupliates

获取行名:

list(dataframe.index)

获取列名

column_names_1  = list(data_info.columns.values)

一、reindex()

.reindex()能够改变或重排Series和DataFrame索引

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

d = d.reindex(columns = ['two','one'])
print d
   two  one
a    9    1
b    8    2
c    7    3
d    6    4

d =d.reindex(index = ['d','a','c','b'])
print d
   two  one
d    6    4
a    9    1
c    7    3
b    8    2

.reindex(index=None, columns=None, …)的参数

参数 说明
index, columns 新的行列自定义索引
fill_value 重新索引中,用于填充缺失位置的值
method 填充方法, ffill当前值向前填充,bfill向后填充
limit 最大填充量
copy 默认True,生成新的对象,False时,新旧相等不复制

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6


d2 =d.columns.insert(2,'add')
d3 = d.reindex(columns = d2,fill_value =200)

print d3
   one  two  add
a    1    9  200
b    2    8  200
c    3    7  200
d    4    6  200

Series和DataFrame的索引是Index类型 Index对象是不可修改类型 索引类型常用方法

方法 说明
.append(idx) 连接另一个Index对象,产生新的Index对象
.diff(idx) 计算差集,产生新的Index对象
.intersection(idx) 计算交集
.union(idx) 计算并集
.delete(loc) 删除loc位置处的元素
.insert(loc,e) 在loc位置增加一个元素e

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

nc =d.columns.delete(1)
ni =d.index.insert(4,'w')
d3 = d.reindex(index =ni,columns = nc,method='ffill')

print d3
   one
a    1
b    2
c    3
d    4
w    4

二、 将Index变成coloumn reset_index

重新生成index

df = df.reset_index()
del df['index']

二、drop() 丢掉数据

.drop()能够删除Series和DataFrame指定行或列索引

2.1 示例数据

>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

2.2 丢掉列:

方法一:

>>> df.drop(['B', 'C'], axis=1) # axis=1,丢掉列;如果axis=0,则丢掉行
   A   D
0  0   3
1  4   7
2  8  11

方法二

>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

2.3 丢掉行

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

2.4 丢掉行或列

示例数据

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3,0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

丢掉行或列

>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

通过Level指定index或column所属的层次

>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8

三、修改DataFrame列名的方法

数据如下:

>>>import pandas as pd
>>>a = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
>>> a 
 A B C
0 1 4 7
1 2 5 8
2 3 6 9

方法一:暴力方法

>>>a.columns = ['a','b','c']
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

但是缺点是必须写三个,要不报错。

这个方法现在python3也开始报错了:

Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.py

data.org/pandas-docs/stable/indexing.html#attribute-access

方法二:较好的方法

>>>a.rename(columns={'A':'a', 'B':'b', 'C':'c'}, inplace = True)
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

好处是可以随意改个数:

>>>a.rename(columns={'A':'a', 'C':'c'}, inplace = True)
>>>a
 a B c
0 1 4 7
1 2 5 8
2 3 6 9

可以只改变’A',‘C’,不改变’B'。

方法三:重新赋值列名

df_2 = df_2.set_axis(['Amino Acid Sequence','PFSC Sequence','PDB Code','Chain','Start Site','End Site','Score'], axis='columns', inplace=False)

五、获得行数

准备数据

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000,3), columns=['col1', 'col2', 'col3'])
df.iloc[::2,0] = np.nan

获取行数

df.shape  # 得到df的行和列数
#(1000, 3)

df['col1'].count() #去除了NaN的数据
# 500

len(df.index)
# 1000

len(df)
# 1000

时间测评

因为CPU采用了缓存优化,所以计算的时间并不是很准确,但是也有一定的代表性。

%timeit df.shape
#The slowest run took 169.99 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 947 ns per loop

%timeit df['col1'].count()
#The slowest run took 50.63 times longer than the fastest. This could mean that an intermediate result is being cached.
#10000 loops, best of 3: 22.6 µs per loop

%timeit len(df.index)
#The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 490 ns per loop

%timeit len(df)
#The slowest run took 18.61 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 653 ns per loop

我们发现速度最快的是len(df.index) 方法, 其次是len(df)

最慢的是df[‘col1’].count(),因为该函数需要去除NaN,当然结果也与其他结果不同,使用时需要格外注意。

六、宽表变窄表

import pandas as pd
# 伪造一些数据
fake_data = {'subject':['math', 'english'],
            'A': [88, 90],
            'B': [70, 80],
            'C': [60, 78]}

# 宽表
test = pd.DataFrame(fake_data, columns=['subject', 'A', 'B', 'C'])
test
	subject	A	B	C
0	math	88	70	60
1	english	90	80	78

# 转换为窄表
pd.melt(test, id_vars=['subject'])

subject	variable	value
0	math	A	88
1	english	A	90
2	math	B	70
3	english	B	80
4	math	C	60
5	english	C	78

七、窄数据变宽数据

获得dataframe

#coding:utf-8
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

df =  pd.read_table('chain_hl_filtered.tsv',sep=" |\t", engine='python',index_col=False,header=None)  # 以空格或tab来切割数据
df_1 = df[[0,2]]   ## 只要0,2列数据
df_1['value'] = 1  ## 增加一列数据value
df_1.rename(columns={0:'pdb', 2:'chain'}, inplace = True)  ## 该列的名字

print '\ndf_1:\n'
print df_1.head()

df_1:

    pdb chain  value
0  12E8     L      1
1  12E8     H      1
2  12E8     L      1
3  12E8     H      1
4  15C8     L      1

根据pdb和chain列合并数据

df_2 = df_1.groupby(['pdb','chain'])['value'].sum().to_frame()

print '\ndf_2:\n'
print df_2.head()

df_2:

            value
pdb  chain       
12E8 H          2
     L          2
15C8 H          1
     L          1
1A0Q H          1

将index变成列的名字

df_3 = df_2.reset_index(level=['pdb','chain'])  #将index变成column的名字

print '\ndf_3:\n'
print df_3.head()

df_3:

    pdb chain  value
0  12E8     H      2
1  12E8     L      2
2  15C8     H      1
3  15C8     L      1
4  1A0Q     H      1

pivot将窄数据变成宽数据

df_4 =df_3.pivot(index='pdb',columns='chain',values='value').fillna(0)  # fillna将NA填充0

print '\ndf_4:\n'
print df_4.head()

df_4:

chain    H    L
pdb            
12E8   2.0  2.0
15C8   1.0  1.0
1A0Q   1.0  1.0
1A14   1.0  1.0
1A2Y   1.0  1.0

求每一列最小值和最大值

df_4 = df_4.reset_index()  #将index变成column的名字
df_4['sum'] = df_4['H'] +df_4['L']
# df_4['min'] = min(df_4['H'],df_4['L'])
df_4['min_value']=df_4.min(axis=1)
# df_raw['max_index']=np.argmin(np.array(df_raw),axis=1)


print '\ndf_4_2:\n'
print df_4.head()

df_4_2:

chain   pdb    H    L  sum  min_value
0      12E8  2.0  2.0  4.0        2.0
1      15C8  1.0  1.0  2.0        1.0
2      1A0Q  1.0  1.0  2.0        1.0
3      1A14  1.0  1.0  2.0        1.0
4      1A2Y  1.0  1.0  2.0        1.0


print df_4[df_4['min_value']>15]

作图

plt.hist(df_4['min_value'],bins=30,log=True)
plt.xlabel('#Pair HV-HL ')
plt.ylabel('# Count')

fig = matplotlib.pyplot.gcf()
fig.set_size_inches(18.5,10.5)

fig.savefig('pic/pair_HL_chains_distribution.png')

八、去重–drop_dupliates

某一行Series的duplicate

df['Seqno'].duplicated().head()
Out[47]: 
0    False
1     True
2     True
3     True
4    False
Name: Seqno, dtype: bool

keep参数指定保留哪一个

df.drop_duplicates(['Seqno'], keep='last').head()
Out[53]: 
    Unnamed: 0   Price  Seqno Symbol        time
3            3  1623.0    0.0   APPL  1473411963
7            7  1649.0    1.0   APPL  1473411964
11          11  1642.0    2.0   APPL  1473411965
15          15  1636.0    3.0   APPL  1473411966
19          19  1669.0    4.0   APPL  1473411967

讨论

1. ix 函数已经被丢弃

  • .loc for label based indexing or
  • .iloc for positional indexing

参考资料

药企,独角兽,苏州。团队长期招人,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn