【3.3.1】Pandas--Dataframe索引修改
- 重新索引:reindex
- 将Index变成coloumn reset_index
- 删除列或行:drop
- 修改dataframe列名
- 获取行数
- dataframe宽表变窄表
- dataframe窄表变宽表
- 删掉重复值: drop_dupliates
获取行名:
list(dataframe.index)
获取列名
column_names_1 = list(data_info.columns.values)
一、reindex()
.reindex()能够改变或重排Series和DataFrame索引
import pandas as pd
import numpy as np
dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
one two
a 1 9
b 2 8
c 3 7
d 4 6
d = d.reindex(columns = ['two','one'])
print d
two one
a 9 1
b 8 2
c 7 3
d 6 4
d =d.reindex(index = ['d','a','c','b'])
print d
two one
d 6 4
a 9 1
c 7 3
b 8 2
.reindex(index=None, columns=None, …)的参数
参数 说明
index, columns 新的行列自定义索引
fill_value 重新索引中,用于填充缺失位置的值
method 填充方法, ffill当前值向前填充,bfill向后填充
limit 最大填充量
copy 默认True,生成新的对象,False时,新旧相等不复制
案例:
import pandas as pd
import numpy as np
dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
one two
a 1 9
b 2 8
c 3 7
d 4 6
d2 =d.columns.insert(2,'add')
d3 = d.reindex(columns = d2,fill_value =200)
print d3
one two add
a 1 9 200
b 2 8 200
c 3 7 200
d 4 6 200
Series和DataFrame的索引是Index类型 Index对象是不可修改类型 索引类型常用方法
方法 说明
.append(idx) 连接另一个Index对象,产生新的Index对象
.diff(idx) 计算差集,产生新的Index对象
.intersection(idx) 计算交集
.union(idx) 计算并集
.delete(loc) 删除loc位置处的元素
.insert(loc,e) 在loc位置增加一个元素e
案例:
import pandas as pd
import numpy as np
dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
one two
a 1 9
b 2 8
c 3 7
d 4 6
nc =d.columns.delete(1)
ni =d.index.insert(4,'w')
d3 = d.reindex(index =ni,columns = nc,method='ffill')
print d3
one
a 1
b 2
c 3
d 4
w 4
二、 将Index变成coloumn reset_index
重新生成index
df = df.reset_index()
del df['index']
二、drop() 丢掉数据
.drop()能够删除Series和DataFrame指定行或列索引
2.1 示例数据
>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
... columns=['A', 'B', 'C', 'D'])
>>> df
A B C D
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
2.2 丢掉列:
方法一:
>>> df.drop(['B', 'C'], axis=1) # axis=1,丢掉列;如果axis=0,则丢掉行
A D
0 0 3
1 4 7
2 8 11
方法二
>>> df.drop(columns=['B', 'C'])
A D
0 0 3
1 4 7
2 8 11
2.3 丢掉行
>>> df.drop([0, 1])
A B C D
2 8 9 10 11
2.4 丢掉行或列
示例数据
>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
... ['speed', 'weight', 'length']],
... labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
... [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
... data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
... [250, 150], [1.5, 0.8], [320, 250],
... [1, 0.8], [0.3,0.2]])
>>> df
big small
lama speed 45.0 30.0
weight 200.0 100.0
length 1.5 1.0
cow speed 30.0 20.0
weight 250.0 150.0
length 1.5 0.8
falcon speed 320.0 250.0
weight 1.0 0.8
length 0.3 0.2
丢掉行或列
>>> df.drop(index='cow', columns='small')
big
lama speed 45.0
weight 200.0
length 1.5
falcon speed 320.0
weight 1.0
length 0.3
通过Level指定index或column所属的层次
>>> df.drop(index='length', level=1)
big small
lama speed 45.0 30.0
weight 200.0 100.0
cow speed 30.0 20.0
weight 250.0 150.0
falcon speed 320.0 250.0
weight 1.0 0.8
三、修改DataFrame列名的方法
数据如下:
>>>import pandas as pd
>>>a = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
>>> a
A B C
0 1 4 7
1 2 5 8
2 3 6 9
方法一:暴力方法
>>>a.columns = ['a','b','c']
>>>a
a b c
0 1 4 7
1 2 5 8
2 3 6 9
但是缺点是必须写三个,要不报错。
这个方法现在python3也开始报错了:
Pandas doesn't allow columns to be created via a new attribute name - see https://pandas.py
data.org/pandas-docs/stable/indexing.html#attribute-access
方法二:较好的方法
>>>a.rename(columns={'A':'a', 'B':'b', 'C':'c'}, inplace = True)
>>>a
a b c
0 1 4 7
1 2 5 8
2 3 6 9
好处是可以随意改个数:
>>>a.rename(columns={'A':'a', 'C':'c'}, inplace = True)
>>>a
a B c
0 1 4 7
1 2 5 8
2 3 6 9
可以只改变’A',‘C’,不改变’B'。
方法三:重新赋值列名
df_2 = df_2.set_axis(['Amino Acid Sequence','PFSC Sequence','PDB Code','Chain','Start Site','End Site','Score'], axis='columns', inplace=False)
五、获得行数
准备数据
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000,3), columns=['col1', 'col2', 'col3'])
df.iloc[::2,0] = np.nan
获取行数
df.shape # 得到df的行和列数
#(1000, 3)
df['col1'].count() #去除了NaN的数据
# 500
len(df.index)
# 1000
len(df)
# 1000
时间测评
因为CPU采用了缓存优化,所以计算的时间并不是很准确,但是也有一定的代表性。
%timeit df.shape
#The slowest run took 169.99 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 947 ns per loop
%timeit df['col1'].count()
#The slowest run took 50.63 times longer than the fastest. This could mean that an intermediate result is being cached.
#10000 loops, best of 3: 22.6 µs per loop
%timeit len(df.index)
#The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 490 ns per loop
%timeit len(df)
#The slowest run took 18.61 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 653 ns per loop
我们发现速度最快的是len(df.index) 方法, 其次是len(df)
最慢的是df[‘col1’].count(),因为该函数需要去除NaN,当然结果也与其他结果不同,使用时需要格外注意。
六、宽表变窄表
import pandas as pd
# 伪造一些数据
fake_data = {'subject':['math', 'english'],
'A': [88, 90],
'B': [70, 80],
'C': [60, 78]}
# 宽表
test = pd.DataFrame(fake_data, columns=['subject', 'A', 'B', 'C'])
test
subject A B C
0 math 88 70 60
1 english 90 80 78
# 转换为窄表
pd.melt(test, id_vars=['subject'])
subject variable value
0 math A 88
1 english A 90
2 math B 70
3 english B 80
4 math C 60
5 english C 78
七、窄数据变宽数据
获得dataframe
#coding:utf-8
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib
df = pd.read_table('chain_hl_filtered.tsv',sep=" |\t", engine='python',index_col=False,header=None) # 以空格或tab来切割数据
df_1 = df[[0,2]] ## 只要0,2列数据
df_1['value'] = 1 ## 增加一列数据value
df_1.rename(columns={0:'pdb', 2:'chain'}, inplace = True) ## 该列的名字
print '\ndf_1:\n'
print df_1.head()
df_1:
pdb chain value
0 12E8 L 1
1 12E8 H 1
2 12E8 L 1
3 12E8 H 1
4 15C8 L 1
根据pdb和chain列合并数据
df_2 = df_1.groupby(['pdb','chain'])['value'].sum().to_frame()
print '\ndf_2:\n'
print df_2.head()
df_2:
value
pdb chain
12E8 H 2
L 2
15C8 H 1
L 1
1A0Q H 1
将index变成列的名字
df_3 = df_2.reset_index(level=['pdb','chain']) #将index变成column的名字
print '\ndf_3:\n'
print df_3.head()
df_3:
pdb chain value
0 12E8 H 2
1 12E8 L 2
2 15C8 H 1
3 15C8 L 1
4 1A0Q H 1
pivot将窄数据变成宽数据
df_4 =df_3.pivot(index='pdb',columns='chain',values='value').fillna(0) # fillna将NA填充0
print '\ndf_4:\n'
print df_4.head()
df_4:
chain H L
pdb
12E8 2.0 2.0
15C8 1.0 1.0
1A0Q 1.0 1.0
1A14 1.0 1.0
1A2Y 1.0 1.0
求每一列最小值和最大值
df_4 = df_4.reset_index() #将index变成column的名字
df_4['sum'] = df_4['H'] +df_4['L']
# df_4['min'] = min(df_4['H'],df_4['L'])
df_4['min_value']=df_4.min(axis=1)
# df_raw['max_index']=np.argmin(np.array(df_raw),axis=1)
print '\ndf_4_2:\n'
print df_4.head()
df_4_2:
chain pdb H L sum min_value
0 12E8 2.0 2.0 4.0 2.0
1 15C8 1.0 1.0 2.0 1.0
2 1A0Q 1.0 1.0 2.0 1.0
3 1A14 1.0 1.0 2.0 1.0
4 1A2Y 1.0 1.0 2.0 1.0
print df_4[df_4['min_value']>15]
作图
plt.hist(df_4['min_value'],bins=30,log=True)
plt.xlabel('#Pair HV-HL ')
plt.ylabel('# Count')
fig = matplotlib.pyplot.gcf()
fig.set_size_inches(18.5,10.5)
fig.savefig('pic/pair_HL_chains_distribution.png')
八、去重–drop_dupliates
某一行Series的duplicate
df['Seqno'].duplicated().head()
Out[47]:
0 False
1 True
2 True
3 True
4 False
Name: Seqno, dtype: bool
keep参数指定保留哪一个
df.drop_duplicates(['Seqno'], keep='last').head()
Out[53]:
Unnamed: 0 Price Seqno Symbol time
3 3 1623.0 0.0 APPL 1473411963
7 7 1649.0 1.0 APPL 1473411964
11 11 1642.0 2.0 APPL 1473411965
15 15 1636.0 3.0 APPL 1473411966
19 19 1669.0 4.0 APPL 1473411967
讨论
1. ix 函数已经被丢弃
- .loc for label based indexing or
- .iloc for positional indexing
参考资料
- https://blog.csdn.net/tanzuozhev/article/details/77411467
- https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.drop.html
- https://blog.csdn.net/Guo_ya_nan/article/details/82955345
- https://www.jianshu.com/p/89de6c085d22
- http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
这里是一个广告位,,感兴趣的都可以发邮件聊聊:tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn