【3】数据分析--10--科学计算--Pandas--3--Dataframe索引的修改

  • 如何改变Series和DataFrame对象?
  • 增加或重排:重新索引
  • 删除:drop

获取行名:

list(dataframe.index)

获取列名

column_names_1  = list(data_info.columns.values)

一、reindex()

.reindex()能够改变或重排Series和DataFrame索引

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

d = d.reindex(columns = ['two','one'])
print d
   two  one
a    9    1
b    8    2
c    7    3
d    6    4

d =d.reindex(index = ['d','a','c','b'])
print d
   two  one
d    6    4
a    9    1
c    7    3
b    8    2

.reindex(index=None, columns=None, …)的参数

参数 说明
index, columns 新的行列自定义索引
fill_value 重新索引中,用于填充缺失位置的值
method 填充方法, ffill当前值向前填充,bfill向后填充
limit 最大填充量
copy 默认True,生成新的对象,False时,新旧相等不复制

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6


d2 =d.columns.insert(2,'add')
d3 = d.reindex(columns = d2,fill_value =200)

print d3
   one  two  add
a    1    9  200
b    2    8  200
c    3    7  200
d    4    6  200

Series和DataFrame的索引是Index类型 Index对象是不可修改类型 索引类型常用方法

方法 说明
.append(idx) 连接另一个Index对象,产生新的Index对象
.diff(idx) 计算差集,产生新的Index对象
.intersection(idx) 计算交集
.union(idx) 计算并集
.delete(loc) 删除loc位置处的元素
.insert(loc,e) 在loc位置增加一个元素e

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

nc =d.columns.delete(1)
ni =d.index.insert(4,'w')
d3 = d.reindex(index =ni,columns = nc,method='ffill')

print d3
   one
a    1
b    2
c    3
d    4
w    4

二、drop() 丢掉数据

.drop()能够删除Series和DataFrame指定行或列索引

2.1 示例数据

>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

2.2 丢掉列:

方法一:

>>> df.drop(['B', 'C'], axis=1) # axis=1,丢掉列;如果axis=0,则丢掉行
   A   D
0  0   3
1  4   7
2  8  11

方法二

>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

2.3 丢掉行

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

2.4 丢掉行或列

示例数据

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3,0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

丢掉行或列

>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

通过Level指定index或column所属的层次

>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8

三、修改DataFrame列名的方法

数据如下:

>>>import pandas as pd
>>>a = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
>>> a 
 A B C
0 1 4 7
1 2 5 8
2 3 6 9

方法一:暴力方法

>>>a.columns = ['a','b','c']
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

但是缺点是必须写三个,要不报错。

方法二:较好的方法

>>>a.rename(columns={'A':'a', 'B':'b', 'C':'c'}, inplace = True)
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

好处是可以随意改个数:

>>>a.rename(columns={'A':'a', 'C':'c'}, inplace = True)
>>>a
 a B c
0 1 4 7
1 2 5 8
2 3 6 9

可以只改变’A’,‘C’,不改变’B’。

四、更改pandas dataframe 列的顺序

这是我的df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

怎样将mid这一列移动到第一列?

                   Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

方法一: ix

In [27]:
# get a list of columns
cols = list(df)
# move the column to head of list using index, pop and insert
cols.insert(0, cols.pop(cols.index('Mid')))
cols
Out[27]:
['Mid', 'Net', 'Upper', 'Lower', 'Zsore']
In [28]:
# use ix to reorder
df = df.ix[:, cols]
df
Out[28]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

方法二:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

五、获得行数

准备数据

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000,3), columns=['col1', 'col2', 'col3'])
df.iloc[::2,0] = np.nan

获取行数

df.shape  # 得到df的行和列数
#(1000, 3)

df['col1'].count() #去除了NaN的数据
# 500

len(df.index)
# 1000

len(df)
# 1000

时间测评

因为CPU采用了缓存优化,所以计算的时间并不是很准确,但是也有一定的代表性。

%timeit df.shape
#The slowest run took 169.99 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 947 ns per loop

%timeit df['col1'].count()
#The slowest run took 50.63 times longer than the fastest. This could mean that an intermediate result is being cached.
#10000 loops, best of 3: 22.6 µs per loop

%timeit len(df.index)
#The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 490 ns per loop

%timeit len(df)
#The slowest run took 18.61 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 653 ns per loop

我们发现速度最快的是len(df.index) 方法, 其次是len(df)

最慢的是df[‘col1’].count(),因为该函数需要去除NaN,当然结果也与其他结果不同,使用时需要格外注意。

参考资料

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学