【3】数据分析--10--科学计算--Pandas--3--Dataframe索引的修改

  • 重新索引:reindex
  • 删除列或行:drop
  • 修改dataframe列名
  • 修改列或行的顺序
  • 获取行数
  • dataframe宽表变窄表
  • dataframe窄表变宽表
  • 删除空的数据行或列

获取行名:

list(dataframe.index)

获取列名

column_names_1  = list(data_info.columns.values)

一、reindex()

.reindex()能够改变或重排Series和DataFrame索引

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

d = d.reindex(columns = ['two','one'])
print d
   two  one
a    9    1
b    8    2
c    7    3
d    6    4

d =d.reindex(index = ['d','a','c','b'])
print d
   two  one
d    6    4
a    9    1
c    7    3
b    8    2

.reindex(index=None, columns=None, …)的参数

参数 说明
index, columns 新的行列自定义索引
fill_value 重新索引中,用于填充缺失位置的值
method 填充方法, ffill当前值向前填充,bfill向后填充
limit 最大填充量
copy 默认True,生成新的对象,False时,新旧相等不复制

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6


d2 =d.columns.insert(2,'add')
d3 = d.reindex(columns = d2,fill_value =200)

print d3
   one  two  add
a    1    9  200
b    2    8  200
c    3    7  200
d    4    6  200

Series和DataFrame的索引是Index类型 Index对象是不可修改类型 索引类型常用方法

方法 说明
.append(idx) 连接另一个Index对象,产生新的Index对象
.diff(idx) 计算差集,产生新的Index对象
.intersection(idx) 计算交集
.union(idx) 计算并集
.delete(loc) 删除loc位置处的元素
.insert(loc,e) 在loc位置增加一个元素e

案例:

import pandas as pd
import numpy as np 


dt = {'one':[1,2,3,4],'two':[9,8,7,6]}
d = pd.DataFrame(dt,index =['a','b','c','d'])
print d
   one  two
a    1    9
b    2    8
c    3    7
d    4    6

nc =d.columns.delete(1)
ni =d.index.insert(4,'w')
d3 = d.reindex(index =ni,columns = nc,method='ffill')

print d3
   one
a    1
b    2
c    3
d    4
w    4

二、drop() 丢掉数据

.drop()能够删除Series和DataFrame指定行或列索引

2.1 示例数据

>>> df = pd.DataFrame(np.arange(12).reshape(3,4),
...                   columns=['A', 'B', 'C', 'D'])
>>> df
   A  B   C   D
0  0  1   2   3
1  4  5   6   7
2  8  9  10  11

2.2 丢掉列:

方法一:

>>> df.drop(['B', 'C'], axis=1) # axis=1,丢掉列;如果axis=0,则丢掉行
   A   D
0  0   3
1  4   7
2  8  11

方法二

>>> df.drop(columns=['B', 'C'])
   A   D
0  0   3
1  4   7
2  8  11

2.3 丢掉行

>>> df.drop([0, 1])
   A  B   C   D
2  8  9  10  11

2.4 丢掉行或列

示例数据

>>> midx = pd.MultiIndex(levels=[['lama', 'cow', 'falcon'],
...                              ['speed', 'weight', 'length']],
...                      labels=[[0, 0, 0, 1, 1, 1, 2, 2, 2],
...                              [0, 1, 2, 0, 1, 2, 0, 1, 2]])
>>> df = pd.DataFrame(index=midx, columns=['big', 'small'],
...                   data=[[45, 30], [200, 100], [1.5, 1], [30, 20],
...                         [250, 150], [1.5, 0.8], [320, 250],
...                         [1, 0.8], [0.3,0.2]])
>>> df
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
        length  1.5     1.0
cow     speed   30.0    20.0
        weight  250.0   150.0
        length  1.5     0.8
falcon  speed   320.0   250.0
        weight  1.0     0.8
        length  0.3     0.2

丢掉行或列

>>> df.drop(index='cow', columns='small')
                big
lama    speed   45.0
        weight  200.0
        length  1.5
falcon  speed   320.0
        weight  1.0
        length  0.3

通过Level指定index或column所属的层次

>>> df.drop(index='length', level=1)
                big     small
lama    speed   45.0    30.0
        weight  200.0   100.0
cow     speed   30.0    20.0
        weight  250.0   150.0
falcon  speed   320.0   250.0
        weight  1.0     0.8

三、修改DataFrame列名的方法

数据如下:

>>>import pandas as pd
>>>a = pd.DataFrame({'A':[1,2,3], 'B':[4,5,6], 'C':[7,8,9]})
>>> a 
 A B C
0 1 4 7
1 2 5 8
2 3 6 9

方法一:暴力方法

>>>a.columns = ['a','b','c']
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

但是缺点是必须写三个,要不报错。

方法二:较好的方法

>>>a.rename(columns={'A':'a', 'B':'b', 'C':'c'}, inplace = True)
>>>a
 a b c
0 1 4 7
1 2 5 8
2 3 6 9

好处是可以随意改个数:

>>>a.rename(columns={'A':'a', 'C':'c'}, inplace = True)
>>>a
 a B c
0 1 4 7
1 2 5 8
2 3 6 9

可以只改变’A’,‘C’,不改变’B’。

四、更改pandas dataframe 列或行的顺序

4.1 修改列的顺序

这是我的df:

                             Net   Upper   Lower  Mid  Zsore
Answer option                                                
More than once a day          0%   0.22%  -0.12%   2    65 
Once a day                    0%   0.32%  -0.19%   3    45
Several times a week          2%   2.45%   1.10%   4    78
Once a week                   1%   1.63%  -0.40%   6    65

怎样将mid这一列移动到第一列?

                   Mid   Upper   Lower  Net  Zsore
Answer option                                                
More than once a day          2   0.22%  -0.12%   0%    65 
Once a day                    3   0.32%  -0.19%   0%    45
Several times a week          4   2.45%   1.10%   2%    78
Once a week                   6   1.63%  -0.40%   1%    65

方法一:

new_col = ['name','sum','H','L']
df_4 = df_3.loc[:,new_col]

方法二:

In [39]:
mid = df['Mid']
df.drop(labels=['Mid'], axis=1,inplace = True)
df.insert(0, 'Mid', mid)
df
Out[39]:
                      Mid Net  Upper   Lower  Zsore
Answer_option                                      
More_than_once_a_day    2  0%  0.22%  -0.12%     65
Once_a_day              3  0%  0.32%  -0.19%     45
Several_times_a_week    4  2%  2.45%   1.10%     78
Once_a_week             6  1%  1.63%  -0.40%     65

4.2 修改行的顺序

示例数据:

>>> df = pd.DataFrame({
...     'col1' : ['A', 'A', 'B', np.nan, 'D', 'C'],
...     'col2' : [2, 1, 9, 8, 7, 4],
...     'col3': [0, 1, 9, 4, 2, 3],
... })
>>> df
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
3   NaN  8    4
4   D    7    2
5   C    4    3

Sort by col1

>>> df.sort_values(by=['col1'])
    col1 col2 col3
0   A    2    0
1   A    1    1
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4

根据多列排序

>>> df.sort_values(by=['col1', 'col2'])
    col1 col2 col3
1   A    1    1
0   A    2    0
2   B    9    9
5   C    4    3
4   D    7    2
3   NaN  8    4

排序方式

>>> df.sort_values(by='col1', ascending=False)
    col1 col2 col3
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1
3   NaN  8    4

NA放的位置

>>> df.sort_values(by='col1', ascending=False, na_position='first')
    col1 col2 col3
3   NaN  8    4
4   D    7    2
5   C    4    3
2   B    9    9
0   A    2    0
1   A    1    1

注:

  • na_position : {‘first’, ‘last’}
  • axis : {0 or ‘index’, 1 or ‘columns’}, default 0 ;列还是行的排序

五、获得行数

准备数据

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(1000,3), columns=['col1', 'col2', 'col3'])
df.iloc[::2,0] = np.nan

获取行数

df.shape  # 得到df的行和列数
#(1000, 3)

df['col1'].count() #去除了NaN的数据
# 500

len(df.index)
# 1000

len(df)
# 1000

时间测评

因为CPU采用了缓存优化,所以计算的时间并不是很准确,但是也有一定的代表性。

%timeit df.shape
#The slowest run took 169.99 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 947 ns per loop

%timeit df['col1'].count()
#The slowest run took 50.63 times longer than the fastest. This could mean that an intermediate result is being cached.
#10000 loops, best of 3: 22.6 µs per loop

%timeit len(df.index)
#The slowest run took 14.11 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 490 ns per loop

%timeit len(df)
#The slowest run took 18.61 times longer than the fastest. This could mean that an intermediate result is being cached.
#1000000 loops, best of 3: 653 ns per loop

我们发现速度最快的是len(df.index) 方法, 其次是len(df)

最慢的是df[‘col1’].count(),因为该函数需要去除NaN,当然结果也与其他结果不同,使用时需要格外注意。

六、宽表变窄表

import pandas as pd
# 伪造一些数据
fake_data = {'subject':['math', 'english'],
            'A': [88, 90],
            'B': [70, 80],
            'C': [60, 78]}

# 宽表
test = pd.DataFrame(fake_data, columns=['subject', 'A', 'B', 'C'])
test
    subject A   B   C
0   math    88  70  60
1   english 90  80  78

# 转换为窄表
pd.melt(test, id_vars=['subject'])

subject variable    value
0   math    A   88
1   english A   90
2   math    B   70
3   english B   80
4   math    C   60
5   english C   78

七、窄数据变宽数据

获得dataframe

#coding:utf-8
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib

df =  pd.read_table('chain_hl_filtered.tsv',sep=" |\t", engine='python',index_col=False,header=None)  # 以空格或tab来切割数据
df_1 = df[[0,2]]   ## 只要0,2列数据
df_1['value'] = 1  ## 增加一列数据value
df_1.rename(columns={0:'pdb', 2:'chain'}, inplace = True)  ## 该列的名字

print '\ndf_1:\n'
print df_1.head()

df_1:

    pdb chain  value
0  12E8     L      1
1  12E8     H      1
2  12E8     L      1
3  12E8     H      1
4  15C8     L      1

根据pdb和chain列合并数据

df_2 = df_1.groupby(['pdb','chain'])['value'].sum().to_frame()

print '\ndf_2:\n'
print df_2.head()

df_2:

            value
pdb  chain       
12E8 H          2
     L          2
15C8 H          1
     L          1
1A0Q H          1

将index变成列的名字

df_3 = df_2.reset_index(level=['pdb','chain'])  #将index变成column的名字

print '\ndf_3:\n'
print df_3.head()

df_3:

    pdb chain  value
0  12E8     H      2
1  12E8     L      2
2  15C8     H      1
3  15C8     L      1
4  1A0Q     H      1

pivot将窄数据变成宽数据

df_4 =df_3.pivot(index='pdb',columns='chain',values='value').fillna(0)  # fillna将NA填充0

print '\ndf_4:\n'
print df_4.head()

df_4:

chain    H    L
pdb            
12E8   2.0  2.0
15C8   1.0  1.0
1A0Q   1.0  1.0
1A14   1.0  1.0
1A2Y   1.0  1.0

求每一列最小值和最大值

df_4 = df_4.reset_index()  #将index变成column的名字
df_4['sum'] = df_4['H'] +df_4['L']
# df_4['min'] = min(df_4['H'],df_4['L'])
df_4['min_value']=df_4.min(axis=1)
# df_raw['max_index']=np.argmin(np.array(df_raw),axis=1)


print '\ndf_4_2:\n'
print df_4.head()

df_4_2:

chain   pdb    H    L  sum  min_value
0      12E8  2.0  2.0  4.0        2.0
1      15C8  1.0  1.0  2.0        1.0
2      1A0Q  1.0  1.0  2.0        1.0
3      1A14  1.0  1.0  2.0        1.0
4      1A2Y  1.0  1.0  2.0        1.0


print df_4[df_4['min_value']>15]

作图

plt.hist(df_4['min_value'],bins=30,log=True)
plt.xlabel('#Pair HV-HL ')
plt.ylabel('# Count')

fig = matplotlib.pyplot.gcf()
fig.set_size_inches(18.5,10.5)

fig.savefig('pic/pair_HL_chains_distribution.png')

七、删除空数据行及列–dropna

import pandas as pd

#删除含有空数据的全部行

df4 = pd.read_csv('4.csv',  encoding='utf-8')
df4 = df4.dropna()

#可以通过axis参数来删除含有空数据的全部列

df4 = df4.dropna(axis=1)

#可以通过subset参数来删除在age和sex中含有空数据的全部行

df4 = df4.dropna(subset=["age", "sex"])
print(df4)
df4 = df4.dropna(subset=['age', 'body','home.dest'])

讨论

1. ix 函数已经被丢弃

  • .loc for label based indexing or
  • .iloc for positional indexing

参考资料

个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn

Sam avatar
About Sam
专注生物信息 专注转化医学