【5.1】数据可视化--数据准备

数据在可视化之前,需要对各种数据进行处理,整理成目标数据集。这里将整理数据各种变形的方法。

一、数据框

1.创建数据框 - data.frame()

# 创建向量p
p = c("A", "B", "C")
# 创建向量q
q = 1:3
# 创建数据框:含p/q两列
dat = data.frame(p, q)
> dat
  p q
1 A 1
2 B 2
3 C 3

2.查看数据框信息 - str()

> str(dat)
'data.frame':   3 obs. of  2 variables:
 $ p: Factor w/ 3 levels "A","B","C": 1 2 3
 $ q: int  1 2 3

3.向数据框添加列 基本格式为:数据框$新列名 = 向量名。如下代码将在dat数据集中创建名为newcol的列,并将向量v赋值给它:

dat$newcol = v

如果向量长度小于数据框的行数,R会重复这个向量,直到所有行被填充。

4.从数据框中删除列 可以将NULL赋值给某列即可。如下代码将删除数据集中的badcol列:

dat$badcol = NULL

也可以使用subset函数(后面会具体讲),并将一个减号至于待删除的列前:

dat = subset(data, select = -badcol)

5.重命名数据框中的列名

可以将列名称向量赋值给names函数:

names(dat) = c("name1", "name2", "name3")

如果想通过列名重命名某一列可以这样:

# 将名为ctrl的列更名为Cntrol
names(anthoming)[names(anthoming) == "ctrl"] = c("Cntrol")

6.重排序数据框的列

可以通过数值位置重排序:

# 通过列的数值位置重排序
dat = dat[c(1,3,2)]

也可以通过列的名称重排序:

# 通过列的名称重排序
dat = dat[c("col1", "col3", "col2")]

7.从数据框提取子集 - subset()

如下R语言代码从climate数据框中,选定Source属性为”Berkeley”的记录的”Year”、”Anomaly10y”两列:

> climate[1:10,]
     Source Year Anomaly1y Anomaly5y Anomaly10y Unc10y
1  Berkeley 1800        NA        NA     -0.435  0.505
2  Berkeley 1801        NA        NA     -0.453  0.493
3  Berkeley 1802        NA        NA     -0.460  0.486
4  Berkeley 1803        NA        NA     -0.493  0.489
5  Berkeley 1804        NA        NA     -0.536  0.483
6  Berkeley 1805        NA        NA     -0.541  0.475
7  Berkeley 1806        NA        NA     -0.590  0.468
8  Berkeley 1807        NA        NA     -0.695  0.461
9  Berkeley 1808        NA        NA     -0.763  0.453
10 Berkeley 1809        NA        NA     -0.818  0.451

# subset函数:首参选定数据集, Source参数选定行,select参选定列

subset(climate, Source == "Berkeley", select = c(Year, Anomaly10y))
climate[climate$Source=="Berkeley" & climate$Year >= 1900 & climate$Year <= 2000, c("Year", "Anomaly10y")] If you grab just a single column this way, it will be returned as a vector instead of a data frame. To prevent this, use drop=FALSE, as in: climate[climate$Source=="Berkeley" & climate$Year >= 1900 & climate$Year <= 2000,
            c("Year", "Anomaly10y"), drop=FALSE]

数据的替换:

for (i in 1:nrow(aa)){if (aa$coverage[i]>=10000) {aa$coverage[i]=5000}};

数字转字符

aa$repeat_time = as.character(aa$repeat_time)

统计数据

> summary(primers_info$coverage ==0,)
 Mode FALSE TRUE NA's 
logical 2052 23 0

拆分数据做boxplot

tmp1=tmp[tmp$gc<=40,]
tmp2=tmp[tmp$gc>40 && tmp$gc<=60,]
tmp3=tmp[tmp$gc>60,]
cbind(tmp1$coverage,tmp2$coverage,tmp3$coverage)[1:5,]
 boxplot(cbind(tmp1$coverage,tmp2$coverage,tmp3$coverage))

根据某个字符拆分某一列

df <- data.frame(a=c("1-2", "23-4", "5-67", "89-10", "11-23"))
library(splitstackshape)
concat.split.multiple(df, "a", "-")

二、因子水平塑型

1.改变Order of Factor Levels

# By default, levels are ordered alphabetically
    sizes = factor(c("small", "large", "large", "small", "medium"))
    sizes
    small  large  large  small  medium
    Levels: large medium small
    # Change the order of levels
    sizes = factor(sizes, levels = c("small", "medium", "large"))
    sizes
    small  large  large  small  medium
    Levels: small medium large

factor(sizes, levels = rev(levels(sizes)))
    small  large  large  small  medium
    Levels: small medium large

2.根据数据的值改变因子水平顺序 - reorder()

下面这个例子将根据count列对spray列中的因子水平进行重排序,汇总数据为mean:

 # Make a copy since we'll modify it
    iss = InsectSprays
    iss$spray
     [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
    [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
    Levels: A B C D E F
    iss$spray = reorder(iss$spray, iss$count, FUN=mean)
    iss$spray
     [1] A A A A A A A A A A A A B B B B B B B B B B B B C C C C C C C C C C C C D D
    [39] D D D D D D D D D D E E E E E E E E E E E E F F F F F F F F F F F F
    attr(,"scores")
ABCDEF 14.500000 15.333333 2.083333 4.916667 3.500000 16.666667 Levels: C E D A B F
# reorder函数:首参选定因子向量,次参选定排序依据的数据向量,FUN参数选定汇总函数

3.改变因子水平的名称 - revalue() / mapvalues() in plyr包

如下两行R语言代码均可将水平因子f中名为”small”,”medium”,”large”的因子分别更名为”S”,”M”, “L”:

sizes = factor(c( "small", "large", "large", "small", "medium"))
sizes
    small  large  large  small  medium
 Levels: large medium small

levels(sizes)
    "large"  "medium" "small"

# 方法一
f = revalue(f, c(small = "S", medium = "M", large = "L"))
# 方法二
f = mapvalues(f, c("small", "medium", "large"), c("S", "M", "L"))


levels(sizes)[1] = "L"
sizes
small  L      L      small  medium
Levels: L medium small
# Rename all levels at once
levels(sizes) = c("L", "M", "S")
sizes

4.去掉因子中不再使用的水平 - droplevels()

如下R语言代码将剔除掉因子f中多余的水平:

droplevels(f)

三、变量塑型

  1. 变量替换 - match()

要将某些值替换为其他特定值,可使用match函数。如下R语言代码将数据框pg的group列的oldvals中的”ctr1”,”trt1”,”trt2”的值分别替换为”No”,”Yes”,”Yes”:

# 旧值
oldvals = c("ctrl1", "trt1", "trt2")
# 新值
newvals = factor(c("No", "Yes", "Yes"))
# 替换
pg$treatment = newvals[match(pg$group, oldvals)]

2.连续型变量变成分类型变量

# Work on a subset of the PlantGrowth data set
pg = PlantGrowth[c(1,2,11,21,22), ]
pg
weight group
  4.17  ctrl
  5.58  ctrl
  4.81  trt1
  6.31  trt2
  5.12  trt2

g$wtclass = cut(pg$weight, breaks = c(0, 5, 6, Inf))
pg
 weight group wtclass
   4.17  ctrl   (0,5]
   5.58  ctrl   (5,6]
   4.81  trt1   (0,5]
   4.17  trt1   (0,5]
   6.31  trt2 (6,Inf]
   5.12  trt2   (5,6]      

If you want the categories to be closed on the left and open on the right, set right = FALSE:
cut(pg$weight, breaks = c(0, 5, 6, Inf), right = FALSE)
  1. 分组转换数据 - ddply() in plyr包

通过使用ddply()函数的transform参数功能,能够对不同分组内的数据进行转换。如下R语代码能够将cabbages数据框按照Cult列因子进行分组,并在数据框中创建一个新的名为DevWt的列,该新列值由原某列值减分组均值得到:

# ddply函数:首参选定数据框,次参选定分组变量,叁参选定处理方式,肆参输出新列

library(MASS) # For the data set
library(plyr)
cb = ddply(cabbages, "Cult", transform, DevWt = HeadWt - mean(HeadWt))
 Cult Date HeadWt VitC       DevWt
  c39  d16    2.5   51 -0.40666667
  c39  d16    2.2   55 -0.70666667
 ...
  c52  d21    1.5   66 -0.78000000
  c52  d21    1.6   72 -0.68000000
  1. 分组汇总数据 - ddply() in plyr包

通过使用ddply()函数的transform参数功能,能够对不同分组内的数据进行汇总。汇总和上面介绍的转换的区别在于汇总结果的记录数等于分组的个数,而转换操作后记录数是不变的,只是对原列进行改动转换。如下R语言代码将cabbages数据框按照Cult和Date列因子进行分组,并在数据框中创建一个新的名为DevWt的列,该新列值由对每个分组进行均值统计得到:

# ddply函数:首参选定数据框,次参选定分组变量,叁参选定处理方式,肆参输出新列
cb = ddply(cabbages, c("Cult", "date"), summarise, Weight = mean(HeadWt))

四、 长/宽数据塑型

  1. 宽数据 -> 长数据 - melt() in reshape2包

anthoming数据集如下所示:

> anthoming
  angle expt ctrl
1   -20    1    0
2   -10    7    3
3     0    2    3
4    10    0    3
5    20    0    1

其中expt和ctrl两列可以合并为一列。合并后的数据框相对合并前的叫长数据,而合并前的数据框相对合并后的数据叫宽数据,是不是很贴切呢? 如下R语言代码使用melt函数将上述数据集”拉长”:

# melt函数:首参选定数据框,次参选定记录标识列,variable.name选定拉长后的属性名列,value.name选定拉长后的属性值列
melt(anthoming, id.vars = "angle", variable.name = "condition", value.name = "count")

拉长后的效果:

> melt(anthoming, id.vars = "angle", variable.name = "condition", value.name = "count")
   angle condition count
1    -20      expt     1
2    -10      expt     7
3      0      expt     2
4     10      expt     0
5     20      expt     0
6    -20      ctrl     0
7    -10      ctrl     3
8      0      ctrl     3
9     10      ctrl     3
10    20      ctrl     1
  1. 长数据 -> 宽数据 - dcast() in reshape2包

plum数据集如下所示:

> plum
  length      time survival count
1   long   at_once     dead    84
2   long in_spring     dead   156
3  short   at_once     dead   133
4  short in_spring     dead   209
5   long   at_once    alive   156
6   long in_spring    alive    84
7  short   at_once    alive   107
8  short in_spring    alive    31

该数据框中length列和time列作为标识列, 如下R语言代码可将该数据框压扁:

# dcast函数:首参选定数据框,次参选定记录标识列和新的属性名列,value.var选定被拉长的属性值列
dcast(plum, length + time ~ survival, value.var = "count")

压扁后的效果:

> dcast(plum, length + time ~ survival, value.var = "count")
  length      time dead alive
1   long   at_once   84   156
2   long in_spring  156    84
3  short   at_once  133   107
4  short in_spring  209    31

参考资料:

《R语言核心手册》

http://www.cnblogs.com/muchen/p/5332359.html

http://bbs.pinggu.org/thread-3051089-1-1.html

个人公众号,比较懒,很少更新,可以在上面提问题:

更多精彩,请移步公众号阅读:

Sam avatar
About Sam
专注生物信息 专注转化医学