培训的要点:
知道什么是R语言,了解基本语法,Rstudio编辑器,读写文件,跟Excel的区别,绘图等可视化,生物信息学相关的bioconductor系列包。
尤为注意的是,初学者千万不要钻牛角尖,而是要广泛涉猎,牢记基础,应用加实践,其背后的计算机逻辑算法等可以后期再补,而且要时刻记住,学习R语言是为了分析生物信息学数据。
嗯,的确,最低要求,让小白知道为什么用R,打开R/RSTUDIO, 把常见的excel或者spss操作可以在R里面完成,绘制几个像样的图,搞个把统计分析出结果有点成就感。bioconductor用一两个例子讲解如何学习
R 入门的基本逻辑
0. R和Rstudio的安装和功能布局介绍
1. R可以做什么?
2. 基本概括
1. 数据类型
| Homogeneous | Heterogeneous
1d | Atomic vector | Matrix List 2d | Data frame nd | Array
aa = c(‘1’,“2”) aa[0] character(0) aa[1] [1] “1”
数据结构
str(aa) chr [1:2] “1” “2”
typeof(aa) [1] “character”
class(aa) [1] “character”
-|typeof() |class() strings or vector of strings | character |character numbers or vector of numbers | numeric |numeric list | list |list data.frame* | list |data.frame
1d (vectors: atomic vector and list)
is.atomic() || is.list()
Type |typeof() |what it is Length | length() |how many elements Attributes |attributes() |additonal arbitrary metadata
Names a character vector giving each element a name |names(x) Dimensions |used to turn vectors into matrices and arrays |dim(x) Class |used to implement the S3 object system |class(x)
<‐ Left assignment, binary ‐> Right assignment, binary = Left assignment, but not recommended «‐
2. 读写数据
getwd()
Find the current working directory (where inputs are found and outputs are sent).
setwd(‘C://file/path’) Change the current working directory.
R data object I/O
data(x) loads specified data set; if no arg is given it lists all available data sets save(file,…) saves the specified objects (…) in XDR platform-independent binary format save.image(file) saves all objects load(file) load datasets written with save
Read and write a delimited text file.
read.table(file), read.csv(file), read.delim(“file”), read.fwf(“file”) read a file using defaults sensible for a table/csv/delimited/fixed-width file and create a data frame from it.
write.table(x,file), write.csv(x,file) saves x after converting to a data frame
df <- read.table(‘file.txt’) write.table(df, ‘file.txt’)
Read and write a comma separated value file. This is a special case of read.table/ write.table.
df <- read.csv(‘file.csv’) write.csv(df, ‘file.csv’)
Read and write an R data file, a file type special for R. load(‘file.RData’) save(df, file = ’file.Rdata’
,数据的操作(提取,变形,统计),写数据(作图)
File connections of functions can also be used to read and write to the clipboard instead of a file
Mac OS: x <‐ read.delim(pipe(“pbpaste”)) Windows: x <‐ read.delim(ʺclipboardʺ)
3. 创建数据
c(…) generic function to combine arguments with the default forming a vector; with recursive=TRUE descends through lists combining all elements into one vector
from:to generates a sequence; “:” has operator priority; 1:4 + 1 is “2,3,4,5”
seq(from,to) generates a sequence by= specifies increment; length= specifies desired length
seq(along=x) generates 1, 2, …, length(along);useful in for loops
rep(x,times) replicate x times; use each to repeat “each” element of x each times; rep(c(1,2,3),2) is 1 2 3 1 2 3; rep(c(1,2,3),each=2) is 1 1 2 2 3 3
data.frame(…) create a data frame of the named or unnamed arguments data.frame (v=1:4, ch= c(“a”,“B”,“c”,“d”), n=10); shorter vectors are recycled to the length of the longest
list(…) create a list of the named or unnamed arguments; list(a=c(1,2),b=“hi”, c=3);
array(x,dim=) array with data x; specify dimensions like dim=c(3,4,2); elements of x recycle if x is not long enough
matrix(x,nrow,ncol) matrix; elements of x recycle factor(x,levels) encodes a vector x as a factor
gl(n, k, length=n*k, labels=1:n) generate levels (factors) by specifying the pattern of their levels; k is the number of levels, and n is the number of replications
expand.grid() a data frame from all combinations of the supplied vectors or factors
案例
c(2, 4, 6)
2:6
seq(2, 3, by=0.5)
rep(1:2, times=3)
rep(1:2, each=3)
4. 整体了解数据
数据类型转换
as.array(x) as.character(x) as.data.frame(x) as.factor(x) as.logical(x) as.numeric(x), convert type; for a complete list, use methods(as)
数据类型判断信息
is.na(x), is.null(x), is.nan(x); is.array(x), is.data.frame(x), is.numeric(x), is.complex(x), is.character(x); for a complete list, use methods(is)
x head(x), tail(x) returns first or last parts of an object summary(x) generic function to give a summary str(x) display internal structure of the data length(x) number of elements in x dim(x) Retrieve or set the dimension of an object; dim(x) <‐ c(3,2) dimnames(x) Retrieve or set the dimension names of an object nrow(x), ncol(x) number of rows/cols; NROW(x),NCOL(x) is the same but treats a vector as a one-row/col matrix
class(x) get or set the class of x; class(x) <‐ ʺmyclassʺ;
unclass(x) removes the class attribute of x
attr(x,which) get or set the attribute which of x
attributes(obj) get or set the list of attributes of obj
5.数据合并
sort(x) Return x sorted.
rev(x) Return x reversed.
table(x) See counts of values.
unique(x) See unique values.
数据的变型
merge(a,b) merge two data frames by common col or row names
stack(x, …) transform data available as separate cols in a data frame or list into a single col
unstack(x, …) inverse of stack()
rbind(…) , cbind(…) combines supplied matrices,data frames, etc. by rows or cols
melt(data, id.vars, measure.vars) changes an object into a suitable form for easy casting, (reshape2 package)
cast(data, formula, fun) applies fun to melted data using formula (reshape2 package)
recast(data, formula) melts and casts in a single step (reshape2 package)
reshape(x, direction…) reshapes data frame between ’wide’ (repeated measurements in separate cols) and ’long’ (repeated measurements in separate rows) format based on directio
paste(x, y, sep = ' ‘) Join multiple vectors together.
paste(x, collapse = ' ‘) Join elements of a vector together
grep(pattern, x) Find regular expression matches in x
gsub(pattern, replace, x) Replace matches in x with a string
toupper(x) Convert to uppercase.
tolower(x) Convert to lowercase.
nchar(x) Number of characters in a string
cut(x, breaks = 4) Turn a numeric vector into a factor by ‘cutting’ into sections.
factor(x) Turn a vector into a factor. Can set the levels of the factor and the order.
4. 子集
x$y is equivalent to x[[‘y’, exact = FALSE]]
var <- ‘cyl’ x$var
doesn’t work, translated to x[[‘var’]] # Instead use x[[var]]
df1[df1$col1 == 5 & df1$col2 == 4, ] subset(df1, col1 == 5 & col2 == 4) which(c(T, F, T F)) -> 1 3
Indexing vectors:
x[n] x[‐n] x[1:n] x[‐(1:n)] x[c(1,4,2)] x[ʺnameʺ] x[x > 3] x[x > 3 & x < 5] x[x %in% c(ʺaʺ,ʺifʺ)] elements in the given set
Indexing lists
x[n] list with elements n x[[n]] nth element of the list x[[ʺnameʺ]] element named “name” x$name as above (w. partial matching)
Indexing matrices x[i,j] element at row i, column j x[i,] row i x[,j] column j x[,c(1,3)] columns 1 and 3 x[ʺnameʺ,] row named “name”
Indexing matrices data frames (same as matrices plus the following)
X[[ʺnameʺ]] column named “name” x$name as above (w. partial matching)
4. 明确问题,知道用什么包(包的调用)
Packages
install.packages(“pkgs”, lib) download and install pkgs from repository (lib) or other external source update.packages checks for new versions and offers to install library(pkg) loads pkg, if pkg is omitted it lists packages detach(ʺpackage:pkgʺ) removes pkg from memory
install.packages(‘dplyr’) Download and install a package from CRAN.
library(dplyr) Load the package into the session, making all its functions available to use. dplyr::select Use a particular function from a package.
data(iris) Load a built-in dataset into the environment
5. 函数与循环
循环
for (variable in sequence){ Do something }
for (i in 1:4){ j <- i + 10 print(j) }
IF 判断
if (condition){ Do something } else { Do something different }
if (i > 3){ print(‘Yes’) } else { print(‘No’) }
While Loop
while (condition){ Do something }
while (i < 5){ print(i) i <- i + 1 }
逻辑判断
a == b a>b a >= b
a != b a < b a <= b
is.na(a) is.null(a)
函数
函数即对象,函数包含的3部分
body() code inside the function formals() list of arguments which controls how you can call the function f <- function(x) { } 10 f(stop(‘This is an error!')) -> 10 environment() “map” of the location of the function’s variables (see “Enclosing Environment”)
function_name <- function(var){ Do something return(new_variable) }
square <- function(x){ squared <- x*x return(squared) }
常见统计函数
log(x) exp(x) max(x) min(x) sum(x) mean(x) median(x) quantile(x)
round(x, n) rank(x) signif(x, n) var(x) cor(x, y) sd(x)
lm(y ~ x, data=df) Linear model.
glm(y ~ x, data=df) Generalised linear model.
summary Get more detailed information out a model.
t.test(x, y) Perform a t-test for difference between means.
pairwise.t.test Perform a t-test for paired data.
prop.test Test for a difference between proportions
aov Analysis of variance.
统计分布
|Random Variates |Density Function | Cumulative Distribution| Quantile Normal | rnorm |dnorm | dnorm |qnorm
常见作图
- 如何debug(入门常见问题)
常见的一些问题:
- 中英文输入,名字输入错误,路径不对
- 数据类型不对
7.常用的一些命令
1. 寻求帮助
help(topic) documentation on topic help(package = ‘dplyr’)
help.search(‘weighted mean’)
?topic same as above; special chars need quotes: for example ?’&&’ ?mean
2.关于对象
summary(x) generic function to give a “summary” of x, often a statistical one
str(x) display the internal structure of an R object
dir() show files in the current directory
class(iris) Find the class an object belongs to.
3.实用的快捷键
整行取消 ctrl + u
- 3个案例
绘图(heatmap) ggplot
- heatmap
- PCA
bioconductor
R Programming Cheat Sheet
1. 环境变量
-
Global environment, access with globalenv(), is the interactive workspace. This is the environment in which you normally work. The parent of the global environment is the last package that you attached with library() or require().
-
Base environment, access with baseenv(), is the environment of the base package. Its parent is the empty environment.
-
Empty environment, access with emptyenv(), is the ultimate ancestor of all environments, and the only environment without a parent. Empty environments contain nothing.
-
Current environment, access with environment()
搜包顺序
> search()
[1] ".GlobalEnv" "tools:rstudio" "package:stats" "package:graphics" "package:grDevices"
[6] "package:utils" "package:datasets" "package:methods" "Autoloads" "package:base"
> library(reshape2); search()
[1] ".GlobalEnv" "package:reshape2" "tools:rstudio" "package:stats" "package:graphics"
[6] "package:grDevices" "package:utils" "package:datasets" "package:methods" "Autoloads"
[11] "package:base"
<- (Regular assignment arrow) – always creates a variable in the current environment «- (Deep assignment arrow) - modifies an existing variable found by walking up the parent environments
Warning: If «- doesn’t find an existing variable, it will create one in the global environment.
5. debugging
6. 面向对象
遇到问题,想想问题的根源是在哪,更多的去思考怎么解决, work it , enjoy it
资料推荐
首先下载R语言打印版的 cheatsheet, 链接:http://pan.baidu.com/s/1nv5Oulb 密码:4tsn 放在办公桌,或者枕头边上,随时浏览记忆
个人公众号,比较懒,很少更新,可以在上面提问题,如果回复不及时,可发邮件给我: tiehan@sina.cn