R Tips

数据操作

检查数据

glimpse() 可以将数据框“逆时针旋转 90°”，以便能尽可能多地展示更多数据。
使用 tibble 代替 data.frame。tibble 的展示函数更加友好。

tibble 被认为是 data.frame 的现代化版本，主要特点是正确性和便利性，但数据处理性能方面似乎不显著优于 data.frame。如何想要更好的性能，可以使用 data.table。

添加新项

数据框和列表均可以直接添加新项：

1
2
3
4
5
6
df <- data.frame(
    a = c(1, 2, 3),
    b = c("b")
)
df$c <- c("a", "b", "c")
df

1
2
3
4
#>   a b c
#> 1 1 b a
#> 2 2 b b
#> 3 3 b c

1
2
3
4
5
6
lt <- list(
    a <- matrix(1:10, nrow = 2),
    b <- "test_string"
)
lt[["c"]] <- "additional entry"
lt

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
#> [[1]]
#>      [,1] [,2] [,3] [,4] [,5]
#> [1,]    1    3    5    7    9
#> [2,]    2    4    6    8   10
#> 
#> [[2]]
#> [1] "test_string"
#> 
#> $c
#> [1] "additional entry"

向量可以使用 c() 或 append() 插入新项：

1
2
3
4
5
6
vc <- c(1, 2, 3)
c(vc, 4)
c(4, vc)
append(vc, after = 0, 0)            # 在最前面插入
append(vc, after = length(vc), 4)   # 在最后面插入
append(vc, 4)                       # 默认在最后面插入

1
2
3
4
5
#> [1] 1 2 3 4
#> [1] 4 1 2 3
#> [1] 0 1 2 3
#> [1] 1 2 3 4
#> [1] 1 2 3 4

选取操作

对于列表，$ 和 [[]] 几乎一样，返回值是子项；而 [] 的返回值则是子集，类型仍然是列表。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
lt <- list(
    a = c(1, 2, 3),
    b = "string"
)

cat("lt$a:\n")
lt$a
cat("lt[[\"a\"]]:\n")
lt[["a"]]
cat("lt[\"a\"]:\n")
lt["a"]
cat("lt[1:2]:\n")
lt[1:2]

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
#> lt$a:
#> [1] 1 2 3
#> lt[["a"]]:
#> [1] 1 2 3
#> lt["a"]:
#> $a
#> [1] 1 2 3
#> 
#> lt[1:2]:
#> $a
#> [1] 1 2 3
#> 
#> $b
#> [1] "string"

管道

除了 %>%，R 语言（4.1+）提供了原生的管道运算符 |>。在 Rstudio 中，通过 “Tools” → “Global Options…” → “Editing” → “Use native pipe operator, |> (requires R 4.1+)” 来启用。默认快捷键是 Ctrl+Shift+m。

将数据储存在 Excel 表的不同 Sheets

可以使用 xlsx 包，但这个包依赖 Java。因此，推荐使用 openxlsx 包。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
library(openxlsx)

# Create a blank workbook
OUT <- createWorkbook()

# Add some sheets to the workbook
addWorksheet(OUT, "Sheet 1 Name")
addWorksheet(OUT, "Sheet 2 Name")

# Write the data to the sheets
writeData(OUT, sheet = "Sheet 1 Name", x = dataframe1)
writeData(OUT, sheet = "Sheet 2 Name", x = dataframe2)

# Reorder worksheets
worksheetOrder(OUT) <- rev(1:3)

# Export the file
saveWorkbook(OUT, "My output file.xlsx")

数据降维

一个 list，包含多个 vectors，需要将其变为一个 vector。

1
2
3
4
5
6
7
8
9
l <- list(
    a = c(1, 2, 3, 4, 5, 6, 7),
    b = c(3, 4, 5, 6, 7, 8, 9)
)

# 获取在任一 vector 中出现的元素
do.call(c, l)
# 获取在所有 vector 中都出现的元素
Reduce(intersect, l)

1
2
3
#> a1 a2 a3 a4 a5 a6 a7 b1 b2 b3 b4 b5 b6 b7 
#>  1  2  3  4  5  6  7  3  4  5  6  7  8  9 
#> [1] 3 4 5 6 7

处理大型数据

见使用 dplyr 操作数据#tab_处理大型数据。

稀疏矩阵

1
2
3
library(Matrix)

mat_a <- as(regMat, "sparseMatrix")       # see also `vignette("Intro2Matrix")`

1
#> Error: 找不到对象'regMat'

1
mat_b <- Matrix(regMat, sparse = TRUE)    # Thanks to Aaron for pointing this out

1
#> Error: 找不到对象'regMat'

1
identical(mat_a, mat_b)

1
#> Error: 找不到对象'mat_a'

1
mat_a

1
#> Error: 找不到对象'mat_a'

移除变量

remove() 或 rm()。

绘图

绘图布局

同时绘制几个 plot：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 设置绘图区域为 2 行 2 列
par(mfrow = c(2, 2))

plot(1:10, rnorm(10), main = "Plot 1", col = "blue", pch = 16)
plot(1:10, runif(10), main = "Plot 2", col = "red", pch = 16)
plot(1:10, rnorm(10, 5), main = "Plot 3", col = "green", pch = 16)
plot(1:10, runif(10, 0, 5), main = "Plot 4", col = "purple", pch = 16)

# 恢复默认的单个图形布局
par(mfrow = c(1, 1))

使用 mfcol 参数会先填充列：

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
# 设置绘图区域为 2 列 2 行
par(mfcol = c(2, 2))

plot(1:10, rnorm(10), main = "Plot 1", col = "blue", pch = 16)
plot(1:10, runif(10), main = "Plot 2", col = "red", pch = 16)
plot(1:10, rnorm(10, 5), main = "Plot 3", col = "green", pch = 16)
plot(1:10, runif(10, 0, 5), main = "Plot 4", col = "purple", pch = 16)

# 恢复默认的单个图形布局
par(mfrow = c(1, 1))

常用绘图函数

par()：设置一些重要参数
plot(): 基本的绘图函数，或用于为其他一些函数创建“画板”
rasterImage()：绘制像素图像
text()：在任意位置绘制文本
title()：绘制标题、副标题以及 X、Y 轴标签。

`ggplot2` 注意

注意：在 for 循环中，需要显式地使用 print() 函数来显示图像。

排列多个图像

使用 patchwork 包可以方便的用 +、|、/ 来排列图像。
cowplot 与 ggplot2 集成良好。函数 plot_grid() 可用来布局，功能类似 gridExtra::grid.arrange()，但语法更加简洁。

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
library(ggplot2)
library(cowplot)

plot1 <- ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point()
plot2 <- ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point()
plot3 <- ggplot(mtcars, aes(x = cyl, y = mpg)) + geom_point()
plot_list <- list(plot1, plot2, plot3)

# 通过 do.call 合并 plot_list
combined_plot <- do.call(plot_grid, c(plot_list, ncol = 2))
combined_plot

# 也可以直接用 plotlist 参数
plot_grid(plotlist = plot_list, ncol = 2)

控制坐标轴

coord_fixed(ratio = 1)：设置 x、y 坐标比例为 1:1。
scale_x_continuous(limits = c(2, 5))：设置横坐标范围。
scale_x_continuous(breaks = c(0, 100, 200, 300, 350), labels = c(0, 100, 200, 300, "infinite"))：设置刻度线和标签。
scale_x_continuous(transform = "log10")（trans 已被废弃）：设置坐标轴转换。
Built-in transformations include “asn”, “atanh”, “boxcox”, “date”, “exp”, “hms”, “identity”, “log”, “log10”, “log1p”, “log2”, “logit”, “modulus”, “probability”, “probit”, “pseudo_log”, “reciprocal”, “reverse”, “sqrt” and “time”.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# 注意命名方式，scale_x_continuous 会自动查找 transform_xxx 函数
transform_log10 <- scales::trans_new(
    name = "trans_log10",
    transform = function(x) log10(x),
    inverse = function(x) 10 ^ x,
    breaks = function(limits) 10 ^ pretty(log10(limits))
)

xs <- seq(1, 100, by = 0.2)
ggplot(data.frame(x = xs, y = xs ^ 2 - 20 * xs)) +
    geom_smooth(mapping = aes(x, y))
ggplot(data.frame(x = xs, y = xs ^ 2 - 20 * xs)) +
    geom_smooth(mapping = aes(x, y)) +
    scale_x_continuous(transform = "log10")

Aloha's Blog