10

数据

数据索引文件 data.csv 包括 1935 条数据。将数据分成 10 等份,用于模型的交叉验证(cross validation)。

ID task file structure content language score
1 TASK1 161102007511.txt 3.0 3.0 3.0 9.0
2 TASK1 161102008210.txt 3.0 3.5 3.0 9.5
……
1935 TASK3 161102007425.txt 4.0 3.5 3.5 11

分隔数据

使用 R caret 包的 createFolds() 函数分割数据:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15

library(caret)

# read in data
data <- read.csv('data.csv')

# create 10 random folds
folds <- createFolds(data$score)

# add a new column folds to the original data
data$folds <- 0

for ( i in 1: length(folds) ) {
data$folds[folds[[i]]] = i
}

生成新的索引文件 data.csv:

ID task file structure content language score folds
1 TASK1 161102007511.txt 3.0 3.0 3.0 9.0 3
2 TASK1 161102008210.txt 3.0 3.5 3.0 9.5 4
……
1935 TASK3 161102007425.txt 4.0 3.5 3.5 11 8