OP 提供了一个只有一行的数据帧。因此,在text 中包含不同数量单词的多行的情况下,尚不清楚预期结果是什么。是否需要
- 结果列包含相同数量的单词(如果有足够的单词可用),或者,
- 每一行是分开的吗?
案例1的解决方案
如果要求每列在所有行中都应包含相同数量的单词(如果有足够的单词可用),则单词最多的行将确定分布。单词较少的行的列从左侧开始填充(左对齐)。
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5
1: 1 This is a very long piece of string. This contains many lines.
2: 2 This is a very long piece of string. It contains one or two more words.
3: 3 Short text
4: 4 Shorter
Text1 到 Text4 列的第 1 行和第 2 行包含相同数量的单词(每个 3 个)。单词数少于列数的行从左侧开始填充。 p>
数据
library(data.table)
DT <- fread(
'ID Text
1 "This is a very long piece of string. This contains many lines."
2 "This is a very long piece of string. It contains one or two more words."
3 "Short text"
4 "Shorter"')
说明
强制转换为 data.table 后,每行中的文本在单词边界处被拆分并以长格式返回(可能被视为等同于时间序列):
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID]
ID V1
1: 1 This
2: 1 is
3: 1 a
4: 1 very
5: 1 long
6: 1 piece
7: 1 of
8: 1 string.
9: 1 This
10: 1 contains
11: 1 many
12: 1 lines.
13: 2 This
14: 2 is
15: 2 a
16: 2 very
17: 2 long
18: 2 piece
19: 2 of
20: 2 string.
21: 2 It
22: 2 contains
23: 2 one
24: 2 or
25: 2 two
26: 2 more
27: 2 words.
28: 3 Short
29: 3 text
30: 4 Shorter
ID V1
然后使用计算的分组变量再次连接单词,该变量使用 rowdid() 编号上的 cut() 函数创建 n_brks 块:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]
ID cut V1
1: 1 (0.986,3.8] This is a
2: 1 (3.8,6.6] very long piece
3: 1 (6.6,9.4] of string. This
4: 1 (9.4,12.2] contains many lines.
5: 2 (0.986,3.8] This is a
6: 2 (3.8,6.6] very long piece
7: 2 (6.6,9.4] of string. It
8: 2 (9.4,12.2] contains one or
9: 2 (12.2,15] two more words.
10: 3 (0.986,3.8] Short text
11: 4 (0.986,3.8] Shorter
最后,此结果再次从长格式重新调整为宽格式,以创建预期的结果。列标题由rowid() 函数创建,缺失值由"" 替换:
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
案例2的解决方案
如果要求将每一行单独拆分并且单词均匀分布,则每列中的单词数将因列而异。单词数少于列数的行每列最多有一个单词。
这种情况的解决方案是修改Jaaps's suggestion:
library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
, ri := cut(seq_len(.N), n_brks), by = ID][
, paste(V1, collapse = " "), by = .(ID, ri)][
, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
ID Text1 Text2 Text3 Text4 Text5
1: 1 This is a very long piece of string. This contains many lines.
2: 2 This is a very long piece of string. It contains one or two more words.
3: 3 Short text
4: 4 Shorter
现在,每列中的单词数因行而异。例如,Text2 到 Text4 列的第 1 行各有 2 个字,第 2 行各有 3 个字。第 3 行的 2 个字放置在不同的列中。