将每个数据框行文本分成五个均匀的文本块答案

【问题标题】：Break up each dataframe row text into five even chunks of text将每个数据框行文本分成五个均匀的文本块
【发布时间】：2017-09-30 12:04:15
【问题描述】：

我希望在这个棘手的字符串问题上得到一些帮助。

当前数据框

ID  Text
1   This is a very long piece of string. This contains many lines.

我想把它改成：

ID   Text1            Text2            Text3           Text4         Text5
1    This is a        very long piece  of string.      This contains  many lines.

字符串拆分应该发生在平均拼接数量的单词上。在上面的示例中，我尝试将行平均拆分 5 次，因此每列应包含 20% 的单词。

这背后的目的是将这些词构建成这样一种方式，即它们可以被视为时间序列数据，因为对话刚刚被拆分。

【问题讨论】：

标签： r dataframe

【解决方案1】：

可能有更好的选择，但无需额外的软件包即可：

首先，我们创建一个reproducible example：

df <- data.frame(ID=1:2,
                 Text=c("Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.",
                        "Lorem ipsum dolor sit amet, consectetur adipiscing elit"),
                 stringsAsFactors = FALSE)

然后，chunkize 是 split+cut 的包装，这是棘手的部分。它需要一个character，将其拆分为空格并分成n 块，然后返回一个带有n 多列的data.frame。（我们删除names，这样rbind 就可以了）。

chunkize <- function(chr, n=5){
  x <- strsplit(chr, " ")[[1]]
  df <- as.data.frame(
    lapply(
      split(x, 
            cut(seq_along(x), 
                breaks=n)), 
      paste, collapse=" "), 
    stringsAsFactors = FALSE, col.names=NULL)
  names(df) <- NULL
  df
}

然后我们简单地将它应用于每一行。我们还添加了ID 列：

df_chunked <- do.call("rbind", 
                      apply(df, 1, 
                         function(x) cbind(x[1], chunkize(x[-1], 5))))

最后，我们重命名列：

colnames(df_chunked) <- c("ID", paste0("Text", 1:5))

同样的事情变成了一个方便的功能：

chunkize_this <- function(df, n=5){
  chunkize <- function(chr, n){
    x <- strsplit(chr, " ")[[1]]
    df <- as.data.frame(
      lapply(
        split(x, 
              cut(seq_along(x), 
                  breaks=n)), 
        paste, collapse=" "), 
      stringsAsFactors = FALSE, col.names=NULL)
    names(df) <- NULL
    df
  }

  df_chunked <- do.call("rbind", 
                        apply(df, 1, function(x) cbind(x[1], chunkize(x[-1], n))))
  colnames(df_chunked) <- c(colnames(df)[1], paste0("Text", 1:n))
  rownames(df_chunked) <- NULL
  df_chunked
}

你可以试试：

View(chunkize_this(df, 3))
View(chunkize_this(df, 5))

另一个例子：

df <- read.table(h=T, text=
  'ID   Text
  1    "This is a very long piece of string. This contains many lines."
  2    "This is a very long piece of string. It contains one or two more word."
  3    "Short"'
)

> chunkize_this(df, 5)
ID     Text1           Text2         Text3           Text4                Text5
1  1 This is a       very long      piece of    string. This contains many lines.
2  2 This is a very long piece of string. It contains one or       two more word.
3  3                                   Short

【讨论】：

哇！谢谢你。这工作得非常好。正如我所希望的那样，行平均分配。对于进一步的上下文，我一直在尝试应用本文中需要此类操作的主题 #5。 arxiv.org/pdf/1605.04462.pdf

【解决方案2】：

在data.table、base R 和tidyverse 中实现的替代方法。零件数量可以硬编码或预先分配：

# pre-allocating number of parts
np <- 5

不同的选择：

1) 带有“data.table”：

library(data.table)

# method 1
setDT(DF)[, strsplit(Text, "\\s"), by = ID
          ][, grp := rleid(cut(1:.N, np)), by = ID
            ][, paste(V1, collapse = " "), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = "V1")]

# method 2
setDT(DF)[, strsplit(Text, ' '), by = ID
          ][, grp := {s <- ceiling(.N/np); rleid(s:(.N+s-1) %/% (.N/np))}, by = ID
            ][, paste(V1, collapse = ' '), by = .(ID, grp)
              ][, dcast(.SD, ID ~ paste0('Text', grp), value.var = 'V1')]

两者都给出：

   ID     Text1           Text2         Text3           Text4                Text5
1:  1   This is     a very long      piece of    string. This contains many lines.
2:  2 This is a very long piece of string. It contains one or      two more words.
3:  3     Short            text            NA              NA                   NA

2) 基础 R：

# method 1
equal_parts <- function(x, np = 5) {
  n <- cut(seq_along(x), np)
  n <- as.integer(n)
  cumsum(c(1, diff(n) > 0))
}

# method 2
equal_parts <- function(x, np = 5) {
  n <- length(x)
  s <- ceiling(n/np)
  rl <- rle(s:(n+s-1) %/% (n/np))$lengths
  rep(seq_along(rl), rl)
}

DF.long <- stack(setNames(strsplit(DF$Text, ' '), DF$ID))

DF.long$grp <- with(DF.long, ave(values, ind, FUN =  equal_parts))
DF.agg <- aggregate(values ~ ind + grp, DF.long, paste0, collapse = ' ')

reshape(DF.agg, idvar = 'ind', timevar = 'grp', direction = 'wide')

给出：

  ind  values.1        values.2      values.3        values.4             values.5
1   1   This is     a very long      piece of    string. This contains many lines.
2   2 This is a very long piece of string. It contains one or      two more words.
3   3     Short            text          <NA>            <NA>                 <NA>

3) 'tidyverse'：

library(dplyr)
library(tidyr)
separate_rows(DF, Text) %>% 
  group_by(ID) %>% 
  mutate(grp = equal_parts(Text)) %>%     # using the 'equal_parts'-function from the base R solution
  group_by(grp, add = TRUE) %>% 
  summarise(Text = paste0(Text, collapse = ' ')) %>% 
  spread(grp, Text)

给出：

# A tibble: 3 x 6
# Groups:   ID [3]
     ID       `1`             `2`           `3`             `4`                  `5`
* <int>     <chr>           <chr>         <chr>           <chr>                <chr>
1     1   This is     a very long      piece of    string. This contains many lines.
2     2 This is a very long piece of string. It contains one or      two more words.
3     3     Short            text          <NA>            <NA>                 <NA>

使用过的数据：

DF <- structure(list(ID = 1:3, Text = c("This is a very long piece of string. This contains many lines.", 
                                        "This is a very long piece of string. It contains one or two more words.", 
                                        "Short text")),
                .Names = c("ID", "Text"), row.names = c(NA, -3L), class = "data.frame")

【讨论】：

这并没有给出正确的结果 imo，因为每列的字数因行而异；-)
@Uwe 确实如此。 OP 希望将文本分成 5 个部分（在查看所需的输出时也会变得清晰）。
也许你是对的，但只有一行的样本数据为解释/推测提供了空间。顺便说一句，我的直觉认为这可能是 X-Y 问题。
非常感谢这个解决方案。我最喜欢使用 tidyverse，所以用它来实现解决方案。它完全按照我的希望工作。谢谢。

【解决方案3】：

OP 提供了一个只有一行的数据帧。因此，在text 中包含不同数量单词的多行的情况下，尚不清楚预期结果是什么。是否需要

结果列包含相同数量的单词（如果有足够的单词可用），或者，
每一行是分开的吗？

案例1的解决方案

如果要求每列在所有行中都应包含相同数量的单词（如果有足够的单词可用），则单词最多的行将确定分布。单词较少的行的列从左侧开始填充（左对齐）。

library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
    , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

   ID      Text1           Text2           Text3                Text4           Text5
1:  1  This is a very long piece of string. This contains many lines.                
2:  2  This is a very long piece   of string. It      contains one or two more words.
3:  3 Short text                                                                     
4:  4    Shorter

Text1 到 Text4 列的第 1 行和第 2 行包含相同数量的单词（每个 3 个）。单词数少于列数的行从左侧开始填充。 p>

数据

library(data.table)

DT <- fread(
  'ID   Text
   1    "This is a very long piece of string. This contains many lines."
   2    "This is a very long piece of string. It contains one or two more words."
   3    "Short text"
   4     "Shorter"')

说明

强制转换为 data.table 后，每行中的文本在单词边界处被拆分并以长格式返回（可能被视为等同于时间序列）：

n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID]

    ID       V1
 1:  1     This
 2:  1       is
 3:  1        a
 4:  1     very
 5:  1     long
 6:  1    piece
 7:  1       of
 8:  1  string.
 9:  1     This
10:  1 contains
11:  1     many
12:  1   lines.
13:  2     This
14:  2       is
15:  2        a
16:  2     very
17:  2     long
18:  2    piece
19:  2       of
20:  2  string.
21:  2       It
22:  2 contains
23:  2      one
24:  2       or
25:  2      two
26:  2     more
27:  2   words.
28:  3    Short
29:  3     text
30:  4  Shorter
    ID       V1

然后使用计算的分组变量再次连接单词，该变量使用 rowdid() 编号上的 cut() 函数创建 n_brks 块：

setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))]

    ID         cut                   V1
 1:  1 (0.986,3.8]            This is a
 2:  1   (3.8,6.6]      very long piece
 3:  1   (6.6,9.4]      of string. This
 4:  1  (9.4,12.2] contains many lines.
 5:  2 (0.986,3.8]            This is a
 6:  2   (3.8,6.6]      very long piece
 7:  2   (6.6,9.4]        of string. It
 8:  2  (9.4,12.2]      contains one or
 9:  2   (12.2,15]      two more words.
10:  3 (0.986,3.8]           Short text
11:  4 (0.986,3.8]              Shorter

最后，此结果再次从长格式重新调整为宽格式，以创建预期的结果。列标题由rowid() 函数创建，缺失值由"" 替换：

setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , paste(V1, collapse = " "), by = .(ID, cut(rowid(ID), n_brks))][
    , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

案例2的解决方案

如果要求将每一行单独拆分并且单词均匀分布，则每列中的单词数将因列而异。单词数少于列数的行每列最多有一个单词。

这种情况的解决方案是修改Jaaps's suggestion：

library(data.table)
n_brks <- 5L
setDT(DT)[, strsplit(Text, "\\s"), by = ID][
  , ri := cut(seq_len(.N), n_brks), by = ID][
    , paste(V1, collapse = " "), by = .(ID, ri)][
      , dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]

   ID     Text1           Text2         Text3           Text4                Text5
1:  1 This is a       very long      piece of    string. This contains many lines.
2:  2 This is a very long piece of string. It contains one or      two more words.
3:  3     Short            text                                                   
4:  4   Shorter

现在，每列中的单词数因行而异。例如，Text2 到 Text4 列的第 1 行各有 2 个字，第 2 行各有 3 个字。第 3 行的 2 个字放置在不同的列中。

【讨论】：

这并没有给出正确的结果 imo，请参阅第一行没有五个文本组。
一个可能的替代方案：setDT(DT)[, strsplit(Text, "\\s"), by = ID][, ri := rowid(ID)][, ri := cut(ri, 5), by = ID][, paste(V1, collapse = " "), by = .(ID, ri)][, dcast(.SD, ID ~ rowid(ID, prefix = "Text"), fill = "", value.var = "V1")]
对造成的任何混淆表示歉意。这是我遇到的问题的第一种情况。您的解决方案也非常有效。谢谢。
感谢您的反馈并确认您正在寻求案例 1 的解决方案。但是，您为什么接受案例 2 的解决方案呢？（请不要误会我的意思。我只是好奇）
哦等等。我明白你的意思。案例2是我正在寻找的。我需要仔细重新阅读那部分。