【问题标题】:How to identify repeated words as well as position and number of repeats in sentences如何识别重复的单词以及句子中重复的位置和数量
【发布时间】:2020-03-05 16:50:35
【问题描述】:

我有一个包含连续单词重复的句子的数据集:

数据

df <- data.frame(
  Turn = c("oh is that that steak i got the other night",       # that that
           "no no no i 'm dave and you 're alan",               # no no no
           "yeah i mean the the film was quite long though",    # the the
           "it had steve martin in it it 's a comedy"))         # it it

目标

我想要获得的是添加到此数据框中的另外三列:

  • df$rep_Word:指定重复单词的列
  • df$rep_Pos: 指定句子中重复单词的第一个位置的列
  • df$rep_Numb:指定单词重复次数的列

所以预期的数据框如下所示:

预期结果

df
                                            Turn rep_Word rep_Pos rep_Numb
1    oh is that that steak i got the other night     that       4        1
2            no no no i 'm dave and you 're alan       no       2        2
3 yeah i mean the the film was quite long though      the       5        1
4       it had steve martin in it it 's a comedy       it       7        1

迄今为止尝试的解决方案

我的直觉是,可以使用strsplit 和函数duplicated 来获取有关重复单词、位置和重复次数的信息,例如,因此:

df_split <- apply(df, 2, function(x) strsplit(x, "\\s"))

df_split
$Turn
$Turn[[1]]
 [1] "oh"    "is"    "that"  "that"  "steak" "i"     "got"   "the"   "other" "night"
$Turn[[2]]
 [1] "no"   "no"   "no"   "i"    "'m"   "dave" "and"  "you"  "'re"  "alan"
$Turn[[3]]
 [1] "yeah"   "i"      "mean"   "the"    "the"    "film"   "was"    "quite"  "long"   "though"
$Turn[[4]]
 [1] "it"     "had"    "steve"  "martin" "in"     "it"     "it"     "'s"     "a"      "comedy"

例如,对于df 中的第一句,duplicated 显示哪个单词被重复(即duplicated 评估为TRUE 的单词),并且重复的数量和位置也可以读取-关闭该信息:

duplicated(df_split$Turn[[1]])
 [1] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

问题是我不知道如何操作duplicated,以便在df 中获得所需的添加列。非常感谢您对这项工作的帮助。

【问题讨论】:

    标签: r duplicates


    【解决方案1】:

    这是解决问题的另一种方法。

    df <- data.frame(
      Turn = c("oh is that that steak i got the other night",  # that that
               "no no no i 'm dave and you 're alan",               # no no no
               "yeah i mean the the film was quite long though",    # the the
               "it had steve martin in it it 's a comedy",         # it it)
               "it had steve martin in in it it 's a comedy",
               "yeah i mean the film was quite long though", 
               "hi hi then other words and hi hi again",
               "no no no i 'm dave yes yes and you 're alan no no no no"))  # no no no and no no no no
    
    library(data.table)
    cols <- c("rep_Word", "rep_Pos", "rep_Numb")
    setDT(df)[, (cols) := {
      words <- strsplit(as.character(Turn), " ")[[1]]
      idx <- rleid(words)
      check <- duplicated(idx)
      chg <- check - shift(check, fill = FALSE)
      starts <- which(chg == 1)
      aend <- if(sum(chg) == 0L) which(chg == -1) else c(which(chg == -1), length(chg) + 1L)
      freq <- aend - starts
      wrd <- words[starts]
      no_dup_default <- .(.(NA_character_), .(NA_integer_), .(NA_integer_))
      if(length(wrd)) .(.(wrd), .(starts), .(freq)) else no_dup_default
    }, seq.int(nrow(df))]
    
    
    df
    #                                                       Turn   rep_Word  rep_Pos rep_Numb
    # 1:             oh is that that steak i got the other night       that        4        1
    # 2:                     no no no i 'm dave and you 're alan         no        2        2
    # 3:          yeah i mean the the film was quite long though        the        5        1
    # 4:                it had steve martin in it it 's a comedy         it        7        1
    # 5:             it had steve martin in in it it 's a comedy      in,it      6,8      1,1
    # 6:              yeah i mean the film was quite long though         NA       NA       NA
    # 7:                  hi hi then other words and hi hi again      hi,hi      2,8      1,1
    # 8: no no no i 'm dave yes yes and you 're alan no no no no  no,yes,no  2, 8,14    2,1,3
    #                
    
    # or
    df[, lapply(.SD, unlist), seq.int(nrow(df))][, -1]
    #                                                        Turn rep_Word rep_Pos rep_Numb
    #  1:             oh is that that steak i got the other night     that       4        1
    #  2:                     no no no i 'm dave and you 're alan       no       2        2
    #  3:          yeah i mean the the film was quite long though      the       5        1
    #  4:                it had steve martin in it it 's a comedy       it       7        1
    #  5:             it had steve martin in in it it 's a comedy       in       6        1
    #  6:             it had steve martin in in it it 's a comedy       it       8        1
    #  7:              yeah i mean the film was quite long though     <NA>      NA       NA
    #  8:                  hi hi then other words and hi hi again       hi       2        1
    #  9:                  hi hi then other words and hi hi again       hi       8        1
    # 10: no no no i 'm dave yes yes and you 're alan no no no no       no       2        2
    # 11: no no no i 'm dave yes yes and you 're alan no no no no      yes       8        1
    # 12: no no no i 'm dave yes yes and you 're alan no no no no       no      14        3
    

    【讨论】:

      【解决方案2】:

      purrrdplyrtibble 选项可以是:

      bind_cols(df, 
                map_dfr(strsplit(df$Turn, " ", fixed = TRUE), 
                        ~ enframe(., value = "rep_word") %>%
                         group_by(rleid = with(rle(rep_word), rep(seq_along(lengths), lengths))) %>%
                         filter(n() > 1) %>%
                         summarise(rep_word = first(rep_word),
                                   rep_pos = nth(name, 2),
                                   rep_number = n()-1) %>%
                         select(-rleid) %>%
                         summarise_all(toString)))
      
                                                  Turn rep_word rep_pos rep_number
      1    oh is that that steak i got the other night     that       4          1
      2            no no no i 'm dave and you 're alan       no       2          2
      3 yeah i mean the the film was quite long though      the       5          1
      4       it had steve martin in it it 's a comedy       it       7          1
      

      【讨论】:

      • 我收到此错误:Error in strsplit(df$Turn, " ", fixed = TRUE) : non-character argument
      • 您需要使用stringsAsFactors = FALSE导入您的数据。
      【解决方案3】:

      这是一个超级基础答案,它依赖于将单词转换为因子。它还解决了 1) 没有重复单词的句子和 2) 不同单词重复多次的句子。

         ID                                                    Turn rep_Word rep_Pos rep_Numb
      1   1             oh is that that steak i got the other night     that       4        1
      2   2                     no no no i 'm dave and you 're alan       no       2        2
      3   3          yeah i mean the the film was quite long though      the       5        1
      4   4                it had steve martin in it it 's a comedy       it       7        1
      5   5             it had steve martin in in it it 's a comedy       in       6        1
      6   5             it had steve martin in in it it 's a comedy       it       8        1
      7   6              yeah i mean the film was quite long though     <NA>      NA        0
      8   7                  hi hi then other words and hi hi again       hi       2        1
      9   7                  hi hi then other words and hi hi again       hi       8        1
      10  8 no no no i 'm dave yes yes and you 're alan no no no no       no       2        2
      11  8 no no no i 'm dave yes yes and you 're alan no no no no      yes       8        1
      12  8 no no no i 'm dave yes yes and you 're alan no no no no       no      14        3
      

      上面的代码:

      l = list("oh is that that steak i got the other night",       # that that
                  "no no no i 'm dave and you 're alan",               # no no no
                  "yeah i mean the the film was quite long though",    # the the
                  "it had steve martin in it it 's a comedy",         # it it)
               "it had steve martin in in it it 's a comedy",
               "yeah i mean the film was quite long though", 
               "hi hi then other words and hi hi again",
               "no no no i 'm dave yes yes and you 're alan no no no no")
      
      n = length(l)
      ans = vector('list', length = n)
      
      for (i in seq_len(n)){
        sentence = l[[i]]
        words_fct = factor(strsplit(sentence, " ", fixed = TRUE)[[1L]])
        levs = as.integer(words_fct)
        inds = which(diff(levs) == 0L)
      
        rep_Numb = length(inds)
        if (length(rep_Numb > 1L)) {
          diffs = diff(inds) 
          diffs_eq_1 = diffs == 1L
          if (all(diffs_eq_1)) {
            inds = inds[1L]
          } else {
            inds = inds[c(TRUE, !diffs_eq_1)]
            sums = cumsum(diffs_eq_1)
            rep_Numb = c(sums[!diffs_eq_1], sums[length(sums)]) - c(0L, sums[!diffs_eq_1]) + 1L
          }
        }
        ans[[i]] = data.frame(ID = i,
                              Turn = sentence,
                              rep_Word = levels(words_fct)[levs[inds]],
                              rep_Pos = inds + 1L,
                              rep_Numb)
      }
      
      do.call(rbind, ans)
      

      【讨论】:

      • 这个解决方案是否也考虑了重复多个单词的句子,如yeah yeah that 's right told told you
      • 是吗?预期的结果是什么?
      • 但在答案中你说“对于#2,这选择了第一个重复的单词”;也就是说,在上面的例子中,它只计算yeah yeah,而忽略told told,对吗?
      • 好吧,我说的是yes?,因为该场景没有预期的输出。 hi hi then other words and hi hi again的预期结果是什么
      • 我进行了编辑以包含多个实例。这不是将结果粘贴在一起,而是为每个重复的单词包含一行。
      【解决方案4】:

      duplicated 将在第 4 行计数四个“它”。因此使用rle 可能会更好。

      v.rle <- lapply(strsplit(as.character(df$Turn), " "), rle)
      v.rle.l <- mapply(`[`, v.rle, "lengths")
      v.rle.v <- mapply(`[`, v.rle, "values")
      res <- within(df, {
        rep_Pos <- mapply(function(x) el(which(x > 1)) + 1, v.rle.l)
        rep_Numb <- mapply(`[`, v.rle.l, rep_Pos - 1) - 1
        rep_Word <- mapply(`[`, v.rle.v, rep_Pos - 1)
      })
      res
      #                                             Turn rep_Word rep_Numb rep_Pos
      # 1    oh is that that steak i got the other night     that        1       4
      # 2            no no no i 'm dave and you 're alan       no        2       2
      # 3 yeah i mean the the film was quite long though      the        1       5
      # 4       it had steve martin in it it 's a comedy       it        1       7
      

      编辑

      为了充分考虑连续出现多个重复或没有重复的情况,您可能需要使用下面的改编版本。如果有多个欺骗,它会用冒号显示位置和单词,如果没有欺骗,它会显示NA

      df2 <- data.frame(
        Turn = c("oh is that that steak i got the other night",  # that that
                 "no no no i 'm dave and you 're alan",          # no no no
                 "yeah i mean the film was quite long though",                ## the the
                 "it had steve martin in in it it 's a comedy"))              ## in in, it it
      
      v.rle <- lapply(STRSP <- strsplit(as.character(df2$Turn), " "), rle)
      v.rle.l <- mapply(`[`, v.rle, "lengths")
      v.rle.v <- mapply(`[`, v.rle, "values")
      
      res <- within(df2, {
        rep_Pos <- mapply(function(x) {
          w <- which(x > 1) + 1
          if (length(w) == 0) NA 
          else if (length(w) > 1) cbind(w + seq(w) - 1)
          else w
        }, v.rle.l)
        rep_Numb <- mapply(function(x) cbind(x[x > 1]), v.rle.l)
        rep_Numb[lengths(rep_Numb) == 0] <- NA
        rep_Word <- sapply(mapply(`[`, STRSP, lapply(rep_Pos, `-`, 1)), cbind)
      })
      res
      #                                          Turn rep_Word rep_Numb rep_Pos
      # 1 oh is that that steak i got the other night     that        1       4
      # 2         no no no i 'm dave and you 're alan       no        2       2
      # 3  yeah i mean the film was quite long though       NA       NA      NA
      # 4 it had steve martin in in it it 's a comedy   in, it     1, 1    6, 8
      

      【讨论】:

      • @ChrisRuehlemann 我刚刚修正了代码中的一个错字,现在可以用了吗?
      • @ChrisRuehlemann 你是对的,我一定忽略了这一点:)请参阅编辑。
      • @ChrisRuehlemann 是的,一行中可能有不同的单词重复,你能确认一下吗?
      • @ChrisRuehlemann 我现在已经修复了代码,因此它计算(仅)重复的第一个单词。不应再抛出错误。
      • @ChrisRuehlemann 只需将其分配给 res &lt;- within(.),请参阅编辑。
      猜你喜欢
      • 2021-04-18
      • 1970-01-01
      • 2020-01-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多