【问题标题】:add column of listed keywords(strings) based on text column根据文本列添加列出的关键字(字符串)列
【发布时间】:2018-07-08 06:32:39
【问题描述】:

如果我有一个包含以下列的数据框:

df$text <- c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example")

还有这样的字符串:

keywords <- c("not that long", "This string", "example", "helps")

我正在尝试向我的数据框添加一列,其中包含每行文本中存在的关键字列表:

df$关键字:

1 c("This string","not that long")    
2 c("This string","not that long")    
3 c("helps","example")

虽然我不确定如何 1) 从文本列中提取匹配的单词,以及 2) 如何在新列的每一行中列出匹配的单词

【问题讨论】:

    标签: r string dataframe


    【解决方案1】:

    可能是这样的:

    df = data.frame(text=c("This string is not that long", "This string is a bit longer but still not that long", "This one just helps with the example"))
    keywords <- c("not that long", "This string", "example", "helps")
    
    df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords,grepl,x)]})
    

    输出:

                                                     text                   keywords
    1                        This string is not that long not that long, This string
    2 This string is a bit longer but still not that long not that long, This string
    3                This one just helps with the example             example, helps
    

    外部lapply 循环遍历df$text,内部lapply 检查keywords 的每个元素是否在df$text 的元素中。因此,稍长但可能更容易阅读的等价物是:

    df$keywords = lapply(df$text, function(x) {keywords[sapply(keywords, function(y){grepl(y,x)})]})
    

    希望这会有所帮助!

    【讨论】:

      【解决方案2】:

      我们可以用str_extractstringr中提取

      library(stringr)
      df$keywords <- str_extract_all(df$text, paste(keywords, collapse = "|"))
      df
      #                                                text                   keywords
      #1                        This string is not that long This string, not that long
      #2 This string is a bit longer but still not that long This string, not that long
      #3                This one just helps with the example             helps, example
      

      或者在链中

      library(dplyr)
      df %>%
         mutate(keywords = str_extract_all(text, paste(keywords, collapse = "|")))
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-10-11
        • 2021-04-18
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多