【问题标题】:How do I include rows from dataframe that contain certain keywords如何包含数据框中包含某些关键字的行
【发布时间】:2026-02-09 10:20:05
【问题描述】:

我正在为一项任务分析 reddit 线程,我只想包含包含某些关键字的线程。

我有一个关键字列表:keywords <- c(addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

数据框有 3 列。我只想在名为 title 的列中包含包含关键字之一的行。

例如

title created_utc
1 Anyone have a RH wallet yet? Asking for a friend 164128421
2 Ravi Menon, managing director of the Monetary Auth... 164131283
3 Different Augmented Reality(AR) NFT apps and marke... 164134123

keywordstest2<-paste0(keywords, collapse = "|") dfsub%>% filter(grepl(keywordstest2,title))

试过了,obvs没用。

有谁知道怎么做。谢谢:D

【问题讨论】:

  • 抱歉,刚刚接受。感谢您的帮助。

标签: r dplyr filter reddit


【解决方案1】:

这应该可行。

library(tidyverse)

dfsub %>% 
filter(grepl('addict|addicted|addiction|addictive|afraid|anxiety|anxious|cry|crying|delusion|delusional', title))

【讨论】:

  • 非常感谢您的帮助。
【解决方案2】:

你可以试试这个。我扩展了示例以包含 2 个关键字

编辑,正如 Merijn 在 cmets 中提到的那样,添加单词边界 \\b 以排除误报,因为 grepl 进行部分匹配

library(dplyr)

keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
"anxious", "cry", "crying", "delusion", "delusional")

df %>% filter(grepl(paste0("\\b",paste(keywords, 
  collapse="\\b|\\b"),"\\b"), df$title))
  id                                                             title
1  1         Anyone have a RH wallet yet? Asking delusion for a friend
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...
  created_utc
1   164128421
2   164134123

数据

df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
"Ravi Menon managing director of the Monetary Auth... crypto", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
-3L))

【讨论】:

  • @BenJohnson 嗯,它跳过了示例中的第二行,所以应该可以正常工作。你检查过dftitlekeywords 等名称吗?也许是错字?
  • 没关系,我让它工作了。非常感谢您的帮助。
  • 嗨,抱歉,还有 1 个问题。如果我想将过滤后的数据集导出为 csv,那么最好的方法是什么。
  • @BenJohnson 为此使用 write.csv()(键入 ?write.csv 以获得更多帮助)。
  • 也将匹配基于 cry 的加密(如果存在)
【解决方案3】:

这是另一个tidyverse 选项。我将您的关键字折叠到可搜索的列表中(例如,addict 或 addicted 或 ...)。然后,我在title 上使用str_detect 来查找这些关键字中的任何一个,如果是,则保留这些行(使用filter)。

library(tidyverse)

df %>% 
  filter(str_detect(title, paste(keywords, collapse = "|")))

或者base R,可以一行过滤:

df[grep(paste(keywords, collapse = "|"),df$title),]

输出

  id                                                             title created_utc
1  1         Anyone have a RH wallet yet? Asking delusion for a friend   164128421
2  3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123

数据

df <-
 structure(
   list(
     id = 1:3,
     title = c(
       "Anyone have a RH wallet yet? Asking delusion for a friend",
       "Ravi Menon managing director of the Monetary Auth...",
       "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
     ),
     created_utc = c(164128421L, 164131283L, 164134123L)
   ),
   class = "data.frame",
   row.names = c(NA,-3L)
 )

keywords <- c('addict', 'addicted', 'addiction','addictive', 'afraid' ,'anxiety','anxious','cry','crying','delusion','delusional')

要过滤 2 列,您可以执行以下操作:

df %>%
  filter(Reduce(`|`, across(
    c(title, selftext), .fns = ~ str_detect(., paste(keywords, collapse = "|"))
  )))

【讨论】:

    【解决方案4】:

    对于这个相对较小的关键字列表连接,| 是一个选项,但是当要匹配的字符串变得太大时会遇到问题。到目前为止,给出的答案也匹配基于关键字“cry”的“crypto”。我稍微调整了df 以包含“crypto”这个词。

    df <- structure(list(id = 1:3, title = c("Anyone have a RH wallet yet? Asking delusion for a friend",
    "Ravi Menon managing director of the Monetary crypto Auth...", "Different Augmented Reality(AR) NFT apps and anxiety and marke..."
    ), created_utc = c(164128421L, 164131283L, 164134123L)), class = "data.frame", row.names = c(NA,
    -3L))
    
    keywords <- c("addict", "addicted", "addiction", "addictive", "afraid", "anxiety",
    "anxious", "cry", "crying", "delusion", "delusional")
    
    library(stringr)
    
    df %>% 
      group_by(id) %>%
      filter(any(stri_trans_tolower(stri_extract_all_words(title)[[1]]) %in% keywords))
    
    # # A tibble: 2 x 3
    # # Groups:   id [2]
    #      id title                                                             created_utc
    #   <int> <chr>                                                                   <int>
    # 1     1 Anyone have a RH wallet yet? Asking delusion for a friend           164128421
    # 2     3 Different Augmented Reality(AR) NFT apps and anxiety and marke...   164134123
    

    【讨论】:

      最近更新 更多