根据列中的部分字符串匹配选择数据框行答案

【问题标题】：Selecting data frame rows based on partial string match in a column根据列中的部分字符串匹配选择数据框行
【发布时间】：2021-09-08 00:25:36
【问题描述】：

我想根据列中字符串的部分匹配从数据框中选择行，例如列“x”包含字符串“hsa”。使用 sqldf - if 它有一个 like 语法 - 我会这样做：

select * from <> where x like 'hsa'.

很遗憾，sqldf 不支持该语法。

或类似：

selectedRows <- df[ , df$x %like% "hsa-"]

这当然行不通。

有人可以帮我解决这个问题吗？

【问题讨论】：

您能否发布几行数据，最好使用dput(head(conservedData)) 之类的内容。

标签： r regex string match subset

【解决方案1】：

我注意到您在当前方法中提到了一个函数%like%。不知道是不是引用了“data.table”中的%like%，如果是的话，你绝对可以这样使用。

请注意，对象不必是data.table（但还要记住data.frames 和data.tables 的子集方法不相同）：

library(data.table)
mtcars[rownames(mtcars) %like% "Merc", ]
iris[iris$Species %like% "osa", ]

如果这就是你所拥有的，那么也许你只是混合了行和列位置来设置数据。

如果您不想加载包，可以尝试使用grep() 搜索您匹配的字符串。这是mtcars 数据集的示例，其中我们匹配行名称包含“Merc”的所有行：

mtcars[grep("Merc", rownames(mtcars)), ]
             mpg cyl  disp  hp drat   wt qsec vs am gear carb
# Merc 240D   24.4   4 146.7  62 3.69 3.19 20.0  1  0    4    2
# Merc 230    22.8   4 140.8  95 3.92 3.15 22.9  1  0    4    2
# Merc 280    19.2   6 167.6 123 3.92 3.44 18.3  1  0    4    4
# Merc 280C   17.8   6 167.6 123 3.92 3.44 18.9  1  0    4    4
# Merc 450SE  16.4   8 275.8 180 3.07 4.07 17.4  0  0    3    3
# Merc 450SL  17.3   8 275.8 180 3.07 3.73 17.6  0  0    3    3
# Merc 450SLC 15.2   8 275.8 180 3.07 3.78 18.0  0  0    3    3

另外一个例子，使用iris数据集搜索字符串osa：

irisSubset <- iris[grep("osa", iris$Species), ]
head(irisSubset)
#   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
# 1          5.1         3.5          1.4         0.2  setosa
# 2          4.9         3.0          1.4         0.2  setosa
# 3          4.7         3.2          1.3         0.2  setosa
# 4          4.6         3.1          1.5         0.2  setosa
# 5          5.0         3.6          1.4         0.2  setosa
# 6          5.4         3.9          1.7         0.4  setosa

为您的问题尝试：

selectedRows <- conservedData[grep("hsa-", conservedData$miRNA), ]

【讨论】：

+1: 另请注意，grep 支持正则表达式，因此您可能需要使用 grep 替换 ^hsa-。
@nico：其实grep来自ed命令g/re/p（全局/正则表达式/print），它的真正威力只有正则表达式高手才显露—— fu ;-): en.wikipedia.org/wiki/Grep
%like% 的建议很棒！我建议把它放在你的答案之上。
@ArenCambre，完成。也许它会帮助我再获得 11 票，这样我就可以在年底之前获得一顶新帽子 :-)
@A5C1D2H2I1M1N2O1R2T1 很好的答案！有没有办法使用 %like% 搜索同时出现的两个字符串（如在数据帧的一行中出现的“pet”和“pip”作为“peter piper”）？

【解决方案2】：

LIKE 应该在 sqlite 中工作：

require(sqldf)
df <- data.frame(name = c('bob','robert','peter'),id=c(1,2,3))
sqldf("select * from df where name LIKE '%er%'")
    name id
1 robert  2
2  peter  3

【讨论】：

SQLDF 最适合列出。但是，它不能删除行。
为什么 R 包在这里加载了require()
因为它不是标准的R库，你必须手动安装它，然后使用require函数加载。

【解决方案3】：

尝试stringr 包中的str_detect()，它会检测字符串中是否存在模式。

这是一种方法，它还结合了 dplyr 包中的 %>% 管道和 filter()：

library(stringr)
library(dplyr)

CO2 %>%
  filter(str_detect(Treatment, "non"))

   Plant        Type  Treatment conc uptake
1    Qn1      Quebec nonchilled   95   16.0
2    Qn1      Quebec nonchilled  175   30.4
3    Qn1      Quebec nonchilled  250   34.8
4    Qn1      Quebec nonchilled  350   37.2
5    Qn1      Quebec nonchilled  500   35.3
...

这会针对处理变量包含子字符串“non”的行过滤样本 CO2 数据集（R 附带）。您可以调整 str_detect 是找到固定匹配项还是使用正则表达式 - 请参阅 stringr 包的文档。

【讨论】：

你也可以像这样使用trc_detect函数myDataFrame[str_detect(myDataFrame$key, myKeyPattern),]
@Bemipefe 你不是说 str_detect 函数而不是 trc_detect 吗？
@Martin 是的，你是对的。那是一个错字。

【解决方案4】：

另一种选择是简单地使用grepl 函数：

df[grepl('er', df$name), ]
CO2[grepl('non', CO2$Treatment), ]

df <- data.frame(name = c('bob','robert','peter'),
                 id = c(1,2,3)
                 )

# name id
# 2 robert  2
# 3  peter  3

【讨论】：