从r中的段落/字符串中提取不同的百分比/数字答案

【问题标题】：Extract different percentages/numbers from a paragraph/string in r从r中的段落/字符串中提取不同的百分比/数字
【发布时间】：2020-09-15 04:54:43
【问题描述】：

我是 R 的新手，正在努力从数据框中的字符串中提取百分比/数字。例如，

df <- data.frame(
  Species =c("Bidens pilosa","Orobanche ramose"),
  Impact = c("Soyabean yield loss was 10%. A density of one plant resulted in a yield loss of 9.4%; two plants, 17.3%; and four to eight plants, 28%...In contrast, suppression of the weed by the crop was only 10%","Cypress was estimated to have a 28% loss annually. The annual increase of the disease in some stands in the Peloponnesus, with an initial attack of 20%, ranged from 5% to 20% ")

我的问题如下：

在这种情况下，我只想提取不同作物的产量损失，即 10 和 28，并希望跳过其他方面的百分比和数字（如 9.4%、17.3%、5* 等）。 ) 我可以通过 R 实现这个目标吗？还是需要一些自然语言处理的技能？
如果很难区分不同类型的百分比，如何一次提取所有百分比/数字，以便我可以手动选择正确的数字。我尝试过使用

df %>% str_match_all("[0-9]+") %>% unlist %>% as.numeric

或

parse_number(df$Impact)

但我认为它们都不起作用，因为它们给了我连续的数字。

感谢您的帮助。

【问题讨论】：

标签： r regex stringr

【解决方案1】：

1) 关于如何提取产量损失没有明确的模式。在第一个字符串本身中，我看到两次提到“产量损失”。

大豆产量损失为 10%。 1株密度导致减产9.4%；

所以至少我不清楚为什么要选择 10 而不是 9.4。

2) 提取所有可以使用的百分比/数字：

stringr::str_extract_all(df$Impact, "\\d+\\.?\\d?")

#[[1]]
#[1] "10"   "9.4"  "17.3" "28"   "10"  

#[[2]]
#[1] "28" "20" "5"  "20"

相当于

regmatches(df$Impact, gregexpr("\\d+\\.?\\d?", df$Impact))

在基础 R 中。

\\d+ 表示 1 位或多于 1 位

\\.? 是可选的小数位

\\d? 是可选数字。

【讨论】：

感谢您的回复。但是我对 "\\d+\\.?\\d?" 是什么感到很困惑。代表？
@Vivi 更新了解释这一点的答案。看看有没有帮助。