如何从R中向量中的每个字符串中提取第一个数字？答案

【问题标题】：how to extract the first number from each string in a vector in R?如何从R中向量中的每个字符串中提取第一个数字？
【发布时间】：2014-11-11 04:22:29
【问题描述】：

我是 R 中正则表达式的新手。这里我有一个向量，我有兴趣在其中提取向量的每个字符串中第一次出现的数字。

我有一个名为“shootsummary”的向量，看起来像这样。

> head(shootsummary)
[1] Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police.                                         
[2] Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him.                           
[3] John Zawahri, 23, armed with a homemade assault rifle and high-capacity magazines, killed his brother and father at home and then headed to Santa Monica College, where he was eventually killed by police.      
[4] Dennis Clark III, 27, shot and killed his girlfriend in their shared apartment, and then shot two witnesses in the building's parking lot and a third victim in another apartment, before being killed by police.
[5] Kurt Myers, 64, shot six people in neighboring towns, killing two in a barbershop and two at a car care business, before being killed by officers in a shootout after a nearly 19-hour standoff.

每个字符串中第一次出现的数字表示个人的“年龄”，我有兴趣从这些字符串中提取年龄，而不会将它们与所列行中的其他数字混合。

我用过：

as.numeric(gsub("\\D", "", shootsummary))

结果：

[1]  34128     42     23     27   6419

我正在寻找一个看起来像这样的结果，其中仅包含从句子中提取的年龄，而没有提取年龄之后出现的其他数字。

[1]  34     42     23     27   64

【问题讨论】：

假设向量元素之一没有数字，你想返回什么。在我的解决方案中，它返回NA。

标签： regex r vector

【解决方案1】：

stringi 会更快

library(stringi)
stri_extract_first(shootsummary, regex="\\d+")
#[1] "34" "42" "23" "27" "64"

【讨论】：

感谢akrun，我安装了'stringi'并成功运行代码。感谢您的早期回复。

【解决方案2】：

你可以试试下面的sub 命令，

> test
[1] "Aaron Alexis, 34, a military veteran and contractor from Texas, opened fire in the Navy installation, killing 12 people and wounding 8 before being shot dead by police."              
[2] "Pedro Vargas, 42, set fire to his apartment, killed six people in the complex, and held another two hostages at gunpoint before a SWAT team stormed the building and fatally shot him."
> sub("^\\D*(\\d+).*$", "\\1", test)
[1] "34" "42"

模式说明：

^ 断言我们在一行的开头。
\D* 匹配零个或多个非数字字符。
(\d+) 然后将以下一个或多个数字捕获到第 1 组（第一个数字）。
.* 匹配任意字符零次或多次。
$ 断言我们在行尾。
最后，所有匹配的字符都被第一组中的字符替换。

【讨论】：

感谢 avinash，它就像一个魅力......因为我是正则表达式的新手，请你帮我更清楚地了解你在这里做了什么。提前致谢

【解决方案3】：

一个选项是 str_extract 来自 stringr 和 as.numeric 包装。

> library(stringr)
> as.numeric(str_extract(shootsummary, "[0-9]+"))
# [1] 34 42 23 27 64

更新为了回答你在这个答案的 cmets 中的问题，这里有一点解释。一个函数的完整解释可以在它的帮助文件中找到。

str_extract 返回第一次出现的正则表达式。它在其第一个参数中对字符向量进行向量化。
正则表达式[0-9]+ 匹配以下任意字符：“0”到“9”（1 次或多次）
as.numeric 将生成的字符向量更改为数值向量。

【讨论】：

谢谢理查德，你的代码也能正常工作，我可以知道它是怎么做的吗？我对正则表达式完全陌生，我熟悉正则表达式中非常基本和简单的代码。提前感谢
感谢您在 stringr 中展示它，我一直想使用 stringr，您帮助我开始使用它:)

【解决方案4】：

怎么样

splitbycomma <- strsplit(shootsummary, ",")
as.numeric(  sapply(splitbycomma, "[", 2)  )

【讨论】：

我想我在这里没有得到任何东西..第一行本身并没有为我运行，让我调整它并返回..谢谢 berry..

【解决方案5】：

R 的regmatches() 方法返回一个向量，其中每个元素中的第一个正则表达式匹配：

regmatches(shootsummary, regexpr("\\d+", shootsummary, perl=TRUE));

【讨论】：

嘿蒂姆，谢谢。这似乎更简单地使用了 R 的包 n 个函数的强大功能.. 但它以字符串的形式返回它.. 但仍然.. 这很有用..谢谢

【解决方案6】：

你可以使用sub:

test <- ("xff 34 sfsdg 352 efsrg")

sub(".*?(\\d+).*", "\\1", test)
# [1] "34"

正则表达式是如何工作的？

. 匹配任何字符。量词* 表示出现次数不限。 ? 用于匹配所有字符，直到\\d（数字）的第一个匹配项。量词+ 表示出现一次或多次。 \\d 周围的括号是第一个匹配组。这后面可能跟有其他字符 (.*)。第二个参数 (\\1) 将整个字符串替换为第一个匹配组（即第一个数字）。

【讨论】：

感谢 Sven，您的代码和 avinash 的代码相似，我很好奇您是如何做到这一点的，我想了解基本概念.. 我是 regex 新手，提前致谢跨度>
@user3563667 我添加了解释。

【解决方案7】：

您可以使用 strex 包中的 str_first_number() 函数很好地做到这一点，或者对于更一般的需求，有 str_nth_number() 函数。

pacman::p_load(strex)
shootsummary <- c("Aaron Alexis, 34, a military veteran and contractor ...",
                  "Pedro Vargas, 42, set fire to his apartment, killed six ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "John Zawahri, 23, armed with a homemade assault rifle ...",
                  "Dennis Clark III, 27, shot and killed his girlfriend ...",
                  "Kurt Myers, 64, shot six people in neighboring ..."
)
str_first_number(shootsummary)
#> [1] 34 42 23 23 27 64
str_nth_number(shootsummary, n = 1)
#> [1] 34 42 23 23 27 64

由reprex package (v0.2.0) 于 2018 年 9 月 3 日创建。

【讨论】：