【问题标题】:Extract age from text in R [closed]从R中的文本中提取年龄[关闭]
【发布时间】:2018-08-07 11:29:55
【问题描述】:

我有一个 .csv 文件,其中有一列包含从网络上抓取的书籍描述,我将其导入 R 以进行进一步分析。我的目标是从R中的这个专栏中提取主角的年龄,所以我想象的是:

  1. 使用正则表达式匹配“age”和“-year-old”等字符串
  2. 将包含这些字符串的句子复制到一个新列中(这样我就可以确保该句子不是,例如“In the middle age 50 people living in xy”
  3. 从该列中提取数字(如果可能,还包括一些数字单词)到一个新列中。

结果表(或者可能是 data.frame)希望看起来像这样

|Description             |Sentence           |Age
|YY is a novel by Mr. X  |The 12-year-old boy| 12
|about a boy. The 12-year|is named Dave.     |
|-old boy is named Dave..|                   |

如果你能帮上忙,那就太好了,因为我的 R 技能仍然非常有限,而且我还没有找到解决这个问题的方法!

【问题讨论】:

标签: r regex string stringr text-extraction


【解决方案1】:

另一个选项,如果字符串包含除年龄之外的其他数字/描述,但您只需要年龄。

library(stringr)
description <- "YY is a novel by Mr. X about a boy. The boy is 5 feet tall. The 12-year-old boy is named Dave. Dave is happy. Dave lives at 42 Washington street."
sentence <- str_split(description, "\\.")[[1]][which(grepl("-year-old", unlist(str_split(description, "\\."))))]
> sentence 
[1] " The 12-year-old boy is named Dave"

age <- as.numeric(str_extract(description, "\\d+(?=-year-old)"))
> age
[1] 12

这里我们使用字符串“-year-old”告诉我们要提取哪个句子,然后我们提取该字符串后面的年龄。

【讨论】:

    【解决方案2】:

    你可以试试下面的

    library(stringr)
    
    description <- "YY is a novel by Mr. X about a boy. The 12-year-old boy is named Dave. Dave is happy."
    
    sentence <- str_extract(description, pattern = "\\.[^\\.]*[0-9]+[^\\.]*.") %>% 
      str_replace("^\\. ", "")
    > sentence
    [1] "The 12-year-old boy is named Dave."
    
    age <- str_extract(sentence, pattern = "[0-9]+")
    > age
    [1] "12"
    

    【讨论】:

    • 谢谢!我稍微编辑了该模式,使其也包含书面数字,并且只包含一位或两位数字,但是您的示例有效。
    猜你喜欢
    • 2019-12-15
    • 2017-08-28
    • 2018-07-29
    • 2021-05-23
    • 1970-01-01
    • 1970-01-01
    • 2012-12-19
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多