【发布时间】:2019-03-27 04:16:43
【问题描述】:
我在 R 中做一个文本挖掘任务。
任务:
1) 计算句子
2) 识别并保存向量中的引号
问题:
像“...”这样的假句号和像“先生”这样的标题中的句号。必须处理。
正文数据中肯定有引号,其中会有“...”。我正在考虑从主体中提取这些引号并将它们保存在向量中。 (也需要对它们进行一些操作。)
重要提示:我的文本数据位于 Word 文档中。我使用 readtext("path to .docx file") 在 R 中加载。当我查看文本时,引号只是“但不是 \”,与可重现的文本相反。
path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text
可重现的文本
text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ...
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
\"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "
# splitting by "."
unlist(strsplit(text, "\\."))
问题是它被错误的句号分割 我试过的解决方案:
# getting rid of . in titles
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)
问题在于它没有取代 [Miss. by [小姐
识别引号:
stri_extract_all_regex(text, '"\\S+"')
但这也行不通。 (它与 \" 一起使用下面的代码)
stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')
确切的预期向量是:
sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
我想把句子分开(这样我就可以数出每段有多少句子)。 并且引号也分开了。
quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""
【问题讨论】:
-
在
Miss后面加上句号很奇怪,因为它不是缩写。即使您使用text <- gsub("Miss.", "Miss", text, fixed=TRUE)删除一个点,我也无法利用tm/OpenNLP包,因为它解析出[4] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n \"Mom how are you o.k. with being called Mrs. Keyboard?"句子。你的分句规则是什么?双引号中的任何文本是否应该按原样提取,即使里面有多个句子? -
上述文本的预期结果是什么?
-
但是,键盘小姐喜欢被称为 Miss K [键盘小姐有点公主...] 是一个句子,“妈妈,你好吗?被称为键盘小姐?”是另一个,因为它们由 \n 分隔
-
好的,您可以使用
gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)匹配您当前的所有vec值。请注意,这不会处理o.k.。您可能会为此使用另一种方法。但是拆分成句子似乎不太清楚。 -
如果你只是想提取引号,试试
regmatches(text, gregexpr('"[^"]*"', text))