【问题标题】：R text mining - remove special characters and quotesR文本挖掘——去除特殊字符和引号
【发布时间】：2019-03-27 04:16:43
【问题描述】：

我在 R 中做一个文本挖掘任务。

任务：

1) 计算句子

2) 识别并保存向量中的引号

问题：

像“...”这样的假句号和像“先生”这样的标题中的句号。必须处理。

正文数据中肯定有引号，其中会有“...”。我正在考虑从主体中提取这些引号并将它们保存在向量中。（也需要对它们进行一些操作。）

重要提示：我的文本数据位于 Word 文档中。我使用 readtext("path to .docx file") 在 R 中加载。当我查看文本时，引号只是“但不是 \”，与可重现的文本相反。

path <- "C:/Users/.../"
a <- readtext(paste(path, "Text.docx", sep = ""))
title <- a$doc_id
text <- a$text

可重现的文本

text <- "Mr. and Mrs. Keyboard have two children. Keyboard Jr. and Miss. Keyboard. ... 
However, Miss. Keyboard likes being called Miss. K [Miss. Keyboard is a bit of a princess ...]
 \"Mom how are you o.k. with being called Mrs. Keyboard? I'll never get it...\". "


#  splitting by "." 
unlist(strsplit(text, "\\."))

问题是它被错误的句号分割我试过的解决方案：

# getting rid of . in titles 
vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")

library(gsubfn)
# replacing . in titles
gsubfn("\\S+", setNames(as.list(vec.rep), vec), text)

问题在于它没有取代 [Miss. by [小姐

识别引号：

stri_extract_all_regex(text, '"\\S+"')

但这也行不通。（它与 \" 一起使用下面的代码）

stri_extract_all_regex("some text \"quote\" some other text", '"\\S+"')

确切的预期向量是：

sentences <- c("Mr and Mrs Keyboard have two children. ", "Keyboard Jr and Miss Keyboard.", "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]", ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

我想把句子分开（这样我就可以数出每段有多少句子）。并且引号也分开了。

quotes <- ""Mom how are you ok with being called Mrs Keyboard? I'll never get it...""

【问题讨论】：

在Miss 后面加上句号很奇怪，因为它不是缩写。即使您使用text <- gsub("Miss.", "Miss", text, fixed=TRUE) 删除一个点，我也无法利用tm / OpenNLP 包，因为它解析出[4] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n \"Mom how are you o.k. with being called Mrs. Keyboard?" 句子。你的分句规则是什么？双引号中的任何文本是否应该按原样提取，即使里面有多个句子？
上述文本的预期结果是什么？
但是，键盘小姐喜欢被称为 Miss K [键盘小姐有点公主...] 是一个句子，“妈妈，你好吗？被称为键盘小姐？”是另一个，因为它们由 \n 分隔
好的，您可以使用gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text) 匹配您当前的所有vec 值。请注意，这不会处理 o.k.。您可能会为此使用另一种方法。但是拆分成句子似乎不太清楚。
如果你只是想提取引号，试试regmatches(text, gregexpr('"[^"]*"', text))

标签： r regex text gsub mining

【解决方案1】：

您可以使用

匹配您当前的所有 vec 值

gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)

也就是说，\w+ 匹配 1 个或多个单词字符，\. 匹配一个点。

接下来，如果你只想提取引号，使用

regmatches(text, gregexpr('"[^"]*"', text))

" 匹配 "，[^"]* 匹配除 " 之外的 0 个或多个字符。

如果你打算用引号来匹配你的句子，你可以考虑

regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))

详情

\\s* - 0+ 个空格
"[^"]*" - 一个"，除" 和一个" 之外的0+ 个字符
| - 或
[^"?!.]+ - 除了?、"、! 和 . 之外的 0+ 个字符
[[:space:]?!.]+ - 1 个或多个空格，?、! 或 . 字符
[^"[:alnum:]]* - 0+ 非字母数字和 " 字符

R 示例代码：

> vec <- c("Mr.", "Mrs.", "Ms.", "Miss.", "Dr.", "Jr.")
> vec.rep <- c("Mr", "Mrs", "Ms", "Miss", "Dr", "Jr")
> library(gsubfn)
> text <- gsubfn("\\w+\\.", setNames(as.list(vec.rep), vec), text)
> regmatches(text, gregexpr('\\s*"[^"]*"|[^"?!.]+[[:space:]?!.]+[^"[:alnum:]]*', trimws(text)))
[[1]]
[1] "Mr and Mrs Keyboard have two children. "                                                       
[2] "Keyboard Jr and Miss Keyboard. ... \n"                                                         
[3] "However, Miss Keyboard likes being called Miss K [Miss Keyboard is a bit of a princess ...]\n "
[4] "\"Mom how are you o.k. with being called Mrs Keyboard? I'll never get it...\""

【讨论】：