R 停用词：摆脱所有以 \'https\' 开头的词答案

【问题标题】：R stopwords: getting rid of ALL the words starting with 'https'R 停用词：摆脱所有以 \'https\' 开头的词
【发布时间】：2023-01-26 23:34:10
【问题描述】：

我正在做一个包括 Twitter 抓取的项目。

问题：我似乎无法删除所有以“https”开头的单词。

我的代码：

library(twitteR)
library(tm)
library(RColorBrewer)
library(e1017)
library(class)
library(wordcloud)
library(tidytext)

scraped_tweets <- searchTwitter('Silk Sonic - leave door open', n = 10000, lang='en')

# get text data from tweets
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})


# removing emojis and characters
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')

scraped_corpus <- Corpus(VectorSource(scraped_text))

doc_matrix <- TermDocumentMatrix(scraped_corpus, control = list(removePunctuation=T,
                                      stopwords = c('https','http', 'sonic', 
                                               'silk',stopwords('english')),
                                                removeNumbers = T,tolower = T))


# convert object into a matrix
doc_matrix <- as.matrix(doc_matrix)


# get word counts

head(doc_matrix,1)

words <- sort(rowSums(doc_matrix), decreasing = T)

dm <- data.frame(word = names(words), freq = words)


# wordcloud

wordcloud(dm$word, dm$freq, random.order = F, colors = brewer.pal(8, 'Dark2'))

我添加了“https”和“http”标签，但没有用。我当然可以使用 gsub 清理输出，但这与我仍然将链接名称的其余部分作为输出不同。

我有什么想法可以做到这一点吗？

提前致谢。

【问题讨论】：

标签： r twitter sentiment-analysis stop-words

【解决方案1】：

让我们看看documentation for the tm：

stopwords 一个布尔值，表示使用默认删除停用词此包附带的特定于语言的停用词列表，一个字符 vec- tor 持有自定义停用词，或用于删除停用词的自定义函数。默认为假。

停用词参数似乎没有对提供的停用词进行任何部分或模式匹配。不过，它确实接受自定义函数。这是一种选择，但我认为最简单的方法是在将字符向量转换为语料库之前对字符向量进行 url 删除：
scraped_text <- sapply(scraped_tweets, function(x){x$getText()})


# removing emojis and characters
scraped_text <- iconv(scraped_text, 'UTF-8', 'ASCII')

# Added line for regex string removal
scraped_text <- str_remove_all(scraped_text, r"(https?://[^)]s]+(?=[)]s]))")


scraped_corpus <- Corpus(VectorSource(scraped_text))
这是用于 url 识别的相当简单的正则表达式，但它工作得相当好。那里有更复杂的，可以通过谷歌搜索轻松找到。

【讨论】：