【发布时间】:2019-11-12 15:53:04
【问题描述】:
我是 R 新手,正在尝试从语料库中删除无意义的单词。我有一个数据框,其中一列包含电子邮件,另一列包含目标变量。我正在尝试清理电子邮件正文数据。我为此使用了 tm 和 qdap 包。 我已经解决了大多数其他问题并尝试了以下示例: Remove meaningless words from corpus in R 我遇到的问题是,当我想从语料库中删除不需要的标记(不是字典单词)时,出现错误。
library(qdap)
library(tm)
corpus = Corpus(VectorSource(Email$Body))
corpus = tm_map(corpus, tolower)
corpus = tm_map(corpus, removePunctuation)
corpus = tm_map(corpus, removeNumbers)
corpus = tm_map(corpus, stripWhitespace)
corpus = tm_map(corpus, stemDocument)
tdm = TermDocumentMatrix(corpus)
all_tokens = findFreqTerms(tdm,1)
tokens_to_remove = setdiff(all_tokens, GradyAugmented)
corpus <- tm_map(corpus, content_transformer(removeWords), tokens_to_remove)
通过运行上面的代码行,我得到了以下错误。
invalid regular expression '(*UCP)\b(zyx|zyer|zxxxxxâ|zxxxxx|zwischenzeit|zwei|zvolen|zverejneni|zurã|zum|zstepswc|zquez|zprã|zorunlulu|zona|zoho|znis|zmir|zlf|zink|zierk|zhou|zhodnoteni|zgyã|zgã|zfs|zfbeswstat|zerust|zeroâ|zeppelinstr|zellerstrass|zeldir|zel|zdanska|zcfqc|zaventem|zarecka|zarardan|zaragoza|zaobchã|zamã|zakã|zaira|zahradnikova|zagorska|zagã|zachyti|zabih|zã|yusof|yukinobu|yui|ypg|ypaint|youtub|yoursid|youâ|yoshitada|yorkshir|yollayan|yokohama|yoganandam|yiewsley|yhlhjpz|yer|yeovil|yeni|yeatman|yazarina|yazaki|yaz|yasakt|yarm|yara|yannick|yanlislikla|yakar|yaiza|yabortslitem|yã|xxxxx|xxxxgbl|xuezi|xuefeng|xprn|xma|xlsx|xjchvnbbafeg|xiii|xii|xiaonan|xgb|xcede|wythenshaw|wys|wydzial|wydzia|wycomb|www|wuppert|wroclaw|wroc|wrightâ|wpisana|woustvil|wouldnâ|worthwhil|worsley|worri|worldwid|worldâ|workwear|worcestershir|worc|wootton|wooller|woodtec|woodsid|woodmansey|woodley|woodham|woodgat|wonâ|wolverhampton|wjodoyg|wjgfjiq|witti|witt|witkowski|wiss
In addition: Warning message:
In gsub(sprintf("(*UCP)\\b(%s)\\b", paste(sort(words, decreasing = TRUE), :
PCRE pattern compilation error
'regular expression is too large'
at ''
电子邮件的样本语料库:
[794] "c mailto sent march ne rntbci accountspay nmuk subject new sig plc item still new statement await retriev use link connect account connect account link work copi past follow text address bar top internet browser https od datainterconnect com sigd sigdodsaccount php p thgqmdz d dt s contact credit control contact experi technic problem visit http bau faq datainterconnect com sig make payment call autom credit debit card payment line sig may abl help improv cashflow risk manag retent recoveri contract disput via www sigfinancetool co uk websit provid detail uniqu award win servic care select third parti avail sig custom power"
tokens_to_remove[1:10]
[1] "advis" "appli" "atlassian" "bosch" "boschrexroth" "busi"
[7] "communic" "dcen" "dcgbsom" "email"
我想删除所有在英语中没有意义的单词,例如 c、mailto、ne、accountspay、nmuk 等。
【问题讨论】:
标签: r text nlp data-cleaning