由于您可能正在从一组包含非水果词的文本中寻找水果风味的组合,因此我编写了一些与您示例中的文档类似的文档。我使用 quanteda 包构建了一个文档术语矩阵,然后基于包含水果词的 ngram 进行过滤。
docs <- c("One flavor is apple strawberry lime.",
"Another flavor is apple grape lime.",
"Pineapple mango guava is our newest flavor.",
"There is also kiwi guava and grape apple.",
"Mixed berry was introduced last year.",
"Did you like kiwi guava pineapple?",
"Try the lime mixed berry.")
flavorwords <- c("apple", "guava", "berry", "kiwi", "guava", "grape")
require(quanteda)
# form a document-feature matrix ignoring common stopwords + "like"
# for ngrams, bigrams, trigrams
fruitDfm <- dfm(docs, ngrams = 1:3, ignoredFeatures = c("like", "also", stopwords("english")))
## Creating a dfm from a character vector ...
## ... lowercasing
## ... tokenizing
## ... indexing documents: 7 documents
## ... indexing features: 90 feature types
## ... removed 47 features, from 176 supplied (glob) feature types
## ... complete.
## ... created a 7 x 40 sparse dfm
## Elapsed time: 0.01 seconds.
# select only those features containing flavorwords as regular expression
fruitDfm <- selectFeatures(fruitDfm, flavorwords, valuetype = "regex")
## kept 22 features, from 5 supplied (regex) feature types
# show the features
topfeatures(fruitDfm, nfeature(fruitDfm))
## apple guava grape pineapple kiwi
## 3 3 2 2 2
## kiwi_guava berry mixed_berry strawberry apple_strawberry
## 2 2 2 1 1
## strawberry_lime apple_strawberry_lime apple_grape grape_lime apple_grape_lime
## 1 1 1 1 1
## pineapple_mango mango_guava pineapple_mango_guava grape_apple guava_pineapple
## 1 1 1 1 1
## kiwi_guava_pineapple lime_mixed_berry
## 1 1
添加:
如果您希望将未用空格分隔的术语与文档匹配,您可以使用空字符串连接器形成 ngram,并按如下方式进行匹配。
flavorwordsConcat <- c("applestrawberrylime", "applegrapelime", "pineapplemangoguava",
"kiwiguava", "grapeapple", "mixedberry", "kiwiguavapineapple",
"limemixedberry")
fruitDfm <- dfm(docs, ngrams = 1:3, concatenator = "")
fruitDfm <- fruitDfm[, features(fruitDfm) %in% flavorwordsConcat]
fruitDfm
# Document-feature matrix of: 7 documents, 8 features.
# 7 x 8 sparse Matrix of class "dfmSparse"
# features
# docs applestrawberrylime applegrapelime pineapplemangoguava kiwiguava grapeapple mixedberry kiwiguavapineapple limemixedberry
# text1 1 0 0 0 0 0 0 0
# text2 0 1 0 0 0 0 0 0
# text3 0 0 1 0 0 0 0 0
# text4 0 0 0 1 1 0 0 0
# text5 0 0 0 0 0 1 0 0
# text6 0 0 0 1 0 0 1 0
# text7 0 0 0 0 0 1 0 1
如果您的文本包含连接的风味词,那么您可以使用 将一元 dfm 匹配到单个水果词的所有三元排列
unigramFlavorWords <- c("apple", "guava", "grape", "pineapple", "kiwi")
head(unlist(combinat::permn(unigramFlavorWords, paste, collapse = "")))
[1] "appleguavagrapepineapplekiwi" "appleguavagrapekiwipineapple" "appleguavakiwigrapepineapple"
[4] "applekiwiguavagrapepineapple" "kiwiappleguavagrapepineapple" "kiwiappleguavapineapplegrape"