【发布时间】:2020-02-04 03:35:17
【问题描述】:
数据
我在 R 中有一个具有以下结构的数据框:
ID group text
100 1 An apple is a sweet, edible fruit produced by an apple tree.
103 1 An apple is a sweet, edible fruit produced by an apple tree.
105 1 Some dog breeds show more variation in size than other dog breeds.
106 1 An apple is a sweet, edible fruit produced by an apple tree.
107 1 An apple is a sweet, edible fruit produced by an apple tree.
209 1 Some dog breeds show more variation in size than other dog breeds.
300 1 Some dog breeds show more variation in size than other dog breeds.
501 1 An apple is a sweet, edible fruit produced by an apple tree.
503 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711 2 Pizza is a savory dish of Italian origin.
799 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811 2 Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
可以用这段代码重现:
test_df <- data.frame(
"ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
"group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
"text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)
实际上,每个主题的文本都略有不同,有几十万个,分布在几十个组中。
我要做什么
我正在尝试编写一个执行以下操作的函数:
- 对于数据框中的每个组,比较该组中的所有文本,并确定主要词汇主题。
- 然后,在数据框中输入每个文本的相关主题作为新列。
以下是分析后数据框中两行的示例:
ID group topic text
100 1 apple An apple is a sweet, edible fruit produced by an apple tree.
105 1 dog Some dog breeds show more variation in size than other dog breeds.
我目前所拥有的
我可以使用以下代码在一个完整的数据帧上运行这种功能(不按组进行子集化):
# Preparing the texts
library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
# Identifying topics
library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
# Inputting the topic classification into the dataframe
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
我尝试将其转换为函数,然后使用以下代码按子集在数据帧上运行该函数:
library(tm)
library(topicmodels)
topic_identifier <- function(text) {
corpus <- Corpus(VectorSource(text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')
test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
}
by(test_df$text, test_df$group, topic_identifier)
但这不允许我将每个子集的相关输出保存在原始 df 中。
【问题讨论】:
标签: r function dataframe subset cluster-analysis