在数据帧的子集上运行聚类函数并为数据帧中的每个子集写入结果答案

【问题标题】：Running a clustering function on subsets of dataframe and writing the results for each subset in the dataframe在数据帧的子集上运行聚类函数并为数据帧中的每个子集写入结果
【发布时间】：2020-02-04 03:35:17
【问题描述】：

数据

我在 R 中有一个具有以下结构的数据框：

ID   group           text
100    1    An apple is a sweet, edible fruit produced by an apple tree.
103    1    An apple is a sweet, edible fruit produced by an apple tree.
105    1    Some dog breeds show more variation in size than other dog breeds.
106    1    An apple is a sweet, edible fruit produced by an apple tree.
107    1    An apple is a sweet, edible fruit produced by an apple tree.
209    1    Some dog breeds show more variation in size than other dog breeds.
300    1    Some dog breeds show more variation in size than other dog breeds.
501    1    An apple is a sweet, edible fruit produced by an apple tree.
503    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
711    2    Pizza is a savory dish of Italian origin.
799    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.
811    2    Ice cream is a sweetened frozen food typically eaten as a snack or dessert.

可以用这段代码重现：

test_df <- data.frame(
  "ID" = c(100, 103, 105, 106, 107, 209, 300, 501, 503, 711, 799, 811,),
  "group" = c(1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2),
  "text" = c('An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Some dog breeds show more variation in size than other dog breeds.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'An apple is a sweet, edible fruit produced by an apple tree.', 'Some dog breeds show more variation in size than other dog breeds.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Pizza is a savory dish of Italian origin.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.', 'Ice cream is a sweetened frozen food typically eaten as a snack or dessert.')
)

实际上，每个主题的文本都略有不同，有几十万个，分布在几十个组中。

我要做什么

我正在尝试编写一个执行以下操作的函数：

对于数据框中的每个组，比较该组中的所有文本，并确定主要词汇主题。
然后，在数据框中输入每个文本的相关主题作为新列。

以下是分析后数据框中两行的示例：

ID   group    topic           text
100    1      apple    An apple is a sweet, edible fruit produced by an apple tree.
105    1       dog     Some dog breeds show more variation in size than other dog breeds.

我目前所拥有的

我可以使用以下代码在一个完整的数据帧上运行这种功能（不按组进行子集化）：

# Preparing the texts

library(tm)
corpus <- Corpus(VectorSource(test_df$text))
corpus <- tm_map(corpus, removeWords, stopwords('english'))
corpus <- tm_map(corpus, stemDocument, language = 'english')
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)

# Identifying topics

library(topicmodels)
TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
lda.output <- LDA(TF, k=2, method = 'Gibbs')

# Inputting the topic classification into the dataframe

test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])

我尝试将其转换为函数，然后使用以下代码按子集在数据帧上运行该函数：

library(tm)
library(topicmodels)

topic_identifier <- function(text) {
  corpus <- Corpus(VectorSource(text))
    corpus <- tm_map(corpus, removeWords, stopwords('english'))
    corpus <- tm_map(corpus, stemDocument, language = 'english')
    corpus <- tm_map(corpus, removePunctuation)
    corpus <- tm_map(corpus, stripWhitespace)
  TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
      lda.output <- LDA(TF, k=2, method = 'Gibbs')
  test_df <- cbind(test_df, terms(lda.output)[topics(lda.output)])
    }

by(test_df$text, test_df$group, topic_identifier)

但这不允许我将每个子集的相关输出保存在原始 df 中。

【问题讨论】：

标签： r function dataframe subset cluster-analysis

【解决方案1】：

by 函数在接收数据帧而不是列向量 text 作为输入时效果最佳。然后，您可以操作此数据框以返回非原始 test_df。本质上，整个数据帧上的完全相同的过程被保留在子数据帧上运行。

此外，您需要使用<- 分配结果以构建对象列表，这些对象甚至可以在末尾使用do.call + rbind 绑定在一起（假设每个数据框保持相同的列数和名称):

topic_identifier <- function(sub_df) { 
   corpus <- Corpus(VectorSource(sub_df$text)) 
   corpus <- tm_map(corpus, removeWords, stopwords('english')) 
   corpus <- tm_map(corpus, stemDocument, language = 'english') 
   corpus <- tm_map(corpus, removePunctuation) 
   corpus <- tm_map(corpus, stripWhitespace) 

   TF <- DocumentTermMatrix(corpus, control = list(weighting = weightTf))
   lda.output <- LDA(TF, k=2, method = 'Gibbs') 

   sub_df <- cbind(sub_df, terms(lda.output)[topics(lda.output)]) 
   return(sub_df)
} 

# BUILD LIST OF DFs, PASSING IN AND RETURNING A DATA FRAME
df_list <- by(test_df, test_df$group, topic_identifier)

# CONCATENATE ALL DFs INTO SINGLE DF
final_df <- do.call(rbind, unname(df_list))

【讨论】：