将两列数据框转换为 Quanteda 字典格式答案

【问题标题】：Transform Two Column Data Frame into Quanteda Dictionary Format将两列数据框转换为 Quanteda 字典格式
【发布时间】：2021-08-12 14:04:26
【问题描述】：

我的最终目标是创建一个 quanteda 字典，用于对文本数据进行主题分类。

但是，我的主题关键字的存储格式有些不同：我有一列大约有 4000 个关键字，第二列指定每个关键字所属的主题。请注意，每个主题的单词数不相等。我的数据如下所示：

     keywords      topic
[1]  "one"         "number"
[2]  "two"         "number"
[3]  "three"       "number"
[4]  "triangle"    "form"
[5]  "circle"      "form"
[...]

如何将我的关键字转换为（quanteda）字典格式，即包含每个主题的命名向量的列表，其中包含相应主题的关键字？

列表应如下所示：

list(number = c("one","two","three"),
     form = c("triangle","circle"))

非常感谢任何帮助！

到目前为止，找到我的方法。但这对我来说似乎不正确（或工作）：

# 1) Initialize an empty list of vectors that corresponds to my number of topics & add topic names ("topic_names" is just a vector type chr 1:88 that includes the topic names)

mydictionary <- vector(mode = "list", length = 88) 
names(mydictionary ) <- topic_names

# 2) Create a loop that checks for each keyword to match a topic and adds it to the respective vector of my dictionary

# I got it working for one keyword like this:
if (names(mydictionary [1]) == keyword_list$topic[1]) { # if topic of keyword matches topic vector name
  mydictionary[[1]] <- c(mydictionary[[1]], keyword_list$keywords[1]) #add keyword to topic vector
}

# However, I don't know how to transform this into a loop, since a loop has to check every index of keyword_list for every index of mydictionary and I don't know how to achieve this...

【问题讨论】：

标签： r dictionary transformation quanteda

【解决方案1】：

如果您的数据位于类似主题的 data.frame 中（请参阅数据部分），您可以根据需要快速获取列表中的数据。可以使用split函数。

my_dictionary <- split(topics$keywords, topics$topic)
my_dictionary

$form
[1] "triangle" "circle"  

$number
[1] "one"   "two"   "three"

数据：

topics <- structure(list(keywords = c("one", "two", "three", "triangle", 
"circle"), topic = c("number", "number", "number", "form", "form"
)), class = "data.frame", row.names = c(NA, -5L))

【讨论】：

完美，谢谢！它立即起作用了！我知道，我处理整个事情的方式太复杂了。
就像一个小后续问题：使用 split() 时，我失去了一个主题。我的列表只包含 87 个而不是 88 个向量。即使我的原始 data.frame 包含 88 个独特的主题。任何想法，为什么以及如何发生这种情况？
你有一个 NA 值的主题吗？如果这样做，拆分将忽略它并将其删除。
不，每个主题至少有一个指定的关键字
这与 split() 无关。我在关键字列中检查了 NA，但没有。我用 unique() 检查了主题的数量，得到了 88 个。我只是没有正确遵循您的建议，确实有一个具有 NA 值的主题，即我的数据中只有 87 个主题（尽管我预计会有 88 个主题，因为我插入了 88，但在此过程中显然丢失了一个）