【问题标题】:Detokenize a Quanteda tokens object取消对 Quanteda 令牌对象的令牌化
【发布时间】:2021-11-27 23:48:30
【问题描述】:

我有一个使用“窗口”选项创建的 quanteda 令牌对象(参见下面的代码)。我有兴趣对一系列单词执行此操作,以便为自定义词典的创建提供信息。如何“去标记化”或将每个标记化的“窗口”文本连接或重新组合成一个字符串。每个字符串可以是列表中的一个项目,也可以是 data.frame 中的一行。我只需要能够在其上下文中阅读单词/短语(在本例中为“未来”)的实例。

是否有一些命令或代码可以让我“去标记化”这个?

library(quanteda)
library(dplyr)

# Example data
d <- c("Thank you Mr. Speaker.  Mr. Speaker I’m not sure how,   but to the department of PWTTS, regarding the question I’d asked previously about the  future of our water reservoir.  I wonder if that was looked at since I ask that question to  Ms. Thompson.  Thank you", "Thank you Mr. Speaker.  Now if that doctor would be  located in that community how is the logistics or air travel going to be, moving between  the communities in the future.  Thank you")

# Corpus
c <- corpus(d)

# My tokens object consisting of 3-word window around instances of "future".
ttt <- tokens(c, remove_punct = T, remove_numbers = F) %>%
  tokens_keep( pattern = "future", window = 3) 

【问题讨论】:

    标签: r text tokenize corpus quanteda


    【解决方案1】:

    对于列表输出:

    > lapply(ttt, paste, collapse = " ")
    $text1
    [1] "previously about the future of our water"
    
    
    $text2
    [1] "communities in the future Thank you"
    

    或者对于一个字符向量,它可以很容易地成为你的 data.frame 中的一个列元素:

    > vapply(ttt, paste, collapse = " ", character(1))
                                         text1                                      text2 
    "previously about the future of our water"      "communities in the future Thank you" 
    

    【讨论】:

    • 这很好用。我在用 vapply() 创建的对象上使用了 as.data.frame()。谢谢!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2014-07-28
    • 1970-01-01
    • 1970-01-01
    • 2013-05-15
    • 2019-02-07
    相关资源
    最近更新 更多