【问题标题】:Dictionary not influenced by input?字典不受输入影响?
【发布时间】:2018-07-23 05:11:03
【问题描述】:

get_dictionary() 包中有一个get_dictionary() 函数,我认为它会返回字典中的所有单词。但是,当我将 wordNgrams 设置为 2 或 3 时,它返回的单词列表与将 wordNgrams 设置为 1 时得到的单词列表完全相同。有人能告诉我这里发生了什么吗?谢谢!

【问题讨论】:

标签: r text-classification fasttext


【解决方案1】:

当您在 n-grams 中增加 n 时,您的 fasttext 分类算法在所有情况下都在同一个字典上工作。然而,它不是在单独的单词(“I”、“love”、“NY”)上进行训练,而是在单词的连接上进行训练(“I love”、“love NY”——它是一个二元组)。为了演示,我在 5-gram(五角星;)上进行了训练,当然 -gram 中的索引 n 越大,计算时间越长,但句法结构被更好地捕获。

library(fastrtext)

data("train_sentences")
data("test_sentences")

# prepare data
tmp_file_model <- tempfile()

train_labels <- paste0("__label__", train_sentences[,"class.text"])
train_texts <- tolower(train_sentences[,"text"])
train_to_write <- paste(train_labels, train_texts)
train_tmp_file_txt <- tempfile()
writeLines(text = train_to_write, con = train_tmp_file_txt)

test_labels <- paste0("__label__", test_sentences[,"class.text"])
test_texts <- tolower(test_sentences[,"text"])
test_to_write <- paste(test_labels, test_texts)

# learn model 1 1-grams
library(microbenchmark)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
                     "-output", tmp_file_model, "-dim", 20, "-lr", 1,
                     "-epoch", 20, "-wordNgrams", 1, "-verbose", 1)), times = 5)

# mean time: 1.229228 seconds

model1 <- load_model(tmp_file_model)

# learn model 2 5-grams)
microbenchmark(execute(commands = c("supervised", "-input", train_tmp_file_txt,
                     "-output", tmp_file_model, "-dim", 20, "-lr", 1,
                     "-epoch", 20, "-wordNgrams", 5, "-verbose", 1)), times = 5)

# mean time: 2.659191

model2 <- load_model(tmp_file_model)
str(get_dictionary(model1))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...
str(get_dictionary(model2))
# chr [1:5060] "the" "</s>" "of" "to" "and" "in" "a" "that" "is" "for" ...

【讨论】:

    猜你喜欢
    • 2022-10-23
    • 1970-01-01
    • 1970-01-01
    • 2014-11-29
    • 1970-01-01
    • 2014-04-01
    • 1970-01-01
    • 2014-12-19
    • 1970-01-01
    相关资源
    最近更新 更多