【发布时间】:2014-11-15 13:13:05
【问题描述】:
我想我或多或少地了解朴素贝叶斯,但对于简单的二进制文本分类测试的实现,我有几个问题。
假设文档D_i 是词汇表x_1, x_2, ...x_n 的某个子集
有两个类 c_i 任何文档都可以落入,我想为某些输入文档 D 计算 P(c_i|D),它与 P(D|c_i)P(c_i) 成正比
我有三个问题
-
P(c_i)是#docs in c_i/ #total docs或#words in c_i/ #total words - 应该
P(x_j|c_i)是#times x_j appears in D/ #times x_j appears in c_i - 假设训练集中不存在
x_j,我是否给它一个概率为 1,这样它就不会改变计算?
例如,假设我有一个训练集:
training = [("hello world", "good")
("bye world", "bad")]
所以课程会有
good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}
所以现在如果我想计算一个测试字符串是好的概率
test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set
p_test1_good = p_good * p_hello_good * p_again_good
【问题讨论】:
标签: algorithm machine-learning