【问题标题】:Basic concepts: Naive Bayes algorithm for classification基本概念:用于分类的朴素贝叶斯算法
【发布时间】:2014-11-15 13:13:05
【问题描述】:

我想我或多或少地了解朴素贝叶斯,但对于简单的二进制文本分类测试的实现,我有几个问题。

假设文档D_i 是词汇表x_1, x_2, ...x_n 的某个子集

有两个类 c_i 任何文档都可以落入,我想为某些输入文档 D 计算 P(c_i|D),它与 P(D|c_i)P(c_i) 成正比

我有三个问题

  1. P(c_i)#docs in c_i/ #total docs#words in c_i/ #total words
  2. 应该P(x_j|c_i)#times x_j appears in D/ #times x_j appears in c_i
  3. 假设训练集中不存在 x_j,我是否给它一个概率为 1,这样它就不会改变计算?

例如,假设我有一个训练集:

training = [("hello world", "good")
            ("bye world", "bad")]

所以课程会有

good_class = {"hello": 1, "world": 1}
bad_class = {"bye":1, "world:1"}
all = {"hello": 1, "world": 2, "bye":1}

所以现在如果我想计算一个测试字符串是好的概率

test1 = ["hello", "again"]
p_good = sum(good_class.values())/sum(all.values())
p_hello_good = good_class["hello"]/all["hello"]
p_again_good = 1 # because "again" doesn't exist in our training set

p_test1_good = p_good * p_hello_good * p_again_good

【问题讨论】:

    标签: algorithm machine-learning


    【解决方案1】:

    由于这个问题太宽泛,所以我只能有限地回答:-

    第一个:- P(c_i) 是 #docs in c_i/ #total docs 或 #words in c_i/ #total words

    P(c_i) = #c_i/#total docs
    

    第二个:- P(x_j|c_i) 应该是#times x_j 出现在 D/#times x_j 出现在 c_i。
    @larsmans 注意到之后..

    It is exactly occurrence of word in a document
    by total number of words in that class in whole dataset.
    

    3rd:-假设训练集中不存在一个 x_j,我是否给它一个概率为 1,这样它就不会改变计算?

    For That we have laplace correction or Additive smoothing. It is applied on
    p(x_j|c_i)=(#times x_j appears in D+1)/ (#times x_j +|V|) which will neutralize
    the effect not occurring features.
    

    【讨论】:

    • 不,P(xⱼ|cᵢ) 是 xⱼ 在 cᵢ 类中的频率除以该类所有文档中的词条总数。
    猜你喜欢
    • 2015-03-06
    • 2015-10-09
    • 2014-09-18
    • 2015-08-27
    • 2017-02-22
    • 2016-08-02
    • 2018-07-29
    • 2012-07-02
    • 2017-01-10
    相关资源
    最近更新 更多