文档分类的监督潜在狄利克雷分配？答案

【问题标题】：Supervised Latent Dirichlet Allocation for Document Classification?文档分类的监督潜在狄利克雷分配？
【发布时间】：2012-11-13 08:23:35
【问题描述】：

我在某些组中有一堆已经人类分类的文档。

是否有修改版的 lda 可以用来训练模型，然后用它对未知文档进行分类？

【问题讨论】：

标签： machine-learning nlp classification document-classification lda

【解决方案1】：

是的，您可以在 stanford 解析器中尝试使用 Labeled LDA http://nlp.stanford.edu/software/tmt/tmt-0.4/

【讨论】：

谢谢，我去看看！你知道l-LDA是否有C/C++/Python实现？
对不起，我最初没有看到您的消息。我不知道 c/python 实现，但我以前没有看过。我知道 Biel（LDA 作者）通常会在他的个人网站上发布他的代码（C/C++），所以我会检查一下。
这种方式的问题在于它需要一个标签来与一个主题进行1对1的匹配，所以限制性很大。

【解决方案2】：

就其价值而言，LDA 作为分类器将相当薄弱，因为它是一个生成模型，而分类是一个判别问题。 LDA 有一个变体，称为supervised LDA，它使用更具区分性的标准来形成主题（您可以在各个地方获得此源），还有一篇我不知道的带有max margin 公式的论文源代码的状态。除非您确定这是您想要的，否则我会避免使用 Labeled LDA 公式，因为它对分类问题中主题和类别之间的对应关系做出了强有力的假设。

但是，值得指出的是，这些方法都没有直接使用主题模型来进行分类。相反，他们使用文档，而不是使用基于单词的特征，而是使用主题的后验（文档推理产生的向量）作为其特征表示，然后将其馈送到分类器，通常是线性 SVM。这将为您提供基于主题模型的降维，然后是强大的判别分类器，这可能是您所追求的。该管道可用在大多数语言中使用流行的工具包。

【讨论】：

另一种可能值得研究的更新方法是部分标记的 LDA。 link 放宽了训练集中每个文档都必须有标签的要求。
嘿，第一个链接没有，这是我应该看的论文arxiv.org/pdf/1003.0783.pdf？

【解决方案3】：

您可以使用 PyMC 实现 监督 LDA，它使用 Metropolis 采样器来学习以下图形模型中的潜在变量：

训练语料库包含 10 条电影评论（5 条正面评论和 5 条负面评论）以及每个文档的相关星级。星级被称为响应变量，它是与每个文档相关的感兴趣的数量。文档和响应变量被联合建模，以便找到最能预测未来未标记文档的响应变量的潜在主题。如需更多信息，请查看original paper。考虑以下代码：

import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
                "simplistic silly and tedious",
                "it's so laddish and juvenile only teenage boys could possibly find it funny",
                "it shows that some studios firmly believe that people have lost the ability to think",
                "our culture is headed down the toilet with the ferocity of a frozen burrito",
                "offers that rare combination of entertainment and education",
                "the film provides some great insight",
                "this is a film well worth seeing",
                "a masterpiece four years in the making",
                "offers a breath of the fresh air of true sophistication"]
test_corpus =  ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3

#LDA parameters
num_features = 1000  #vocabulary size
num_topics = 4       #fixed for LDA

tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')

#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus)  #size D x V

print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]

#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()

K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents

data = A_tfidf_sp.toarray()

#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]        
alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])    

z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])

@pm.deterministic
def zbar(z=z):    
    zbar_list = []
    for i in range(len(z)):
        hist, bin_edges = np.histogram(z[i], bins=K)
        zbar_list.append(hist / float(np.sum(hist)))                
    return pm.Container(zbar_list)

eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)

@pm.deterministic
def y_mu(eta=eta, zbar=zbar):
    y_mu_list = []
    for i in range(len(zbar)):
        y_mu_list.append(np.dot(eta, zbar[i]))
    return pm.Container(y_mu_list)

#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
                  value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)

#visualize topics    
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()

给定训练数据（观察到的词和响应变量），除了每个文档的主题比例 (theta) 之外，我们还可以学习用于预测响应变量 (Y) 的全局主题 (beta) 和回归系数 (eta)。为了根据学习到的 beta 和 eta 对 Y 进行预测，我们可以定义一个不观察 Y 的新模型，并使用之前学习到的 beta 和 eta 来获得以下结果：

在这里，我们预测测试语料库的正面评论（大约 2 条，评论评分范围为 -2 到 2），由一句话组成：“这是一部非常正面的评论，很棒的电影”，如后验模式所示右侧的直方图。请参阅ipython notebook 了解完整的实现。

【讨论】：

嗨@vadim-smolyakov，这与多项式朴素贝叶斯有什么不同？
是的，sLDA 的目的是同时学习全局主题和本地文档分数（例如电影评分），而多项式朴素贝叶斯更侧重于分类。两种模型都需要监督（sLDA 得分，MNB 类别标签）。我对 Bernoulli NB 做了一些分析，这可能对这里有帮助：github.com/vsmolyakov/experiments_with_python/blob/master/chp01/…
@VadimSmolyakov，如果 Y 不是数字而是文本/标签，我们如何更改代码？