【问题标题】:Latent Dirichlet Allocation (LDA) with Gibbs Sampling in PythonPython 中带有 Gibbs 采样的潜在狄利克雷分配 (LDA)
【发布时间】:2022-07-04 17:23:31
【问题描述】:

我正在尝试在 Python 3.8 中使用 Gibbs 采样重新实现 LDA,但我的代码给出了错误的结果。如果您能帮我调试 Gibbs 采样程序,我将不胜感激!

我改编的代码是Agustinus Kristiadi's Blog,它使用了推理方法而不是采样。参数的命名遵循Griffiths et al 2004。我目前在使用 Gibbs 采样时遇到问题,这是我通过该算法实现的:

为了简单地测试正确性,我有一个大小为 5 的词汇表和一组 9 个文档,我想将它们分成 2 个主题。您可以从 main.py 代码中看到文档集。前四个文档应属于一个主题,后五个文档应属于另一个主题。

由于我已经完成了所有部分的编码并训练了 1000 次迭代,因此生成的文档主题分布看起来是错误的 - 它将几乎所有文档都归为第一类。

 [[0.57142857 0.42857143]
 [0.57142857 0.42857143]
 [0.42857143 0.57142857]
 [0.57142857 0.42857143]
 [0.71428571 0.28571429]
 [0.85714286 0.14285714]
 [0.57142857 0.42857143]
 [0.71428571 0.28571429]
 [0.57142857 0.42857143]]

但是,一旦我使用 Agustinus Kristiadi 的方式获取 Z,代码给出了合理的结果:

[[0.14285714 0.85714286]
 [0.14285714 0.85714286]
 [0.28571429 0.71428571]
 [0.28571429 0.71428571]
 [0.85714286 0.14285714]
 [0.85714286 0.14285714]
 [0.85714286 0.14285714]
 [0.57142857 0.42857143]
 [0.85714286 0.14285714]]

我已经检查了很多次代码,但仍然找不到错误。对我来说,这是对上述算法的忠实实现。我想知道我是否对 Gibbs 抽样程序有误解。下面我将展示我的代码。

这是 main.py:

import numpy as np
import lda

# Vocabulary - all the words
W = np.array([0, 1, 2, 3, 4])

# Document words
X = np.array([
    [0, 0, 1, 2, 2],
    [0, 0, 1, 1, 1],
    [0, 1, 2, 2, 2],
    [2, 2, 1, 1, 4],
    [4, 4, 4, 4, 4],
    [3, 3, 4, 4, 4],
    [3, 4, 4, 4, 4],
    [3, 3, 3, 4, 1],
    [4, 4, 3, 3, 2],
])

D = X.shape[0]  # num of docs
n_w = W.shape[0]  # num of words
T = 2  # num of topics

'''Randomized Initialization'''
# Dirichlet priors
alpha = 1    # Dirichlet parameter for Theta, document-topic distribution
beta = 1     # Dirichlet parameter for Phi, topic-word distribution
iterations = 1000

# Z := word-topic assignment
Z = np.zeros(shape=[D, n_w], dtype=int)

for i in range(D):
    for l in range(n_w):
        Z[i, l] = np.random.randint(T)  # randomly assign word's topic

# Theta := document-topic distribution
Theta = np.zeros([D, T])

for i in range(D):
    Theta[i] = np.random.dirichlet(alpha*np.ones(T))

# Phi := word-topic distribution
Phi = np.zeros([T, n_w])

for k in range(T):
    Phi[k] = np.random.dirichlet(beta*np.ones(n_w))

Theta, Phi, Z = lda.gibbs_sampling_mine(D, T, W, Theta, Phi, X, Z, alpha, beta, iterations)
print(Theta)

这是lda.py:

import numpy as np

'''
Symbols for all the parameters follow Griffiths et al 2004: 
https://www.pnas.org/content/pnas/101/suppl_1/5228.full.pdf?__=
T: Number of topics
n_w: Number of words
D: Number of documents

Theta ~ Dirichlet(alpha), document-topic distribution
Phi ~ Dirichlet(beta), topic-word distribution

X: corpus
Z: word-topic assignment

-- For Z --
n_ij_wi: the number of word wi assigned to topic j, not including the current one
n_ij_a:  the number of words assigned to topic j, not including the current one
n_ij_di: the number of words in document di assigned to topic j, not including the current one
n_i_di:  the number of words in di minus one

-- For Phi --
n_jw: The number of word w assigned to topic j
n_ja: The total number of word in topic j in z

-- For Theta --
n_jd: The number of words in document d assigend to j
n_ad: The number of words in document d
'''

def gibbs_sampling_mine(D, T, W, Theta, Phi, X, Z, alpha, beta, iterations=1000):
    n_w = len(W)
    '''Gibbs sampling'''
    for it in range(iterations):
        # Sample from full conditional of Z
        # ---------------------------------
        for d in range(D):
            for w in range(n_w):
                P_zdw = np.zeros([T])
                for j in range(T):
                    n_ij_wi = find_n_ij_wi(Z, X, j, w, d)  
                    n_ij_a  = np.sum(Z==j)-1 if Z[d][w]==j else np.sum(Z==j)
                    n_ij_di = np.sum(Z[d]==j)-1 if Z[d][w]==j else np.sum(Z[d]==j)
                    n_i_di  = X[d].shape[0]-1
                    P_zdw[j] = (n_ij_wi + beta)/(n_ij_a + n_w*beta) * (n_ij_di+alpha)/(n_i_di+T*alpha)
                P_zdw = P_zdw / np.sum(P_zdw)
                Z[d][w] = np.random.multinomial(1, P_zdw).argmax()

        # Agustinus Kristiadi's implementation for Z: 
        # for i in range(D):
        #     for v in range(n_w):
        #         p_iv = np.exp(np.log(Theta[i]) + np.log(Phi[:, X[i, v]]))
        #         p_iv /= np.sum(p_iv)
        #         Z[i, v] = np.random.multinomial(1, p_iv).argmax()

        # Sample from full conditional of Theta - document-topic distribution
        # ----------------------------------
        for d in range(D):
            for j in range(T):
                n_jd = np.sum(Z[d]==j)
                n_ad = X[d].shape[0]
                Theta[d][j] = (n_jd + alpha) / (n_ad + T*alpha)

        # Sample from full conditional of Phi - topic-word distribution
        # ---------------------------------
        for j in range(T):
            for w in range(n_w):
                n_jw = find_n_jw(Z, X, j, w)
                n_ja = np.sum(Z==j)
                Phi[j][w] = (n_jw + beta) / (n_ja + T*beta)

    return Theta, Phi, Z


def find_n_jw(Z, X, j, w):
    n_jw = 0
    for d in range(X.shape[0]):
        for i in range(X.shape[1]):
            if Z[d][i]==j and X[d][i]==w:
                n_jw+=1
    return n_jw

def find_n_ij_wi(Z, X, j, w, d):
    n_ij_wi = 0
    for di in range(X.shape[0]):
        for i in range(X.shape[1]):
            if di==d and i==w:
                continue
            elif Z[di][i]==j and X[di][i]==w:
                n_ij_wi+=1
    return n_ij_wi

【问题讨论】:

    标签: python-3.x implementation lda


    【解决方案1】:

    我一直在进行相同的实现,并试图在您的代码中找到错误。我改变了很多东西,我认为问题在于计数的定义。这是我的代码(不确定它是否正确,仍然得到不同的结果),所以如果你发现我的错误,我会非常 将 X 转换为文档字数矩阵的函数:

    def doc_w_count(X):
        V = len(np.unique(X))
        D,W = np.shape(X)
        doc_w_counts = np.zeros([D,W])
        for d in range(D):
            for w in range(W):
                for v in range(V):
                    if X[d,w]==v:
                        doc_w_counts[d,v]+=1
        return doc_w_counts
    

    初始化:

    import numpy as np
    X = np.array([
        [0, 0, 1, 2, 2],
        [0, 0, 1, 1, 1],
        [0, 1, 2, 2, 2],
        [2, 2, 1, 1, 4],
        [4, 4, 4, 4, 4],
        [3, 3, 4, 4, 4],
        [3, 4, 4, 4, 4],
        [3, 3, 3, 4, 1],
        [4, 4, 3, 3, 2],
    ])
    K = 2
    D,W = np.shape(X)
    Z = np.random.randint(0,K,[D,W])
    
    alpha = 1
    beta = 1
    iters = 2000
    doc_w_counts_ = doc_w_count(X)
    doc_count = doc_w_counts_.sum(axis=1)
    doc_topic_count = np.zeros([D,K])
    topic_w_count = np.zeros([K,W])
    topic_count = np.zeros(K)
    

    示例:

    for it in range(iters):
        for d in range(D):
            for w in range(W):
                P_z = np.zeros(K)
                for k in range(K):
                    doc_topic_count[:,k] = np.sum((Z==k)*doc_w_counts_,axis=1)
                    topic_w_count[k,:]=np.sum((Z==k)*doc_w_counts_,axis=0)
                    topic_count[k] = np.sum((Z==k)*doc_w_counts_)
                    if Z[d,w]==k and doc_w_counts_[d,w]>0:
                        topic_w_count[k,w]-=1
                        topic_count[k]-=1
                        doc_topic_count[d,k]-=1
                    phi = (topic_w_count[k,w]+beta)/(topic_count[k]+W*beta)
                    theta = (doc_topic_count[d,k]+alpha)/(doc_count[d]+alpha*K)
                    P_z[k] = phi*theta
                P_z = P_z/np.sum(P_z)
                Z[d,w]=np.random.multinomial(K,P_z).argmax()
    

    计算 Theta:

    Theta = np.zeros([D,K])
    for d in range(D):
        for k in range(K):
            doc_topic_count[:,k] = np.sum((Z==k)*doc_w_counts_,axis=1)
            Theta[d,k] = (doc_topic_count[d,k]+alpha)/(doc_count[d]+alpha*K)
    

    Theta 的结果:

    array([[0.85714286, 0.14285714],
           [0.57142857, 0.42857143],
           [0.14285714, 0.85714286],
           [0.85714286, 0.14285714],
           [0.85714286, 0.14285714],
           [0.85714286, 0.14285714],
           [0.85714286, 0.14285714],
           [0.85714286, 0.14285714],
           [0.57142857, 0.42857143]])
    

    如您所见,很遗憾我还没有解决它,所以如果您发现您的代码有问题,请告诉我!

    最好的祝福

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-09-08
      • 1970-01-01
      • 1970-01-01
      • 2014-09-24
      • 2019-06-28
      • 2016-10-20
      • 2020-07-21
      • 2017-03-28
      相关资源
      最近更新 更多