【问题标题】:Implementing skip gram with scikit-learn?用 scikit-learn 实现 skip gram?
【发布时间】:2019-12-28 19:20:33
【问题描述】:

有没有办法在scikit-learn 库中实现skip-gram? 我已经手动生成了一个带有 n-skip-grams 的列表,并将其作为 CountVectorizer() 方法的词汇表传递给 skipgrams。

不幸的是,它的预测性能很差:只有 63% 的准确率。 但是,我在默认代码中使用 ngram_range(min,max)CountVectorizer() 上获得了 77-80% 的准确度。

有没有更好的方法在 scikit learn 中实现 skip-grams?

这是我的部分代码:

corpus = GetCorpus() # This one get text from file as a list

vocabulary = list(GetVocabulary(corpus,k,n))  
# this one returns a k-skip n-gram   

vec = CountVectorizer(
          tokenizer=lambda x: x.split(),
          ngram_range=(2,2),
          stop_words=stopWords,
          vocabulary=vocabulary)

【问题讨论】:

    标签: python machine-learning scikit-learn


    【解决方案1】:

    要在 scikit-learn 中使用 skip-grams 对文本进行矢量化,只需将 skip-gram 标记作为词汇表传递给 CountVectorizer 是行不通的。您需要修改令牌的处理方式,这可以使用自定义分析器来完成。下面是一个生成 1-skip-2-grams 的示例矢量化器,

    from toolz import itertoolz, compose
    from toolz.curried import map as cmap, sliding_window, pluck
    from sklearn.feature_extraction.text import CountVectorizer
    
    class SkipGramVectorizer(CountVectorizer):
        def build_analyzer(self):    
            preprocess = self.build_preprocessor()
            stop_words = self.get_stop_words()
            tokenize = self.build_tokenizer()
            return lambda doc: self._word_skip_grams(
                    compose(tokenize, preprocess, self.decode)(doc),
                    stop_words)
    
        def _word_skip_grams(self, tokens, stop_words=None):
            # handle stop words
            if stop_words is not None:
                tokens = [w for w in tokens if w not in stop_words]
    
            return compose(cmap(' '.join), pluck([0, 2]), sliding_window(3))(tokens)
    

    例如,this Wikipedia example

    text = ['the rain in Spain falls mainly on the plain']
    
    vect = SkipGramVectorizer()
    vect.fit(text)
    vect.get_feature_names()
    

    这个向量化器会产生以下标记,

    ['falls on',  'in falls',  'mainly the',  'on plain',
     'rain spain',  'spain mainly',  'the in']
    

    【讨论】:

    • 谢谢你的回复,兄弟。我会尽快尝试并通知您。
    【解决方案2】:

    我想出了我自己实现的 skip-gram 矢量化器。它的灵感来自this 帖子。我还将skip-grams限制为不跨越句子边界(使用nltk.sent_tokenize),以限制特征空间。这是我的代码:

    import nltk
    from itertools import combinations
    from toolz import compose
    from sklearn.feature_extraction.text import CountVectorizer
    
    class SkipGramVectorizer(CountVectorizer):
    
        def __init__(self, k=1, **kwds):
            super(SkipGramVectorizer, self).__init__(**kwds)
            self.k=k
    
        def build_sent_analyzer(self, preprocess, stop_words, tokenize):
            return lambda sent : self._word_skip_grams(
                    compose(tokenize, preprocess, self.decode)(sent),
                    stop_words)
    
        def build_analyzer(self):    
            preprocess = self.build_preprocessor()
            stop_words = self.get_stop_words()
            tokenize = self.build_tokenizer()
            sent_analyze = self.build_sent_analyzer(preprocess, stop_words, tokenize)
    
            return lambda doc : self._sent_skip_grams(doc, sent_analyze)
    
        def _sent_skip_grams(self, doc, sent_analyze):
            skip_grams = []
            for sent in nltk.sent_tokenize(doc):
                skip_grams.extend(sent_analyze(sent))
            return skip_grams
    
        def _word_skip_grams(self, tokens, stop_words=None):
            """Turn tokens into a sequence of n-grams after stop words filtering"""
            # handle stop words
            if stop_words is not None:
                tokens = [w for w in tokens if w not in stop_words]
    
            # handle token n-grams
            min_n, max_n = self.ngram_range
            k = self.k
            if max_n != 1:
                original_tokens = tokens
                if min_n == 1:
                    # no need to do any slicing for unigrams
                    # just iterate through the original tokens
                    tokens = list(original_tokens)
                    min_n += 1
                else:
                    tokens = []
    
                n_original_tokens = len(original_tokens)
    
                # bind method outside of loop to reduce overhead
                tokens_append = tokens.append
                space_join = " ".join
    
                for n in xrange(min_n,
                                min(max_n + 1, n_original_tokens + 1)):
                    for i in xrange(n_original_tokens - n + 1):
                        # k-skip-n-grams
                        head = [original_tokens[i]]                    
                        for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1):
                            tokens_append(space_join(head + list(skip_tail)))
            return tokens
    
    def test(text, ngram_range, k):
        vectorizer = SkipGramVectorizer(ngram_range=ngram_range, k=k)
        vectorizer.fit_transform(text)
        print(vectorizer.get_feature_names())
    
    def main():
        text = ['Insurgents killed in ongoing fighting.']
    
        # 2-skip-bi-grams
        test(text, (2,2), 2)
        # 2-skip-tri-grams
        test(text, (3,3), 2)
    ###############################################################################################
    if __name__ == '__main__':
        main()
    

    这将生成以下功能名称:

    [u'in fighting', u'in ongoing', u'insurgents in', u'insurgents killed', u'insurgents ongoing', u'killed fighting', u'killed in', u'killed ongoing', u'ongoing fighting']
    [u'in ongoing fighting', u'insurgents in fighting', u'insurgents in ongoing', u'insurgents killed fighting', u'insurgents killed in', u'insurgents killed ongoing', u'insurgents ongoing fighting', u'killed in fighting', u'killed in ongoing', u'killed ongoing fighting']
    

    注意,我基本上是从VectorizerMixin 类中取出_word_ngrams 函数并替换了行

    tokens_append(space_join(original_tokens[i: i + n]))
    

    以下内容:

    head = [original_tokens[i]]                    
    for skip_tail in combinations(original_tokens[i+1:i+n+k], n-1):
        tokens_append(space_join(head + list(skip_tail)))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2018-05-01
      • 2019-07-11
      • 2018-08-07
      • 2018-04-08
      • 1970-01-01
      • 2016-10-03
      • 2019-10-20
      相关资源
      最近更新 更多