【问题标题】:sklearn TfidfVectorizer : How to make few words to only be part of bi gram in the featuressklearn TfidfVectorizer:如何使几个单词仅成为特征中二元组的一部分
【发布时间】:2019-08-05 07:44:27
【问题描述】:

我希望 TfidfVectorizer 的特征化考虑一些预定义的单词,例如 "script", "rule", 只能在二元组中使用。

如果我有短信"Script include is a script that has rule which has a business rule"

如果我使用上面的文字

tfidf = TfidfVectorizer(ngram_range=(1,2),stop_words='english')

我应该得到

['script include','business rule','include','business']

【问题讨论】:

  • 为什么'include script'不在你的输出中,因为'include is a script' 'is a' 是停用词,而您正在删除停用词。你能澄清一下吗

标签: python scikit-learn tfidfvectorizer


【解决方案1】:

基本上,您正在寻找根据您的特殊单词自定义 n_grams 创建(我在函数中将其称为interested_words)。我已经为您定制了默认的n_grams creation function

def custom_word_ngrams(tokens, stop_words=None, interested_words=None):
    """Turn tokens into a sequence of n-grams after stop words filtering"""

    original_tokens = tokens
    stop_wrds_inds = np.where(np.isin(tokens,stop_words))[0]
    intersted_wrds_inds = np.where(np.isin(tokens,interested_words))[0]

    tokens = [w for w in tokens if w not in stop_words+interested_words] 

    n_original_tokens = len(original_tokens)

    # bind method outside of loop to reduce overhead
    tokens_append = tokens.append
    space_join = " ".join

    for i in xrange(n_original_tokens - 1):
        if  not any(np.isin(stop_wrds_inds, [i,i+1])):
            tokens_append(space_join(original_tokens[i: i + 2]))

    return tokens

现在,我们可以将这个函数插入到 TfidfVectorizer 的 analyzer 中,如下所示!

import numpy as np
from sklearn.externals.six.moves import xrange
from sklearn.feature_extraction.text import  TfidfVectorizer,CountVectorizer
from sklearn.feature_extraction import  text


def analyzer():
    base_vect = CountVectorizer()
    stop_words = list(text.ENGLISH_STOP_WORDS)
    preprocess = base_vect.build_preprocessor()
    tokenize = base_vect.build_tokenizer()

    return lambda doc: custom_word_ngrams(
        tokenize(preprocess(base_vect.decode(doc))), stop_words, ['script', 'rule']) 
    #feed your special words list here

vectorizer = TfidfVectorizer(analyzer=analyzer())
vectorizer.fit(["Script include is a script that has rule which has a business rule"])
vectorizer.get_feature_names()

['business', 'business rule', 'include', 'script include']

【讨论】:

    【解决方案2】:
    from sklearn.feature_extraction import text 
    # Given a vocabulary returns a filtered vocab which
    # contain only tokens in include_list and which are 
    # not stop words
    def filter_vocab(full_vocab, include_list):
        b_list = list()
        for x in full_vocab:
            add = False
            for t in x.split():
                if t in text.ENGLISH_STOP_WORDS:
                    add = False
                    break
                if t in include_list:
                    add = True
            if add:
                b_list.append(x)
        return b_list
    
    # Get all the ngrams (one can also use nltk.util.ngram)
    ngrams = TfidfVectorizer(ngram_range=(1,2), norm=None, smooth_idf=False, use_idf=False)
    X = ngrams.fit_transform(["Script include is a script that has rule which has a business rule"])
    full_vocab = ngrams.get_feature_names()
    
    # filter the full ngram based vocab
    filtered_v = filter_vocab(full_vocab,["include", "business"])
    
    # Get tfidf using the new filtere vocab
    vectorizer = TfidfVectorizer(ngram_range=(1,2), vocabulary=filtered_v)
    X = vectorizer.fit_transform(["Script include is a script that has rule which has a business rule"])
    v = vectorizer.get_feature_names()
    print (v)
    

    代码被注释以解释它在做什么

    【讨论】:

    • 不完全是我想要的......但给了我一个方向......谢谢
    【解决方案3】:

    TfidfVectorizer 允许您提供自己的标记器,您可以执行以下操作。但是你会丢失词汇中的其他单词信息。

    from sklearn.feature_extraction.text import TfidfVectorizer
    corpus = ["Script include is a script that has rule which has a business rule"]
    
    vectorizer = TfidfVectorizer(ngram_range=(1,2),tokenizer=lambda corpus: [ "script", "rule"],stop_words='english')
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names())
    

    【讨论】:

      猜你喜欢
      • 2013-11-14
      • 1970-01-01
      • 2021-02-13
      • 2015-09-04
      • 2016-11-26
      • 2016-09-07
      • 2022-08-13
      • 2019-12-26
      • 2013-06-24
      相关资源
      最近更新 更多