【问题标题】:Error when trying to create n-grams from a column DataFrame尝试从列 DataFrame 创建 n-gram 时出错
【发布时间】:2019-07-08 10:41:59
【问题描述】:

给定一个DataFrame,它有一个单列Text

      Text
0     chest  pain  nstemi  this  84-year  old  man  present  on  26/5  with  
      chest  pain  associate  with  profuse  sweating  and  nausea

我想创建两个新列,其中包含为之前的 DataFrame 生成的一元和二元。

这是我用来生成 ngram 的方法:

    def generate_ngrams(self, s, n):
        # Convert to lowercases
        s = s.lower()

        # Replace all none alphanumeric characters with spaces
        s = re.sub(r'[^a-zA-Z0-9\s]', ' ', s)

        # Break sentence in the token, remove empty tokens
        tokens = [token for token in s.split(" ") if token != ""]

        # Use the zip function to help us generate n-grams
        # Concatentate the tokens into ngrams and return
        ngrams = zip(*[tokens[i:] for i in range(n)])
        return [" ".join(ngram) for ngram in ngrams]

这就是我试图填充我的DataFrame

    for index, row in featuresDF.iterrows():
        featuresDF.at[index, '1-gram'] = generate_ngrams(infoDF.at[index, 'Text'], 1)
        featuresDF.at[index, '2-gram'] = generate_ngrams(infoDF.at[index, 'Text'], 2)

当我运行它时,我收到以下错误:ValueError: setting an array element with a sequence.

这是回溯:

Traceback (most recent call last):

  File "<ipython-input-64-e014e2e1c7e2>", line 3, in <module>
    featuresDF.at[index, '1-gram'] = featureExtraction.generate_ngrams(infoDF.at[index, 'Text'], 1)

  File "C:\Users\as\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py", line 2287, in __setitem__
    self.obj._set_value(*key, takeable=self._takeable)

  File "C:\Users\as\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py", line 2815, in _set_value
    engine.set_value(series._values, index, value)

  File "pandas/_libs/index.pyx", line 95, in pandas._libs.index.IndexEngine.set_value

  File "pandas/_libs/index.pyx", line 106, in pandas._libs.index.IndexEngine.set_value

我知道当我将一元和二元分配给DataFrame 时,这是一个问题,对吧?但我不知道如何解决它。谢谢!

【问题讨论】:

    标签: python arrays pandas dataframe


    【解决方案1】:

    generate_ngrams() 应该返回一个字符串,但它返回一个列表,如下所示:

    ['chest', 'pain', .....] 
    

    在返回列表之前,您不能将其转换为逗号分隔的字符串,例如:

    chest,pain, .....
    

    通过添加这些行:

    ngramList = [" ".join(ngram) for ngram in ngrams]        
    return ','.join(ngramList)
    

    此外,您还可以使用 CountVectorizer 来查找 N-gram:

    from sklearn.feature_extraction.text import CountVectorizer
    vectorizer = CountVectorizer(ngram_range=(2,2)) # 2,2 means 2-gram, 1,1 is unigram
    corpus = ['the boy is gone !']
    X = vectorizer.fit_transform(corpus)
    print(vectorizer.get_feature_names()) # this will print the list containing gram values
    

    【讨论】:

      【解决方案2】:

      您正在使用return [" ".join(ngram) for ngram in ngrams]返回一个列表

      不返回列表,只返回字符串本身:

      return " ".join(ngram) for ngram in ngrams

      如果您还真的想用列表设置元素,请关注ValueError: setting an array element with a sequence. for Pandas

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-12-14
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多