熊猫空数据框行为异常答案

【问题标题】：Pandas empty dataframe behaving strangely熊猫空数据框行为异常
【发布时间】：2021-07-31 19:08:11
【问题描述】：

def opinion_df_gen(preprocessor):
    op_files = {'POSITIVE': Path('clustering/resources/positive_words.txt').resolve(),
                'NEGATIVE':  Path('clustering/resources/negative_words.txt').resolve()}
    df_ = pd.DataFrame()
    for i, (sentiment, filepath) in enumerate(op_files.items()):
        print(filepath)
        word_set = preprocessor.lemmatize_words(file_path=filepath)
        print(len(word_set))
        df_['tokens'] = list(word_set)

preprocessor 是一个自定义类，preprocessor.lemmatize_words 返回单词/令牌的set。问题是df_['tokens'] = list(word_set) 抛出错误。

Traceback (most recent call last):
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 184, in <module>
    main()
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 155, in main
    to_label_df, token_col, target_col = opinion_df_gen(preprocessor)
  File "E:/Anoop/cx-index-score/sentiment_analysis/main.py", line 35, in opinion_df_gen
    df_['tokens'] = list(word_set)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3044, in __setitem__
    self._set_item(key, value)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3120, in _set_item
    value = self._sanitize_column(key, value)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\frame.py", line 3768, in _sanitize_column
    value = sanitize_index(value, self.index)
  File "C:\ProgramData\Anaconda3\envs\cx_env_\lib\site-packages\pandas\core\internals\construction.py", line 748, in sanitize_index
    "Length of values "
ValueError: Length of values (4085) does not match length of index (1782)

但是如果你看到代码，数据框是空的。我不明白它说索引的长度为 1782。这段代码应该根据我的理解工作。 pandas 版本为1.1.5，python 版本为3.8.0。

【问题讨论】：

我不认为问题出在第一次迭代上。您需要确保所有组的长度相同或自行处理。
哦，现在我明白了。我应该把 df_ = pd.DataFrame() 放在 for 循环中
这样你就可以为每个集合创建不同的DataFrame。这是你想要的吗？

标签： python pandas dataframe indexing

【解决方案1】：

问题是您在 for 循环中调用同一列。你有两次迭代：第一次是正面的词，第二次是负面的词。当您完成第一次迭代时， df_['tokens'] 列中有 word_set of positive_words 。此列表的长度应为 4085。所以，接下来发生的是当第二次迭代进来时，即对于negative_words，negative_words的words_set的长度是1782，这与df_['tokens']不同。请注意，它现在不是空的。因此错误

【讨论】：