【问题标题】:Grouping a row into multiple groups with pandas用熊猫将一行分成多个组
【发布时间】:2016-03-14 05:01:28
【问题描述】:

我有一组句子,我想对它们进行分组,这样组中的所有行都应该共享一个特定的单词。然而,一个句子可以属于许多组,因为它有很多单词。

所以在下面的例子中,应该有一个这样的组:

  • 包含所有行(0、1、2、3 和 4)的“温度”组
  • 包含第 2 行和第 4 行的“冻结”组
  • 包含第 0、1、2 和 3 行的“the”组
  • 仅包含第 0 行的“金属”组。
  • 数据集中每个其他单词的组
import pandas as pd

# An example data set
df = pd.DataFrame({"sentences": [
    "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
    "the temperature at which a liquid boils",
    "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
    "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
    "a system for measuring temperature in which water freezes at 32º and boils at 212º"
]})

# Create a new series which is a list of words in each "sentences" column
df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))

# Try to group by this new column 
df.groupby('words').count()

# TypeError: unhashable type: 'list'

但是我的代码抛出了一个错误,如图所示。(见下文) 由于我的任务有点复杂,我知道它可能不仅仅涉及调用 groupby()。有人可以帮我用熊猫做词组吗?

edit 在通过返回tuple(sentence.split())(感谢ethan-furman)解决了错误后,我尝试打印结果,但它似乎没有做任何事情。我认为它可能只是将每一行放在一个组中:

print(df.groupby('words').count())

# sentences    5
# dtype: int64

【问题讨论】:

    标签: python python-3.x pandas group-by


    【解决方案1】:

    要修复您的 TypeError,您可以将您的 lambda 更改为

    lambda sentence: tuple(sentence.split())
    

    这将返回 tuple 而不是 list(以及 tuples 和可散列的)。

    【讨论】:

    • 这确实解决了错误,但我仍然无法得到正确的结果(见编辑)
    【解决方案2】:

    您可以使用集合,以便每个单词都是唯一的。首先,我们需要得到所有句子中所有单词的列表。为此,我们将单词初始化为一个空集,然后使用列表推导在每个句子中添加每个小写单词(在拆分句子之后)。

    接下来,我们使用字典推导来构建一个以单词集中每个单词为关键字的字典。该值是包含包含该单词的每个句子的数据框。这些是通过对函数 groupby(df.sentences.str.contains(word, case=False)) 进行分组,然后获取条件为 True 的每个组来获得的。

    words = set()
    _ = [words.add(word.lower()) for sentence in df.sentences for word in sentence.split()]
    
    word_dict = {word: df.groupby(df.sentences.str.contains(word, case=False)).get_group(True) 
                 for word in words}
    
    >>> word_dict['temperature']
                                               sentences
    0  two long pieces of metal fixed together, each ...
    1            the temperature at which a liquid boils
    2  a system for measuring temperature that is par...
    3  a unit for measuring temperature. Measurements...
    4  a system for measuring temperature in which wa...
    
    >>> word_dict['freezes']
                                               sentences
    2  a system for measuring temperature that is par...
    4  a system for measuring temperature in which wa...
    
    >>> words
    {'0',
     '100',
     '212\xc2\xba',
     '32\xc2\xba',
     'a',
     'amount',
     'and',
     'are',
     'as',
     'at',
     'bends',
     ...
    

    获取每个单词的索引值字典:

    >>> {word: word_dict[word].index.tolist() for word in word_dict}
    {'0': [2],
     '100': [2],
     '212\xc2\xba': [4],
     '32\xc2\xba': [4],
     'a': [0, 1, 2, 3, 4],
     'amount': [0],
     'and': [2, 4],
     'are': [0, 3],
     'as': [2, 3, 4],
     'at': [0, 1, 2, 3, 4],
     'bends': [0],
     'boils': [1, 2, 4],
     'both': [0],
     'by': [3],
     'degrees': [2],
     'different': [0],
     'each': [0],
     'expressed': [3],
     'fixed': [0],
     'followed': [3],
     'for': [2, 3, 4],
     'freezes': [2, 4],
     ...
    

    或者一个布尔指标矩阵。

    >>> [df.sentences.str.contains(word, case='lower').tolist() for word in word_dict]
    [[False, False, True, False, True],
     [False, False, False, True, False],
     [True, False, False, False, False],
     [False, False, True, False, False],
     ...
    

    【讨论】:

      【解决方案3】:

      我当前的解决方案使用 pandas 的 MultiIndex 功能。我确信可以通过更有效地使用 numpy 来改进它,但我相信这将比其他仅 python 的答案表现得更好:

      import pandas as pd
      import numpy as np
      
      # An example data set
      df = pd.DataFrame({"sentences": [
          "two long pieces of metal fixed together, each of which bends a different amount when they are both heated to the same temperature",
          "the temperature at which a liquid boils",
          "a system for measuring temperature that is part of the metric system, in which water freezes at 0 degrees and boils at 100 degrees",
          "a unit for measuring temperature. Measurements are often expressed as a number followed by the symbol °",
          "a system for measuring temperature in which water freezes at 32º and boils at 212º"
      ]})
      
      # Create a new series which is a list of words in each "sentences" column
      df['words'] = df['sentences'].apply(lambda sentence: sentence.split(" "))
      
      # This is all the words in the dataset. Each word will be its own index (level of the MultiIndex)
      names = np.unique(df['words'].sum())
      
      # Create an array of tuples, one tuple for each row of data
      # Each tuple contains True if the row has that word in it, and False if it does not
      values = df['words'].map(
          lambda words: np.vectorize(
              lambda word:
                  True if word in words else False)(names)
      )
      
      # Make a multindex
      index = pd.MultiIndex.from_tuples(values, names=names)
      
      # Add the MultiIndex without creating a new data frame
      df.set_index(index, inplace=True)
      
      # Find all the rows that have the word 'temperature'
      xs = df.xs(True, level='temperature')
      
      print(xs.to_string(index=False))
      

      【讨论】:

      • 这个解决方案还有效吗?你找到更好的东西了吗?
      猜你喜欢
      • 2016-07-23
      • 1970-01-01
      • 2020-02-08
      • 1970-01-01
      • 2019-09-30
      • 1970-01-01
      • 2020-10-05
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多