在 Pandas 中自动创建子数据框（使用类？）答案

【问题标题】：Automating the creation of sub-dataframe in Pandas (with classes?)在 Pandas 中自动创建子数据框（使用类？）
【发布时间】：2022-01-21 13:37:23
【问题描述】：

我有一个数据框，我想创建一些子数据框。现在我“手动”创建了 3 个子数据集，但我想自动化这个过程，因为我需要重用代码，而且将来子数据集可能会超过 3 个。

假设这是我的数据集：

import pandas as pd
 

data = {'line':['a', 'b', 'c', 'a', 'a', 'b', 'b', 'b', 'c', 'r', 'j', 'j', 'r'],
        'time':['10', '3', '5', '50', '10', '20', '7', '33', '42', '15', '25', '9', '81']}
 
# Create DataFrame
df = pd.DataFrame(data)
 
# Print the output.
print(df)

结果是：

   line time
0     a   10
1     b    3
2     c    5
3     a   50
4     a   10
5     b   20
6     b    7
7     b   33
8     c   42
9     r   15
10    j   25
11    j    9
12    r   81

我需要创建 3 个子数据集，始终不包括“行”列中的值“r”和“j”。这就是我现在正在做的事情：

a = df[~df['line'].str.startswith('r') & ~df['line'].str.startswith('j') & df['line'].str.startswith('a') ]

print(a)

  line time
0    a   10
3    a   50
4    a   10

b = df[~df['line'].str.startswith('r') & ~df['line'].str.startswith('j') & df['line'].str.startswith('b') ]

print(b)


  line time
1    b    3
5    b   20
6    b    7
7    b   33

c = df[~df['line'].str.startswith('r') & ~df['line'].str.startswith('j') & df['line'].str.startswith('c') ]

print(c)

  line time
2    c    5
8    c   42

如前所述，我想自动化这个过程。我的想法是创建一个类；类似的东西[编辑代码]：

class Line:
    line_r = df['line'].str.startswith('r')
    line_j = df['line'].str.startswith('j')
    
    def __init__(self, line): 
        self.line= df['line'].str.startswith('')
        
    def get_line(self):
        if df['line'].str.startswith('a'):
            return df[~line_r & ~line_j & (self.line)]
        elif df['line'].str.startswith('b'):
            return df[~line_r & ~line_j & (self.line)]
        elif df['line'].str.startswith('c'):
            return df[~line_r & ~line_j & (self.line)]
        else:
            pass

但是当我尝试调用它时，我得到一个错误：

line_a = Line('a')

line_a.get_line()

错误是：

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

我认为问题在于使用类来实现输出...... 而且，这个过程不是自动化的：如果将来我需要 50 个子数据帧，我必须写 49 个 'elif'，这不太好......

确实，如果我使用“for 循环”，我会得到同样的错误：

for s in df[~df['line'].str.startswith('r') & ~df['line'].str.startswith('j') & df['line'].str.startswith('s')]:
    if s == a:
        print('Hello')

错误：

ValueError: The truth value of a DataFrame is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

你怎么看？有什么建议吗？

【问题讨论】：

self.phase 应该指什么？好像没有很好的定义。您希望将来如何拆分数据帧？基于line?中的第一个字符？
对不起，它必须是你建议的'self.line'。我已经编辑了帖子。

标签： python pandas dataframe

【解决方案1】：

你可以用下面的代码做到这一点。

请记住，在强制覆盖 GLOBAL 变量 a 时，需要非常小心地使用此方法， b 和 c。

for let in ["a","b","c"]:
    globals()["{}".format(let)] = (df[~df['line'].str.startswith('r') & 
                                      ~df['line'].str.startswith('j') & 
                                       df['line'].str.startswith('{}'.format(let))])

【讨论】：