【问题标题】:pandas/numpy: Grouping a dataframe with specified number of replicatespandas/numpy:对具有指定重复次数的数据帧进行分组
【发布时间】:2018-06-22 22:02:47
【问题描述】:

这是我的数据框:

data1 = [['2017-02-10','orange','jon','small','1','1.1'], ['2017-02-10','orange','jon','medium','1','2.1'], ['2017-02-10','orange','jon','large','1','3.1'], ['2017-02-11','orange','mary','small','2','1.2'], ['2017-02-10','orange','jon','medium','2','2.2'], ['2017-02-10','orange','jon','large','2','3.2'], ['2017-02-10','orange','jon','small','1','7.1'], ['2017-02-11','orange','mary','medium','1','8.1'], ['2017-02-11','orange','mary','large','1','9.1'], ['2017-02-11','orange','mary','small','2','10.1'], ['2017-02-11','orange','mary','medium','2','11.1'], ['2017-02-11','orange','mary','large','2','12.1']]

df = pd.DataFrame(data1,columns=['date', 'fruit', 'name', 'size', 'replicate', 'weight'])
print df
          date   fruit  name    size replicate weight
0   2017-02-10  orange   jon   small         1    1.1
1   2017-02-10  orange   jon  medium         1    2.1
2   2017-02-10  orange   jon   large         1    3.1
3   2017-02-11  orange   mary   small         2    1.2
4   2017-02-10  orange   jon  medium         2    2.2
5   2017-02-10  orange   jon   large         2    3.2
6   2017-02-10  orange  jon   small         1    7.1
7   2017-02-11  orange  mary  medium         1    8.1
8   2017-02-11  orange  mary   large         1    9.1
9   2017-02-11  orange  mary   small         2   10.1
10  2017-02-11  orange  mary  medium         2   11.1
11  2017-02-11  orange  mary   large         2   12.1

我需要对这个数据框进行分组,以便输出具有由复制分隔的小值、中值和大值,如下所示:

val1 = ['2017-02-10', 'orange', 'jon', 'small', '1', '1.1'],
['2017-02-10', 'orange', 'jon', 'medium', '1', '2.1'],
['2017-02-10', 'orange', 'jon', 'large', '1', '3.1'],

val2 = ['2017-02-10', 'orange', 'jon', 'small', '2', '7.1'],
['2017-02-10', 'orange', 'jon', 'medium', '2', '2.2'],
['2017-02-10', 'orange', 'jon', 'large', '2', '3.2'],

val3 = ['2017-02-11', 'orange', 'mary', 'small', '1', '1.2'],
['2017-02-11', 'orange', 'mary', 'medium', '1', '8.1'],
['2017-02-11', 'orange', 'mary', 'large', '1', '9.1'],

val4....

输出的格式无关紧要,更重要的是如何对数据进行适当的分组。使用非 pandas/numpy 方法,我可以从多个列中获取的值创建一个唯一标识符,这样如果“jon”实例不合适,它仍然会在输出中正确分组。更具体地说,每个输出组可以有一个唯一标识符“日期”、“水果”、“名称”,但必须具有“小”、“中”和“大”的所有对应实例,以及项目。

【问题讨论】:

  • 您是否只想一次提取 3 行?另外,您输入的内容不应该是val4 吗?
  • 不,我不想一次只提取 3 行。这个例子可能是这样组织的,但不是所有的都是这样。是的 val4 也应该在输出中

标签: python pandas numpy grouping unique


【解决方案1】:

有序数据

您可以将字典用于可变数量的变量。这是使用pd.DataFrame.ilocitertools.zip_longest 的一种方式:

from itertools import zip_longest

# calculate when replicate changes
s = df['replicate'] != df['replicate'].shift()

# extract index of True values
idx = s[s].index

# enumerate and slice using integer location
vals = {num: df.iloc[i:j] for num, (i, j) in enumerate(zip_longest(idx, idx[1:]), 1)}

print(vals)

{1:          date   fruit name    size replicate weight
 0  2017-02-10  orange  jon   small         1    1.1
 1  2017-02-10  orange  jon  medium         1    2.1
 2  2017-02-10  orange  jon   large         1    3.1,
 2:          date   fruit name    size replicate weight
 3  2017-02-10  orange  jon   small         2    1.2
 4  2017-02-10  orange  jon  medium         2    2.2
 5  2017-02-10  orange  jon   large         2    3.2,
 3:          date   fruit  name    size replicate weight
 6  2017-02-11  orange  mary   small         1    7.1
 7  2017-02-11  orange  mary  medium         1    8.1
 8  2017-02-11  orange  mary   large         1    9.1,
 4:           date   fruit  name    size replicate weight
 9   2017-02-11  orange  mary   small         2   10.1
 10  2017-02-11  orange  mary  medium         2   11.1
 11  2017-02-11  orange  mary   large         2   12.1}

无序数据

你还是可以用字典,这次pd.DataFrame.groupby很方便:

groups = df.groupby(['date', 'fruit', 'name', 'replicate'])

vals = {i: v for i, (_, v) in enumerate(groups, 1)}

print(vals)

{1:          date   fruit name    size replicate weight
0  2017-02-10  orange  jon   small         1    1.1
1  2017-02-10  orange  jon  medium         1    2.1
2  2017-02-10  orange  jon   large         1    3.1,
 2:          date   fruit name    size replicate weight
3  2017-02-10  orange  jon   small         2    1.2
4  2017-02-10  orange  jon  medium         2    2.2
5  2017-02-10  orange  jon   large         2    3.2,
 3:          date   fruit  name    size replicate weight
6  2017-02-11  orange  mary   small         1    7.1
7  2017-02-11  orange  mary  medium         1    8.1
8  2017-02-11  orange  mary   large         1    9.1,
 4:           date   fruit  name    size replicate weight
9   2017-02-11  orange  mary   small         2   10.1
10  2017-02-11  orange  mary  medium         2   11.1
11  2017-02-11  orange  mary   large         2   12.1}

【讨论】:

  • 虽然这在技术上有效,但它严格基于此示例中的行顺序,而不是数据帧的结构。例如,如果我将一行移到别处,则此代码不再有效。我将更改示例以反映这一点....对不起,感谢您的帮助!
  • 您能否指定用于分组的具体标准(例如,哪些字段)?如果您的示例不那么琐碎,这可能会有所帮助,至少可以帮助未来的访问者。
  • @Rob,另外,在某些假设下已经更新了无序的情况。
  • 非常感谢您的努力和耐心 jpp
猜你喜欢
  • 1970-01-01
  • 2016-04-15
  • 1970-01-01
  • 2012-10-07
  • 1970-01-01
  • 2019-09-23
  • 2021-01-25
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多