遍历数据框以根据这些值提取最小值和最大值以及其他列答案

【问题标题】：Iterating through a dataframe to pull mins and max's as well as other columns based off of those values遍历数据框以根据这些值提取最小值和最大值以及其他列
【发布时间】：2021-12-18 04:55:29
【问题描述】：

我对 python 很陌生，对 pandas 也很陌生，所以任何帮助都将不胜感激！

我有一个数据框，其中数据的结构如下：

Batch_Name	Tag 1	Tag 2
2019-01	1	3
2019-02	2	3

我想遍历数据框并将以下内容拉入一个新的数据框：

每个标签的最大值（我的完整数据框中有 5 个）

该最大值处的批次名称

该标签的最小值

该最小值的批次名称

该标签的平均值

该标签的标准

尝试在心理上构建此结构时遇到了很多麻烦，但即使尝试使用摘要统计信息创建数据框，我也遇到了错误。下面是我第一次尝试使用统计信息创建新方法，我根本不知道如何提取批次名称。

def tag_stats(df):
    min_col = {}
    min_col_batch = {}
    max_col = {}
    max_col_batch = {}
    std_col = {}
    avg_col = {}
    for col in range(df.shape[3:]):
        max_col[col]= df[col].max()
        min_col[col]= df[col].min()
        std_col[col]= df[col].std()
        avg_col[col]= df[col].avg()

    result = pd.DataFrame([min_col, max_col, std_col, avg_col], index=['min', 'max', 'std', 'avg'])
    return result

【问题讨论】：

df.agg([min, max, 'mean', 'std']) 或 df.describe()?

标签： python pandas

【解决方案1】：

这是基于您的代码的答案！

import pandas as pd
import numpy as np

#Slightly modified your function
def tag_stats(df, tag_list):
    df = df.set_index('Batch_Name')
    
    data = {
        'tag':[],
        'min':[],
        'max':[],
        'min_batch':[],
        'max_batch':[],
        'std':[],
        'mean':[],
    }
    for tag in tag_list:
        values = df[tag]
        
        data['tag'].append(tag)
        data['min'].append(values.min())
        data['max'].append(values.max())
        data['min_batch'].append(values.idxmin())
        data['max_batch'].append(values.idxmax())
        data['std'].append(values.std())
        data['mean'].append(values.mean())

    result = pd.DataFrame(data)
    
    return result


#Create a df using some random data
np.random.seed(1)

num_batches = 10

df = pd.DataFrame({
    'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
    'Tag 1':np.random.randint(1,100,num_batches),
    'Tag 2':np.random.randint(1,100,num_batches),
    'Tag 3':np.random.randint(1,100,num_batches),
    'Tag 4':np.random.randint(1,100,num_batches),
    'Tag 5':np.random.randint(1,100,num_batches),
})


#Apply your function
cols = ['Tag 1','Tag 2','Tag 3','Tag 4','Tag 5']
summary_df = tag_stats(df, cols)
print(summary_df)

输出

     tag  min  max min_batch max_batch        std  mean
0  Tag 1    2   80   batch_9   batch_6  32.200759  38.0
1  Tag 2    7   85   batch_2   batch_7  28.926919  39.9
2  Tag 3   14   97   batch_9   batch_7  33.297314  63.4
3  Tag 4    1   82   batch_7   batch_9  31.060693  37.1
4  Tag 5    4   89   batch_7   batch_1  31.212711  43.3

@It_is_Chris 的评论也很棒，这是基于它的答案

import pandas as pd
import numpy as np

#Create a df using some random data
np.random.seed(1)

num_batches = 10

df = pd.DataFrame({
    'Batch_Name':['batch_{}'.format(i) for i in range(num_batches)],
    'Tag 1':np.random.randint(1,100,num_batches),
    'Tag 2':np.random.randint(1,100,num_batches),
    'Tag 3':np.random.randint(1,100,num_batches),
    'Tag 4':np.random.randint(1,100,num_batches),
    'Tag 5':np.random.randint(1,100,num_batches),
})

#Convert to a long df and index by Batch_Name:
#       index  |    tag   | tag_value
# ------------------------------------
#     batch_0  | Tag 1 38 |        38
#     batch_1  | Tag 1 13 |        13
#     batch_2  | Tag 1 73 |        73
long_df = df.melt(
    id_vars = 'Batch_Name',
    var_name = 'tag',
    value_name = 'tag_value',
).set_index('Batch_Name')

#Groupby tag and aggregate to get columns of interest
summary_df = long_df.groupby('tag').agg(
    max_value = ('tag_value','max'),
    max_batch = ('tag_value','idxmax'),
    min_value = ('tag_value','min'),
    min_batch = ('tag_value','idxmin'),
    mean_value = ('tag_value','mean'),
    std_value = ('tag_value','std'),
).reset_index()

summary_df

输出：

     tag  max_value max_batch  min_value min_batch  mean_value  std_value
0  Tag 1         80   batch_6          2   batch_9        38.0  32.200759
1  Tag 2         85   batch_7          7   batch_2        39.9  28.926919
2  Tag 3         97   batch_7         14   batch_9        63.4  33.297314
3  Tag 4         82   batch_9          1   batch_7        37.1  31.060693
4  Tag 5         89   batch_1          4   batch_7        43.3  31.212711

【讨论】：

我不确定我是否遗漏了什么，但这给了我以下错误：TypeError: reduction operation 'argmax' not allowed for this dtype
如果您复制并粘贴我在上面编写的代码，是否会出现该错误，或者当您尝试将代码应用于您的情况时出现错误？您的df 的列是否比Batch_Name 和Tag 1 Tag 2 等更多
我已经编辑了上面的答案，以包含一种更类似于您发布的代码的方法。也许这样会更容易为你工作