【问题标题】:How to groupby multiple dataframes into an aggregated table如何将多个数据框分组到聚合表中
【发布时间】:2021-03-22 03:08:53
【问题描述】:

我使用加载数据

X_train, X_test, y_train, y_test = load_data() 
# y_train here has a string class name inside

然后我使用 sklearn 的编码器将我的数据更改为分类

from sklearn.preprocessing import LabelEncoder

数据示例

y_train = ["tomato","strawberry", "strawberry",  "potato", "strawberry",  "potato", "lemon"]
y_test = ["strawberry", "lemon", "lemon", "lemon"]

encoder = LabelEncoder()
y_train = encoder.fit_transform(y_train)
print(y_train)
>>>[3 2 2 1 2 1 0]
y_test = encoder.transform(y_test)
print(y_test)
>>>[2 0 0 0]

我可以打电话给encoder.inverse_transform(y_test) 来取回字符串

如何建立一个汇总表来显示类似的表格

Label     | y_train_count | y_test_count
----------+---------------+-------------
strawberry| 3             | 1
potato    | 2             | 0
...       | ...           | ...

【问题讨论】:

    标签: python pandas numpy dataframe group-by


    【解决方案1】:

    只需将列表编译成 Pandas 数据框,然后与groupby(用于长报告)和pivot_table(用于广泛报告)聚合。

    编译

    # COMPILE EACH LIST INTO COLUMN WITH INDICATORS
    y_train = ["tomato","strawberry", "strawberry",  "potato", "strawberry",  "potato", "lemon"]
    y_test = ["strawberry", "lemon", "lemon", "lemon"]
    
    categ_df = pd.concat([pd.DataFrame({'source': 'y_train', 'category': y_train}),
                          pd.DataFrame({'source': 'y_test', 'category': y_test})])
    
    categ_df
    #      source    category
    # 0   y_train      tomato
    # 1   y_train  strawberry
    # 2   y_train  strawberry
    # 3   y_train      potato
    # 4   y_train  strawberry
    # 5   y_train      potato
    # 6   y_train       lemon
    # 7    y_test  strawberry
    # 8    y_test       lemon
    # 9    y_test       lemon
    # 10   y_test       lemon
    

    聚合

    all_vals = pd.MultiIndex.from_product(
                    [categ_df['category'].unique(), categ_df['source'].unique()]
               )
                                   
    agg_df = (categ_df.groupby(['category', 'source'])
                      .agg(count=('category', 'size'))
                      .reindex(all_vals, axis='index')
                      .fillna(0)
             )
    agg_df
    #                     count
    # tomato     y_train    1.0
    #            y_test     0.0
    # strawberry y_train    3.0
    #            y_test     1.0
    # potato     y_train    2.0
    #            y_test     0.0
    # lemon      y_train    1.0
    #            y_test     3.0
    
    agg_df = (categ_df.assign(key=1)
                      .pivot_table(index='category', columns='source',
                                   values='key', aggfunc='count')
                      .reindex(categ_df['source'].unique(), 
                               axis='columns')
                      .fillna(0)
             )
    agg_df
    # source      y_test  y_train
    # category
    # lemon          3.0      1.0
    # potato         0.0      2.0
    # strawberry     1.0      3.0
    # tomato         0.0      1.0
    

    【讨论】:

    • 成功了!然而,这对我来说并不简单。能不能告诉我是怎么学的,或者是怎么学的,不要在这里再问了?
    • 正如所解释的,最终您的问题是如何编译和汇总数据(即按组计数)。因此,学习如何使用DataFrameconcat 编译数据以及如何使用groupbypivot_table 进行聚合。另一个reindex 是清理列和索引。阅读文档、教程,尤其是这些pandas user guides,最重要的是边做边学! Pandas 是 StackOverflow 上非常流行的标签。可能已经以不同的形式询问了任何错误/不需要的结果。所以在这里使用搜索过去的 Q !编码愉快!
    猜你喜欢
    • 2021-09-20
    • 2021-01-07
    • 2021-10-18
    • 1970-01-01
    • 1970-01-01
    • 2020-03-01
    • 2021-08-05
    • 1970-01-01
    • 2017-12-27
    相关资源
    最近更新 更多