基于列的分组和取消分组答案

【问题标题】：Grouping and ungrouping based on a column基于列的分组和取消分组
【发布时间】：2016-07-20 22:26:20
【问题描述】：

我的目标是能够按列值对 CSV 文件的行进行分组，并执行逆运算。举个例子，希望能够在这两种格式之间来回转换：

uniqueId, groupId, feature_1, feature_2
1, 100, text of 1, 10
2, 100, some text of 2, 20
3, 200, text of 3, 30
4, 200, more text of 4, 40
5, 100, another text of 5, 50

按 groupId 分组：

uniqueId, groupId, feature_1, feature_2
1|2|5, 100, text of 1|some text of 2|another text of 5, 10|20|50
3|4, 200, text of 3|more text of 4, 30|40

分隔符（此处为 |）假定不存在于数据中的任何位置。

我正在尝试使用 Pandas 来执行此转换。到目前为止，我的代码可以访问按 groupId 分组的行的单元格，但我不知道如何填充新的数据框。

如何完成我的方法以完成转换为所需的新df？

将新的 df 转换回原始的逆向方法是什么样子的？

如果 R 是这项工作的更好工具，我也愿意接受 R 中的建议。

import pandas as pd  

def getGroupedDataFrame(df, groupByField, delimiter):
''' Create a df with the rows grouped on groupByField, values separated by delimiter'''
    groupIds = set(df[groupByField])
    df_copy = pd.DataFrame(index=groupIds,columns=df.columns)
    # iterate over the different groupIds
    for groupId in groupIds:
        groupRows = df.loc[df[groupByField] == groupId]
        # for all rows of the groupId
        for index, row in groupRows.iterrows():
            # for all columns in the df
            for column in df.columns:
                print row[column]
                # this prints the value the cell
                # here append row[column] to its cell in the df_copy row of groupId, separated by delimiter

【问题讨论】：

标签： python r csv pandas

【解决方案1】：

要执行分组，您可以在'groupId' 上使用groupby，然后在每个组内使用您给定的分隔符在每列上执行连接：

def group_delim(grp, delim='|'):
    """Join each columns within a group by the given delimiter."""
    return grp.apply(lambda col: delim.join(col))

# Make sure the DataFrame consists of strings, then apply grouping function.
grouped = df.astype(str).groupby('groupId').apply(group_delim)

# Drop the grouped groupId column, and replace it with the index groupId.
grouped = grouped.drop('groupId', axis=1).reset_index()

分组输出：

  groupId uniqueId                                   feature_1 feature_2
0     100    1|2|5  text of 1|some text of 2|another text of 5  10|20|50
1     200      3|4                    text of 3|more text of 4     30|40

逆过程的类似想法，但由于每一行都是一个独特的组，您可以只使用常规的apply，不需要groupby：

def ungroup_delim(col, delim='|'):
    """Split elements in a column by the given delimiter, stacking columnwise"""
    return col.str.split(delim, expand=True).stack()

# Apply the ungrouping function, and forward fill elements that aren't grouped.
ungrouped = grouped.apply(ungroup_delim).ffill()

# Drop the unwieldy altered index for a new one.
ungrouped = ungrouped.reset_index(drop=True)

取消分组会产生原始数据：

  groupId uniqueId          feature_1 feature_2
0     100        1          text of 1        10
1     100        2     some text of 2        20
2     100        5  another text of 5        50
3     200        3          text of 3        30
4     200        4     more text of 4        40

要使用不同的分隔符，您只需将delim 作为参数传递给apply：

foo.apply(group_delim, delim=';')

附带说明，一般而言，遍历 DataFrame 非常慢。只要有可能，您就会想使用像我上面所做的那样的矢量化方法。

【讨论】：

我注意到在旧版本的 Pandas for col.str.split(delim, expand=True) 中，expand 不是已知的关键字参数。避免这种情况的解决方案是stackoverflow.com/a/35567326/3229995

【解决方案2】：

R 中的解决方案：

我定义了初始数据框（为了清楚起见）

df <- data.frame(uniqueID = c(1,2,3,4,5),
           groupID = c(100,100,200,200,100),
           feature_1 = c("text of 1","some text of 2",
                       "text of 3", "more text of 4",
                       "another text of 5"),
           feature_2 = c(10,20,30,40,50), stringsAsFactors = F)

获取分组数据框：

# Group and summarise using dplyr
library(dplyr)
grouped <- df %>% group_by(groupID) %>% summarise_each(funs(paste(.,collapse = "|")))

输出：

grouped

 groupID uniqueID                                  feature_1 feature_2
    (dbl)    (chr)                                      (chr)     (chr)
1     100    1|2|5 text of 1|some text of 2|another text of 5  10|20|50
2     200      3|4                   text of 3|more text of 4     30|40

取消分组并返回原始数据框：

library(stringr)
apply(grouped, 1, function(x)  {

        temp <- data.frame(str_split(x, '\\|'), stringsAsFactors = F)
        colnames(temp) <- names(x)
        temp

        }) %>%
        bind_rows()

输出：

  groupID uniqueID         feature_1 feature_2
    (chr)    (chr)             (chr)     (chr)
1     100        1         text of 1        10
2     100        2    some text of 2        20
3     100        5 another text of 5        50
4     200        3         text of 3        30
5     200        4    more text of 4        40

【讨论】：