合并列时如何保留所有唯一的值组合？答案

【问题标题】：How to preserve all unique combinations of values when merging columns?合并列时如何保留所有唯一的值组合？
【发布时间】：2018-03-12 22:13:10
【问题描述】：

给定表格

| A | B | C | C | C | D | D |
  1   0   x   y   z   8   9
  2   4   x   b

什么是最好的返回方法

| A | B | C | D |
  1   0   x   8
  1   0   y   8
  1   0   z   8
  1   0   x   9
  1   0   y   9
  1   0   z   9
  2   4   x
  2   4   b

我正在使用 pandas read_csv 从 csv 中提取...不确定我是否可以在那里处理它，或者使用 SQL，或者使用 Python dicts。

苦苦搜索，找不到答案。

（我是新手，所以我可能会遗漏一些基本的东西......）

编辑：需要容纳 n 行

【问题讨论】：

标签： python pandas dataframe merge duplicates

【解决方案1】：

import pandas as pd

df = pd.DataFrame([[1,0,'x','y','z',8,9]], columns=list('ABCCCDD'))

result = pd.MultiIndex.from_product(
             [grp for key, grp in df.T.groupby(level=0)[0]]).to_frame(index=False)
print(result)

产量

   0  1  2  3
0  1  0  x  8
1  1  0  x  9
2  1  0  y  8
3  1  0  y  9
4  1  0  z  8
5  1  0  z  9

如果您的 DataFrame 有多于一行：

import numpy as np
import pandas as pd

def row_to_arrays(row, idx):
    """
    Split a row into a list of component arrays.
    idx specifies the indices at which we want to split the row
    """
    # Use row[1:] because the first item in each row is the index 
    # (which we want to ignore)
    result = np.split(row[1:], idx)
    # Filter out empty strings
    result = [arr[arr != ''] for arr in result]
    # Filter out empty arrays
    result = [arr for arr in result if len(arr)]
    return result

def arrays_to_dataframe(arrays):
    """
    Convert list of arrays to product DataFrame
    """
    return pd.MultiIndex.from_product(arrays).to_frame(index=False) 

def df_to_row_product(df):
    # find the indices at which to cut each row
    idx = pd.DataFrame(df.columns).groupby(0)[0].agg(lambda x: x.index[0])[1:]
    data = [arrays_to_dataframe(row_to_arrays(row, idx))
            for row in df.itertuples()]
    result = pd.concat(data, ignore_index=True).fillna('')
    return result

df = pd.DataFrame([[1,0,'x','y','z',8,9],
                   [2,4,'x','b','','','']], columns=list('ABCCCDD'))

print(df_to_row_product(df))

产量

   0  1  2  3
0  1  0  x  8
1  1  0  x  9
2  1  0  y  8
3  1  0  y  9
4  1  0  z  8
5  1  0  z  9
6  2  4  x   
7  2  4  b

【讨论】：

可以做到pd.MultiIndex.from_product([g.values[0] for k, g in df.groupby(level=0, axis=1)]).to_frame(index=False) 无需外部转置。
所以这很有趣，但我们丢失了标题名称。此外，我仍然无法让它为我的其他数据集工作......结果仅显示来自具有 3 个唯一列和 61 行的数据集的 3 行和 2 列。 ...您能否提供一个多行版本，以便更轻松地进行故障排除？
我想我看到了问题，我的示例给出了单行，但我需要容纳多行数据（问题已更新）

【解决方案2】：

我可以想到一种可能的解决方案，使用一点预处理和itertools.product：

from itertools import product 

prod = list(product(*df.groupby(df.columns, axis=1)\
                  .apply(lambda x: x.values.reshape(-1, )).tolist()))
prod
[(1, 0, 'x', 8),
 (1, 0, 'x', 9),
 (1, 0, 'y', 8),
 (1, 0, 'y', 9),
 (1, 0, 'z', 8),
 (1, 0, 'z', 9)]

df = pd.DataFrame(prod, columns=list('ABCD'))\
                 .sort_values('D').reset_index(drop=1)
df
   A  B  C  D
0  1  0  x  8
1  1  0  y  8
2  1  0  z  8
3  1  0  x  9
4  1  0  y  9
5  1  0  z  9

【讨论】：

我想将 ['A','B'] 设置为索引，然后应用笛卡尔。这个不错
在 prod= 行上获取带有完整数据集的 SegFault
@uxmatthew 您必须有很多列，在这种情况下，这可能对您没有帮助。道歉。