熊猫数据框中的条件分组答案

【问题标题】：Conditional grouping in pandas data frame熊猫数据框中的条件分组
【发布时间】：2019-02-15 19:04:23
【问题描述】：

想象一个由

给出的pandas数据框

df = pd.DataFrame({
    'id': range(5),
    'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
    'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})

产生

   id            desc  mfr
0   0    This is text  ABC
1   1    John Doe ABC  DEF
2   2        John Doe  DEF
3   3   Something JKL  GHI
4   4  Something more  JKL

我希望确定哪些id 属于彼此。它们要么与mfrcolumn 匹配，要么mfrvalue 包含在desccolumn 中。例如。 id = 1 和 2 是同一组，因为 mfr 相等，但 id = 0 和 1 也是同一组，因为 mfr 中的 ABC 和 id = 0 是 desc 的一部分987654337@.

生成的数据框应该是

   id            desc  mfr  group
0   0    This is text  ABC      0
1   1    John Doe ABC  DEF      0
2   2        John Doe  DEF      0
3   3   Something JKL  GHI      1
4   4  Something more  JKL      1

有没有人对此有很好的解决方案？我想没有真正简单的，所以欢迎任何。

【问题讨论】：

我不明白你的问题。什么意味着id属于彼此？ id 如何与 mfr 列匹配？
@AntonioAndrés “它们要么与 mfr 列匹配，要么 mfr 值包含在 desc 列中”。是否需要进一步澄清？
是的，我不明白第一个条件。第二个条件是 df['desc'] == df['mfr']。对吗？
@AntonioAndrés 看不到原始帖子中的编辑。
好的，我明白你的意思。我会尽力帮助你

标签： python string pandas conditional pandas-groupby

【解决方案1】：

我假设 'desc' 不包含多个 'mfr' 值

解决方案 1：

import numpy as np
import pandas as pd

# original dataframe
df = pd.DataFrame({
    'id': range(5),
    'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
    'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})

# for final merge
ori = df.copy()

# max words used in 'desc'
max_len = max(df.desc.apply(lambda x: len(x.split(' '))))

# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()

# if list is less than max len, then pad with nan
def padding(lst, mx):
    for i in range(mx):
        if len(lst) < mx:
            lst.append(np.nan)
    return lst
df['desc'] = df.desc.apply(lambda x: x.split(' ')).apply(padding, args=(max_len,))

# each word makes 1 column
for i in range(max_len):
    newcol = 'desc{}'.format(i)
    df[newcol] = df.desc.apply(lambda x: x[i])
    df.loc[~df[newcol].isin(uniq_mfr), newcol] = np.nan

# merge created columns into 1 by taking 'mfr' values only
df['desc'] = df[df.columns[3:]].fillna('').sum(axis=1).replace('', np.nan)

# create [ABC, ABC] type of column by merging two columns (desc & mfr)
df = df[df.columns[:3]]
df.desc.fillna(df.mfr, inplace=True)
df.desc = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)

# BELOW IS COMMON WITH SOLUTION2
# from here I borrowed the solution by @mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools

L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L)) 

for each in LL:
    components = [x for x in L if each in x]
    for i in components:
        L.remove(i)
        L += [tuple(set(itertools.chain.from_iterable(components)))]

# allocate merged tuple to 'desc'
df['desc'] = sorted(L)

# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]

# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]

Solution2（后半部分与Solution1通用）：

import numpy as np
import pandas as pd

# original dataframe
df = pd.DataFrame({
    'id': range(5),
    'desc': ('This is text', 'John Doe ABC', 'John Doe', 'Something JKL', 'Something more'),
    'mfr': ('ABC', 'DEF', 'DEF', 'GHI', 'JKL')
})

# for final merge
ori = df.copy()

# unique 'mfr' values
uniq_mfr = df.mfr.unique().tolist()

# make desc entries as lists
df['desc'] = df.desc.apply(lambda x: x.split(' '))

# pick up mfr values in desc column otherwise nan
mfr_in_descs = []
for ds, ms in zip(df.desc, df.mfr):
    for i, d in enumerate(ds):
        if d in uniq_mfr:
            mfr_in_descs.append(d)
            continue
        if i == (len(ds) - 1):
            mfr_in_descs.append(np.nan)

# create column whose element is like [ABC, ABC]
df['desc'] = mfr_in_descs
df['desc'].fillna(df.mfr, inplace=True)
df['desc'] = [[x, y] for x, y in zip(df.desc.tolist(), df.mfr.tolist())]
df = df[['id', 'desc']]
df = df.sort_values('desc').reset_index(drop=True)

# BELOW IS COMMON WITH SOLUTION1
# from here I borrowed the solution by @mimomu from below URL (slightly modified)
# try to get merged tuple based on the common elements
# https://stackoverflow.com/questions/4842613/merge-lists-that-share-common-elements
import itertools

L = df.desc.tolist()
LL = set(itertools.chain.from_iterable(L)) 

for each in LL:
    components = [x for x in L if each in x]
    for i in components:
        L.remove(i)
        L += [tuple(set(itertools.chain.from_iterable(components)))]

# allocate merged tuple to 'desc'
df['desc'] = sorted(L)

# grouping by 'desc' value (tuple can be key list cannot be fyi...)
df['group'] = df.groupby('desc').grouper.group_info[0]

# merge with the original
df = df.drop('desc', axis=1).merge(ori, on='id', how='left')
df = df[['id', 'desc', 'mfr', 'group']]

从上面的 2 个解决方案中，我得到了相同的结果 df:

    id  desc            mfr  group
0   0   This is text    ABC  0
1   1   John Doe ABC    DEF  0
2   2   John Doe        DEF  0
3   3   Something JKL   GHI  1
4   4   Something more  JKL  1

【讨论】：