根据多列中的字符串值在数据框中创建新行答案

【问题标题】：Creating new rows in dataframe based on string values in multiple columns根据多列中的字符串值在数据框中创建新行
【发布时间】：2022-08-10 02:05:01
【问题描述】：

我遇到了这个问题，我有一个如下所示的数据框（最后 3 列中的值通常是 4-5 个字母数字代码）。

import pandas as pd

data = {\'ID\':[\'P39\',\'S32\'],
        \'Name\':[\'Pipe\',\'Screw\'],
        \'Col3\':[\'Test1, Test2, Test3\',\'Test6, Test7\'],
        \'Col4\':[\'\',\'Test8, Test9\'],
        \'Col5\':[\'Test4, Test5\',\'Test10, Test11, Test12, Test13\']
       }

df = pd.DataFrame(data)

	ID	Name	Col3	Col4	Col5
0	P39	Pipe	Test1, Test2, Test3		Test4, Test5
1	S32	Screw	Test6, Test7	Test8, Test9	Test10, Test11, Test12, Test13

我想扩展此数据框或根据每行最后 3 列中的值创建一个新数据框。我想根据最后 3 行之一中用逗号分隔的最大值创建更多行。然后我想在所有展开的行中保持前 2 列相同。但我想用原始列中的每个值填充扩展行中的最后 3 列。

在上面的示例中，第一行表示我总共需要 3 行（Col3 最多有 3 个值），第二行表示我需要总共 4 行（Col5 最多有 4 个值）。所需的输出将是：

	ID	Name	Col3	Col4	Col5
0	P39	Pipe	Test1		Test4
1	P39	Pipe	Test2		Test5
2	P39	Pipe	Test3
3	S32	Screw	Test6	Test8	Test10
4	S32	Screw	Test7	Test9	Test11
5	S32	Screw			Test12
6	S32	Screw			Test13

我首先找到了一种计算所需行数的方法。我也有在同一个循环中将值附加到新数据帧的想法。虽然，我不确定如何分隔最后 3 列中的值并将它们一一附加到行中。我知道 str.split() 对于将值放入列表很有用。我唯一的想法是如果我需要分别循环遍历每一列并将其附加到正确的行，但我不知道该怎么做。

output1 = pd.DataFrame(
    columns = [\'ID\', \'Name\', \'Col3\', \'Col4\', \'Col5\'])

for index, row in df.iterrows():
    
    output2 = pd.DataFrame(
        columns = [\'ID\', \'Name\', \'Col3\', \'Col4\', \'Col5\'])

    col3counter = df.iloc[index, 2].count(\',\')
    col4counter = df.iloc[index, 3].count(\',\')
    col5counter = df.iloc[index, 4].count(\',\')
    
    numofnewcols = max(col3counter, col4counter, col5counter) + 1

    iter1 = df.iloc[index, 2].split(\', \')
    iter2 = df.iloc[index, 3].split(\', \')
    iter3 = df.iloc[index, 4].split(\', \')

    #for q in iter1
        #output2.iloc[ , 2] = 
    

    output1 = pd.concat([output1, output2], ignore_index=True)
    del output2

标签： python pandas dataframe

【解决方案1】：

这是一种方法：

cols = ['Col3','Col4','Col5']

s = df[cols].stack().str.split(', ')
s2 = s.str.len().groupby(level=0).transform(lambda x: x.max() - x)
df.loc[:,~df.columns.isin(cols)].join((s + s2.map(lambda x: x * [''])).unstack()).explode(cols).reset_index(drop=True)

这是使用.stack() str.split() 并使用输出创建新的df 的另一种方法：

cols = ['Col3','Col4','Col5']

s = df[cols].stack().str.split(',')
(df[['ID','Name']].join(pd.DataFrame(s.tolist(),index = s.index)
.stack()
.unstack(level=1)
.droplevel(1)
.fillna('')))

输出：

    ID   Name   Col3   Col4    Col5
0  P39   Pipe  Test1          Test4
1  P39   Pipe  Test2          Test5
2  P39   Pipe  Test3               
3  S32  Screw  Test6  Test8  Test10
4  S32  Screw  Test7  Test9  Test11
5  S32  Screw                Test12
6  S32  Screw                Test13

【讨论】：

【解决方案2】：

有点棘手，但它应该与melt 一起使用来平整您的数据框，然后使用pivot_table 来重塑它：

out = (df.reset_index().melt(['ID', 'Name', 'index'], var_name='col', value_name='val')
         .assign(val=lambda x: x['val'].str.split(', ')).explode('val')
         .assign(row=lambda x: x.groupby(['index', 'col']).cumcount())
         .pivot_table('val', ['index', 'row', 'ID', 'Name'], 'col', aggfunc='first')
         .droplevel(['index', 'row']).reset_index().rename_axis(columns=None).fillna(''))

输出：

	ID	Name	Col3	Col4	Col5
0	P39	Pipe	Test1		Test4
1	P39	Pipe	Test2		Test5
2	P39	Pipe	Test3
3	S32	Screw	Test6	Test8	Test10
4	S32	Screw	Test7	Test9	Test11
5	S32	Screw			Test12
6	S32	Screw			Test13

【讨论】：

【解决方案3】：

这会逐行均衡每个列表中的值数量，以便您可以通过多列分解获得所需的输出。

import pandas as pd
import numpy as np

cols = ['Col3','Col4','Col5']

for col in cols:
     df[col] = df[col].str.split(', ')

df['rows'] = df[cols].applymap(len).max(axis=1)

for col in cols:
    df[col] = df[[col, 'rows']].apply(lambda x: x[col] + [np.nan]*(x['rows'] - len(x[col])), axis=1)
'''
# Or, simplified with more-itertools and np.vectorize
from more_itertools import padded
vec_pad = np.vectorize(padded, excluded={1})
for col in cols:
    df[col] = vec_pad(df[col], np.nan, df.rows)
df[cols] = df[cols].applymap(list)
'''
df = (df.explode(cols)
        .drop('rows', axis=1)
        .replace('', np.nan))
print(df)

输出：

    ID   Name   Col3   Col4    Col5
0  P39   Pipe  Test1    NaN   Test4
0  P39   Pipe  Test2    NaN   Test5
0  P39   Pipe  Test3    NaN     NaN
1  S32  Screw  Test6  Test8  Test10
1  S32  Screw  Test7  Test9  Test11
1  S32  Screw    NaN    NaN  Test12
1  S32  Screw    NaN    NaN  Test13

【讨论】：