如何通过分隔符为相应的名称和值列拆分数据答案

【问题标题】：How to split data by delimiter for corresponding name and value columns如何通过分隔符为相应的名称和值列拆分数据
【发布时间】：2022-01-08 15:15:25
【问题描述】：

我正在尝试使用一个 excel 文件来做一些以相当烦人的格式组合在一起的事情（我没有创建它；它是我正在使用的现有资源）。感兴趣的值位于名为（类似于）All_Values 的列中，由句点分隔，而与这些值对应的度量值在单独的列 All_Measures 中指定，也由句点分隔并且每行不同。例如，使用玩具数据集：

Object        All_Measures  All_Values     (additional columns that are not like this)
     1       Height.Weight      20.50      ...
     2       Weight.Height      65.30      ...
     3  Height.Width.Depth   22.30.10      ...

我想做的是像这样重新格式化数据，用 0 填充缺失值（列的最终顺序并不重要）：

Object  Height  Weight  Width  Depth  (additional columns)
     1      20      50      0      0  ...
     2      30      65      0      0  ...
     3      22       0     30     10  ...

我可以做到这一点的一种方法是（非常缓慢，因为它是一个大数据集）创建一个新的空白数据框，然后迭代现有的每一行，创建一个新的数据框行，其中包含通过拆分 @ 指定的列987654325@ by .，以及通过将All_Values 拆分为. 指定的值。然后，我从行中删除 All_Measures 和 All_Values 并将新数据帧附加到它的末尾，并将其附加到空白数据帧。但这很笨拙，如果有更快、更优雅的方法来做这件事就更好了。

因为这里没有错误，所以我没有 MWE，但这里有一些代码可以复制来创建一个像上面这样的玩具数据集，以防万一。

df = pd.DataFrame(
    columns = ['Object','All_Measures','All_Values','Object_Name']
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']]
)

【问题讨论】：

标签： python pandas dataframe

【解决方案1】：

使用str.split、explode和pivot_table：

# split the "All" columns into lists
df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

# explode the lists into rows
df = df.explode(['All_Measures', 'All_Values'])

# pivot the measures into columns
df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

输出：

All_Measures       Depth Height Weight Width
Object Object_Name                          
1      First           0     20     50     0
2      Second          0     30     65     0
3      Third          10     22      0    30

详细分类

str.split 将“所有”列放入列表：

df['All_Measures'] = df['All_Measures'].str.split('.')
df['All_Values'] = df['All_Values'].str.split('.')

#    Object            All_Measures    All_Values Object_Name
# 0       1        [Height, Weight]      [20, 50]       First
# 1       2        [Weight, Height]      [65, 30]      Second
# 2       3  [Height, Width, Depth]  [22, 30, 10]       Third

explode 将列表分成行：

df = df.explode(['All_Measures', 'All_Values'])

#    Object All_Measures All_Values Object_Name
# 0       1       Height         20       First
# 0       1       Weight         50       First
# 1       2       Weight         65      Second
# 1       2       Height         30      Second
# 2       3       Height         22       Third
# 2       3        Width         30       Third
# 2       3        Depth         10       Third

pivot_table 将度量分为列：

df.pivot_table(
    index=['Object', 'Object_Name'],
    columns='All_Measures',
    values='All_Values',
    fill_value=0)

# All_Measures       Depth Height Weight Width
# Object Object_Name                          
# 1      First           0     20     50     0
# 2      Second          0     30     65     0
# 3      Third          10     22      0    30

【讨论】：

这太棒了！我只需要做一件事，即最后使用pd.DataFrame(df.to_records()) 将数据透视表转换回数据框。谢谢你让我头疼，并把我介绍给pd.DataFrame.explode()。
太棒了。我认为您也可以重置数据透视表的索引以获得类似的结果：df.pivot_table(...).reset_index()

【解决方案2】：

可能有一些方法可以在不使用循环或 apply() 的情况下完成此操作，但我想不出。以下是我想到的：

import pandas as pd
df = pd.DataFrame(
    [[1,'Height.Weight','20.50','First'], 
     [2,'Weight.Height','65.30','Second'], 
     [3,'Height.Width.Depth','22.30.10','Third']],
    columns = ['Object','All_Measures','All_Values','Object_Name'],
)

def parse_combined_measure(row):
    keys = row["All_Measures"].split(".")
    values = row["All_Values"].split(".")
    return row.append(pd.Series(dict(zip(keys, values))))

df2 = df.apply(parse_combined_measure, axis=1)
df2 = df2.fillna(0)

【讨论】：

【解决方案3】：

# Create a new DataFrame with just the values extracted from the All_Values column
In [24]: new_df = df['All_Values'].str.split('.').apply(pd.Series)
Out[24]:
    0   1    2
0  20  50  NaN
1  65  30  NaN
2  22  30   10

# Figure out the names those columns should have
In [37]: df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures']
Out[37]: 'Height.Width.Depth'

In [38]: new_df.columns = df.loc[df['All_Measures'].str.count('\.').idxmax(), 'All_Measures'].split('.')
Out[39]:
  Height Width Depth
0     20    50   NaN
1     65    30   NaN
2     22    30    10

# Join the new DF with the original, except the columns we've expanded.
In [41]: df[['Object', 'Object_Name']].join(new_df)
Out[41]:
   Object Object_Name Height Width Depth
0       1       First     20    50   NaN
1       2      Second     65    30   NaN
2       3       Third     22    30    10

【讨论】：

不应该是第二件的重量是65，而不是第二件的高度是65吗？
是的，我喜欢这个想法，但它不适用于我的数据集，其中All_Measures 中的顺序在各行之间不一致。它也没有得到前两行有Weight而不是Width或Depth的事实，但是可以通过new_df.columns = list(set('.'.join(df['All_Measures']).split('.')))来检索完整的名称集。