【问题标题】：Sorting rows in python pandas在 python pandas 中对行进行排序
【发布时间】：2021-01-07 17:10:24
【问题描述】：

我有一个数据框（示例如下所示）

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,M       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots S                              S        330
Variation      2.7      Boots M                              M        330
Variable       3        Helmet           Helmet Sizes      E42,E41
Variation      3.8      Helmet E42                          E42       89
Variation      3.2      Helmet E41                          E41       89

我要做的是根据大小对值进行排序，因此最终数据框应如下所示：

  Type          SKU      Description   FullDescription        Size      Price
    Variable       2        Boots          Shoes on sale       S,M,XL        
    Variation      2.6      Boots S                             S       330
    Variation      2.7      Boots M                             M        330
    Variation      2.5      Boots XL                            XL        330
    Variable       3        Boots           Helmet Sizes       E41,E42
    Variation      3.2      Helmet E41                          E41       89
    Variation      3.8      Helmet E42                          E42       89

我能够使用此代码成功获得结果

sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'Variable').cumsum()) 
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df

问题是给定的代码在数据帧上不起作用

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,3XL       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots 3XL                            3XL        330
Variation      2.7      Boots S                              S        330
Variable       3        Helmet           Helmet Sizes      S19, S9
Variation      3.8      Helmet E42                          S19       89
Variation      3.2      Helmet E41                          S9       89

它给出的结果是 'S,3XL,XL' 和 'S19,S9' 而我想要的结果是

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       S,XL,3XL       
Variation      2.7      Boots S                             S          330
Variation      2.5      Boots XL                            XL        330
Variation      2.6      Boots 3XL                           3XL        330
Variable       3        Helmet           Helmet Sizes      S9,S19
Variation      3.2      Helmet E41                          S9        89
Variation      3.8      Helmet E42                          S19       89

如果尺寸更大，顺序应该是'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL'，如果是第二个例子，'S9,S19,M9,M19,L9 and so on'

这是我到目前为止所做的，但它不起作用并且显示错误的顺序

sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'variable').cumsum())
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)

【问题讨论】：

标签： python python-3.x pandas dataframe

【解决方案1】：

第 1 步：重新创建数据

import pandas as pd

#----------------------#
# Recreate the dataset #
#----------------------#
# raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|  
                         Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330 
                         Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41| 
                         Variation|3.8|Helmet E42||E42|89 
                         Variation|3.2|Helmet E41||E41|89"""

data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL| 
             Variation|2.5|Boots XL||XL|330 
             Variation|2.6|Boots 3XL||3XL|330 
             Variation|2.7|Boots S||S|330 
             Variable|3|Helmet|Helmet Sizes|S19, S9| 
             Variation|3.8|Helmet E42||S19|89 
             Variation|3.2|Helmet E41||S9|89"""

# Construct 1 data set
data = 'Type|SKU|Description|FullDescription|Size|Price'
data += data_2 # this can also be data_1  or data_1 + data_2

# pre-process: split the lines and values into a list of lists.
data = [row.split('|') for row in data.split('\n')]

#-------------#
# create a df #
#-------------#
df = pd.DataFrame(data[1:], columns=data[0]) df

临时结果

Type    SKU     Description     FullDescription          Size   Price
0   Variable    2               Boots   Shoes on sale   XL,S,3XL    
1   Variation   2.5             Boots XL                XL          330
2   Variation   2.6             Boots 3XL               3XL         330
3   Variation   2.7             Boots S                 S           330
4   Variable    3               Helmet  Helmet Sizes    S19, S9     
5   Variation   3.8             Helmet E42              S19         89
6   Variation   3.2             Helmet E41              S9          89

第 2 步：创建优先级字典

我不是很喜欢时尚 + 我也是个男人 -->（我只熟悉 S M L XL）
但请随时重新订购或在列表中添加额外尺寸

# Prioritize the sizes
# ps, i don't know the order :) 
priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
priority_dict

临时结果

{'XXS': 0,
 'XS': 1,
 'S': 2,
 'M': 3,
 'L': 4,
 'XL': 5,
 'XXL': 6,
 '3XL': 7,
 '4XL': 8,
 '5XL': 9,
 'E41': 10,
 'E42': 11,
 'S9': 12,
 'S19': 13}

第 3 步：根据大小字符串创建元组列表

# Split the string  "SIZE" into a list    "XL,S,M" --> ["XL", "S", "M"]
# And, add the value from our priority dict to it  --> [(5, "XL"), (2, "S"), (3, "M")]
# Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip())  for size in sizes.split(',')]) for sizes in df.Size]
df

临时结果

Type    SKU     Description     FullDescription          Size       Price  TMP_SIZE
0   Variable    2               Boots   Shoes on sale   XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]
1   Variation   2.5             Boots XL                XL          330    [(5, XL)]
2   Variation   2.6             Boots 3XL               3XL         330    [(7, 3XL)]
3   Variation   2.7             Boots S                 S           330    [(2, S)]
4   Variable    3               Helmet  Helmet Sizes    S19, S9            [(12, S9), (13, S19)]
5   Variation   3.8             Helmet E42              S19         89     [(13, S19)]
6   Variation   3.2             Helmet E41              S9          89     [(12, S9)]

第 4 步：清理 TEMP_SIZE

# Create a new SIZE
# loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )

df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]

临时结果

Type    SKU     Description     ...     Size        Price  TMP_SIZE                       NEW_SIZE
0   Variable    2               ...     XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]  S, XL, 3XL
1   Variation   2.5             ...     XL          330    [(5, XL)]                    XL
2   Variation   2.6             ...     3XL         330    [(7, 3XL)]                   3XL
3   Variation   2.7             ...     S           330    [(2, S)]                     S
4   Variable    3               ...     S19, S9            [(12, S9), (13, S19)]        S9, S19
5   Variation   3.8             ...     S19         89     [(13, S19)]                  S19
6   Variation   3.2             ...     S9          89     [(12, S9)]                   S9

第 5 步：grp

添加你的 grp

#grp
df['grp']= (df['Type'] == 'Variable').cumsum()
df

第 6 步：排序

在最后一步，您可以对所有内容进行排序（我认为你需要单独对 TMP_SIZE 进行排序）

# sort the dataset
df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
df.sort_values(by=['grp', 'Type'])

【讨论】：

您好，感谢您的详细回答，它对于给定的数据集非常有效，但不适用于包含更多尺寸类型差异的原始数据集，例如 EC 40-EC42 和 Ladies XS，女士 XL。所以它向我抛出了 'TypeError: '
我认为这是因为 EC 40-EC42 不在您的优先字典中 --> { s.strip() for sizes in df.Size for s in [ size for size.split(',')]} - set(priority_dict)
+ --> 字符串“EC 40-EC42”是 1 号吗？或 2 个尺寸？在这种情况下，您应该使用正则表达式进行拆分（因此 nu 仅在 ',' 上但也在 '-' 上）并进一步清理您的数据
尺寸是这样的 '''EC 38-42,EC 43-46', 'EC 38-42', 'EC 43-46', 'EC 46,EC 48', ' EC 38,EC 40', 'EC 39-41,EC 42-XL,EC 45-47,EC 48-51' 所以它被认为是列中的一个值，例如 EC 38-42 在一行中，EC 43 -46 在其他行等等
@hyeri --> 你试过检查这个吗？ : { s.strip() for sizes in df.Size for s in [ size for size in sizes.split(',')]} - set(priority_dict) 看是否为空？