【问题标题】:Sorting rows in python pandas在 python pandas 中对行进行排序
【发布时间】:2021-01-07 17:10:24
【问题描述】:

我有一个数据框(示例如下所示)

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,M       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots S                              S        330
Variation      2.7      Boots M                              M        330
Variable       3        Helmet           Helmet Sizes      E42,E41
Variation      3.8      Helmet E42                          E42       89
Variation      3.2      Helmet E41                          E41       89

我要做的是根据大小对值进行排序,因此最终数据框应如下所示:

  Type          SKU      Description   FullDescription        Size      Price
    Variable       2        Boots          Shoes on sale       S,M,XL        
    Variation      2.6      Boots S                             S       330
    Variation      2.7      Boots M                             M        330
    Variation      2.5      Boots XL                            XL        330
    Variable       3        Boots           Helmet Sizes       E41,E42
    Variation      3.2      Helmet E41                          E41       89
    Variation      3.8      Helmet E42                          E42       89

我能够使用此代码成功获得结果

sizes, dig = ['S','M','XL','L',], ['000','111','333','222'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'Variable').cumsum()) 
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)
df

问题是给定的代码在数据帧上不起作用

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       XL,S,3XL       
Variation      2.5      Boots XL                             XL       330
Variation      2.6      Boots 3XL                            3XL        330
Variation      2.7      Boots S                              S        330
Variable       3        Helmet           Helmet Sizes      S19, S9
Variation      3.8      Helmet E42                          S19       89
Variation      3.2      Helmet E41                          S9       89

它给出的结果是 'S,3XL,XL' 和 'S19,S9' 而我想要的结果是

Type          SKU      Description   FullDescription        Size      Price
Variable       2        Boots          Shoes on sale       S,XL,3XL       
Variation      2.7      Boots S                             S          330
Variation      2.5      Boots XL                            XL        330
Variation      2.6      Boots 3XL                           3XL        330
Variable       3        Helmet           Helmet Sizes      S9,S19
Variation      3.2      Helmet E41                          S9        89
Variation      3.8      Helmet E42                          S19       89

如果尺寸更大,顺序应该是'XXS,XS,S,M,L,XL,XXL,3XL,4XL,5XL',如果是第二个例子,'S9,S19,M9,M19,L9 and so on'

这是我到目前为止所做的,但它不起作用并且显示错误的顺序

sizes, dig = ['XS','S','M','L','XL','XXL','3XL','4XL','5XL'], ['000','111','222','333','444','555','666','777','888'] #make sure dig values do not exist as a substring anywhere in your dataframe
df = (df.assign(Size=df['Size'].replace(sizes, dig, regex=True))
        .assign(grp=(df['Type'] == 'variable').cumsum())
        .sort_values(['grp', 'Type', 'Size']).drop('grp', axis=1))
df['Size'] = df['Size'].apply(lambda x: ','.join(sorted(x.split(',')))).replace(dig, sizes, regex=True)

【问题讨论】:

    标签: python python-3.x pandas dataframe


    【解决方案1】:

    第 1 步:重新创建数据

    import pandas as pd
    
    #----------------------#
    # Recreate the dataset #
    #----------------------#
    # raw input data_1 = """ Variable|2|Boots|Shoes on sale|XL,S,M|  
                             Variation|2.5|Boots XL||XL|330 Variation|2.6|Boots S||S|330 
                             Variation|2.7|Boots M||M|330 Variable|3|Helmet|Helmet Sizes|E42,E41| 
                             Variation|3.8|Helmet E42||E42|89 
                             Variation|3.2|Helmet E41||E41|89"""
    
    data_2 = """ Variable|2|Boots|Shoes on sale|XL,S,3XL| 
                 Variation|2.5|Boots XL||XL|330 
                 Variation|2.6|Boots 3XL||3XL|330 
                 Variation|2.7|Boots S||S|330 
                 Variable|3|Helmet|Helmet Sizes|S19, S9| 
                 Variation|3.8|Helmet E42||S19|89 
                 Variation|3.2|Helmet E41||S9|89"""
    
    # Construct 1 data set
    data = 'Type|SKU|Description|FullDescription|Size|Price'
    data += data_2 # this can also be data_1  or data_1 + data_2
    
    # pre-process: split the lines and values into a list of lists.
    data = [row.split('|') for row in data.split('\n')]
    
    #-------------#
    # create a df #
    #-------------#
    df = pd.DataFrame(data[1:], columns=data[0]) df
    

    临时结果

    Type    SKU     Description     FullDescription          Size   Price
    0   Variable    2               Boots   Shoes on sale   XL,S,3XL    
    1   Variation   2.5             Boots XL                XL          330
    2   Variation   2.6             Boots 3XL               3XL         330
    3   Variation   2.7             Boots S                 S           330
    4   Variable    3               Helmet  Helmet Sizes    S19, S9     
    5   Variation   3.8             Helmet E42              S19         89
    6   Variation   3.2             Helmet E41              S9          89
    

    第 2 步:创建优先级字典

    我不是很喜欢时尚 + 我也是个男人 -->(我只熟悉 S M L XL)
    但请随时重新订购或在列表中添加额外尺寸

    # Prioritize the sizes
    # ps, i don't know the order :) 
    priority_dict = {k : e for e, k in enumerate([ 'XXS','XS','S','M','L','XL','XXL','3XL','4XL','5XL', 'E41', 'E42', 'S9', 'S19' ])}
    priority_dict
    

    临时结果

    {'XXS': 0,
     'XS': 1,
     'S': 2,
     'M': 3,
     'L': 4,
     'XL': 5,
     'XXL': 6,
     '3XL': 7,
     '4XL': 8,
     '5XL': 9,
     'E41': 10,
     'E42': 11,
     'S9': 12,
     'S19': 13}
    

    第 3 步:根据大小字符串创建元组列表

    # Split the string  "SIZE" into a list    "XL,S,M" --> ["XL", "S", "M"]
    # And, add the value from our priority dict to it  --> [(5, "XL"), (2, "S"), (3, "M")]
    # Last but not least, sort list (by the first value) --> [(2, "S"), (3, "M"), (5, "XL")]
    df["TMP_SIZE"] = [ sorted([(priority_dict.get(size.strip()), size.strip())  for size in sizes.split(',')]) for sizes in df.Size]
    df
    

    临时结果

    Type    SKU     Description     FullDescription          Size       Price  TMP_SIZE
    0   Variable    2               Boots   Shoes on sale   XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]
    1   Variation   2.5             Boots XL                XL          330    [(5, XL)]
    2   Variation   2.6             Boots 3XL               3XL         330    [(7, 3XL)]
    3   Variation   2.7             Boots S                 S           330    [(2, S)]
    4   Variable    3               Helmet  Helmet Sizes    S19, S9            [(12, S9), (13, S19)]
    5   Variation   3.8             Helmet E42              S19         89     [(13, S19)]
    6   Variation   3.2             Helmet E41              S9          89     [(12, S9)]
    

    第 4 步:清理 TEMP_SIZE

    # Create a new SIZE
    # loop over the TMPS_SIZE and create a string from the second value of the tuplelist --> ', '.join( my_list )
    
    df['NEW_SIZE'] = [', '.join([ size[1]for size in sizes ]) for sizes in df["TMP_SIZE"] ]
    

    临时结果

    Type    SKU     Description     ...     Size        Price  TMP_SIZE                       NEW_SIZE
    0   Variable    2               ...     XL,S,3XL           [(2, S), (5, XL), (7, 3XL)]  S, XL, 3XL
    1   Variation   2.5             ...     XL          330    [(5, XL)]                    XL
    2   Variation   2.6             ...     3XL         330    [(7, 3XL)]                   3XL
    3   Variation   2.7             ...     S           330    [(2, S)]                     S
    4   Variable    3               ...     S19, S9            [(12, S9), (13, S19)]        S9, S19
    5   Variation   3.8             ...     S19         89     [(13, S19)]                  S19
    6   Variation   3.2             ...     S9          89     [(12, S9)]                   S9
    

    第 5 步:grp

    添加你的 grp

    #grp
    df['grp']= (df['Type'] == 'Variable').cumsum()
    df
    

    第 6 步:排序

    在最后一步,您可以对所有内容进行排序 (我认为你需要单独对 TMP_SIZE 进行排序)

    # sort the dataset
    df = df.sort_values('TMP_SIZE') # notice that we sort on the list of tuples
    df.sort_values(by=['grp', 'Type'])
    

    【讨论】:

    • 您好,感谢您的详细回答,它对于给定的数据集非常有效,但不适用于包含更多尺寸类型差异的原始数据集,例如 EC 40-EC42 和 Ladies XS,女士 XL。所以它向我抛出了 'TypeError: '
    • 我认为这是因为 EC 40-EC42 不在您的优先字典中 --> { s.strip() for sizes in df.Size for s in [ size for size.split(',')]} - set(priority_dict)
    • + --> 字符串“EC 40-EC42”是 1 号吗?或 2 个尺寸?在这种情况下,您应该使用正则表达式进行拆分(因此 nu 仅在 ',' 上但也在 '-' 上)并进一步清理您的数据
    • 尺寸是这样的 '''EC 38-42,EC 43-46', 'EC 38-42', 'EC 43-46', 'EC 46,EC 48', ' EC 38,EC 40', 'EC 39-41,EC 42-XL,EC 45-47,EC 48-51' 所以它被认为是列中的一个值,例如 EC 38-42 在一行中,EC 43 -46 在其他行等等
    • @hyeri --> 你试过检查这个吗? : { s.strip() for sizes in df.Size for s in [ size for size in sizes.split(',')]} - set(priority_dict) 看是否为空?
    猜你喜欢
    • 2018-04-11
    • 2022-01-21
    • 2017-05-28
    • 1970-01-01
    • 2021-09-02
    • 2016-04-08
    • 1970-01-01
    • 2021-12-20
    相关资源
    最近更新 更多