【问题标题】:Pandas scientific form values in acolumn gives wrong output after applying minmaxscaler应用 minmaxscaler 后,列中的 Pandas 科学形式值给出错误的输出
【发布时间】:2020-02-25 13:15:01
【问题描述】:

我有一个如下所示的数据框。我在 Price 列上应用了 MinMaxScaler 并使用 inverse_transform 函数来取回原始价格值,但它给出了错误的结果。请就此提出建议。

数据帧:

Date    Customer    Price
1/6/2019    A   142404534.13
1/7/2019    A   989.34
1/8/2019    A   45444.57
1/9/2019    A   574343.10
1/10/2019   A   23232.34
1/1/2019    A   923423.00
1/2/2019    A   332.00
1/3/2019    A   2342323.24
1/4/2019    A   232.00
1/5/2019    A   65.70
1/6/2019    B   875.46
1/7/2019    B   142466027340.03
1/8/2019    B   25.17
1/9/2019    B   1.01
1/10/2019   B   1.00
1/10/2019   B   57.61
1/6/2019    B   232232.78
1/7/2019    B   15.20
1/8/2019    B   44.56
1/9/2019    B   2323254.45
1/10/2019   B   395.45
1/10/2019   B   23423454.92
1/6/2019    C   34.12
1/7/2019    C   89.34
1/8/2019    C   44.57
1/9/2019    C   343.10
1/10/2019   C   232.34

df 上的 MinMaxScaler 代码:

 from sklearn.preprocessing import MinMaxScaler
    df['Price'] = df['Price'].apply(lambda x: '{:.2f}'.format(x))
    scaler=MinMaxScaler()
    dff = df.groupby('Customer').Price.transform(lambda s:scaler.fit_transform(s.values.reshape(-1,1)).ravel())
dff = pd.DataFrame(dff)
dff['Price'] = dff['Price'].apply(lambda x: '{:.2f}'.format(x))

dff = pd.concat([dff['Price'] , df['Customer']] , axis=1)

dff 输出:

Price   Customer
0   1.00    A
1   0.00    A
2   0.00    A
3   0.00    A
4   0.00    A
5   0.01    A
6   0.00    A
7   0.02    A
8   0.00    A
9   0.00    A
10  0.00    B
11  1.00    B
12  0.00    B
.
.
.
.

20  0.00    B
21  0.00    B
22  0.00    C
23  0.18    C
24  0.03    C
25  1.00    C
26  0.64    C

获取实际价格值的inverse_transform函数代码:

dd = dff.groupby('Customer').Price.transform(lambda s: scaler.inverse_transform(s.values.reshape(-1,1)).ravel())

dd = pd.DataFrame(dd)
dd['Price'] = dd['Price'].apply(lambda x: '{:.2f}'.format(x))
dd = pd.concat([dd['Price'] , df['Customer']] , axis=1)

dd 输出:

Price   Customer
0   343.10  A
1   34.12   A
2   34.12   A
3   34.12   A
4   34.12   A
5   37.21   A
6   34.12   A
7   40.30   A
8   34.12   A
9   34.12   A
10  34.12   B
11  343.10  B
12  34.12   B
13  34.12   B
.
.
.
.
.
18  34.12   B
19  34.12   B
20  34.12   B
21  34.12   B
22  34.12   C
23  89.74   C
24  43.39   C
25  343.10  C
26  231.87  C

请帮助并建议我。

【问题讨论】:

    标签: python pandas data-science normalization scientific-notation


    【解决方案1】:

    这里有多个脚本问题:

    1. 使用df['Price'] = df['Price'].apply(lambda x: '{:.2f}'.format(x)) 可以将 Price 列转换为字符串,因此任何数字操作在此 dtype 上均无效。如果要设置显示的浮点精度,可以在 import pandas 行后使用pd.set_option('display.float_format', lambda x: '%.2f' % x)
    2. 如果我理解得很好,您希望按比例缩放价格值 客户范围为 0-1 值。如果是这种情况,你不能像这样在一行中做到这一点:

      df.groupby('客户').Price.transform(lambda x:scaler.fit_transform(s.values.reshape(-1,1)).ravel())

      因为每组对缩放器对象的每次迭代都会覆盖前一组的最小/最大参数。所以最后当你调用 scaler.inverse_transform() 时你有最后一组的参数。

    解决方案:

    import pandas as pd
    pd.set_option('display.float_format', lambda x: '%.7f' % x)
    from sklearn.preprocessing import MinMaxScaler
    
    scalers_dict = dict({}) # Prepare place to hold scalers per Customer
    
    df = pd.read_csv('stack_data.csv', parse_dates=['Date'])
    
    df['Price_scaled'] = None
    df['Price_inverse_scaled'] = None
    
    # Loop over Customers, scale Price and save scaler for further use
    for customer in pd.unique(df['Customer']):
        scalers_dict[customer] = MinMaxScaler()
        scaled_price = scalers_dict[customer].fit_transform(df[df['Customer']==customer].Price.values.reshape(-1, 1))
        df.loc[df['Customer']==customer, 'Price_scaled'] = scaled_price
    
    # Loop over Customers and inverse scaled values
    for customer in pd.unique(df['Customer']):
        inverse_scale = scalers_dict[customer].inverse_transform(df[df['Customer']==customer].Price_scaled.values.reshape(-1, 1))
        df.loc[df['Customer']==customer, 'Price_inverse_scaled'] = inverse_scale
    

    输出(我将精度设置为 .7f,因为 Max 值对于许多值来说太大了,其中大多数在双精度下变成了 0.00):

            Date Customer                Price Price_scaled Price_inverse_scaled
    0  2019-01-06        A    142404534.1300000    1.0000000    142404534.1300000
    1  2019-01-07        A          989.3400000    0.0000065          989.3400000
    2  2019-01-08        A        45444.5700000    0.0003187        45444.5700000
    3  2019-01-09        A       574343.1000000    0.0040327       574343.1000000
    4  2019-01-10        A        23232.3400000    0.0001627        23232.3400000
    5  2019-01-01        A       923423.0000000    0.0064840       923423.0000000
    6  2019-01-02        A          332.0000000    0.0000019          332.0000000
    7  2019-01-03        A      2342323.2400000    0.0164479      2342323.2400000
    8  2019-01-04        A          232.0000000    0.0000012          232.0000000
    9  2019-01-05        A           65.7000000    0.0000000           65.7000000
    10 2019-01-06        B          875.4600000    0.0000000          875.4600000
    11 2019-01-07        B 142466027340.0299988    1.0000000 142466027340.0299988
    12 2019-01-08        B           25.1700000    0.0000000           25.1700000
    13 2019-01-09        B            1.0100000    0.0000000            1.0100000
    14 2019-01-10        B            1.0000000    0.0000000            1.0000000
    15 2019-01-10        B           57.6100000    0.0000000           57.6100000
    16 2019-01-06        B       232232.7800000    0.0000016       232232.7800000
    17 2019-01-07        B           15.2000000    0.0000000           15.2000000
    18 2019-01-08        B           44.5600000    0.0000000           44.5600000
    19 2019-01-09        B      2323254.4500000    0.0000163      2323254.4500000
    20 2019-01-10        B          395.4500000    0.0000000          395.4500000
    21 2019-01-10        B     23423454.9200000    0.0001644     23423454.9200000
    22 2019-01-06        C           34.1200000    0.0000000           34.1200000
    23 2019-01-07        C           89.3400000    0.1787171           89.3400000
    24 2019-01-08        C           44.5700000    0.0338210           44.5700000
    25 2019-01-09        C          343.1000000    1.0000000          343.1000000
    26 2019-01-10        C          232.3400000    0.6415302          232.3400000
    

    编辑: 多列版本

    import pandas as pd
    pd.set_option('display.float_format', lambda x: '%.7f' % x)
    from sklearn.preprocessing import MinMaxScaler
    
    scalers_dict = dict({}) # Prepare place to hold scalers per Customer
    
    df = pd.read_csv('stack_data.csv', parse_dates=['Date'])
    df['Price2'] = range(df.shape[0])
    df['Price3'] = df['Price2'] * 2
    
    
    df = df.join(pd.DataFrame(0, df.index, ['Price1_scaled', 'Price2_scaled', 'Price3_scaled', 'Price1_inverse_scaled', 'Price2_inverse_scaled', 'Price3_inverse_scaled']))
    
    # Loop over Customers, scale Price and save scaler for further use
    for customer in pd.unique(df['Customer']):
        scalers_dict[customer] = MinMaxScaler()
        scaled_price = scalers_dict[customer].fit_transform(df.loc[df['Customer']==customer, 'Price':'Price3'])
        df.loc[df['Customer']==customer, 'Price1_scaled':'Price3_scaled'] = scaled_price
    
    # Loop over Customers and inverse scaled values
    for customer in pd.unique(df['Customer']):
        inverse_scale = scalers_dict[customer].inverse_transform(df.loc[df['Customer']==customer, 'Price1_scaled':'Price3_scaled'])
        df.loc[df['Customer']==customer, 'Price1_inverse_scaled':'Price3_inverse_scaled'] = inverse_scale
    

    输出:

             Date Customer                Price  Price2  Price3  Price1_scaled  Price2_scaled  Price3_scaled  Price1_inverse_scaled  Price2_inverse_scaled  Price3_inverse_scaled
    0  2019-01-06        A    142404534.1300000       0       0      1.0000000      0.0000000      0.0000000      142404534.1300000              0.0000000              0.0000000
    1  2019-01-07        A          989.3400000       1       2      0.0000065      0.1111111      0.1111111            989.3400000              1.0000000              2.0000000
    2  2019-01-08        A        45444.5700000       2       4      0.0003187      0.2222222      0.2222222          45444.5700000              2.0000000              4.0000000
    3  2019-01-09        A       574343.1000000       3       6      0.0040327      0.3333333      0.3333333         574343.1000000              3.0000000              6.0000000
    4  2019-01-10        A        23232.3400000       4       8      0.0001627      0.4444444      0.4444444          23232.3400000              4.0000000              8.0000000
    5  2019-01-01        A       923423.0000000       5      10      0.0064840      0.5555556      0.5555556         923423.0000000              5.0000000             10.0000000
    6  2019-01-02        A          332.0000000       6      12      0.0000019      0.6666667      0.6666667            332.0000000              6.0000000             12.0000000
    7  2019-01-03        A      2342323.2400000       7      14      0.0164479      0.7777778      0.7777778        2342323.2400000              7.0000000             14.0000000
    8  2019-01-04        A          232.0000000       8      16      0.0000012      0.8888889      0.8888889            232.0000000              8.0000000             16.0000000
    9  2019-01-05        A           65.7000000       9      18      0.0000000      1.0000000      1.0000000             65.7000000              9.0000000             18.0000000
    10 2019-01-06        B          875.4600000      10      20      0.0000000      0.0000000      0.0000000            875.4600000             10.0000000             20.0000000
    11 2019-01-07        B 142466027340.0299988      11      22      1.0000000      0.0909091      0.0909091   142466027340.0299988             11.0000000             22.0000000
    12 2019-01-08        B           25.1700000      12      24      0.0000000      0.1818182      0.1818182             25.1700000             12.0000000             24.0000000
    13 2019-01-09        B            1.0100000      13      26      0.0000000      0.2727273      0.2727273              1.0100000             13.0000000             26.0000000
    14 2019-01-10        B            1.0000000      14      28      0.0000000      0.3636364      0.3636364              1.0000000             14.0000000             28.0000000
    15 2019-01-10        B           57.6100000      15      30      0.0000000      0.4545455      0.4545455             57.6100000             15.0000000             30.0000000
    16 2019-01-06        B       232232.7800000      16      32      0.0000016      0.5454545      0.5454545         232232.7800000             16.0000000             32.0000000
    17 2019-01-07        B           15.2000000      17      34      0.0000000      0.6363636      0.6363636             15.2000000             17.0000000             34.0000000
    18 2019-01-08        B           44.5600000      18      36      0.0000000      0.7272727      0.7272727             44.5600000             18.0000000             36.0000000
    19 2019-01-09        B      2323254.4500000      19      38      0.0000163      0.8181818      0.8181818        2323254.4500000             19.0000000             38.0000000
    20 2019-01-10        B          395.4500000      20      40      0.0000000      0.9090909      0.9090909            395.4500000             20.0000000             40.0000000
    21 2019-01-10        B     23423454.9200000      21      42      0.0001644      1.0000000      1.0000000       23423454.9200000             21.0000000             42.0000000
    22 2019-01-06        C           34.1200000      22      44      0.0000000      0.0000000      0.0000000             34.1200000             22.0000000             44.0000000
    23 2019-01-07        C           89.3400000      23      46      0.1787171      0.2500000      0.2500000             89.3400000             23.0000000             46.0000000
    24 2019-01-08        C           44.5700000      24      48      0.0338210      0.5000000      0.5000000             44.5700000             24.0000000             48.0000000
    25 2019-01-09        C          343.1000000      25      50      1.0000000      0.7500000      0.7500000            343.1000000             25.0000000             50.0000000
    26 2019-01-10        C          232.3400000      26      52      0.6415302      1.0000000      1.0000000            232.3400000             26.0000000             52.0000000
    

    【讨论】:

    • 感谢您的回复,如果我想为价格、体重和身高等多个列运行上述逻辑,我该怎么做
    • 只需在循环内添加更多行和其他列: scaled_price = scalers_dict[customer].fit_transform(df[df['Customer']==customer].Another_column.values.reshape(-1, 1 ))
    • 如果数据框像这个链接一样,stackoverflow.com/questions/58624544/… 我应该缩放它们
    • 如果我有 50 列,继续添加行不是一个好主意,你能建议一下吗
    • 然后为每个客户放入整片缩放器:scaled_price = scalers_dict[customer].fit_transform(df.loc[df['Customer']==customer, 'Price':'Price50'])
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-01-09
    • 2023-04-02
    • 2021-02-08
    • 1970-01-01
    相关资源
    最近更新 更多