【问题标题】:Standardize some columns in Python Pandas dataframe?标准化 Python Pandas 数据框中的某些列?
【发布时间】:2018-09-13 11:11:54
【问题描述】:

下面的 Python 代码只返回一个数组,但我希望缩放后的数据替换原始数据。

from sklearn.preprocessing import StandardScaler
df = StandardScaler().fit_transform(df[['cost', 'sales']])
df

输出

array([[ 1.99987622, -0.55900276],
       [-0.49786658, -0.45658181],
       [-0.5146864 , -0.505097  ],
       [-0.48104676, -0.47814412],
       [-0.50627649,  1.9988257 ]])

原始数据

id  cost    sales   item
1   300       50    pen
2   3         88    bottle
3   1         70    drink
4   5         80    cup
5   2        999    ink

【问题讨论】:

    标签: python pandas sklearn-pandas standardized


    【解决方案1】:

    只需将其分配回去

    df[['cost', 'sales']] = StandardScaler().fit_transform(df[['cost', 'sales']])
    df
    Out[45]: 
       id      cost     sales    item
    0   1  1.999876 -0.559003     pen
    1   2 -0.497867 -0.456582  bottle
    2   3 -0.514686 -0.505097   drink
    3   4 -0.481047 -0.478144     cup
    4   5 -0.506276  1.998826     ink
    

    【讨论】:

      【解决方案2】:

      或者如果使用列索引而不是列名:

      import pandas as pd
      from sklearn.preprocessing import StandardScaler
      df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
      
      # Scale selected columns by index
      df.iloc[:, 0:2] = StandardScaler().fit_transform(df.iloc[:, 0:2])
      
             cost     sales    item
      0  1.999876 -0.559003     pen
      1 -0.497867 -0.456582  bottle
      2 -0.514686 -0.505097   drink
      3 -0.481047 -0.478144     cup
      4 -0.506276  1.998826     ink
      

      sclaer 对象也可以保存,以便在现有缩放器的基础上缩放“新数据”:

      df = pd.DataFrame({"cost": [300,3,1,5,2], "sales": [50,88,70,80,999], "item": ["pen","bottle","drink","cup","ink"]})
      df_new = pd.DataFrame({"cost": [299,5,12,64,2], "sales": [55,99,48,20,999], "item": ["pen","bottle","drink","cup","ink"]})
      
      # Set up scaler
      scaler = StandardScaler().fit(df.iloc[:, 0:2])
      
      # Scale original data
      df.iloc[:, 0:2] = scaler.transform(df.iloc[:, 0:2])
      
      # Scale new data 
      df_new.iloc[:, 0:2] = scaler.transform(df_new.iloc[:, 0:2])
      

      【讨论】:

        【解决方案3】:

        如果你想拥有benefits of an sklearn Pipeline(方便/封装,联合参数选择,安全不泄漏),你可以使用ColumnTransformer

        preproc = ColumnTransformer(
            transformers=[
                ('scale', StandardScaler(), ["cost", "sales"]),
            ],
            remainder="passthrough",
        )
        

        (有几种方法可以指定哪些列进入缩放器,请查看the docs)。现在您可以将缩放器对象保存为@Peter mentions,而且您不必一直重复切片:

        df = preproc.fit_transform(df)
        df_new = preproc.transform(df)
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2015-04-29
          • 2021-09-23
          • 2021-08-20
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多