【问题标题】:How do I handle multiple non-ordinal categorical variables?如何处理多个非序分类变量?
【发布时间】:2021-12-14 21:07:00
【问题描述】:

我在网上获取了一个数据集,其中包含今年 NBA 球员的数据。我正在尝试对数据集运行线性回归,以查看给定玩家在给定以下特征的情况下平均可以得分多少分:团队名称、位置、年龄、每场比赛的上场时间。但是,我不知道如何处理前两列,它们是我的分类变量。我刚刚开始了关于 Udemy 的数据科学课程,讲师还没有真正解释在这种情况下该怎么做,因为他的 OneHotEncoding 示例仅适用于具有一个分类变量的数据集。

我的代码:

#Import Libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

#Import Dataset

dataset = pd.read_csv('nba_clean.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values

#Encode Dataset

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))

#Splitting the Dataset into Training set and Test Set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)

#Perform Multiple Linear Regression on Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

#Compare predicted values to true values
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2)
new_y_pred = y_pred.reshape(len(y_pred), 1)
new_y_test = y_test.reshape(len(y_test), 1)
print(np.concatenate((new_y_pred, new_y_test), 1))

【问题讨论】:

    标签: python machine-learning scikit-learn linear-regression


    【解决方案1】:

    您的列转换器必须处理所有不同的列类型: 你必须更换

     ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1])], remainder = 'passthrough')
    

    使用以下类型的代码:

    首先定义你的列类型列表:

    num_f  = ['age', 'points', ...]
    ord_f  = ['bbb', 'ccc', ...]
    cat_f  = ['aaa', 'ddd', ...]
    drop_f = []
    

    然后为每种类型的值创建一个转换器

    # create a transformer for the categorical values
    cat_tr = Pipeline(steps=[
        ('onehot', OneHotEncoder())])
    
    # create a transformer for the categorical ordinal values
    ord_tr = Pipeline(steps=[
        ('ordinal', OrdinalEncoder())])
    
    # create a transformed for the numerical values
    num_tr = Pipeline(steps=[
        ('scaler', StandardScaler())])
    
    ct = ColumnTransformer(transformers=[
        ("drop",'drop' ,drop_f)
        ,("cat", cat_tr, cat_f)
        ,("ord", ord_tr, ord_f)
        ,("num", num_tr, num_f)
        ],remainder='passthrough')
    

    【讨论】:

    • 谢谢!你知道如何显示所有列吗?我得到这个 [[ 1. 0. 0. ... 0. 23.11 34.3 ] [ 1. 0. 0. ... 0. 30.62 14.1 ] [ 0. 0. 0. ... 1. 26.64 34. ] ... [ 0. 0. 0. ... 0. 25.6 14.8 ] [ 1. 0. 0. ... 0. 35.01 12.5 ] [ 0. 0. 0. ... 0. 25.68 29.9 ]]
    • @NoobAtDataScience,将pd.set_option('display.max_columns', None) 添加到您的代码中
    【解决方案2】:

    您可以使用 pandas 函数将某些列转换为 one-hot:

    pandas.get_dummies(data, column=["TeamName", "Position"])
    

    像这样:

    df = pd.DataFrame({
            "Player": ['player1', 'player2', 'player3'],
            "TeamName": ['Lakers', 'Spurs', 'Lakers'],
            "Position":['point guard', 'center', 'forward']
            })
        
    df
               Player TeamName     Position
           0  player1   Lakers  point guard
           1  player2    Spurs       center
           2  player3   Lakers      forward
    
    
    pd.get_dummies(df, columns=['TeamName', 'Position'], prefix='', prefix_sep='')
    
        Player   Lakers   Spurs   center   forward   point guard
    0  player1        1       0        0         0             1
    1  player2        0       1        1         0             0
    2  player3        1       0        0         1             0
    

    【讨论】:

    • 神圣!你不知道我有多需要这个!非常感谢!
    猜你喜欢
    • 2020-09-10
    • 2018-01-22
    • 1970-01-01
    • 2014-09-02
    • 2020-05-22
    • 1970-01-01
    • 1970-01-01
    • 2020-06-11
    • 1970-01-01
    相关资源
    最近更新 更多