【发布时间】:2021-12-14 21:07:00
【问题描述】:
我在网上获取了一个数据集,其中包含今年 NBA 球员的数据。我正在尝试对数据集运行线性回归,以查看给定玩家在给定以下特征的情况下平均可以得分多少分:团队名称、位置、年龄、每场比赛的上场时间。但是,我不知道如何处理前两列,它们是我的分类变量。我刚刚开始了关于 Udemy 的数据科学课程,讲师还没有真正解释在这种情况下该怎么做,因为他的 OneHotEncoding 示例仅适用于具有一个分类变量的数据集。
我的代码:
#Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#Import Dataset
dataset = pd.read_csv('nba_clean.csv')
X = dataset.iloc[:, 1:-1].values
y = dataset.iloc[:, -1].values
#Encode Dataset
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0, 1])], remainder = 'passthrough')
X = np.array(ct.fit_transform(X))
#Splitting the Dataset into Training set and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 0)
#Perform Multiple Linear Regression on Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
#Compare predicted values to true values
y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2)
new_y_pred = y_pred.reshape(len(y_pred), 1)
new_y_test = y_test.reshape(len(y_test), 1)
print(np.concatenate((new_y_pred, new_y_test), 1))
【问题讨论】:
标签: python machine-learning scikit-learn linear-regression