有视频:https://www.youtube.com/watch?v=BFaadIqWlAg

有代码:https://github.com/jem1031/pandas-pipelines-custom-transformers

 

 


一、模型训练

简单的preprocessing后,仅使用一个“属性”做预测,看看结果如何?

#%%
import pandas as pd
import numpy as np
import os

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline

# SET UP

# Read in data
# source: https://data.seattle.gov/Permitting/Special-Events-Permits/dm95-f8w5
data_folder = '../data/'
data_file = 'Special_Events_Permits_2016.csv'
data_file_path = os.path.join(data_folder, data_file)
print("debug: data_file_path is {}".format(data_file_path))
df = pd.read_csv(data_file_path)

# Set aside 25% as test data
df_train, df_test = train_test_split(df, random_state=4321)

# Take a look
df_train.head()

#%%
# SIMPLE MODEL

# Binarize string feature
y_train = np.where(df_train.permit_status == 'Complete', 1, 0)
y_test  = np.where(df_test.permit_status == 'Complete', 1, 0)

print(y_train[:5])
print(y_test[:5])

# Missing value,且只使用这一列做出这次模型训练的特征!
X_train_1 = df_train[['attendance']].fillna(value=0)
X_test_1  = df_test[['attendance']].fillna(value=0)

print(X_train_1[:5])
print(X_test_1[:5])

#%%
# Fit model
model_1 = LogisticRegression(random_state=5678)
model_1.fit(X_train_1, y_train)

 

二、模型评估

评估指标 ROC AUC

(1) 获得二值化的分类结果; 

(2) 获得分类的概率数值。

y_pred_train_1 = model_1.predict(X_train_1)
print("y_pred_train_1 is {}".format(y_pred_train_1))
p_pred_train_1 = model_1.predict_proba(X_train_1)[:, 1]
print("p_pred_train_1 is {}".format(p_pred_train_1))

# Evaluate model
# baseline: always predict the average
p_baseline_test = [y_train.mean()]*len(y_test)
auc_baseline = roc_auc_score(y_test, p_baseline_test)
print(auc_baseline)  # 0.5

#######################################################
y_pred_test_1 = model_1.predict(X_test_1) print("y_pred_test_1 is {}".format(y_pred_test_1)) p_pred_test_1 = model_1.predict_proba(X_test_1)[:, 1] print("p_pred_test_1 is {}".format(p_pred_test_1))
# Evaluate model auc_test_1
= roc_auc_score(y_test, p_pred_test_1) print(auc_test_1) # 0.576553672316

 

Ref: 机器学习评价指标 ROC与AUC 的理解和python实现

以FPR为横坐标,TPR为纵坐标,那么ROC曲线就是改变各种阈值后得到的所有坐标点 (FPR,TPR) 的连线,画出来如下。

红线是随机乱猜情况下的 ROC,曲线越靠左上角,分类器越佳。

AUC(Area Under Curve)就是ROC曲线下的面积。

既然已经这么多评价标准,为什么还要使用ROC和AUC呢?

因为ROC曲线有个很好的特性:当测试集中的正负样本的分布变化的时候,ROC曲线能够保持不变

[Feature] Final pipeline: custom transformers

 

评估指标 R2

决定系数R2 Score ,衡量模型预测能力好坏(真实和预测的 相关程度百分比)

预测数据和真实数据越接近,R2越大,当然最大值是 1;模型的R2 值为0,还不如直接用平均值(均值模型)来预测效果好。

 

Ref: 【从零开始学机器学习12】MSE、RMSE、R2_score

既然不同数据集的量纲不同,很难通过上面的三种方式去比较,那么不妨找一个第三者作为参照,根据参照计算 R方值,就可以比较模型的好坏了。

R2_score < 0 :分子大于分母,训练模型产生的误差比使用均值产生的还要大,也就是训练模型反而不如直接去均值效果好。出现这种情况,通常是模型本身不是线性关系的,而我们误使用了线性模型,导致误差很大。

评估指标 Residual

方差越大,模型越不稳定; 

import numpy as np
from sklearn.datasets import load_boston
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as CK
from sklearn.model_selection import cross_val_predict

boston = load_boston()
boston_X = boston.data
boston_y = boston.target
train_set = np.random.choice([True, False], len(boston_y),p=[.75, .25])
# 这里获得布尔index,方便从数据集中pick up所需数据

mixed_kernel = kernel = CK(1.0, (1e-4, 1e4)) * RBF(10, (1e-4, 1e4))
gpr = GaussianProcessRegressor(alpha=5, n_restarts_optimizer=20, kernel = mixed_kernel) 
gpr.fit(boston_X[train_set], boston_y[train_set])
test_preds = gpr.predict(boston_X[~train_set]
View Code

相关文章:

  • 2021-04-04
  • 2021-12-25
  • 2021-07-21
  • 2021-11-29
  • 2022-01-13
  • 2021-07-07
  • 2022-12-23
  • 2021-07-02
猜你喜欢
  • 2021-11-25
  • 2021-10-08
  • 2022-12-23
  • 2021-09-02
  • 2022-12-23
  • 2022-12-23
  • 2021-07-06
相关资源
相似解决方案