逻辑回归
逻辑回归是应用非常广泛的一个分类机器学习算法,它将数据拟合到一个logit函数(或者叫做logistic函数)中,从而能够完成对事件发生的概率进行预测。
构建逻辑回归模型步骤:
导入数据
预处理数据
对不平衡的数据进行下采样(或者过采样)处理
把处理之后的数据进行切分,切分为训训练集和测试集
对训练集进行交叉验证,同时寻找最佳的正则化参数以减少过拟合
使用最佳的正则化参数对处理之后的数据进行训练并预测,观察召回率和精确率
使用最佳的正则化参数对处理之后的数据进行训练并预测,观察召回率和精确率
修改阈值以获得更好的召回率和精确率
1. 数据与任务
信用卡欺诈数据
1 2 3 4 5 6 7 8
import pandas as pdimport matplotlib.pyplot as pltimport numpy as np%matplotlib inline data = pd.read_csv("creditcard.csv" ) data.head()
图片说明
要使用逻辑回归对数据进行建模 任务:二分类, 把数据分为有欺诈和无欺诈的两种数据
2. 使用sklearn进行数据预处理
公式为:(X-mean)/std 计算时对每个属性/每列分别进行。
Standardization标准化:将特征数据的分布调整成标准正太分布,也叫高斯分布,也就是使得数据的均值维0,方差为1
标准化的原因在于如果有些特征的方差过大,则会主导目标函数从而使参数估计器无法正确地去学习其他特征。
标准化的过程为两步:去均值的中心化(均值变为0);方差的规模化(方差变为1)。
在sklearn.preprocessing中提供了一个scale的方法,可以实现以上功能。如下面所示:
1 2 3 4 5 6
x = np.array([[1. , -1. , 2. ], [2. , 0. , 0. ], [0. , 1. , -1. ]]) x_scale = preprocessing.scale(x) x_scale
preprocessing这个模块还提供了一个实用类StandarScaler,它可以在训练数据集上做了标准转换操作之后,把相同的转换应用到测试训练集中。 可以对训练数据,测试数据应用相同的转换,以后有新的数据进来也可以直接调用,不用再重新把数据放在一起再计算一次了。
1 2 3 4 5 6 7 8 9
scaler = preprocessing.StandardScaler().fit(x) scaler StandardScaler(copy=True , with_mean=True , with_std=True ) scaler.transform(x)
StandardScaler()中可以传入两个参数:with_mean,with_std.这两个都是布尔型的参数,默认情况下都是true,但也可以自定义成false.即不要均值中心化或者不要方差规模化为1.
1. 处理数据 数据下采样
1.1 预处理数据,修改列”Amount”数据分布
1 2 3 4 5 6 7
from sklearn.preprocessing import StandardScalerif 'Amount' in data.columns: data['normAount' ] = StandardScaler().fit_transform(data['Amount' ].reshape(-1 , 1 ))
1.2 数据处理,去除不需要的特征
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
if ('Time' ) in data.columns: data = data.drop(['Time' ], axis=1 ) if ('Amount' ) in data.columns: data = data.drop(['Amount' ], axis=1 ) print(data.columns, len(data.columns)) count_classes = pd.value_counts(data['Class' ], sort=True ).sort_index() count_classes.plot(kind = 'bar' ) plt.xlabel("Class" ) plt.ylabel("Frequency" ) X = data.loc[:, data.columns != "Class" ] y = data.loc[:, data.columns == 'Class' ] print ("SHAPE" , X.shape, y.shape)
图片说明
3.下采样
把数据相对多的减少,可减少为和数据少的数量相同的数量
1.3 区分正常数据和异常数据: 通过特征’Class’区分
1 2 3 4 5 6 7 8 9 10 11 12 13 14
number_records = data[data.Class == 1 ] number_records_fraud = len(number_records) frand_indices = np.array(number_records.index) print ("异常样本索引 有{}个" .format(number_records_fraud), frand_indices[:10 ])normal_indices = data[data.Class == 0 ].index print ("正常样本索引 有{}个" .format(len(normal_indices)), normal_indices[-10 :])print (">>>>>>>>>>>>>>>>>>>>所有数据 正常异常比 " , len(normal_indices), '\t' , number_records_fraud)print("**************" )
1.4 下采样处理数据 把多的一方数据进行随机减少到与少的一方相同
1 2 3 4 5
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False ) random_normal_indices = np.array(random_normal_indices) print ("'下采样'后有正常样本个数:" , len(random_normal_indices))
1.5 数据索引合并 (意思就是把新的正常数据和原来的异常数据进行拼接)
1 2 3 4 5 6
under_sample_indices = np.concatenate([frand_indices, random_normal_indices]) print ("合并后有样本个数:" , len(under_sample_indices))under_sample_data = data.iloc[under_sample_indices, :] print ("合并后样本:" , under_sample_data[:1 ])print (">>>>>>>>>>>>>>>>>>>>下采样数据 正常异常比 " , len(under_sample_data[under_sample_data == 0 ]), '\t' , len(under_sample_data[under_sample_data == 1 ]))
1.6 获取合并数据中的feature(特征)和label(分类)
1 2 3 4 5
X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class' ] y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class' ] print (X_undersample.shape, y_undersample.shape)print (len(under_sample_data[under_sample_data["Class" ] == 1 ]), len(under_sample_data[under_sample_data["Class" ] == 0 ]))
图片说明
2. 切分数据为训练和测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
from sklearn.cross_validation import train_test_splitprint (X.shape, y.shape)X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3 , random_state=0 ) print ("1.训练集数据大小" , X_train.shape)print ("2.测试集数据大小" , X_test.shape)print (len(X_train) + len(X_test), len(y_train), len(y_test), "\n" )X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample, y_undersample, test_size=0.2 , random_state=0 ) print ("3.训练集数据大小" , X_train_undersample.shape)print ("4.测试集数据大小" , X_test_undersample.shape)print (len(X_train_undersample) + len(X_test_undersample), len(y_train_undersample), len(y_test_undersample), "\n" )
图片说明
评估标准:召回率(recall)
不适用准确率,因为准确率不能正确的得到所求的,是没用的
模型评估表:
相关(Relevant),正类
不相关(NonRelevant),负类
被检测到(Retrieved)
true positives (TP)
false positives (FP)
未被检测到(Retrieved)
false negatives (FN)
true negatives (TN)
评估计算:
Recall=TPTP+FN” role=”presentation”>R e c a l l = T P T P + F N Recall=TPTP+FN
过拟合: 数据在训练集表现很好 在测试集表现很差
1 2 3 4
from sklearn.linear_model import LogisticRegression from sklearn.cross_validation import KFold from sklearn.metrics import confusion_matrix, recall_score, classification_report
3. 通过多次循环交叉验证 确定正则化参数 random_state:随机种子数
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
def printing_Kfold_scores (x_train_data, y_train_data) : fold = KFold(len(y_train_data), 5 , shuffle=False ) print (type(fold), len(y_train_data), len(fold)) c_param_range = [0.01 , 0.1 , 1 , 10 , 100 ] results_table = pd.DataFrame(index = range(len(c_param_range)), columns = ['C_parameter' , 'Mean recall score' ]) results_table['C_parameter' ] = c_param_range for index, c_param in enumerate(c_param_range): print (">>>>>>>>>>>>>>>>>>>>>>>>>>" ) print ("C_parameter " , c_param) recall_accs = [] for iteration, indices in enumerate(fold, start=1 ): print (iteration, len(indices[0 ]), len(indices[1 ])) lr = LogisticRegression(C = c_param, penalty = 'l1' ) lr.fit(x_train_data.iloc[indices[0 ],:], y_train_data.iloc[indices[0 ],:].values.ravel()) y_pred_undersample = lr.predict(x_train_data.iloc[indices[1 ], :].values) recall_acc = recall_score(y_train_data.iloc[indices[1 ],:].values, y_pred_undersample) recall_accs.append(recall_acc) print (len(indices), "Iteration " , iteration, ": recall score = " , recall_acc) results_table.loc[index, 'Mean recall score' ] = np.mean(recall_accs) print ('\nMean recall score ' , np.mean(recall_accs), '\n' ) print (results_table) best_c = results_table.loc[results_table['Mean recall score' ].idxmax()]['C_parameter' ] print ("finally-------best is--------> " , best_c) return best_c best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)
图片说明
图片说明
图片说明
ndarray数据格式化: set_printoptions
set_printoptions(precision=None, threshold=None, edgeitems=None, linewidth=None, suppress=None, nanstr=None, infstr=None, formatter=None)
precision:输出结果保留精度的位数 (num)
threshold:array数量的个数在小于threshold的时候不会被折叠 (num)
edgeitems:在array已经被折叠后,开头和结尾都会显示edgeitems个数 (num)
formatter:这个很有意思,像python3里面str.format(),就是可以对你的输出进行自定义的格式化 其他的暂时没用到
4. 使用最好的正则化参数 构建逻辑回归模型并进行测试
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53
lr = LogisticRegression(C = best_c, penalty='l1' ) lr.fit(X_train_undersample, y_train_undersample.values.ravel()) y_pred_undersample = lr.predict(X_test_undersample.values) print (type(y_pred_undersample), len(y_pred_undersample), "\n" )import itertoolsdef plot_confusion_matrix (cm, classes, title='Confussion matrix' , cmap=plt.cm.Blues) : plt.imshow(cm, interpolation='nearest' , cmap=cmap) plt.title(title) plt.colorbar() tick_marks = np.arange(len(classes)) plt.xticks(tick_marks, classes, rotation=0 ) plt.yticks(tick_marks, classes) thresh = cm.max() / 2 for i, j in itertools.product(range(cm.shape[0 ]), range(cm.shape[1 ])): plt.text(j, i, cm[i, j], horizontalalignment="center" , color="white" if cm[i, j] > thresh else "black" ) plt.tight_layout() plt.ylabel("True label" ) plt.xlabel("Predicted label" ) cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample) np.set_printoptions(precision=2 ) print ("Recall metric in the testing dataset: " , cnf_matrix[1 , 1 ]/(cnf_matrix[1 , 0 ] + cnf_matrix[1 , 1 ]))class_names = [0 , 1 ] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix' ) plt.show()
图片说明
4. 使用最好的正则化参数 构建逻辑回归模型并进行测试 (使用原始数据的测试集和训练集)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
lr = LogisticRegression(C = best_c, penalty='l1' ) lr.fit(X_train, y_train.values.ravel()) y_pred = lr.predict(X_test.values) cnf_matrix = confusion_matrix(y_test, y_pred) np.set_printoptions(precision=2 ) print ("Recall metric in the testing dataset: " , cnf_matrix[1 , 1 ]/(cnf_matrix[1 , 0 ] + cnf_matrix[1 , 1 ]))class_names = [0 , 1 ] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix' ) plt.show()
5. 修改阈值以获取最好的逻辑回归模型
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
lr = LogisticRegression(C = best_c, penalty='l1' ) lr.fit(X_train, y_train.values.ravel()) y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) thresholds = [0.1 , 0.2 , 0.3 , 0.4 , 0.5 , 0.6 , 0.7 , 0.8 , 0.9 ] plt.figure(figsize=(10 , 10 )) for index, i in enumerate(thresholds): y_test_predictions_high_recall = y_pred_undersample_proba[:, 1 ] > i plt.subplot(3 , 3 , index + 1 ) cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall) np.set_printoptions(precision=2 ) print (i, "Recall metric in the testing dataset: " , cnf_matrix[1 , 1 ] / (cnf_matrix[1 , 0 ] + cnf_matrix[1 , 1 ])) class_names = [0 , 1 ] plot_confusion_matrix(cnf_matrix, classes=class_names, title="Threshold >= %s" %i)
图片说明
图片说明
图片说明
过采样
把数据相对少的增加,可增加为和数据多的数量相同的数量 (生成)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
import pandas as pdfrom imblearn.over_sampling import SMOTEfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.metrics import confusion_matrixfrom sklearn.model_selection import train_test_splitcredit_cards = pd.read_csv("creditcard.csv" ) columns = credit_cards.columns features_columns = columns.delete(len(columns) - 1 ) print (features_columns)features = credit_cards[features_columns] labels = credit_cards['Class' ] print ("原始的数据个数" , (credit_cards[credit_cards['Class' ] == 0 ]).shape, (credit_cards[credit_cards['Class' ] == 1 ]).shape)
1 2
features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2 , random_state=0 ) print (features_train.shape, features_test.shape, labels_train.shape, labels_test.shape)
图片说明
1 2 3 4 5 6 7 8
oversampler = SMOTE(random_state = 0 ) os_features, os_labels = oversampler.fit_sample(features_train, labels_train) print ("可见 的确生成了新的数据,补充了异常的数据 " , len(os_labels[os_labels[:] == 1 ]), len(os_labels[os_labels[:] == 0 ]))print ((os_features).shape, len(os_features[os_features == 1 ]), len(os_features[os_features == 0 ]), (os_labels).shape, len(os_labels[os_labels == 1 ]), len(os_labels[os_labels == 0 ]))
图片说明
1 2 3 4 5 6
os_features = pd.DataFrame(os_features) os_labels = pd.DataFrame(os_labels) best_c = printing_Kfold_scores(os_features, os_labels)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
lr = LogisticRegression(C = best_c, penalty='l1' ) lr.fit(os_features, os_labels.values.ravel()) y_pred = lr.predict(features_test.values) cnf_matrix = confusion_matrix(labels_test, y_pred) np.set_printoptions(precision=2 ) print ("Recall metric in the testing dataset: " , cnf_matrix[1 , 1 ] / (cnf_matrix[1 , 0 ] + cnf_matrix[1 , 1 ]))class_names = [0 , 1 ] plt.figure() plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix' ) plt.show()
个人博客 欢迎来访: http://zj2626.com