构建逻辑回归-实例

逻辑回归

逻辑回归是应用非常广泛的一个分类机器学习算法，它将数据拟合到一个logit函数(或者叫做logistic函数)中，从而能够完成对事件发生的概率进行预测。

构建逻辑回归模型步骤：

导入数据
预处理数据
对不平衡的数据进行下采样（或者过采样）处理
把处理之后的数据进行切分，切分为训训练集和测试集
对训练集进行交叉验证，同时寻找最佳的正则化参数以减少过拟合
使用最佳的正则化参数对处理之后的数据进行训练并预测，观察召回率和精确率
使用最佳的正则化参数对处理之后的数据进行训练并预测，观察召回率和精确率
修改阈值以获得更好的召回率和精确率

1. 数据与任务

信用卡欺诈数据

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

%matplotlib inline

data = pd.read_csv("creditcard.csv")
data.head()

图片说明

要使用逻辑回归对数据进行建模任务：二分类，把数据分为有欺诈和无欺诈的两种数据

2. 使用sklearn进行数据预处理

公式为：(X-mean)/std 计算时对每个属性/每列分别进行。

Standardization标准化:将特征数据的分布调整成标准正太分布，也叫高斯分布，也就是使得数据的均值维0，方差为1

标准化的原因在于如果有些特征的方差过大，则会主导目标函数从而使参数估计器无法正确地去学习其他特征。

标准化的过程为两步：去均值的中心化（均值变为0）；方差的规模化（方差变为1）。

在sklearn.preprocessing中提供了一个scale的方法，可以实现以上功能。如下面所示:

x = np.array([[1., -1., 2.],
              [2., 0., 0.],
              [0., 1., -1.]])
# 将每一列特征标准化为标准正太分布，注意，标准化是针对每一列而言的
x_scale = preprocessing.scale(x)
x_scale

preprocessing这个模块还提供了一个实用类StandarScaler，它可以在训练数据集上做了标准转换操作之后，把相同的转换应用到测试训练集中。
可以对训练数据，测试数据应用相同的转换，以后有新的数据进来也可以直接调用，不用再重新把数据放在一起再计算一次了。

# 调用fit方法，根据已有的训练数据创建一个标准化的转换器
scaler = preprocessing.StandardScaler().fit(x)

scaler

StandardScaler(copy=True, with_mean=True, with_std=True)

# 使用上面这个转换器去转换训练数据x,调用transform方法
scaler.transform(x)

StandardScaler()中可以传入两个参数：with_mean,with_std.这两个都是布尔型的参数，默认情况下都是true,但也可以自定义成false.即不要均值中心化或者不要方差规模化为1.

1. 处理数据数据下采样

1.1 预处理数据,修改列”Amount”数据分布

# 导入预处理sklearn中预处理模块的标准化模块
from sklearn.preprocessing import StandardScaler

if 'Amount' in data.columns:
    # 转化特征为新的特征
    data['normAount'] = StandardScaler().fit_transform(data['Amount'].reshape(-1, 1))  # reshape:改变数组的形状,参数为改变后的行列数 
# fit_transform：对数据进行变换 矩阵旋转：-1表示自动识别 根据另一个矩阵列（行）数确定本行（列）数

1.2 数据处理,去除不需要的特征

# 去掉两个没用的特征（列） axis=1表示对每一行去做这个操作，axis=0表示对每一列做相同的这个操作
if ('Time') in data.columns:
    data = data.drop(['Time'], axis=1) 
if ('Amount') in data.columns:
    data = data.drop(['Amount'], axis=1) 
print(data.columns, len(data.columns))

# 1.2.1 数据图形化展示(1的数据太少索引看上去没有)
count_classes = pd.value_counts(data['Class'], sort=True).sort_index() # 画图显示按某列分类之后的数据数量比例
count_classes.plot(kind = 'bar') # bar：条形图
plt.xlabel("Class")
plt.ylabel("Frequency")

# 1.2.2 原数据特征和分类
X = data.loc[:, data.columns != "Class"]
y = data.loc[:, data.columns == 'Class']
print ("SHAPE", X.shape, y.shape)

# Class为0的数量远远大于1的数据，需要使数据个数相近 解决方案： 1.下采样（多的数据抽取部分） 2.过采样（少的数据生成更多）

图片说明

3.下采样

把数据相对多的减少,可减少为和数据少的数量相同的数量

1.3 区分正常数据和异常数据: 通过特征’Class’区分

# 1.3.1 异常数据-信息
number_records = data[data.Class == 1]
# 1.3.2 异常数据个数
number_records_fraud = len(number_records)
# 1.3.3 异常数据索引
frand_indices = np.array(number_records.index)
print ("异常样本索引 有{}个".format(number_records_fraud), frand_indices[:10])

# 1.3.4 正常数据-索引
normal_indices = data[data.Class == 0].index
print ("正常样本索引 有{}个".format(len(normal_indices)), normal_indices[-10:])

print (">>>>>>>>>>>>>>>>>>>>所有数据 正常异常比 ", len(normal_indices), '\t', number_records_fraud)
print("**************")

1.4 下采样处理数据把多的一方数据进行随机减少到与少的一方相同

# 在所有的正常样本索引normal_indices中随机获取，随机选取number_records_fraud个
# np.random.choice: 可以从一个int数字或1维array里随机选取内容，并将选取结果放入n维array中返回。
random_normal_indices = np.random.choice(normal_indices, number_records_fraud, replace=False)
random_normal_indices = np.array(random_normal_indices)
print ("'下采样'后有正常样本个数：", len(random_normal_indices))

1.5 数据索引合并 (意思就是把新的正常数据和原来的异常数据进行拼接)

under_sample_indices = np.concatenate([frand_indices, random_normal_indices])
print ("合并后有样本个数：", len(under_sample_indices))
under_sample_data = data.iloc[under_sample_indices, :]
print ("合并后样本：", under_sample_data[:1])

print (">>>>>>>>>>>>>>>>>>>>下采样数据 正常异常比 ", len(under_sample_data[under_sample_data == 0]), '\t', len(under_sample_data[under_sample_data == 1]))

1.6 获取合并数据中的feature(特征)和label(分类)

X_undersample = under_sample_data.loc[:, under_sample_data.columns != 'Class']
y_undersample = under_sample_data.loc[:, under_sample_data.columns == 'Class']

print (X_undersample.shape, y_undersample.shape)
print (len(under_sample_data[under_sample_data["Class"] == 1]), len(under_sample_data[under_sample_data["Class"] == 0]))

图片说明

2. 切分数据为训练和测试

# 切分原始数据 取数据集中80%的数据作为训练集（建立model） 其他20%的为测试集(测试model)
from sklearn.cross_validation import train_test_split

print (X.shape, y.shape)
## 2.1.对原始数据进行切分 （最终需要使用原数据集中的测试数据进行测试）
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0) # test_size测试集所占比例 random_state切分之前进行乱序
print ("1.训练集数据大小", X_train.shape)
print ("2.测试集数据大小", X_test.shape)
print (len(X_train) + len(X_test), len(y_train), len(y_test), "\n")

## 2.2.对下采样数据进行切分
X_train_undersample, X_test_undersample, y_train_undersample, y_test_undersample = train_test_split(X_undersample, y_undersample, test_size=0.2, random_state=0) # test_size测试集大小 random_state切分之前进行乱序
print ("3.训练集数据大小", X_train_undersample.shape)
print ("4.测试集数据大小", X_test_undersample.shape)
print (len(X_train_undersample) + len(X_test_undersample), len(y_train_undersample), len(y_test_undersample), "\n")

# 切分训练集 把训练集平均切分为三分然后进行交叉验证 （三组数据分别进行建模和验证）

图片说明

评估标准：召回率（recall）

不适用准确率，因为准确率不能正确的得到所求的，是没用的

模型评估表：

	相关（Relevant），正类	不相关（NonRelevant），负类
被检测到（Retrieved）	true positives （TP）	false positives （FP）
未被检测到（Retrieved）	false negatives （FN）	true negatives （TN）

评估计算：

Recall=TPTP+FN” role=”presentation”>Recall=TPTP+FNRecall=TPTP+FN

过拟合：数据在训练集表现很好在测试集表现很差

from sklearn.linear_model import LogisticRegression  # 逻辑回归
# 注意这里导入的 不是from sklearn.model_selection import KFold
from sklearn.cross_validation import KFold  # 交叉验证  # cross_val_score
from sklearn.metrics import confusion_matrix, recall_score, classification_report # 混淆矩阵

3. 通过多次循环交叉验证确定正则化参数 random_state：随机种子数

def printing_Kfold_scores(x_train_data, y_train_data):
    # KFold：切分数据集 （这里切分为5部分） shuffle:是否每次都"洗牌"(Falses时，其效果等同于random_state等于整数，每次划分的结果相同)
    fold = KFold(len(y_train_data), 5, shuffle=False) 
    print (type(fold), len(y_train_data), len(fold)) # 长度是5
    
    # 正则化惩罚项(正则化参数) 预设了多个惩罚值，具体使用哪个需要尝试 列举了5个
    c_param_range = [0.01, 0.1, 1, 10, 100]
    
    # 新建DataFrame类型的数据用来存放不同正则化之后的结果
    results_table = pd.DataFrame(index = range(len(c_param_range)), columns = ['C_parameter', 'Mean recall score'])
    results_table['C_parameter'] = c_param_range

    # 先按照正则化参数进行循环以确定最好的参数 然后对每个逻辑回归进行交叉验证以获得最好的逻辑回归函数
    # 循环正则化参数 获取最好的c参数
    for index, c_param in enumerate(c_param_range):
        print (">>>>>>>>>>>>>>>>>>>>>>>>>>")
        print ("C_parameter ", c_param)
        
        recall_accs = []
        # 循环进行交叉验证 
        # 每次循环次数为数据切分的大小,切分为n块就交叉验证n次,每次都是区其中n-1块为训练集1块为验证集
        # start=1:开始索引为1
        # iteration为索引 indices为划分好的数据:其中有n-1数据大小的训练集以及1数据代销的验证集
        # 循环中集合每次都不一样,所有的数据都会当一次验证集:例如 三个数据[1,2,3],循环使得数据分别为训练和验证每次为:[[1],[2, 3]], [[2],[1, 3]], [[3],[1, 2]]
        for iteration,  indices in enumerate(fold, start=1):
            # 这里并不是用fold直接划分训练集数据, 而是把索引进行1:5的划分, 然后按照索引获取数据中的对应的数据
            print (iteration, len(indices[0]), len(indices[1]))
            
            # 建立逻辑回归模型
            lr = LogisticRegression(C = c_param, penalty = 'l1') # C:正则化参数; penalty:惩罚项:使用L1正则化(惩罚) ‘l1’ or ‘l2’(默认: ‘l2’)
            # 在调参时如果我们主要的目的只是为了解决过拟合，一般penalty选择L2正则化就够了。
            # 但是如果选择L2正则化发现还是过拟合，即预测效果差的时候，就可以考虑L1正则化。
            # 另外，如果模型的特征非常多，我们希望一些不重要的特征系数归零，从而让模型系数稀疏化的话，也可以使用L1正则化。
            # print ("LR-逻辑回归表达式---", lr)
            
            # 训练 参数一:训练数据特征(feature) 参数二:训练数据分类(label)
            lr.fit(x_train_data.iloc[indices[0],:], y_train_data.iloc[indices[0],:].values.ravel())
            
            # 预测
            y_pred_undersample = lr.predict(x_train_data.iloc[indices[1], :].values)
            
            # 计算召回率 召回率 =提取出的正确信息条数 /样本中的信息条数。通俗地说，就是所有准确的条目有多少被检索出来了。
            # 参数: 1.真实数据集  2.预测数据集
            recall_acc = recall_score(y_train_data.iloc[indices[1],:].values, y_pred_undersample)
            recall_accs.append(recall_acc)
            print (len(indices), "Iteration ", iteration, ": recall score = ", recall_acc)
            
        # 求每个惩罚值经过交叉验证之后平均召回率
        results_table.loc[index, 'Mean recall score'] = np.mean(recall_accs)

        print ('\nMean recall score ', np.mean(recall_accs), '\n')
    
    print (results_table)
    
    best_c = results_table.loc[results_table['Mean recall score'].idxmax()]['C_parameter']
    print ("finally-------best is--------> ", best_c)
    
    return best_c

best_c = printing_Kfold_scores(X_train_undersample, y_train_undersample)

图片说明

ndarray数据格式化: set_printoptions

set_printoptions(precision=None,
threshold=None,
edgeitems=None,
linewidth=None,
suppress=None,
nanstr=None,
infstr=None,
formatter=None)

precision:输出结果保留精度的位数 (num)
threshold:array数量的个数在小于threshold的时候不会被折叠 (num)
edgeitems:在array已经被折叠后，开头和结尾都会显示edgeitems个数 (num)
formatter:这个很有意思，像python3里面str.format(),就是可以对你的输出进行自定义的格式化其他的暂时没用到

4. 使用最好的正则化参数构建逻辑回归模型并进行测试

# 构建逻辑回归模型
lr = LogisticRegression(C = best_c, penalty='l1')
# 训练回归模型
lr.fit(X_train_undersample, y_train_undersample.values.ravel())
# 使用模型进行测试
y_pred_undersample = lr.predict(X_test_undersample.values)
# y_pred_undersample为预测(分类)值, y_test_undersample为真实测试集的(分类)值

print (type(y_pred_undersample), len(y_pred_undersample), "\n")

# 打印和绘制混淆矩阵
import itertools
def plot_confusion_matrix(cm, classes, title='Confussion matrix', cmap=plt.cm.Blues):
    #设置显示混淆矩阵
    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    
    # 设置坐标数
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=0)
    plt.yticks(tick_marks, classes)
    
    thresh = cm.max() / 2
    
    # itertools.product可进行多层次循环 传入参数个数(n)和索引个数相同 可循环n^2次
    # 设置每个方块中的文字    
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        # print (j, i, cm[i, j])
        # 因为i表示横坐标的位置, j表示纵坐标的位置 所以需要把i和j交换位置
        plt.text(j, i, cm[i, j], horizontalalignment="center", color="white" if cm[i, j] > thresh else "black")
    
    plt.tight_layout()
    # 设置坐标文字
    plt.ylabel("True label")      # 真实数据 
    plt.xlabel("Predicted label") # 预测数据 1表示正例 0表示负例

# 画混淆矩阵图 参数: 1.y_true, 2.y_pred
cnf_matrix = confusion_matrix(y_test_undersample, y_pred_undersample)
np.set_printoptions(precision=2)

print ("Recall metric in the testing dataset: ", cnf_matrix[1, 1]/(cnf_matrix[1, 0] + cnf_matrix[1, 1]))

class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

# 由图可见, 召回率为 85 / (85 + 6) = 93.41%
#           精确率为 (85) / (85 + 9) = 90.43%
#           准确率为 (85 + 97) / (85 + 9 + 6 + 97) = 92.39%

# 以上计算都是基于下采样数据集的,还需要在原数据的测试集上进行测试操作 (与上面同理)

图片说明

4. 使用最好的正则化参数构建逻辑回归模型并进行测试 (使用原始数据的测试集和训练集)

# 在原数据的测试集上进行测试操作
lr = LogisticRegression(C = best_c, penalty='l1')
lr.fit(X_train, y_train.values.ravel())
y_pred = lr.predict(X_test.values)
# y_pred为预测(分类)值, y_test为真实测试集的(分类)值


cnf_matrix = confusion_matrix(y_test, y_pred)
np.set_printoptions(precision=2)

print ("Recall metric in the testing dataset: ", cnf_matrix[1, 1]/(cnf_matrix[1, 0] + cnf_matrix[1, 1]))

class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')
plt.show()

5. 修改阈值以获取最好的逻辑回归模型

# 阈值: 默认使用sigma函数默认值:0.5, 意思是当预测概率大于0.5表示True,概率小鱼0.5表示False

lr = LogisticRegression(C = best_c, penalty='l1')
# 训练
lr.fit(X_train, y_train.values.ravel())
# 预测 这里是预测概率值
y_pred_undersample_proba = lr.predict_proba(X_test_undersample.values) # 预测概率值而不是类别值

# 可能的阈值
thresholds = [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

plt.figure(figsize=(10, 10)) # 画图域

for index, i in enumerate(thresholds):
    # 预测概率
    y_test_predictions_high_recall = y_pred_undersample_proba[:, 1] > i
    
    plt.subplot(3, 3, index + 1)
    
    cnf_matrix = confusion_matrix(y_test_undersample, y_test_predictions_high_recall)
    np.set_printoptions(precision=2)
    
    print (i, "Recall metric in the testing dataset: ", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))
    
    class_names = [0, 1]
    plot_confusion_matrix(cnf_matrix, classes=class_names, title="Threshold >= %s" %i)
    
## 随着阈值上升 召回率不断变化 其中本来是1的被误检测为0的越来越多 可见 要选取最合适的阈值以达到召回率最高

图片说明

过采样

把数据相对少的增加,可增加为和数据多的数量相同的数量 (生成)

import pandas as pd
from imblearn.over_sampling import SMOTE
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

credit_cards = pd.read_csv("creditcard.csv")
columns = credit_cards.columns

features_columns = columns.delete(len(columns) - 1) #删除最后一列数据
print (features_columns)

features = credit_cards[features_columns]
labels = credit_cards['Class']

print ("原始的数据个数", (credit_cards[credit_cards['Class'] == 0]).shape, (credit_cards[credit_cards['Class'] == 1]).shape)

1
2

features_train, features_test, labels_train, labels_test = train_test_split(features, labels, test_size=0.2, random_state=0)
print (features_train.shape, features_test.shape, labels_train.shape, labels_test.shape)

图片说明

oversampler = SMOTE(random_state = 0) # SMOTE随机生成数据 生成只能是训练集生成数据, 而测试集不生成
# 只生成训练集数据 使得Class为1和为0的数量相同 返回训练集的特征和分类
os_features, os_labels = oversampler.fit_sample(features_train, labels_train)

print ("可见 的确生成了新的数据,补充了异常的数据 ", len(os_labels[os_labels[:] == 1]), len(os_labels[os_labels[:] == 0]))

print ((os_features).shape, len(os_features[os_features == 1]), len(os_features[os_features == 0]), 
       (os_labels).shape, len(os_labels[os_labels == 1]), len(os_labels[os_labels == 0]))

图片说明

os_features = pd.DataFrame(os_features)
os_labels = pd.DataFrame(os_labels)
# 获取最佳参数
best_c = printing_Kfold_scores(os_features, os_labels)

# plot_confusion_matrix

lr = LogisticRegression(C = best_c, penalty='l1')
# 训练 使用生成的数据
lr.fit(os_features, os_labels.values.ravel())
# 使用真实数据测试
y_pred = lr.predict(features_test.values)

# 打印和绘制混淆矩阵
cnf_matrix = confusion_matrix(labels_test, y_pred)
np.set_printoptions(precision=2)

print ("Recall metric in the testing dataset: ", cnf_matrix[1, 1] / (cnf_matrix[1, 0] + cnf_matrix[1, 1]))

class_names = [0, 1]
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=class_names, title='Confusion matrix')

plt.show()

个人博客欢迎来访： http://zj2626.com

逻辑回归

构建逻辑回归模型步骤：

1. 数据与任务

信用卡欺诈数据

要使用逻辑回归对数据进行建模 任务：二分类， 把数据分为有欺诈和无欺诈的两种数据

2. 使用sklearn进行数据预处理

1. 处理数据 数据下采样

1.1 预处理数据,修改列”Amount”数据分布

1.2 数据处理,去除不需要的特征

3.下采样

1.3 区分正常数据和异常数据: 通过特征’Class’区分

1.4 下采样处理数据 把多的一方数据进行随机减少到与少的一方相同

1.5 数据索引合并 (意思就是把新的正常数据和原来的异常数据进行拼接)

1.6 获取合并数据中的feature(特征)和label(分类)

2. 切分数据为训练和测试

评估标准：召回率（recall）

模型评估表：

评估计算：

3. 通过多次循环交叉验证 确定正则化参数 random_state：随机种子数

ndarray数据格式化: set_printoptions

4. 使用最好的正则化参数 构建逻辑回归模型并进行测试

4. 使用最好的正则化参数 构建逻辑回归模型并进行测试 (使用原始数据的测试集和训练集)

5. 修改阈值以获取最好的逻辑回归模型

过采样

要使用逻辑回归对数据进行建模任务：二分类，把数据分为有欺诈和无欺诈的两种数据

1. 处理数据数据下采样

1.4 下采样处理数据把多的一方数据进行随机减少到与少的一方相同

3. 通过多次循环交叉验证确定正则化参数 random_state：随机种子数

4. 使用最好的正则化参数构建逻辑回归模型并进行测试

4. 使用最好的正则化参数构建逻辑回归模型并进行测试 (使用原始数据的测试集和训练集)