机器学习实践——Logistic回归

基本原理如下：根据现有数据对分类边界线建立回归公式，以此进行分类。这里的“回归”一词源于最佳拟合，表示要找到最佳拟合参数。而最佳拟合参数就是在训练分类器时，通过最优化算法获得。逻辑斯蒂回归是一种线性分类器，针对的是线性可分问题。
https://blog.csdn.net/lgb_love/article/details/80592147
https://blog.csdn.net/haochen233/article/details/79868125
logistic函数（由于它的图像呈S形，有时也称为sigmoid函数）：
机器学习实践——Logistic回归
正则化：
所谓的过拟合是指——模型过于复杂，所以虽然模型在训练数据集上表现良好，但是用于未知数据（测试数据）时性能不佳。若一个模型出现了过拟合的问题，就是说这模型有高方差，可能是因为使用了相关数据中过多的参数，从而使得模型变得过于复杂。
而欠拟合是指——模型过于简单，无法发现训练数据集中隐含的模式，这也使得训练好的模型用于未知数据（测试数据）时性能不佳。
所以在回归的代价函数中加入正则项即可防止过拟合。
以下是通过Python中的sklearn模块中的鸢尾花数据对其用logistic回归进行分类。

import numpy as np
from matplotlib.colors import ListedColormap
import matplotlib.pyplot as plt
def plot_decision_regions(X, y, classifier, test_idx=None, resolution=0.02):
    markers = ('s', 'x', 'o', '^', 'v')
    colors = ('red', 'blue', 'lightgreen', 'gray', 'cyan')
    cmap = ListedColormap(colors[:len(np.unique(y))])
    x1_min, x1_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    x2_min, x2_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx1, xx2 = np.meshgrid(np.arange(x1_min, x1_max, resolution), np.arange(x2_min, x2_max, resolution))
    # 创建一个与数据训练集中列数相同的矩阵，以预测多维数组中所有对应点的类标z
    Z = classifier.predict(np.array([xx1.ravel(), xx2.ravel()]).T)
    Z = Z.reshape(xx1.shape)  # 将z变换为与xx1和xx2相同维度
    # 使用contourf函数，对于网格数组中每个预测的类以不同的颜色绘制出预测得到的决策区域
    plt.contourf(xx1, xx2, Z, alpha=0.3, cmap=cmap)
    plt.xlim(xx1.min(), xx1.max())
    plt.ylim(xx2.min(), xx2.max())

    for idx, cl in enumerate(np.unique(y)):
        plt.scatter(x=X[y == cl, 0],
                    y=X[y == cl, 1],
                    alpha=0.8,
                    c=colors[idx],
                    marker=markers[idx],
                    label=cl,
                    edgecolor='black')


# 训练集与测试集的获取,采用鸢尾花数据集
from sklearn import datasets

iris = datasets.load_iris()
x = iris.data[:, [2, 3]]
y = iris.target

# 对数据集进行划分
from sklearn.cross_validation import train_test_split

# 采用scikit-learn中的cross_validation模块中的train_test_split（）函数，随机将iris数据特征矩阵x与类标向量y按照3:7划分为测试数据集和训练数据集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=0)

# 为了优化性能，对特征进行标准化处理
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()
sc.fit(x_train)  # 通过fit方法，可以计算训练数据中每个特征的样本均值和方差
x_train_std = sc.transform(x_train)  # 通过调用transform方法，可以使用前面获得的样本均值和方差来对数据做标准化处理
x_test_std = sc.transform(x_test)

from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(C=1000.0, random_state=0)
lr.fit(x_train_std, y_train)
print("Training Score:%f" % lr.score(x_train_std, y_train))  # 返回在(X_train,y_train)上的准确率
print("Testing Score:%f" % lr.score(x_test_std, y_test))  # 返回在(X_test,y_test)上的准确率

x_combined_std = np.vstack((x_train_std, x_test_std))  # 将数组垂直排列成多个子数组的列表。
y_combined = np.hstack((y_train, y_test))  # 按水平顺序（列）顺序堆栈数组。
plot_decision_regions(X=x_combined_std, y=y_combined, classifier=lr, test_idx=range(105, 150))
plt.xlabel('petal length [standardized]')
plt.ylabel('petal width [standardized]')
plt.legend(loc='upper left')
plt.show()

结果显示：
机器学习实践——Logistic回归