杂乱的散点图回归线：Python答案

【问题标题】：messy scatter plot regression line: Python杂乱的散点图回归线：Python
【发布时间】：2016-05-30 09:01:23
【问题描述】：

在python 2.7.6，matlablib，scikit learn 0.17.0，当我在散点图上做多项式回归线时，多项式曲线会很乱，像这样：

脚本是这样的：它会读取两列浮动数据，做散点图和回归

import pandas as pd
import scipy.stats as stats
import pylab 
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import sklearn
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge

df=pd.read_csv("boston_real_estate_market_clean.csv")

LSTAT = df['LSTAT'].as_matrix()

LSTAT=LSTAT.reshape(LSTAT.shape[0], 1)

MEDV=df['MEDV'].as_matrix()

MEDV=MEDV.reshape(MEDV.shape[0], 1)

# Train test set split
X_train1, X_test1, y_train1, y_test1 =                train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)

# Ploynomial Regression-nst order

plt.scatter(X_test1, y_test1, s=10, alpha=0.3)

for degree in [1,2,3,4,5]:
    model = make_pipeline(PolynomialFeatures(degree), Ridge())
    model.fit(X_train1,y_train1)
    y_plot = model.predict(X_test1)
    plt.plot(X_test1, y_plot, label="degree %d" % degree
             +'; $q^2$: %.2f' % model.score(X_train1, y_train1)
             +'; $R^2$: %.2f' % model.score(X_test1, y_test1))


plt.legend(loc='upper right')

plt.show()

我猜是因为“X_test1, y_plot”没有正确排序？

X_test1 是一个像这样的 numpy 数组：

[[  5.49]
 [ 16.65]
 [ 17.09]
 ....
 [ 25.68]
 [ 24.39]]

yplot 是一个像这样的 numpy 数组：

[[ 29.78517812]
 [ 17.16759833]
 [ 16.86462359]
 [ 23.18680265]
...[ 37.7631725 ]]

我尝试用这个排序：

 [X_test1, y_plot] = zip(*sorted(zip(X_test1, y_plot), key=lambda y_plot: y_plot[0]))

     plt.plot(X_test1, y_plot, label="degree %d" % degree
              +'; $q^2$: %.2f' % model.score(X_train1, y_train1)
              +'; $R^2$: %.2f' % model.score(X_test1, y_test1))

曲线现在看起来很正常，但结果很奇怪，R^2 为负数。

任何大师都可以告诉我真正的问题是或如何正确排序吗？谢谢！

【问题讨论】：

这特别奇怪，因为任何实数的平方都应该是正数……虚数？！
您是否尝试过使用reverse = True 作为sorted 的参数来反转排序？不知道它是否会起作用，但值得一试。

标签： python python-2.7 matplotlib scikit-learn

【解决方案1】：

虽然情节现在是正确的，但您在排序时弄乱了 X_test1 与 y_test1 的配对，因为您忘记了以同样的方式对 y_test1 进行排序。最好的解决方案是在拆分后立即排序。那么后面计算出来的y_plot会自动正确：（这里未经测试的例子使用numpy作为np）

X_train1, X_test1, y_train1, y_test1 =             train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)

sorted_index = np.argsort(X_test1)
X_test1 = X_test1[sorted_index]
y_test1 = y_test1[sorted_index]

【讨论】：