【发布时间】:2016-05-30 09:01:23
【问题描述】:
在python 2.7.6,matlablib,scikit learn 0.17.0,当我在散点图上做多项式回归线时,多项式曲线会很乱,像这样:
脚本是这样的:它会读取两列浮动数据,做散点图和回归
import pandas as pd
import scipy.stats as stats
import pylab
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
import pylab as pl
import sklearn
from sklearn import preprocessing
from sklearn.cross_validation import train_test_split
from sklearn import datasets, linear_model
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import Ridge
df=pd.read_csv("boston_real_estate_market_clean.csv")
LSTAT = df['LSTAT'].as_matrix()
LSTAT=LSTAT.reshape(LSTAT.shape[0], 1)
MEDV=df['MEDV'].as_matrix()
MEDV=MEDV.reshape(MEDV.shape[0], 1)
# Train test set split
X_train1, X_test1, y_train1, y_test1 = train_test_split(LSTAT,MEDV,test_size=0.3,random_state=1)
# Ploynomial Regression-nst order
plt.scatter(X_test1, y_test1, s=10, alpha=0.3)
for degree in [1,2,3,4,5]:
model = make_pipeline(PolynomialFeatures(degree), Ridge())
model.fit(X_train1,y_train1)
y_plot = model.predict(X_test1)
plt.plot(X_test1, y_plot, label="degree %d" % degree
+'; $q^2$: %.2f' % model.score(X_train1, y_train1)
+'; $R^2$: %.2f' % model.score(X_test1, y_test1))
plt.legend(loc='upper right')
plt.show()
我猜是因为“X_test1, y_plot”没有正确排序?
X_test1 是一个像这样的 numpy 数组:
[[ 5.49]
[ 16.65]
[ 17.09]
....
[ 25.68]
[ 24.39]]
yplot 是一个像这样的 numpy 数组:
[[ 29.78517812]
[ 17.16759833]
[ 16.86462359]
[ 23.18680265]
...[ 37.7631725 ]]
我尝试用这个排序:
[X_test1, y_plot] = zip(*sorted(zip(X_test1, y_plot), key=lambda y_plot: y_plot[0]))
plt.plot(X_test1, y_plot, label="degree %d" % degree
+'; $q^2$: %.2f' % model.score(X_train1, y_train1)
+'; $R^2$: %.2f' % model.score(X_test1, y_test1))
曲线现在看起来很正常,但结果很奇怪,R^2 为负数。
任何大师都可以告诉我真正的问题是或如何正确排序吗?谢谢!
【问题讨论】:
-
这特别奇怪,因为任何实数的平方都应该是正数……虚数?!
-
您是否尝试过使用
reverse = True作为sorted的参数来反转排序?不知道它是否会起作用,但值得一试。
标签: python python-2.7 matplotlib scikit-learn