【问题标题】:sklearn model returns a mean absolute error of 0, why?sklearn 模型返回的平均绝对误差为 0,为什么?
【发布时间】:2021-03-09 01:48:14
【问题描述】:

玩弄sklearn,我想使用OpenHighLow 价格和Volume 预测几个日期的TSLA Close 价格。我使用了一个非常基本的模型来预测收盘价,据说它们是 100% 准确的,但我不知道为什么。 0% 的错误让我感觉好像我没有正确设置模型。

代码:

from os import X_OK
from numpy.lib.shape_base import apply_along_axis
import pandas as pd
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_absolute_error

tsla_data_path = "/Users/simon/Documents/PythonVS/ML/TSLA.csv"
tsla_data = pd.read_csv(tsla_data_path)
tsla_features = ['Open','High','Low','Volume']

y = tsla_data.Close
X = tsla_data[tsla_features]

# define model
tesla_model = DecisionTreeRegressor(random_state = 1)

# fit model
tesla_model.fit(X,y)

#print results
print('making predictions for the following five dates')
print(X.head())
print('________________________________________________')
print('the predictions are')
print(tesla_model.predict(X.head()))
print('________________________________________________')
print('the error is ')
print(mean_absolute_error(y.head(),tesla_model.predict(X.head())))

输出:

making predictions for the following five dates
        Open       High        Low    Volume
0  67.054001  67.099998  65.419998  39737000
1  66.223999  66.786003  65.713997  27778000
2  66.222000  66.251999  65.500000  12328000
3  65.879997  67.276001  65.737999  30372500
4  66.524002  67.582001  66.438004  32868500
________________________________________________
the predictions are
[65.783997 66.258003 65.987999 66.973999 67.239998]
________________________________________________
the error is
0.0

数据:

Date,Open,High,Low,Close,Adj_Close,Volume
2019-11-26,67.054001,67.099998,65.419998,65.783997,65.783997,39737000
2019-11-27,66.223999,66.786003,65.713997,66.258003,66.258003,27778000
2019-11-29,66.222000,66.251999,65.500000,65.987999,65.987999,12328000
2019-12-02,65.879997,67.276001,65.737999,66.973999,66.973999,30372500
2019-12-03,66.524002,67.582001,66.438004,67.239998,67.239998,32868500

【问题讨论】:

  • 您正在使用与fit 相同的集合进行预测。

标签: python pandas machine-learning scikit-learn


【解决方案1】:

在用于训练模型的数据集上衡量模型的性能是错误的。

如果您想有一个适当的绩效评估指标,您应该将数据集拆分为 2 个数据集。一个用于训练模型,另一个用于测量其性能。您可以使用sklearn.model_selection.train_test_split() 拆分数据集,如下所示:

tesla_model = DecisionTreeRegressor(random_state = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
tesla_model.fit(X_train, X_test)
mae = mean_absolute_error(y_test,tesla_model.predict(X_test))

看看这个 Wikipedia page 解释 ML 中的差异数据集。

【讨论】:

    猜你喜欢
    • 2019-07-10
    • 2018-12-18
    • 2020-06-22
    • 1970-01-01
    • 1970-01-01
    • 2019-01-02
    • 2021-04-19
    • 2015-06-24
    • 1970-01-01
    相关资源
    最近更新 更多