使用来自 train_test_split() 的值列表作为训练数据答案

【问题标题】：Using a list of values from train_test_split() as training data使用来自 train_test_split() 的值列表作为训练数据
【发布时间】：2021-08-03 07:43:17
【问题描述】：

我正在尝试对一些数据进行线性回归。这就是数据的样子。

X = df['vectors'] 看起来像这样：

0      [-1.86135, 1.3202, 0.023501, -2.9511, 1.62135,...
1      [0.5487195, 0.27389452, 0.49712706, 0.6853927,...
2      [-1.3525691, -0.8444542, 2.8269022, -1.4456564...
3      [1.0730275, -0.14970247, -1.1424525, -1.953272...
4      [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...

当我对其运行线性回归模型时：

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)

我收到此错误：

TypeError                                 Traceback (most recent call last)
TypeError: only size-1 arrays can be converted to Python scalars

The above exception was the direct cause of the following exception:

如何将 X 中的值转换为标量？我正在考虑获取向量的平均值，但不确定如何去做。

【问题讨论】：

X 中的所有列表是否具有相同数量的元素？
@ArturoSbr 是的，他们有，他们都被填充了。
X 看起来不像一个列表，它看起来像一个用于交叉验证的分层拆分的pd.Series，其中索引是折叠的数量，值是它本身一个列表（数值）。与其假设你知道X 的类型，不如检查type(X) 和type(X.loc[0]) 的值。
谁说X是一个列表？
别担心。我发布的答案有帮助吗？如果您遇到任何问题，请告诉我。

标签： python pandas numpy machine-learning scikit-learn

【解决方案1】：

从表面上看，X 是一个 pandas.Series 对象。

由于X 的每一行内的所有列表长度相同，您可以将X 重塑为具有与X 相同的行数和与每个列表中的元素一样多的列的ndarray。

# Import numpy
import numpy as np

# Reshape
X = np.array(X.explode()).reshape(len(X), -1)

# Do the same as before
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.4, random_state=101)
lm = LinearRegression()
lm.fit(X_train, y_train)

【讨论】：

【解决方案2】：

尝试使用numpy.array 将该列表转换为数组，然后将其设为二维，因为sklearn 适用于数组并且它需要更高维数据。

【讨论】：