将 pandas TimeStamp 与 scikit-learn 一起使用答案

【问题标题】：Using pandas TimeStamp with scikit-learn将 pandas TimeStamp 与 scikit-learn 一起使用
【发布时间】：2016-05-28 03:43:15
【问题描述】：

sklearn 分类器接受 pandas 的 TimeStamp (=datetime64[ns]) 作为 X 中的一列，只要所有 X 列都属于该类型。但是当同时存在 TimeStamp 和 float 列时，sklearn 拒绝使用 TimeStamp。

除了使用 astype(int) 将 TimeStamp 转换为 int 之外，还有其他解决方法吗？（我仍然需要原始列来访问dt.year 等，因此理想情况下，最好不要创建重复的列来为 sklearn 提供功能。）

import pandas as pd
from sklearn.linear_model import LinearRegression
test = pd.date_range('20000101', periods = 100)
test_df = pd.DataFrame({'date': test})
test_df['a'] = 1
test_df['y'] = 1
lr = LinearRegression()
lr.fit(test_df[['date']], test_df['y']) # works fine
lr.fit(test_df[['date', 'date']], test_df['y']) # works fine
lr.fit(test_df[['date', 'a']], test_df['y']) # complains

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-90-0605fa5bcdfa> in <module>()
----> 1 lr.fit(test_df[['date', 'a']], test_df['y'])

/home/shoya/.pyenv/versions/3.5.0/envs/study-env/lib/python3.5/site-packages/sklearn/linear_model/base.py in fit(self, X, y, sample_weight)
    434         n_jobs_ = self.n_jobs
    435         X, y = check_X_y(X, y, accept_sparse=['csr', 'csc', 'coo'],
--> 436                          y_numeric=True, multi_output=True)
    437 
    438         if ((sample_weight is not None) and np.atleast_1d(

/home/shoya/.pyenv/versions/3.5.0/envs/study-env/lib/python3.5/site-packages/sklearn/utils/validation.py in check_X_y(X, y, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, multi_output, ensure_min_samples, ensure_min_features, y_numeric, warn_on_dtype, estimator)
    521     X = check_array(X, accept_sparse, dtype, order, copy, force_all_finite,
    522                     ensure_2d, allow_nd, ensure_min_samples,
--> 523                     ensure_min_features, warn_on_dtype, estimator)
    524     if multi_output:
    525         y = check_array(y, 'csr', force_all_finite=True, ensure_2d=False,

/home/shoya/.pyenv/versions/3.5.0/envs/study-env/lib/python3.5/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    402         # make sure we acually converted to numeric:
    403         if dtype_numeric and array.dtype.kind == "O":
--> 404             array = array.astype(np.float64)
    405         if not allow_nd and array.ndim >= 3:
    406             raise ValueError("Found array with dim %d. %s expected <= 2."

TypeError: float() argument must be a string or a number, not 'Timestamp'

显然，当 dtype 混合时，ndarray 的类型为 object，sklearn 尝试将它们转换为 float，但以 TimeStamp 失败。但是当 dtypes 都是datetime64[ns] 时，sklearn 只是保持不变。

【问题讨论】：

问题解决了吗？你能分享你的解决方案吗？
你必须至少有两个特征才能使线性回归工作

标签： python python-3.x datetime pandas scikit-learn

【解决方案1】：

我通常将 DateTime 转换为感兴趣的特征，例如年、月、日、小时、分钟。

df['Year'] = df['Timestamp'].apply(lambda time: time.year)

df['Month'] = df['Timestamp'].apply(lambda time: time.month)

df['Day'] = df['Timestamp'].apply(lambda time: time.day)

df['Hour'] = df['Timestamp'].apply(lambda time: time.hour)

df['Minute'] = df['Timestamp'].apply(lambda time: time.minute)

【讨论】：

【解决方案2】：

您想适应 X 和 y，其中 X 是特征（2 个或更多），y 是目标。将您的 datetimeindex 用作时间序列，而不是功能。在我的示例中，我拟合 mag > 7 的地震并计算每次地震之间经过的天数。经过的天数和深度以及经纬度被馈送到线性回归分类器。

 events=df[df.mag >7]
 events=events.sort_index()

 index=0
 #dates ascending False
 events['previous']=events.index
 for key,item in events.iterrows():
      if index>0:
          events.loc[key,'previous']=events.index.values[index-1]
          events.loc[key,'time_delta']=events.index.values[index]-events.index.values[index-1]
          index+=1

events['elapsed_days']=events['time_delta'].apply(lambda x: np.nan_to_num(x.days))

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split

X=events[['latitude','longitude','elapsed_days','depth']]
y=np.nan_to_num(events['mag'])
X_train,X_test,y_train, y_test=train_test_split(X,y,test_size=0.3,random_state=42)

lr = LinearRegression()

lr.fit(X,y)
y_pred=lr.predict(X_test)

fig, ax= plt.subplots()
ax.plot(X_test['elapsed_days'],y_pred)
plt.title('Magnitude Prediction')
plt.show()
fig, ax= plt.subplots()
ax.plot(events.index,np.nan_to_num(events['mag']))
plt.xticks(rotation=90)
plt.legend(['Magnitude'])
twin_ax=ax.twinx()
twin_ax.plot(events.index,events['elapsed_days'],color='red')
plt.legend(['Elapsed Days'],loc=1)
plt.show()

【讨论】：

数据框长什么样？你能打印head吗？

【解决方案3】：

您可以将其转换为适当的整数或浮点数

test_df['date'] = test_df['date'].astype(int)

【讨论】：

为什么投反对票？我秒了这个并将转换为秒，如this post right here 所示