【问题标题】:Python Replacing missing valuesPython替换缺失值
【发布时间】:2021-09-24 22:07:08
【问题描述】:

我正在尝试替换数据框中特定列中的缺失值,但遇到了一些问题。 试过了:

from sklearn.impute import SimpleImputer
fill_0_with_mean = SimpleImputer(missing_values=0, strategy='mean')
X_train['Age'] = fill_0_with_mean.fit_transform(X_train['Age'])

X_train[:,15] = fill_0_with_mean.fit_transform(X_train[:,15])

X_train[:,15:16] = fill_0_with_mean.fit_transform(X_train[:,15:16])

X_train['Age'] = fill_0_with_mean.fit_transform(X_train['Age'].values)

X_train[:,15:16] = fill_0_with_mean.fit_transform(X_train[:,15:16].values)

但我总是遇到错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). or IndexError: only integers, slices (:), ellipsis (...), numpy.newaxis (None) 和整数或布尔数组是有效的索引

我的数据中有零值和缺失 (NaN) 值。 imputer只能做两者之一吗?我该怎么做呢? 我也尝试将我的年龄列转换为整数

X_train['Age'] = X_train['Age'].as_type('int32')

但这只会给我其他错误。

我的数据看起来像(年龄列)

Age
0 31.0
1 79.0
2 53.0
3 40.0
4 55.0
...
44872 NaN
44873 NaN
44874 NaN
44875 NaN
44876 NaN

numpy 和 pandas 有没有可能混在一起?我用它把我的数据分成训练和测试:

from sklearn.model_selection import train_test_split

dep_var = ['is_overdue']
features = model_data2.columns
features = features.drop(dep_var)

print(features)

X = model_data2[features].values
Y = model_data2[dep_var].values

split_test_size = 0.30

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=split_test_size, random_state=42) 

非常感谢您的帮助。

【问题讨论】:

标签: python pandas numpy scikit-learn


【解决方案1】:

因为你想用平均值替换 0,你必须用 0 填充 NaN

fill_0_with_mean = SimpleImputer(missing_values=0, strategy='mean')
X_train['Age'] = fill_0_with_mean.fit_transform(X_train['Age'].fillna(0))

【讨论】:

  • 然后我得到这个错误:`IndexError: 只有整数、切片 (:)、省略号 (...)、numpy.newaxis (None) 和整数或布尔数组是有效的索引`
猜你喜欢
  • 2018-12-27
  • 2020-01-04
  • 1970-01-01
  • 1970-01-01
  • 2019-11-30
  • 2014-06-09
  • 1970-01-01
相关资源
最近更新 更多