【发布时间】:2021-05-28 14:27:47
【问题描述】:
这是我的数据样本
我编写了删除所有分类列的代码(例如MsZoning)。但是,某些非分类列具有 NA 值。如何将它们从我的数据集中排除。
def main():
print('Starting program execution')
iowa_train_prices_file_path='C:\\...\\programs\\python\\kaggle_competition_iowa_house_prices_train.csv'
iowa_file_data = pd.read_csv(iowa_train_prices_file_path)
print('Read file')
model_random_forest = RandomForestRegressor(random_state=1)
features = ['MSSubClass','MSZoning',...]
y = iowa_file_data.SalePrice
# every colmn except SalePrice
X = iowa_file_data.drop('SalePrice', axis = 1)
#The object dtype indicates a column has text (hint that the column is categorical)
X_dropped = X.select_dtypes(exclude=['object'])
print("fitting model")
model_random_forest.fit(X_dropped, y)
print("MAE of dropped categorical approach");
pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
main()
当我运行程序时,我收到错误ValueError: Input contains NaN, infinity or a value too large for dtype('float32'),我认为这是由于NA 的值是Id=8。
问题 1 - 如何完全删除这些行
问题 2 - 此类列的类型主要是什么?但中间有文字吗?我以为我会做print("X types",type(X.columns)),但这并没有给出结果
【问题讨论】: