有没有办法提高模型性能？答案

【问题标题】：Is there a way to improve the model performance?有没有办法提高模型性能？
【发布时间】：2021-09-06 21:59:50
【问题描述】：

# Import the Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df1 = pd.read_csv('/content/drive/MyDrive/Regression/train.csv')

df1.shape

a = [x for x in df1.columns if df1[x].dtype == 'O']  # Categorical Columns
len(a)

b =  [x for x in df1.columns if df1[x].dtype != 'O']  # Numerical Columns
len(b)

df1[a]

df1[b]

# Filling the Categorical columns

def fill_in(dataset):
for i in dataset.columns:
    if dataset[i].isna and dataset[i].dtype == 'O':
        dataset[i].fillna('missing',inplace = True)
return dataset

fill_in(df1)

# Filling the Numerical columns



def filling_integer(dataset):
   for i in dataset.columns:
       if dataset[i].isna and dataset[i].dtype != 'O':
              dataset[i].fillna(dataset[i].median(),inplace = True)
    return dataset
filling_integer(df1)



sns.heatmap(df1.isna())

"""Check for Outliers"""

for i in b:
  plt.title(i)
  sns.boxplot(x=df1[i])
  plt.show()

"""Handling the outliers"""

!pip install feature-engine

from feature_engine.outliers import Winsorizer

# for Q-Q plots
import scipy.stats as stats

# create the capper

windsoriser = Winsorizer(capping_method='quantiles', # choose from iqr, gaussian or quantiles
                          tail='both', # cap left, right or both tails 
                          fold=0.05,
                          variables= list(df1[b]))

windsoriser.fit(df1)

df1_t = windsoriser.transform(df1)

# function to create boxplot.


def diagnostic_plots(df, variable):
    # function takes a dataframe (df) and
    # the variable of interest as arguments

    # define figure size
    plt.figure(figsize=(16, 4))

    # boxplot
    plt.subplot(1, 3, 3)
    sns.boxplot(y=df[variable])
    plt.title('Boxplot')

    plt.show()

diagnostic_plots(df1, 'SalePrice'),diagnostic_plots(df1_t, 'SalePrice')

diagnostic_plots(df1, 'WoodDeckSF'),diagnostic_plots(df1_t, 'WoodDeckSF')

df1.shape, df1_t.shape

df1_t.head().T

"""Converting Categorical into Numerical"""

for feature in a:
    labels_ordered=df1_t.groupby([feature])['SalePrice'].mean().sort_values().index
    labels_ordered={k:i for i,k in enumerate(labels_ordered,0)}
    df1_t[feature]=df1_t[feature].map(labels_ordered)

df1_t

"""Scale the Features"""

scale= [feature for feature in df1_t.columns if feature not in ['Id','SalePrice']]

from sklearn.preprocessing import MinMaxScaler
scaler=MinMaxScaler()
scaler.fit(df1_t[scale])

scaler.transform(df1_t[scale])

data = pd.concat([df1_t[['Id', 'SalePrice']].reset_index(drop=True),
                    pd.DataFrame(scaler.transform(df1_t[scale]), columns=scale)],axis=1)

data

X = data.drop(['Id','SalePrice'],axis=1)

y = data[['SalePrice']]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25)

import tensorflow as tf

ann = tf.keras.models.Sequential()

ann.add(tf.keras.layers.Dense(units=128, activation='relu'))

ann.add(tf.keras.layers.Dense(units=128,activation='relu'))

ann.add(tf.keras.layers.Dense(units=1))

ann.compile(optimizer = 'adam', loss = 'mean_squared_error')

ann.fit(X_train, y_train, batch_size = 32, epochs = 100)

我正在使用 ANN 来解决房价回归问题，并且模型的性能太差了。即使我尝试了 100 个 epoch 和 2 个隐藏层，每个隐藏层有 128 个节点，损失函数也相当高。我还是输了

Epoch 100/100
35/35 [==============================] - 0s 2ms/step - loss: 633115520.0000

我哪里出错了。有人可以帮我理解吗？在此先感谢:)

【问题讨论】：

请张贴minimal reproducible example，而不是代码链接
@ForceBru 嗨，感谢您纠正我。我已经更新了帖子。让我知道我哪里出错了？
使用它并检查准确性也ann.compile(optimizer = 'adam', loss = 'mean_squared_error', metrics=['accuracy'])
@PrakashDahal 你好。感谢您的回复，我也尝试过这种方式，但准确性非常非常差。时期 100/100 35/35 [===============================] - 0s 2ms/步 - 损失：679746304.0000 - 准确度: 0.0000e+00
尝试从代码中删除 windsoriser 并检查新的准确度是多少

标签： python deep-learning neural-network regression

【解决方案1】：

我认为您应该在Kaggle Discussion Forum 中特别针对此数据集或一般discussion forum 发布此问题。总之还有很大的提升空间。

分类列

例如，以LotConfig 列为例，您已将所有类标记为一些数字，然后模型将理解 Inside 不如 Corner 更可取，因为在数据集中，Inside 被赋予 0 和 Corner 为 0.25。

模型会更偏爱Corner，因为它具有最高的数值，会偏向于此，但实际上它不应该

数值列

对于数字列中的每个缺失值，您都用median 填充它，这是错误的，这会扭曲列的性质。假设它有 60% 的缺失值，你用中值填充所有这些值将完全不同于你之前的观察者。必须对每一列的缺失值进行不同的评估和填充。

注意：由于它有很多分类列并且每列中的类不多，因此基于树的算法可能会更好

【讨论】：