【问题标题】:How do I find the false positive and false negative rates for a neural network?如何找到神经网络的误报率和误报率?
【发布时间】:2019-08-28 07:56:31
【问题描述】:

我有下面的代码,它非常适合神经网络。我知道我需要混淆矩阵库来找到误报率和误报率,但我不知道该怎么做,因为我不是编程专家。有人可以帮忙吗?

import pandas as pd
from sklearn import preprocessing
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from keras.models import Sequential
from keras.layers import Dense

# read the csv file and convert into arrays for the machine to process
df = pd.read_csv('dataset_ori.csv')
dataset = df.values

# split the dataset into input features and the feature to predict
X = dataset[:,0:7]
Y = dataset[:,7]

# scale the dataset using sigmoid function min_max_scaler so that all the input features lie between 0 and 1
min_max_scaler = preprocessing.MinMaxScaler()

# store the dataset into an array
X_scale = min_max_scaler.fit_transform(X)

# split the dataset into 30% testing and the rest to train
X_train, X_val_and_test, Y_train, Y_val_and_test = train_test_split(X_scale, Y, test_size=0.3)

# split the val_and_test size equally to the validation set and the test set.
X_val, X_test, Y_val, Y_test = train_test_split(X_val_and_test, Y_val_and_test, test_size=0.5)

# specify the sequential model and describe the layers that will form architecture of the neural network
model = Sequential([Dense(7, activation='relu', input_shape=(7,)), Dense(32, activation='relu'), Dense(5, activation='relu'), Dense(1, activation='sigmoid'),])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# training the data
hist = model.fit(X_train, Y_train, batch_size=32, epochs=100, validation_data=(X_val, Y_val))

# to find the accuracy of the mf the classifier
scores = model.evaluate(X_test, Y_test)

print("Accuracy: %.2f%%" % (scores[1]*100))

这是下面答案中提供的代码。响应,模型都以红色突出显示,以表示 unreslove 参考

from keras import models
from keras.layers import Dense, Dropout
from keras.utils import to_categorical
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from keras.models import Sequential
from keras.layers import Dense, Activation
from sklearn import metrics
from sklearn.preprocessing import StandardScaler


# read the csv file and convert into arrays for the machine to process
df = pd.read_csv('dataset_ori.csv')
dataset = df.values

# split the dataset into input features and the feature to predict
X = dataset[:,0:7]
Y = dataset[:,7]

# Splitting into Train and Test Set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(dataset,
                                                    response,
                                                    test_size = 0.2,
                                                    random_state = 0)

# Initialising the ANN
classifier = Sequential()

# Adding the input layer and the first hidden layer
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim =7 ))
model.add(Dropout(0.5))
# Adding the second hidden layer
classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
model.add(Dropout(0.5))
# Adding the output layer
classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))

# Compiling the ANN
classifier.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])

# Fitting the ANN to the Training set
classifier.fit(X_train, y_train, batch_size = 10, epochs = 20)

# Train model
scaler = StandardScaler()
classifier.fit(scaler.fit_transform(X_train.values), y_train)

# Summary of neural network
classifier.summary()

# Predicting the Test set results & Giving a threshold probability
y_prediction = classifier.predict_classes(scaler.transform(X_test.values))
print ("\n\naccuracy" , np.sum(y_prediction == y_test) / float(len(y_test)))
y_prediction = (y_prediction > 0.5)


#Let's see how our model performed
from sklearn.metrics import classification_report
print(classification_report(y_test, y_prediction))

【问题讨论】:

    标签: python machine-learning neural-network pycharm confusion-matrix


    【解决方案1】:

    您对confusion_matrix 的输入必须是一个int 数组,而不是一个热编码。

    # Predicting the Test set results
    y_pred = model.predict(X_test)
    y_pred = (y_pred > 0.5)
    matrix = metrics.confusion_matrix(y_test.argmax(axis=1), y_pred.argmax(axis=1))
    

    以下输出将以这种方式出现,因此通过给出概率阈值 0.5 会将其转换为二进制。

    输出(y_pred):

    [0.87812372 0.77490434 0.30319547 0.84999743]
    

    sklearn.metrics.accuracy_score(y_true, y_pred) 方法将 y_pred 定义为:

    y_pred :一维数组,或标签指示数组/稀疏矩阵。分类器返回的预测标签。

    这意味着 y_pred 必须是 1 或 0(谓词标签)的数组。它们不应该是概率。

    您的错误的根本原因是理论而非计算问题:您试图在回归(即数字预测)模型(神经逻辑模型)中使用分类指标(准确性),这是没有意义的。

    就像大多数性能指标一样,准确率将苹果与苹果进行比较(即真实标签为 0/1,预测再次为 0/1);因此,当您要求该函数将二进制真实标签(苹果)与连续预测(橙子)进行比较时,您会得到一个预期的错误,该消息会从计算的角度准确地告诉您问题所在:

    Classification metrics can't handle a mix of binary and continuous target
    

    尽管该消息没有直接告诉您您正在尝试计算一个对您的问题无效的指标(我们实际上不应该期望它走那么远),但 scikit 无疑是一件好事-learn 至少会给你一个直接而明确的警告,表明你正在尝试做错事;其他框架不一定是这种情况 - 例如,请参阅 Keras 在非常相似的情况下的行为,在这种情况下您根本没有收到任何警告,并且最终会抱怨回归设置中的“准确性”低...

    from keras import models
    from keras.layers import Dense, Dropout
    from keras.utils import to_categorical
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    from keras.models import Sequential
    from keras.layers import Dense, Activation
    from sklearn.cross_validation import  train_test_split
    from sklearn import metrics
    from sklearn.cross_validation import KFold, cross_val_score
    from sklearn.preprocessing import StandardScaler
    
    
    # read the csv file and convert into arrays for the machine to process
    df = pd.read_csv('dataset_ori.csv')
    dataset = df.values
    
    # split the dataset into input features and the feature to predict
    X = dataset[:,0:7]
    Y = dataset[:,7]
    
    # Splitting into Train and Test Set
    from sklearn.model_selection import train_test_split
    X_train, X_test, y_train, y_test = train_test_split(dataset,
                                                        response,
                                                        test_size = 0.2,
                                                        random_state = 0)
    
    # Initialising the ANN
    classifier = Sequential()
    
    # Adding the input layer and the first hidden layer
    classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu', input_dim =7 ))
    model.add(Dropout(0.5))
    # Adding the second hidden layer
    classifier.add(Dense(units = 10, kernel_initializer = 'uniform', activation = 'relu'))
    model.add(Dropout(0.5))
    # Adding the output layer
    classifier.add(Dense(units = 1, kernel_initializer = 'uniform', activation = 'sigmoid'))
    
    # Compiling the ANN
    classifier.compile(optimizer = 'adam', loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
    
    # Fitting the ANN to the Training set
    classifier.fit(X_train, y_train, batch_size = 10, epochs = 20)
    
    # Train model
    scaler = StandardScaler()
    classifier.fit(scaler.fit_transform(X_train.values), y_train)
    
    # Summary of neural network
    classifier.summary()
    
    # Predicting the Test set results & Giving a threshold probability
    y_prediction = classifier.predict_classes(scaler.transform(X_test.values))
    print ("\n\naccuracy" , np.sum(y_prediction == y_test) / float(len(y_test)))
    y_prediction = (y_prediction > 0.5)
    
    
    
    
    ## EXTRA: Confusion Matrix Visualize
    from sklearn.metrics import confusion_matrix,accuracy_score
    cm = confusion_matrix(y_test, y_pred) # rows = truth, cols = prediction
    df_cm = pd.DataFrame(cm, index = (0, 1), columns = (0, 1))
    plt.figure(figsize = (10,7))
    sn.set(font_scale=1.4)
    sn.heatmap(df_cm, annot=True, fmt='g')
    print("Test Data Accuracy: %0.4f" % accuracy_score(y_test, y_pred))
    
    #Let's see how our model performed
    from sklearn.metrics import classification_report
    print(classification_report(y_test, y_pred))
    

    【讨论】:

    • 您能进一步说明吗?此外,metrics.confusion_matrix 中的指标给出了一个错误,它指的是什么以及如何保存到数组中,我以为我曾经这样做过。如何打印 FP FN?
    • Canyon 告诉我您的目标 y 变量是二进制 (0/1) 还是连续变量?是回归问题还是分类问题?
    • 二进制,所以 0 或 1。我输入的 excel 文件基本上是 8 列和数千行。最后一列是 1 或 0,表示它是否是恶意软件(监督学习)。最后一列是进入 Y 的内容,应该是 NN 的输出。我真的很抱歉打扰了,但就像我说的那样,我是这方面的超级初学者,我几乎无法将神经网络放在一起。感谢您帮助修复代码。
    • 我当然会这样做
    • 你能告诉我该文件中的所有列都是数字还是文本?
    【解决方案2】:

    由于您已经从scikit.learn 加载了confusion_matrix,您可以使用这个:

    cutoff = 0.5
    y_predict = model.predict(x_test)                              
    y_pred_classes = np.zeros_like(y_pred)    # initialise a matrix full with zeros
    y_pred_classes[y_pred > cutoff] = 1
    
    y_test_classes = np.zeros_like(y_pred)
    y_test_classes[y_test > cutoff] = 1
    print(confusion_matrix(y_test_classes, y_pred_classes)
    

    混淆矩阵总是这样排列的:

    True Positives    False negatives
    False Positives   True negatives
    

    对于 tn 等等,你可以运行这个:

    tn, fp, fn, tp = confusion_matrix(y_test_classes, y_pred_classes).ravel()
    (tn, fp, fn, tp)
    

    【讨论】:

    • 错误:回溯(最近一次调用最后一次):文件“neuralNetwork3.py”,第 38 行,在 print(confusion_matrix(Y_test, y_predict)) 文件“C:\Users\Maryam\ Anaconda3\envs\rango\lib\site-packages\sklearn\metrics\classification.py”,第 253 行,confusion_matrix y_type, y_true, y_pred = _check_targets(y_true, y_pred) 文件“C:\Users\Maryam\Anaconda3\envs \rango\lib\site-packages\sklearn\metrics\classification.py",第 81 行,在 _check_targets "和 {1} 个目标".format(type_true, type_pred)) ValueError:分类指标无法处理二进制混合和持续的目标
    • 感谢您的帮助。我在我的程序末尾添加了上面的代码,并在我运行它时以 [[ 636 114] [ 45 2147]] 结束。您能否详细说明这些数字的含义以及我如何获得 FP 和 FN 的一个数字指标中的百分比/比率?
    • 我需要在哪里添加 tn, fp, fn, tp = chaos_matrix(y_test_classes, y_pred_classes).ravel() (tn, fp, fn, tp)
    • 打印功能之前
    • y_test_classes = np.zeros_like(y_predict) y_test_classes[Y_test > cutoff] = 1 tn, fp, fn, tp = confusion_matrix(y_test_classes, y_pred_classes).ravel() (tn, fp, fn, tp) print(confusion_matrix(y_test_classes, y_pred_classes)) 这就是我所拥有的,我仍然得到相同的输出“准确率:92.86% [[ 595 144] [66 2137]]”
    猜你喜欢
    • 2020-04-06
    • 2011-02-16
    • 1970-01-01
    • 2018-08-04
    • 2014-05-25
    • 1970-01-01
    • 1970-01-01
    • 2014-08-07
    • 2014-08-12
    相关资源
    最近更新 更多