【发布时间】:2020-10-02 14:26:51
【问题描述】:
我想根据单一输入的训练数据(即学生的考试成绩)模拟对学生是否通过课程进行分类。
我首先创建了 1000 名学生的考试成绩数据集,该数据集正态分布,均值为 80。 然后我为前 300 名学生创建了一个分类“1”(及格),它基于种子是 80.87808591534409 的测试分数。
(显然我们并不真的需要机器学习,因为这意味着任何测试分数高于 80.87808591534409 的人都能通过课程。但我想建立一个准确预测这一点的模型,这样我就可以开始添加新的输入特征并将我的分类扩展到通过/失败)。
接下来,我以相同的方式创建了一个测试集,并使用之前为训练集计算的分类阈值 (80.87808591534409) 对这些学生进行了分类。
然后,正如您在下面或链接的 Jupyter 笔记本中看到的那样,我创建了一个模型,该模型采用一个输入特征并返回两个结果(零索引分类的概率(失败)和一个索引分类的概率(通过)。
然后我在训练数据集上对其进行了训练。但是正如你所看到的,每次迭代的损失永远不会真正改善。它只是徘徊在 0.6。
最后,我在测试数据集上运行经过训练的模型并生成预测。
我将结果绘制如下:
绿线代表测试集的实际(而非预测)分类。 蓝线代表 0 指数结果(失败)的概率,橙色线代表 1 指数结果(通过)的概率。
如您所见,它们保持平坦。如果我的模型正常工作,我会期望这些线在实际数据从失败切换到通过的阈值处交换位置。
我想我可能做错了很多事情,但如果有人有时间查看下面的代码并给我一些建议,我将不胜感激。
我已经为我的尝试创建了一个公开的工作示例here。 我在下面包含了当前代码。
我遇到的问题是模型训练似乎卡在计算损失中,因此它报告我的测试集中的每个学生(所有 1,000 名学生都失败),无论他们的测试结果如何,这显然是错误的。
import numpy as np
import matplotlib.pyplot as plt
import tensorflow as tf
import tensorflow_hub as hub
import tensorflow_datasets as tfds
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
print("Version: ", tf.__version__)
print("Eager mode: ", tf.executing_eagerly())
print("Hub version: ", hub.__version__)
print("GPU is", "available" if tf.config.experimental.list_physical_devices("GPU") else "NOT AVAILABLE")
## Create data
# Set Seed
np.random.seed(0)
# Create 1000 test scores normally distributed with a range of 2 with a mean of 80
train_exam_scores = np.sort(np.random.normal(80,2,1000))
# Create classification; top 300 pass the class (classification of 1), bottom 700 do not class (classification of 0)
train_labels = np.array([0. for i in range(700)])
train_labels = np.append(train_labels, [1. for i in range(300)])
print("Point at which test scores correlate with passing class: {}".format(train_exam_scores[701]))
print("computed point with seed of 0 should be: 80.87808591534409")
print("Plot point at which test scores correlate with passing class")
## Plot view
plt.plot(train_exam_scores)
plt.plot(train_labels)
plt.show()
#create another set of 1000 test scores with different seed (10)
np.random.seed(10)
test_exam_scores = np.sort(np.random.normal(80,2,1000))
# create classification labels for the new test set based on passing rate of 80.87808591534409 determined above
test_labels = np.array([])
for index, i in enumerate(test_exam_scores):
if (i >= 80.87808591534409):
test_labels = np.append(test_labels, 1)
else:
test_labels = np.append(test_labels, 0)
plt.plot(test_exam_scores)
plt.plot(test_labels)
plt.show()
print(tf.shape(train_exam_scores))
print(tf.shape(train_labels))
print(tf.shape(test_exam_scores))
print(tf.shape(test_labels))
train_dataset = tf.data.Dataset.from_tensor_slices((train_exam_scores, train_labels))
test_dataset = tf.data.Dataset.from_tensor_slices((test_exam_scores, test_labels))
BATCH_SIZE = 5
SHUFFLE_BUFFER_SIZE = 1000
train_dataset = train_dataset.shuffle(SHUFFLE_BUFFER_SIZE).batch(BATCH_SIZE)
test_dataset = test_dataset.batch(BATCH_SIZE)
# view example of feature to label correlation, values above 80.87808591534409 are classified as 1, those below are classified as 0
features, labels = next(iter(train_dataset))
print(features)
print(labels)
# create model with first layer to take 1 input feature per student; and output layer of two values (percentage of 0 or 1 classification)
model = tf.keras.Sequential([
tf.keras.layers.Dense(10, activation=tf.nn.relu, input_shape=(1,)), # input shape required
tf.keras.layers.Dense(10, activation=tf.nn.relu),
tf.keras.layers.Dense(2)
])
# Test untrained model on training features; should produce nonsense results
predictions = model(features)
print(tf.nn.softmax(predictions[:5]))
print("Prediction: {}".format(tf.argmax(predictions, axis=1)))
loss_object = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
optimizer = tf.keras.optimizers.SGD(learning_rate=0.1)
model.compile(optimizer=optimizer,
loss=loss_object,
metrics=['categorical_accuracy'])
#train model
model.fit(train_dataset,
epochs=20,
validation_data=test_dataset,
verbose=1)
#make predictions on test scores from test_dataset
predictions = model.predict(test_dataset)
tf.nn.softmax(predictions[:1000])
tf.argmax(predictions, axis=1)
# I anticipate that the predictions would show a higher probability for index position [0] (classification 0, "did not pass")
#until it reaches a value greater than 80.87808591534409
# which in the test data with a seed of 10 should be the value at the 683 index position
# but at this point I would expect there to be a higher probability for index position [1] (classification 1), "did pass"
# because it is obvious from the data that anyone who scores higher than 80.87808591534409 should pass.
# Thus in the chart below I would expect the lines charting the probability to switch precisely at the point where the test classifications shift.
# However this is not the case. All predictions are the same for all 1000 values.
plt.plot(tf.nn.softmax(predictions[:1000]))
plt.plot(test_labels)
plt.show()
【问题讨论】:
标签: python tensorflow machine-learning keras