选择什么机器学习模型？答案

【问题标题】：What machine learning model to choose?选择什么机器学习模型？
【发布时间】：2023-03-23 02:55:01
【问题描述】：

我有一个 .csv 格式的表格，其中包括以下列：

recipe, defect, material1, material2, material3, ..., material122

recipe 是一种或多种材料组合的 ID（例如，recipe_1 对应于 material1 + material3 + material28 而 recipe_2 对应 material3 + material5)。
defect 是一个 ID，表示在使用某些 recipe 制造的某些产品中发现的缺陷。
materialN 是某种材料的权重。但是，我使用材料的比率而不是它们的权重（例如，我说 material1 = 0.25 和 material2 = 0.75 而不是说 material1 = 5 kg 和 material2 = 15 kg 对于给定的 recipe = material1 + material2）。

注意：同一配方可能存在多个缺陷。

This is how my train table looks like. 它包括 124 列和近 90.000 行。

现在，我需要使用 material1, material2, material3, ..., material122 作为输入和 defect 作为输出来训练一些模型。例如，让我们从我的文件中取出第 2-15 行：

given input: [0, 0, 0.898, 0.062, 0.039, 0, 0, ..., 0, 0] // ratios of materials for recipe 1701192
given output: [149, 146, 148, 90, 89, ..., 59, 71, 63] // defects found for recipe 1701192

我在这里看到的主要问题是相同的recipe对应不同的缺陷。此外，我需要在另一个文件中给出的测试数据集中预测多个缺陷。

This is how the test dataset looks like. 它包含 123 列和 8400 行。请注意，没有关于缺陷的信息——我需要预测它们。

不幸的是，我不知道允许对某些属性组合进行多次预测的模型。你能推荐什么吗？它也可以是一个神经网络。

【问题讨论】：

标签： machine-learning neural-network bigdata

【解决方案1】：

可以做到这一点的一种方法是进行多元回归。如果您知道将发生的所有缺陷类型（类别），那么您可以将它们作为“n”因变量，然后对您的数据执行回归。在运行回归之前你应该做的一件事是标准化或规范化你的输入数据（如果你还没有这样做的话）。如果您的所有输出变量彼此独立，那么您还可以对模型中的每个变量运行单独的分析。

【讨论】：