ml.net 关于格式错误和错误值的情绪分析警告答案

【问题标题】：ml.net sentiment analysis warning about format errors & bad valuesml.net 关于格式错误和错误值的情绪分析警告
【发布时间】：2018-06-18 10:04:59
【问题描述】：

我的 ml.net 控制台应用程序出现问题。这是我第一次在 Visual Studio 中使用 ml.net，所以我遵循了来自 microsoft.com 的 this 教程，这是一个使用二进制分类的情感分析。

我正在尝试以 tsv 文件的形式处理一些测试数据以获得正面或负面的情绪分析，但在调试时我收到警告，有 1 个格式错误和 2 个错误值。

我决定在 Stack 上向所有伟大的开发者请教，看看是否有人可以帮助我找到解决方案。

下面是调试的图片：

这是我的测试数据的链接：
wiki-data
wiki-test-data

最后，这是我的代码，供那些重现问题的人使用：

有 2 个 c# 文件：SentimentData.cs 和 Program.cs。

1 - SentimentData.cs：

using System;
using System.Collections.Generic;
using System.Text;
using Microsoft.ML.Runtime.Api;

namespace MachineLearningTut
{
 public class SentimentData
 {
    [Column(ordinal: "0")]
    public string SentimentText;
    [Column(ordinal: "1", name: "Label")]
    public float Sentiment;
 }

 public class SentimentPrediction
 {
    [ColumnName("PredictedLabel")]
    public bool Sentiment;
 }
}

2 - Program.cs：

using System;
using Microsoft.ML.Models;
using Microsoft.ML.Runtime;
using Microsoft.ML.Runtime.Api;
using Microsoft.ML.Trainers;
using Microsoft.ML.Transforms;
using System.Collections.Generic;
using System.Linq;
using Microsoft.ML;
using Microsoft.ML.Data;
using System.Threading.Tasks;

namespace MachineLearningTut
{
class Program
{
    const string _dataPath = @".\Data\wikipedia-detox-250-line-data.tsv";
    const string _testDataPath = @".\Data\wikipedia-detox-250-line-test.tsv";
    const string _modelpath = @".\Data\Model.zip";

    static async Task Main(string[] args)
    {
        var model = await TrainAsync();

        Evaluate(model);

        Predict(model);
    }

    public static async Task<PredictionModel<SentimentData, SentimentPrediction>> TrainAsync()
    {
        var pipeline = new LearningPipeline();

        pipeline.Add(new TextLoader (_dataPath).CreateFrom<SentimentData>());

        pipeline.Add(new TextFeaturizer("Features", "SentimentText"));

        pipeline.Add(new FastForestBinaryClassifier() { NumLeaves = 5, NumTrees = 5, MinDocumentsInLeafs = 2 });

        PredictionModel<SentimentData, SentimentPrediction> model = pipeline.Train<SentimentData, SentimentPrediction>();

        await model.WriteAsync(path: _modelpath);

        return model;
    }

    public static void Evaluate(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        var testData = new TextLoader(_testDataPath).CreateFrom<SentimentData>();

        var evaluator = new BinaryClassificationEvaluator();

        BinaryClassificationMetrics metrics = evaluator.Evaluate(model, testData);

        Console.WriteLine();
        Console.WriteLine("PredictionModel quality metrics evaluation");
        Console.WriteLine("-------------------------------------");
        Console.WriteLine($"Accuracy: {metrics.Accuracy:P2}");
        Console.WriteLine($"Auc: {metrics.Auc:P2}");
        Console.WriteLine($"F1Score: {metrics.F1Score:P2}");

    }

    public static void Predict(PredictionModel<SentimentData, SentimentPrediction> model)
    {
        IEnumerable<SentimentData> sentiments = new[]
        {
            new SentimentData
            {
                SentimentText = "Please refrain from adding nonsense to Wikipedia."
            },

            new SentimentData
            {
                SentimentText = "He is the best, and the article should say that."
            }
        };

        IEnumerable<SentimentPrediction> predictions = model.Predict(sentiments);

        Console.WriteLine();
        Console.WriteLine("Sentiment Predictions");
        Console.WriteLine("---------------------");

        var sentimentsAndPredictions = sentiments.Zip(predictions, (sentiment, prediction) => (sentiment, prediction));

        foreach (var item in sentimentsAndPredictions)
        {
            Console.WriteLine($"Sentiment: {item.sentiment.SentimentText} | Prediction: {(item.prediction.Sentiment ? "Positive" : "Negative")}");
        }
        Console.WriteLine();
    }
}

}

如果有人想查看解决方案的代码或更多详细信息，请在聊天中询问我，我会发送。提前致谢！！！ [竖起大拇指]

【问题讨论】：

欢迎来到 S.O.检查帮助部分并提供minimal reproducible example。请注意，“it gives me warnings”不是详细描述。
谢谢，我应该重写问题并使其更详细吗？
恕我直言，您不仅应该提供更多信息，还应该包含重现问题的最少代码，描述您打算做什么以及您有什么错误。
@TheGodofOfficeWork 您可以编辑问题以添加您的代码:)。我想我在定义数据输入列或分数输出的类中的类型不正确之前已经看到过这个问题。编辑：这是this issue 我之前看到的看起来很相似。
@Jon Gotcha。我们说话的时候，我正忙着修改这个问题。我也会看看你放的那个链接

标签： c# visual-studio ml.net

【解决方案1】：

我认为我为你解决了问题。有几点要更新：

首先，我认为您已将 SentimentData 属性切换为数据所具有的属性。尝试将其更改为

[Column(ordinal: "0", name: "Label")]
public float Sentiment;

[Column(ordinal: "1")]
public string SentimentText;

其次，在TextLoader.CreateFrom方法中使用useHeader参数。不要忘记将其添加到另一个用于验证数据。

pipeline.Add(new TextLoader(_dataPath).CreateFrom<SentimentData>(useHeader: true));

通过这两个更新，我得到了以下输出。看起来不错的模型，AUC 为 85%！

【讨论】：

我想补充一点，我还删除了错误的格式行。
听起来（和看起来）它会起作用。我现在就试试。
只有一件事我不清楚。当您说将 useHeader 参数添加到验证器数据时，您是指 Evaluate 方法吗？
是的！否则，您将收到一个错误值被报告的实例。
我已经更改了 SentimentData 道具，以及 TrainAysnc/Evaluate 中的 TextLoaders，但在调试时控制台仍会进入中断模式 :( .

【解决方案2】：

对文本类型数据集有帮助的另一件事是表明文本有引号：

TextLoader("someFile.txt").CreateFrom<Input>(useHeader: true, allowQuotedStrings: true)

【讨论】：

我尝试添加该属性以允许引用字符串，但没有看到任何积极或消极的影响。我可以通过这个答案看到你从哪里来。谢谢艾米
请注意，allowQuotedStrings 默认通过docs 设置为true。不过，绝对值得一试。你永远不知道：p

【解决方案3】：

252 和 253 行的格式值错误。愿我那里包含分隔符的字段。如果您发布代码或示例数据，我们可以更准确。

【讨论】：

我已将我的代码添加到问题中。生病检查是否有任何分隔符