斯坦福 NLP 文本分类器、自定义特征和混淆矩阵答案

【问题标题】：Stanford NLP Text Classifier, Custom Features and Confusion Matrix斯坦福 NLP 文本分类器、自定义特征和混淆矩阵
【发布时间】：2017-03-16 08:15:38
【问题描述】：

我在我的 Java 代码中使用斯坦福 NLP 文本分类器 (ColumnDataClassifier)。我有两个主要问题。

1-) 如何打印更详细的评估信息，例如混淆矩阵。

2-) 我的代码已经进行了预处理并为术语提取了数字特征（向量），例如二进制特征或 TF-IDF 值。如何使用这些功能来训练和测试分类器。

【问题讨论】：

这是一个很好的分类器资源：nlp.stanford.edu/wiki/Software/Classifier
我认为没有任何直接的方法可以打印出混淆矩阵。这里也是该类的 javadoc：nlp.stanford.edu/nlp/javadoc/javanlp/edu/stanford/nlp/classify/…
@StanfordNLPHelp 谢谢。你能看看这个：stackoverflow.com/questions/40685303/…

标签： stanford-nlp text-classification

【解决方案1】：

我在here 中提出了一个相关问题。 ColumnDataClassifier 没有在混淆矩阵中输出度量的选项。但是，如果您查看ColumnDataClassifier.java 中的代码，您可以看到 TP、FP、TN、FN 输出到标准输入的位置。这个地方有你需要的原始值。它可以用于将这些聚合成混淆矩阵并在运行后输出的方法，但您必须自己编写此代码。

wiki 有一个示例，说明如何将数字功能与ColumnDataClassifier 一起使用。如果您使用数字特征，请查看API 中的这些选项，它们允许您应用一些转换：

realValued  boolean false   Treat this column as real-valued and do not perform any transforms on the feature value.    Value
logTransform    boolean false   Treat this column as real-valued and use the log of the value as the feature value. Log
logitTransform  boolean false   Treat this column as real-valued and use the logit of the value as the feature value.   Logit
sqrtTransform   boolean false   Treat this column as real-valued and use the square root of the value as the feature value. Sqrt

【讨论】：