如何使用 stanford tokenize 分隔单词和特殊字符？答案

【问题标题】：how to separate words and special character using stanford tokenize?如何使用 stanford tokenize 分隔单词和特殊字符？
【发布时间】：2015-12-20 20:43:56
【问题描述】：

我正在使用斯坦福 CoreNLP 工具，我需要将链分离为： “（参见功能要求编号 150）。”

我的代码的结果是（在 corelabels 中）： [（见，功能，要求，数量，150）。]

什么时候应该： [(,see,functional, requirements, number, 150,),.]

代码段为：

public List<CoreMap> armador(String text){

   Properties props;
   StanfordCoreNLP pipeline;
   props.put("annotators", "tokenize,ssplit,pos");
   props.put("ssplit.eolonly", "true");
   props.put("tokenize.whitespace", "true");

   pipeline = new StanfordCoreNLP(props);
   Annotation document = new Annotation(text);
   pipeline.annotate(document);
   List<CoreMap> result = document.get(CoreAnnotations.SentencesAnnotation.class);  

   return result;
}

谢谢，对不起我的英语！

【问题讨论】：

标签： java stanford-nlp tokenize

【解决方案1】：

这是由属性引起的：

props.put("tokenize.whitespace", "true");

默认情况下，CoreNLP 将运行 Penn Treebank 标记化，这将正确标记出括号。但是，tokenize.whitespace 属性强制 CoreNLP 仅对空白标记进行标记。

编辑你也许还应该警惕props.put("ssplit.eolonly", "true");——这只会在换行符上拆分句子。

【讨论】：