如何将 PTBTokenizer 的结果拆分成句子？答案

【问题标题】：How to split the result of PTBTokenizer into sentences?如何将 PTBTokenizer 的结果拆分成句子？
【发布时间】：2015-11-13 01:26:18
【问题描述】：

我知道我可以使用DocumentPreprocessor 将文本拆分为句子。但是如果想要将标记化的文本转换回原始文本，它并不能提供足够的信息。所以我必须使用PTBTokenizer，它有一个invertible 选项。

但是，PTBTokenizer 只是返回文档中所有标记 (CoreLabels) 的迭代器。它不会将文档拆分成句子。

The documentation 说：

PTBTokenizer 的输出可以进行后处理，将文本分成句子。

但这显然不是小事。

Stanford NLP 库中是否有一个类可以将CoreLabels 序列作为输入并输出句子？这就是我的意思：

List<List<CoreLabel>> split(List<CoreLabel> documentTokens);

【问题讨论】：

标签： stanford-nlp

【解决方案1】：

我建议您使用 StanfordCoreNLP 课程。下面是一些示例代码：

import java.io.*;
import java.util.*;
import edu.stanford.nlp.io.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.pipeline.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.semgraph.*;
import edu.stanford.nlp.ling.CoreAnnotations.*;
import edu.stanford.nlp.util.*;

public class PipelineExample {

    public static void main (String[] args) throws IOException {
        // build pipeline                                                                                                                                         
        Properties props = new Properties();
        props.setProperty("annotators","tokenize, ssplit, pos");
        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
        String text = " I am a sentence.  I am another sentence.";
        Annotation annotation = new Annotation(text);
        pipeline.annotate(annotation);
        System.out.println(annotation.get(TextAnnotation.class));
        List<CoreMap> sentences = annotation.get(SentencesAnnotation.class);
        for (CoreMap sentence : sentences) {
            System.out.println(sentence.get(TokensAnnotation.class));
            for (CoreLabel token : sentence.get(TokensAnnotation.class)) {
                System.out.println(token.after() != null);
                System.out.println(token.before() != null);
                System.out.println(token.beginPosition());
                System.out.println(token.endPosition());
            }
        }
    }

}

【讨论】：

会在生成的CoreMaps 中实现before()、after()、beginPosition() 和endPosition()（即，不仅仅是返回nulls）吗？
是的，所有这些都在设置中。