【问题标题】:OpenNLP sentence training exampleOpenNLP 句子训练示例
【发布时间】:2015-12-24 19:21:55
【问题描述】:

我正在尝试使用官方 OpenNLP 网站手册示例来训练新模型,示例如下:


    Charset charset = Charset.forName("UTF-8");
    ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);
    ObjectStream sampleStream = new SentenceSampleStream(lineStream);
    SentenceModel model;
    try {
      model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
    } finally {
      sampleStream.close();
    }
    OutputStream modelOut = null;
    try {
      modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
      model.serialize(modelOut);
    } finally {
      if (modelOut != null) 
      modelOut.close();
    }

问题出在2º线,

    
ObjectStream lineStream = new PlainTextByLineStream(new FileInputStream("en-sent.train"), charset);

帮助说我: 已弃用。请改用 PlainTextByLineStream(InputStreamFactory, Charset)。 但是...我不知道如何使用这个构造函数。我想要一个使用同一个语料库文件的非弃用构造函数的例子。

我已经编写了下一个代码,使用 opennlp 帮助和 2 种使用 train 方法的方法,不推荐使用的方法和文档帮助中建议的方法:

    Charset charset = Charset.forName("UTF-8");
    InputStreamFactory inputStreamFactory=null;
    ObjectStream<String> lineStream=null;
    ObjectStream<SentenceSample> sampleStream=null;
    SentenceModel model=null;
    OutputStream modelOut = null;
    try{
        inputStreamFactory=InputStreamFactory.class.newInstance();
        lineStream=new PlainTextByLineStream(inputStreamFactory,charset);
        sampleStream = new SentenceSampleStream(lineStream);
        //The deprecated:
        model = SentenceDetectorME.train("en", sampleStream, true, null, TrainingParameters.defaultParams());
        //The sugested:
        model = SentenceDetectorME.train("en", sampleStream, new SentenceDetectorFactory(), new TrainingParameters()); 
    } catch (InstantiationException e2){
        e2.printStackTrace();
    } catch (IllegalAccessException e2){
        e2.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    }finally {
        try{
            sampleStream.close();
        } catch (IOException e){
            e.printStackTrace();
        }
    }
    try {
        modelOut = new BufferedOutputStream(new FileOutputStream(new File("modelFile")));
        model.serialize(modelOut);
    } catch (FileNotFoundException e){
        e.printStackTrace();
    } catch (IOException e){
        e.printStackTrace();
    } finally {
        if (modelOut != null) try{
            modelOut.close();
        } catch (IOException e){
            e.printStackTrace();
        }      
    }

但在这个新代码中,我不知道从哪里获取我的语料库数据文件。 有什么想法吗?

【问题讨论】:

标签: java opennlp training-data sentence


【解决方案1】:

你必须用你想要的数据文件初始化inputStreamFactory,使用

inputStreamFactory = new MarkableFileInputStreamFactory(
        new File("en-sent.train"));

【讨论】: