BioNLP 斯坦福 - 标记化答案

【问题标题】：BioNLP stanford - tokenizationBioNLP 斯坦福 - 标记化
【发布时间】：2016-10-06 04:23:08
【问题描述】：

我尝试标记生物医学文本，所以我决定使用http://nlp.stanford.edu/software/eventparser.shtml。我使用了独立程序 RunBioNLPTokenizer 来满足我的需求。

现在，我想创建自己的使用斯坦福图书馆的程序。所以，我阅读了下面描述的 RunBioNLPTokenizer 的代码。

package edu.stanford.nlp.ie.machinereading.domains.bionlp;

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintStream;
import java.util.Collection;
import java.util.List;
import java.util.Properties;

import edu.stanford.nlp.ie.machinereading.GenericDataSetReader;
import edu.stanford.nlp.ie.machinereading.msteventextractor.DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.EpigeneticsDataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.GENIA11DataSet;
import edu.stanford.nlp.ie.machinereading.msteventextractor.InfectiousDiseasesDataSet;
import edu.stanford.nlp.io.IOUtils;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.util.StringUtils;

/**
 * Standalone program to run our BioNLP tokenizer and save its output
 */
public class RunBioNLPTokenizer extends GenericDataSetReader {

  public static void main(String[] args) throws IOException {
    Properties props = StringUtils.argsToProperties(args);
    String basePath = props.getProperty("base.directory", "/u/nlp/data/bioNLP/2011/originals/");

    DataSet dataset = new GENIA11DataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "genia/training");
    runTokenizerForDirectory(dataset, basePath + "genia/development");
    runTokenizerForDirectory(dataset, basePath + "genia/testing");

    dataset = new EpigeneticsDataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "epi/training");
    runTokenizerForDirectory(dataset, basePath + "epi/development");
    runTokenizerForDirectory(dataset, basePath + "epi/testing");

    dataset = new InfectiousDiseasesDataSet();
    dataset.getFilesystemInformation().setTokenizer("stanford");
    runTokenizerForDirectory(dataset, basePath + "infect/training");
    runTokenizerForDirectory(dataset, basePath + "infect/development");
    runTokenizerForDirectory(dataset, basePath + "infect/testing");
  }

  private static void runTokenizerForDirectory(DataSet dataset, String path) throws IOException {
    System.out.println("Input directory: " + path);
    BioNLPFormatReader reader = new BioNLPFormatReader();    
    for (File rawFile : reader.getRawFiles(path)) {
      System.out.println("Input filename: " + rawFile.getName());
      String rawText = IOUtils.slurpFile(rawFile);

      String docId = rawFile.getName().replace("." + BioNLPFormatReader.TEXT_EXTENSION, "");
      String parentPath = rawFile.getParent();

      runTokenizer(dataset.getFilesystemInformation().getTokenizedFilename(parentPath, docId), rawText);
    }
  }

  private static void runTokenizer(String tokenizedFilename, String text) {
    System.out.println("Tokenized filename: " + tokenizedFilename);
    Collection<String> sentences = BioNLPFormatReader.splitSentences(text);

    PrintStream os = null;
    try {
      os = new PrintStream(new FileOutputStream(tokenizedFilename));
    } catch (IOException e) {
      System.err.println("ERROR: cannot save online tokenization to " + tokenizedFilename);
      e.printStackTrace();
      System.exit(1);
    }

    for (String sentence : sentences) {
      BioNLPFormatReader.BioNLPTokenizer tokenizer = new BioNLPFormatReader.BioNLPTokenizer(sentence);
      List<CoreLabel> tokens = tokenizer.tokenize();
      for (CoreLabel l : tokens) {
        os.print(l.word() + " ");
      }
      os.println();
    }
    os.close();
  }
}

我写了下面的代码。我实现了将文本拆分为句子，但我不能使用 BioNLPTokenizer，因为它在 RunBioNLPTokenizer 中使用。

public static void main(String[] args) throws Exception {
  // TODO code application logic here
  Collection<String> c =BioNLPFormatReader.splitSentences("..");
  for (String sentence : c) {
    System.out.println(sentence);
    BioNLPFormatReader.BioNLPTokenizer x = BioNLPFormatReader.BioNLPTokenizer(sentence);
  }
}

我犯了这个错误

线程“main”java.lang.RuntimeException 中的异常：无法编译的源代码 - edu.stanford.nlp.ie.machinereading.domains.bionlp.BioNLPFormatReader.BioNLPTokenizer 在 edu.stanford.nlp.ie.machinereading 中具有受保护的访问权限。 domain.biolp.BioNLPFormatReader

我的问题是。如何在不使用 RunBioNLPTokenizer 的情况下根据斯坦福图书馆对生物医学句子进行标记？

【问题讨论】：

标签： java nlp stanford-nlp

【解决方案1】：

不幸的是，我们将BioNLPTokenizer 设为protected 内部类，因此您需要编辑源并将其访问权限更改为public。

请注意，BioNLPTokenizer 可能不是最通用的生物医学句子标记器——我会抽查输出以确保它是合理的。我们针对 BioNLP 2009/2011 共享任务对其进行了大量开发。

【讨论】：

感谢您的回答。我解决了这个问题（我认为）。我上课是为了扩展 BioNLPFormatReader。这对我有用。我已经读过这是测试版。您的图书馆中是否有任何用于生物医学文本的标记器？
很高兴听到您找到了解决方法。我可能会说“大部分未维护”而不是“测试版”，因为 Mihai 和我自己不再在斯坦福大学 :) 你指的是哪个图书馆？
嗯，我的意思是斯坦福 CoreNLP 库。但是，如果您对生物医学中的标记化有任何其他了解，我将不胜感激。提前谢谢你:)
查看本文的工具包部分（主要是关于句子边界检测，但也有一些关于标记化的内容）：ncbi.nlm.nih.gov/pmc/articles/PMC5001746