【发布时间】:2017-04-29 22:37:48
【问题描述】:
我有一个文本文件,其中包含以下示例 UTF-8 文本:
ኣእምሮኣዊ/ADJ ጥዕና/N ።/PUN
ቅድሚ/PRE ብዙሕ/ADJ ዓመታት/N “/PUN ኣእምሮኣዊ/ADJ ስንክልና/N ብጋኔን/N ወይ/CON እከይ/ADJ መናፍስቲ/N ኢዩ/V_AUX ዝመጽእ/V_REL “/PUN ዝብል/V_REL ግጉይ/ADJ ኣመለኻኽታ/N ነይሩ/V_GER ።/PUN
ከም/CON ውጺኢቱ/N ድማ/CON ኣእምሮኣዊ/ADJ ስንክልና/N ዘጋጠሞም/ADJ ኣባላት/N ናይ/PRE ሓደ/NUM ሕብረተ-ሰብ/N ብኣሰቃቕን/ADJ ኢሰብኣውን/ADJ ኣገባብ/N ይተሓዙ/V_IMF ነይሮም/V_AUX ።/PUN
用于 Brown Corpus 的 HMM 词性标注器的 Lingpipe 实现:
BrownCorpus 类读取压缩后的 POS 语料库如下:
public class BrownPosCorpus implements PosCorpus {
private final File mBrownZipFile;
public BrownPosCorpus(File brownZipFile) {
mBrownZipFile = brownZipFile;
}
public Parser<ObjectHandler<Tagging<String>>> parser() {
return new BrownPosParser();
}
public Iterator<InputSource> sourceIterator() throws IOException {
return new BrownSourceIterator(mBrownZipFile);
}
static class BrownSourceIterator extends Iterators.Buffered<InputSource> {
private ZipInputStream mZipIn = null;
public BrownSourceIterator(File brownZipFile) throws IOException {
FileInputStream fileIn = new FileInputStream(brownZipFile);
mZipIn = new ZipInputStream(fileIn);
}
public InputSource bufferNext() {
ZipEntry entry = null;
try {
while ((entry = mZipIn.getNextEntry()) != null) {
if (entry.isDirectory()) continue;
String name = entry.getName();
if (name.equals("brown/CONTENTS")
|| name.equals("brown/README")) continue;
return new InputSource(mZipIn);
}
} catch (IOException e) {
// ignore and close and return null
}
Streams.closeQuietly(mZipIn);
return null;
}
}
}
BrownPosParser.java 类解析压缩后的棕色 pos 语料库如下:
public class BrownPosParser
extends StringParser<ObjectHandler<Tagging<String>>> {
@Override
public void parseString(char[] cs, int start, int end) {
String in = new String(cs,start,end-start);
String[] sentences = in.split("\n");
for (int i = 0; i < sentences.length; ++i)
if (!Strings.allWhitespace(sentences[i]))
processSentence(sentences[i]);
}
public String normalizeTag(String rawTag) {
String tag = rawTag;
String startTag = tag;
// remove plus, default to first
int splitIndex = tag.indexOf('+');
if (splitIndex >= 0)
tag = tag.substring(0,splitIndex);
int lastHyphen = tag.lastIndexOf('-');
if (lastHyphen >= 0) {
String first = tag.substring(0,lastHyphen);
String suffix = tag.substring(lastHyphen+1);
if (suffix.equalsIgnoreCase("HL")
|| suffix.equalsIgnoreCase("TL")
|| suffix.equalsIgnoreCase("NC")) {
tag = first;
}
}
int firstHyphen = tag.indexOf('-');
if (firstHyphen > 0) {
String prefix = tag.substring(0,firstHyphen);
String rest = tag.substring(firstHyphen+1);
if (prefix.equalsIgnoreCase("FW")
|| prefix.equalsIgnoreCase("NC")
|| prefix.equalsIgnoreCase("NP"))
tag = rest;
}
// neg last, and only if not whole thing
int negIndex = tag.indexOf('*');
if (negIndex > 0) {
if (negIndex == tag.length()-1)
tag = tag.substring(0,negIndex);
else
tag = tag.substring(0,negIndex)
+ tag.substring(negIndex+1);
}
// multiple runs to normalize
return tag.equals(startTag) ? tag : normalizeTag(tag);
}
private void processSentence(String sentence) {
String[] tagTokenPairs = sentence.split(" ");
List<String> tokenList = new ArrayList<String>(tagTokenPairs.length);
List<String> tagList = new ArrayList<String>(tagTokenPairs.length);
for (String pair : tagTokenPairs) {
int j = pair.lastIndexOf('/');
String token = pair.substring(0,j);
String tag = normalizeTag(pair.substring(j+1));
tokenList.add(token);
tagList.add(tag);
}
Tagging<String> tagging
= new Tagging<String>(tokenList,tagList);
getHandler().handle(tagging);
}
}
问题是在解析 UTF-8 语料库时出现以下错误: 关键问题在 BrownPosParser.java 中:
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
堆栈跟踪如下:
C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags>ant eval-brown
Buildfile: C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build.xml
compile:
[javac] Compiling 11 source files to C:\Lingpipe-Ver-4.1.2\Experiments\NER\posTags\build\classes
eval-brown:
[java] COMMAND PARAMETERS:
[java] Sent eval rate=5
[java] Toks before eval=1000000
[java] Max n-best eval=32
[java] Max n-gram=8
[java] Num chars=128
[java] Lambda factor=8.0
[java] Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: -1
[java] at java.lang.String.substring(String.java:1967)
[java] at BrownPosParser.processSentence(BrownPosParser.java:72)
[java] at BrownPosParser.parseString(BrownPosParser.java:20)
[java] at com.aliasi.corpus.StringParser.parse(StringParser.java:71)
[java] at EvaluatePos.parseCorpus(EvaluatePos.java:123)
[java] at EvaluatePos.run(EvaluatePos.java:75)
[java] at EvaluatePos.main(EvaluatePos.java:183)
[java] Java Result: 1
我应该修改哪部分代码才能正确解析 UTF-8 pos 语料库?
非常感谢任何帮助。
【问题讨论】:
-
你的问题到底是什么?
-
问题是?你的文本文件和这两个类之间有什么联系?出了什么问题,你期待什么?
-
这是一个很好的 Ge'ez 脚本示例(用于埃塞俄比亚)。但我在你的帖子中看不到任何问题。你想知道什么?
-
InputSource是什么? -
问题在于这两个类使用的是 Latin-1 字符集。我无法解析文本文件中的 UTF-8 字符。我应该修改哪部分代码才能正确解析 UTF-8 pos corpus?
标签: java parsing utf-8 substring tagging