【发布时间】:2011-12-22 07:44:22
【问题描述】:
我需要从页面(包含多个子项)中提取一个标签,然后将检索到的文本拆分为包含多个星号 (*) 的标签。我需要删除带有星号的标签,然后将文本拆分为我想存储在 StringArray 中的部分。
我之前使用过http://htmlparser.sourceforge.net/,它可以很好地从特定标签中提取文本。
public class ToeGuideParser extends NodeVisitor{
private static final String TAG = "ToeGuideParser";
final String url = "http://p7510.teamovercome.net/?page_id=18";
private String Guide;
Context context;
int tag_number = 0;
public ToeGuideParser () throws ParserException{
this(null);
}
public ToeGuideParser(Context context) throws ParserException{
context = this.context;
long bfr = startStopWatch();
Parser parser = new Parser (url);
parser.visitAllNodesWith(this);
stopStopWatch(bfr);
}
public void visitTag (Tag tag){
String tagName = tag.getTagName();
String content = tag.toPlainTextString();
//Log.d(TAG, tagName);
if (tagName.equalsIgnoreCase("div")){
Attribute attr = tag.getAttributeEx("class");
if (attr!=null){
String value = attr.getValue();
if (value.equals("entry-content")){
//save
Guide = tag.toHtml(true);
int guide_start = tag.getStartingLineNumber();
int guide_end = tag.getEndingLineNumber();
Log.d(TAG, "Guide starts at "+guide_start+" and ends at "+guide_end);
//Log.d(TAG, Guide);
}
}
}
if (content.contains("*****")){
tag_number++;
int start = tag.getStartingLineNumber();
int end = tag.getEndingLineNumber();
Log.d(TAG, tag_number+" = Tag found at "+start+", ends at "+end);
}
}
private void split (String bfrSplit){
if (bfrSplit != null){
//Log.d(TAG, bfrSplit);
Pattern pattern = Pattern.compile("<([A-Z][A-Z0-9]*).*>[*]+</\1>", Pattern.CASE_INSENSITIVE);
Matcher matcher = pattern.matcher(bfrSplit);
while (matcher.find()){
Log.d(TAG,"Start index: " + matcher.start());
Log.d(TAG," End index: " + matcher.end() + " ");
Log.d(TAG,matcher.group());
}
}
}
public void finishedParsing(){
//split(Guide);
Log.w(TAG, "#########");
Log.w(TAG, "finished");
}
public long startStopWatch(){
return System.currentTimeMillis();
}
public String stopStopWatch(long bfr){
long time = System.currentTimeMillis()-bfr;
String formatedTime = "Time Taken: "+time+" milli's" ;
Log.i(TAG, formatedTime);
return formatedTime;
}
}
public long startStopWatch(){
return System.currentTimeMillis();
}
public String stopStopWatch(long bfr){
long time = System.currentTimeMillis()-bfr;
String formatedTime = "Time Taken: "+time+" milli's" ;
Log.i(TAG, formatedTime);
return formatedTime;
}
}
这段代码的问题:
- 返回的行号完全错误。 (在主题之前有偶数行号匹配)
- 正则表达式从不匹配,尽管我在 regextester 中使用页面的源代码对其进行了尝试。我只尝试了正则表达式,因为带有 htmlparser 的代码不起作用。
Stacktrace 来说明:
D / ToeGuideParser ( 2146): 1 = Tag found at 11, ends at 11
D / ToeGuideParser ( 2146): 2 = Tag found at 201, ends at 201
D / ToeGuideParser ( 2146): 3 = Tag found at 202, ends at 202
D / ToeGuideParser ( 2146): 4 = Tag found at 237, ends at 237
D / ToeGuideParser ( 2146): 5 = Tag found at 238, ends at 238
D / ToeGuideParser ( 2146): 6 = Tag found at 239, ends at 239
D / ToeGuideParser ( 2146): Guide starts at 248 and ends at 248
D / ToeGuideParser ( 2146): 7 = Tag found at 248, ends at 248
D / ToeGuideParser ( 2146): 8 = Tag found at 261, ends at 261
D / ToeGuideParser ( 2146): 9 = Tag found at 261, ends at 261
D / ToeGuideParser ( 2146): 10 = Tag found at 280, ends at 280
D / ToeGuideParser ( 2146): 11 = Tag found at 280, ends at 280
D / ToeGuideParser ( 2146): 12 = Tag found at 307, ends at 307
D / ToeGuideParser ( 2146): 13 = Tag found at 318, ends at 318
D / ToeGuideParser ( 2146): 14 = Tag found at 322, ends at 322
D / ToeGuideParser ( 2146): 15 = Tag found at 328, ends at 328
D / ToeGuideParser ( 2146): 16 = Tag found at 350, ends at 350
D / ToeGuideParser ( 2146): 17 = Tag found at 367, ends at 367
D / ToeGuideParser ( 2146): 18 = Tag found at 376, ends at 376
W / ToeGuideParser ( 2146): #########
W / ToeGuideParser ( 2146): finished
I / ToeGuideParser ( 2146): Time Taken: 1021 milli's
【问题讨论】:
-
你考虑过使用jsoup吗? jsoup.org
-
我将它用于几个项目。这显然是我用过的最好的解析器。
标签: java android parsing html-parsing