【问题标题】:Retrieving HTML encoded text from XML using SAXParser使用 SAXParser 从 XML 中检索 HTML 编码的文本
【发布时间】:2010-10-25 04:50:44
【问题描述】:

这是我第一次使用 SAXParser,(我在 Android 中使用它,但我认为这对这个特定问题没有影响),我正在尝试从 RSS 提要中读取数据。到目前为止,它在大多数情况下对我来说都很好,但是当它到达包含 HTML 编码文本的标签时我遇到了麻烦(例如<a href="http://...)。 characters() 方法仅将< 作为< 读取,然后将下一组字符视为单独的实体,而不是一次获取全部内容。我宁愿它只是按原样读入,而不实际翻译 HTML。我用于文档处理程序(已缩短)的代码发布在下面:

@Override
    public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
        if (localName.equalsIgnoreCase("channel")) {
            inChannel = true;
        }
        if (inChannel) {
            if (newFeed == null) newFeed = new Feed();

            if (localName.equalsIgnoreCase("image")) {
                if (feedImage == null) feedImage = new Image();
                inImage = true;
            }

            if (localName.equalsIgnoreCase("item")) {
                if (newItem == null) newItem = new Item();
                if (itemList == null) itemList = new ArrayList<Item>();
                inItem = true;
            }
        }   
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        if(!inItem) {
            if(!inImage) {
                if(inChannel) {
                    //Reached end of feed
                    if(localName.equalsIgnoreCase("channel")) {
                        newFeed.setItems((ArrayList<Item>)itemList);
                        finalFeed = newFeed;
                        newFeed = null;                     
                        inChannel = false;
                        return;
                    } else if(localName.equalsIgnoreCase("title")) {
                        newFeed.setTitle(currentValue); return;
                    } else if(localName.equalsIgnoreCase("link")) {
                        newFeed.setLink(currentValue); return;
                    } else if(localName.equalsIgnoreCase("description")) {
                        newFeed.setDescription(currentValue); return;
                    } else if(localName.equalsIgnoreCase("language")) {
                        newFeed.setLanguage(currentValue); return;
                    } else if(localName.equalsIgnoreCase("copyright")) {
                        newFeed.setCopyright(currentValue); return;
                    } else if(localName.equalsIgnoreCase("category")) {
                        newFeed.addCategory(currentValue); return;
                    }                       
                }
            }
            else { //is inImage
                //finished with feed image
                if(localName.equalsIgnoreCase("image")) {
                    newFeed.setImage(feedImage);
                    feedImage = null;
                    inImage = false;
                    return;
                } else if (localName.equalsIgnoreCase("url")) {
                    feedImage.setUrl(currentValue); return;
                } else if (localName.equalsIgnoreCase("title")) {
                    feedImage.setTitle(currentValue); return;
                } else if (localName.equalsIgnoreCase("link")) {
                    feedImage.setLink(currentValue); return;
                }
            }
        }
        else { //is inItem
            //finished with news item
            if (localName.equalsIgnoreCase("item")) {
                itemList.add(newItem);
                newItem = null;
                inItem = false;
                return;
            } else if (localName.equalsIgnoreCase("title")) {
                newItem.setTitle(currentValue); return;
            } else if (localName.equalsIgnoreCase("link")) {
                newItem.setLink(currentValue); return;
            } else if (localName.equalsIgnoreCase("description")) {
                newItem.setDescription(currentValue); return;
            } else if (localName.equalsIgnoreCase("author")) {
                newItem.setAuthor(currentValue); return;
            } else if (localName.equalsIgnoreCase("category")) {
                newItem.addCategory(currentValue); return;
            } else if (localName.equalsIgnoreCase("comments")) {
                newItem.setComments(currentValue); return;
            } /*else if (localName.equalsIgnoreCase("enclosure")) {
                 To be implemented later
            }*/ else if (localName.equalsIgnoreCase("guid")) {
                newItem.setGuid(currentValue); return;
            } else if (localName.equalsIgnoreCase("pubDate")) {
                newItem.setPubDate(currentValue); return;
            }           
        }
    }

    @Override
    public void characters(char[] ch, int start, int length) {
        currentValue = new String(ch, start, length);
    }

我尝试解析的 RSS 提要的一个示例是 this one

有什么想法吗?

【问题讨论】:

    标签: java android xml rss saxparser


    【解决方案1】:

    太棒了。这个解决方案让我有点困惑,我无法像你一样获得 localName 的值,但我仍然能够让 StringBuilder 方法工作。

    我没有在方法中替换:

    public void characters(char[] ch, int start, int length) throws SAXException {

    tempVal = new String(ch,start,length); 而是在方法中添加了以下行:

    tempSB = tempSB.append(new String(ch, start, length));
    

    其中 tempSB 是 StringBuilder 对象。 这意味着我不需要更改我的整个解析器,并且可以在必要时简单地切换到读取 SB。 当我来到一个包含 html 的元素时,在 startElement 中,我使用了:

    tempSB.delete(0, tempSB.length());
    

    在 endElement 中我使用了:

    tempText.setText(tempSB.toString()) ;
    

    就这么简单。在我的情况下不需要复杂的布尔系统,也不需要访问 localName,这是一个让我难以理解的概念。我似乎可以很好地访问 qName。

    非常感谢 kcoppock 发布您找到的解决方案。我一直在寻找几个小时,这是我能找到的唯一一篇简洁明了的文章。我正在处理的任务非常紧迫,如果没有你的帮助,我可能会失败。

    【讨论】:

    • 很高兴能帮到您!也感谢您解释您的改进。 :) 祝你的项目好运。
    【解决方案2】:

    如果它对任何人有帮助,我可以通过对我对数据感兴趣的每个字段使用布尔值来解决这个问题。然后我只是继续追加到一个 StringBuilder 直到我到达一个结束标记,之后我获取 StringBuilder 值,然后清空它,并将我的布尔值设置为 false。

    @Override
        public void startElement(String uri, String localName, String qName, Attributes attrs) throws SAXException {
            sb.delete(0, sb.length());
            if (localName.equalsIgnoreCase("channel")) {
                inChannel = true;
                newFeed = new Feed();
                itemList = new ArrayList<Item>();
            }
            if (inChannel) {            
                if (localName.equalsIgnoreCase("image")) {
                    feedImage = new Image();
                    inImage = true;
                    return;
                }           
                else if (localName.equalsIgnoreCase("item")) {
                    newItem = new Item();
                    inItem = true;
                    return;
                }
    
                if(inImage) { //set booleans for image elements
                    if (localName.equalsIgnoreCase("title")) imgTitle = true;
                    else if (localName.equalsIgnoreCase("link")) imgLink = true;
                    else if (localName.equalsIgnoreCase("url")) imgURL = true;
                    return;
                }           
                else if(inItem) { //set booleans for item elements
                    if (localName.equalsIgnoreCase("title")) iTitle = true;
                    else if (localName.equalsIgnoreCase("link")) iLink = true;
                    else if (localName.equalsIgnoreCase("description")) iDescription = true;
                    else if (localName.equalsIgnoreCase("author")) iAuthor = true;
                    else if (localName.equalsIgnoreCase("category")) iCategory = true;
                    else if (localName.equalsIgnoreCase("comments")) iComments = true;
                    else if (localName.equalsIgnoreCase("guid")) iGuid = true;
                    else if (localName.equalsIgnoreCase("pubdate")) iPubDate= true;
                    else if (localName.equalsIgnoreCase("source")) iSource = true;
                    return;
                } else { //set booleans for channel elements
                    if (localName.equalsIgnoreCase("title")) fTitle = true;
                    else if (localName.equalsIgnoreCase("link")) fLink = true;
                    else if (localName.equalsIgnoreCase("description")) fDescription = true;
                    else if (localName.equalsIgnoreCase("language")) fLanguage= true;
                    else if (localName.equalsIgnoreCase("copyright")) fCopyright = true;
                    else if (localName.equalsIgnoreCase("category")) fCategory = true;
                    return;
                }
            }       
        }
    
        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            if(inChannel) {
                if(inImage) {
                    if (localName.equalsIgnoreCase("title")) {
                        feedImage.setTitle(sb.toString());
                        sb.delete(0, sb.length());
                        imgTitle = false;
                        return;
                    }
                    else if (localName.equalsIgnoreCase("link")) {
                        feedImage.setLink(sb.toString());
                        sb.delete(0, sb.length());
                        imgLink = false;
                        return;
                    }
                    else if (localName.equalsIgnoreCase("url")) {
                        feedImage.setUrl(sb.toString());
                        sb.delete(0, sb.length());
                        imgURL = false;
                        return;
                    }
                    else return;
                } 
                else if(inItem) {
                    if (localName.equalsIgnoreCase("item")) {
                        itemList.add(newItem);
                        newItem = null;
                        inItem = false;
                        return;
                    } else if (localName.equalsIgnoreCase("title")) {
                        newItem.setTitle(sb.toString()); 
                        sb.delete(0, sb.length());
                        iTitle = false;
                        return;
                    } else if (localName.equalsIgnoreCase("link")) {
                        newItem.setLink(sb.toString()); 
                        sb.delete(0, sb.length());
                        iLink = false;
                        return;
                    } else if (localName.equalsIgnoreCase("description")) {
                        newItem.setDescription(sb.toString()); 
                        sb.delete(0, sb.length());
                        iDescription = false;
                        return;
                    } else if (localName.equalsIgnoreCase("author")) {
                        newItem.setAuthor(sb.toString()); 
                        sb.delete(0, sb.length());
                        iAuthor = false;
                        return;
                    } else if (localName.equalsIgnoreCase("category")) {
                        newItem.addCategory(sb.toString()); 
                        sb.delete(0, sb.length());
                        iCategory = false;
                        return;
                    } else if (localName.equalsIgnoreCase("comments")) {
                        newItem.setComments(sb.toString());
                        sb.delete(0, sb.length());
                        iComments = false;
                        return;
                    } /*else if (localName.equalsIgnoreCase("enclosure")) {
                         To be implemented later
                    }*/ else if (localName.equalsIgnoreCase("guid")) {
                        newItem.setGuid(sb.toString()); 
                        sb.delete(0, sb.length());
                        iGuid = false;
                        return;
                    } else if (localName.equalsIgnoreCase("pubDate")) {
                        newItem.setPubDate(sb.toString()); 
                        sb.delete(0, sb.length());
                        iPubDate = false;
                        return;
                    }
                } 
                else {
                    if(localName.equalsIgnoreCase("channel")) {
                        newFeed.setItems((ArrayList<Item>)itemList);
                        finalFeed = newFeed;
                        newFeed = null;                     
                        inChannel = false;
                        return;
                    } else if(localName.equalsIgnoreCase("title")) {
                        newFeed.setTitle(currentValue); 
                        sb.delete(0, sb.length());
                        fTitle = false;
                        return;
                    } else if(localName.equalsIgnoreCase("link")) {
                        newFeed.setLink(currentValue); 
                        sb.delete(0, sb.length());
                        fLink = false;
                        return;
                    } else if(localName.equalsIgnoreCase("description")) {
                        newFeed.setDescription(sb.toString());
                        sb.delete(0, sb.length());
                        fDescription = false;
                        return;
                    } else if(localName.equalsIgnoreCase("language")) {
                        newFeed.setLanguage(currentValue); 
                        sb.delete(0, sb.length());
                        fLanguage = false;
                        return;
                    } else if(localName.equalsIgnoreCase("copyright")) {
                        newFeed.setCopyright(currentValue); 
                        sb.delete(0, sb.length());
                        fCopyright = false;
                        return;
                    } else if(localName.equalsIgnoreCase("category")) {
                        newFeed.addCategory(currentValue); 
                        sb.delete(0, sb.length());
                        fCategory = false;
                        return;
                    }
                }
            }
        }
    
        @Override
        public void characters(char[] ch, int start, int length) {
            sb.append(new String(ch, start, length));
        }
    

    【讨论】:

      【解决方案3】:

      像这样的特殊字符包含在 CDATA 标记中。您需要查看它们是否被保留,然后 SAX Parser 才能正确处理它们。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2020-04-07
        • 1970-01-01
        • 1970-01-01
        • 2011-12-22
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多