【问题标题】:Add null elements to non existing elements将空元素添加到不存在的元素
【发布时间】:2014-09-16 14:51:52
【问题描述】:

我想在以下 div 中解析来自网页的数据:

我想从可以提供以下内容的网页中解析数据:

<div class="InseratDaten">
    <div class="Art">Rent</div>
    <div class="Ort">TestCity 3., Roads Street</div>
    <div class="Preis"><span class='Label'>Miete:</span> 950 EUR</div>
    <div class="Groesse"><span class='Label'>Fläche:</span> 72 m²</div>
    <div class="Zimmer"><span class='Label'>Zimmer:</span> 3</div>
</div>

但是,有时这些结构看起来完全不同,例如:

<div class="InseratDaten">
    <div class="Art">Rent</div>
    <div class="Ort">Test 3., Road Street</div>
    <div class="Preis"><span class='Label'>Miete:</span> 919 EUR</div>
    <div class="Groesse"><span class='Label'>Fläche:</span> 84 m²</div>
    <div class="Zimmer"><span class='Label'>Zimmer:</span> 3</div>
    <div class="EigTitel">weitere Eigenschaften:</div>
    <div class='EigListe'>Shower, Balcony, Dog</div>
</div>

<div class="InseratDaten">
    <div class="Art">Sale</div>
    <div class="Ort">Test 4., Road Street</div>
    <div class="Preis"><span class='Label'>Miete:</span> 919 EUR</div>
    <div class="Groesse"><span class='Label'>Fläche:</span> 84 m²</div>
</div>

如您所见,后面的代码被&lt;div class="EigTitel"&gt; 扩展或缺少某些元素。

目前我正在这样解析我的数据:

    if (page.getParseData() instanceof HtmlParseData) {
        HtmlParseData htmlParseData = (HtmlParseData) page.getParseData();
        String html = htmlParseData.getHtml();
        Document doc = Jsoup.parseBodyFragment(html);
        Elements title = doc.select("div[class=Title]");
        Elements art = doc.select("div[class=Art]");
        Elements location = doc.select("div[class=Ort]");
        Elements price = doc.select("div[class=Preis]");
        Elements size = doc.select("div[class=Groesse]");
        Elements numberOfRooms = doc.select("div[class=Zimmer]");
        Elements furtherProperties = doc.select("div[class=EigListe]");

        /**
         * get each element as List
         */
        if (!(art.isEmpty()) && !(location.isEmpty()) && !(title.isEmpty()) && !(price.isEmpty())) {
            //iterate over art cause all elems have the same size
            titleList = new ArrayList<String>();
            artList = new ArrayList<String>();
            locationList = new ArrayList<String>();
            priceList = new ArrayList<String>();
            sizeList = new ArrayList<String>();
            numberOfRoomsList = new ArrayList<String>();
            furtherPropertiesList = new ArrayList<String>();

            //price
            for (Element element : price) {
                priceList.add(element.text().toString());
            }
            //size
            for (Element element : size) {
                sizeList.add(element.text().toString());
            }
            //numberOfRooms
            for (Element element : numberOfRooms) {
                numberOfRoomsList.add(element.text().toString());
            }
            //furtherProperties
            for (Element element : furtherProperties) {
                furtherPropertiesList.add(element.text().toString());
            }
            //location
            for (Element element : location) {
                locationList.add(element.text().toString());
            }   
            //art
            for (Element element : art) {
                artList.add(element.text().toString());
            }
            //title
            for (Element element : title) {
                titleList.add(element.text().toString());
            }

            log.info(ListstoString());

            //add everything to the main domain List
            for (int i = 0; i < locationList.size(); i++) {
                Property prop = new Property();
                //price
                prop.setPrice(priceList.get(i));
                //size
                prop.setSize(sizeList.get(i));
                //number of rooms
                prop.setNumberOfRooms(numberOfRoomsList.get(i));
                //furtherProperties
                prop.setFurtherProperties(furtherPropertiesList.get(i));
                //location
                prop.setLocation(locationList.get(i));
                //art
                prop.setTransactionType(artList.get(i));
                //title
                prop.setTitle(titleList.get(i));
                //set date
                prop.setCrawlingDate(new Date());
                list.add(prop);
            }
            log.info(list.toString());
   }
}

我的问题是,在某些情况下,我的列表长度可能会有所不同,因为数据可能会丢失,因此我会收到错误消息:

[sizeList=16, priceList=16, locationList=16, numberOfRoomsList=12, furtherPropertiesList=12]

我想将空元素放在 div 没有此类属性的地方,以保持我的数据一致。我想这与 jsoup 将空元素放在那里有关吗?有什么想法可以实现吗?

非常感谢您的回答!

【问题讨论】:

    标签: java arraylist web-scraping jsoup web-crawler


    【解决方案1】:

    您可以创建预定义大小的列表,例如:

    titleList = Arrays.asList(new String[locationList.size()]);
    

    然后在设置元素的时候使用索引:

    for (int i = 0; i < title.size(); i++) {
      titleList.set(i, title.get(i).text().toString());
    }
    

    【讨论】:

    • 感谢您的回答!但是,我想我这样做时会丢失元素的顺序?我怎么能设置不可用的元素只是null
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2011-03-24
    • 2020-10-25
    • 1970-01-01
    • 2011-11-23
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多