【问题标题】:Java website parserJava网站解析器
【发布时间】:2014-10-29 21:05:42
【问题描述】:

我正在尝试从一个站点解析以下行:

<div class="search-result__price">£2,995</div>

我只想要其中的 2995 部分,但我很难做到。这是我的代码;它目前能够解析所有包含 £ 符号的行并在网站中显示所有货币。请帮忙!

public class parser {

    private static String string1 = "&pound";
    private String testURL = "http://www.autotrader.co.uk/search/used/cars/bmw/1_series/postcode/tn126bg/radius/1500/onesearchad/used%2Cnearlynew%2Cnew/quicksearch/true/page/2";
    private ArrayList<String> list = new ArrayList<String>();
    private ArrayList<Integer> prices = new ArrayList<Integer>();
    private int averagePrice;
    private int start;
    private int finish;

    public parser() throws IOException {

        URL url = new URL(testURL);
        Scanner scan = new Scanner(url.openStream());
        boolean alreadyHit = false;

        while (scan.hasNext()) {

            String line = scan.nextLine();

            if (line.contains(string1)) {

                list.add(line);

                start = line.indexOf("&pound;");
                line = line.substring(start);
                for (int i = 0; i < line.length(); i++) {

                    if (((line.charAt((i)) == ' ') || ((line.charAt((i)) == '<'))) && (alreadyHit == false)) {
                        finish = i;
                        alreadyHit = true;
                    }
                }
                alreadyHit = false;

                line = line.substring(0, finish);
                line = line.trim();
                line = line.replace("&pound;", "");
                line = line.replace(",", "");

                try {

                    int price = Integer.parseInt(line);
                    prices.add(price);
                } catch (Exception e) {

                }
            }
        }
    }

    public static void main(String args[]) throws IOException {

        parser p = new parser();

        for (Integer x : p.prices) {

            System.out.println(x);
        }
    }
}

【问题讨论】:

  • 如果它当前能够解析站点中的所有行并显示货币,那么问题是什么?还是您的意思是“无法”?如果是这样,它在做什么?
  • 1+ @Qix 刚才所说的。使用 REGEX 解析非常规语言会导致疯狂。

标签: java string parsing arraylist


【解决方案1】:

您应该使用 jsoup 之类的东西,而不是逐行使用 Scanner 或使用正则表达式 (!) 来获得清晰的 HTML 内容:

Document doc = Jsoup
    .connect(testURL)
    .userAgent("Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
    .timeout(60000).get();
Elements elems = doc.select("div .search-result__price");

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2012-02-22
    • 2013-07-13
    • 2015-09-30
    • 2012-02-12
    • 1970-01-01
    • 2014-09-27
    相关资源
    最近更新 更多