【问题标题】:HtmlUnit Scraping Xpath from DivHtmlUnit 从 Div 抓取 Xpath
【发布时间】:2015-12-09 19:40:41
【问题描述】:

我正在尝试抓取谷歌电影页面的内容,我想要剧院的名称、地址和时间。 正如您在 google 电影页面中看到的那样,该信息的每个块都在一个具有名为剧院的类的 div 中,并且在该 div 中有每个剧院的名称、地址和时间。

所以我所做的是使用 htmlunit 来提取剧院 div 列表:

List<HtmlDivision> div =  (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");

当打印列表的内容时,我得到了预期的结果:

System.out.println(div.get(0).asText());

Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00‎ ‎4:10‎ ‎7:20‎ ‎10:35pm‎

现在我想将这些信息拆分为姓名、地址和时间,问题是当我这样做时:

System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));

结果是页面中每个剧院的名称:

Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]

如果我在一个甚至没有这些信息的对象中执行 getByXpath,我怎么可能得到所有的剧院?

【问题讨论】:

    标签: java xpath web-crawler htmlunit


    【解决方案1】:

    您需要在 XPath 的开头添加一个点 (.) 以表明它与当前上下文元素相关,在本例中是第一个 div (div.get(0))。否则 XPath 将忽略上下文元素并从根开始搜索匹配的元素:

    div.get(0).getByXPath(".//div[@class='name']/a/text()")
    

    【讨论】:

    • 正是我想要的!谢谢。
    猜你喜欢
    • 2015-09-02
    • 1970-01-01
    • 2014-04-14
    • 1970-01-01
    • 1970-01-01
    • 2020-04-11
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多