【发布时间】:2015-12-09 19:40:41
【问题描述】:
我正在尝试抓取谷歌电影页面的内容,我想要剧院的名称、地址和时间。 正如您在 google 电影页面中看到的那样,该信息的每个块都在一个具有名为剧院的类的 div 中,并且在该 div 中有每个剧院的名称、地址和时间。
所以我所做的是使用 htmlunit 来提取剧院 div 列表:
List<HtmlDivision> div = (List<HtmlDivision>) page.getByXPath("//div[@class='theater']");
当打印列表的内容时,我得到了预期的结果:
System.out.println(div.get(0).asText());
Regal Battery Park Stadium 11
102 North End Avenue, New York, NY
1:00 4:10 7:20 10:35pm
现在我想将这些信息拆分为姓名、地址和时间,问题是当我这样做时:
System.out.println("Theater " + div.get(0).getByXPath("//div[@class='name']/a/text()"));
结果是页面中每个剧院的名称:
Theater [Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, AMC Village 7, UA Court Street Stadium 12 & RPX, Cobble Hill Cinemas, AMC Loews 19th St. East 6, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, AMC Newport Centre 11, AMC Loews 34th Street 14, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, Regal Union Square Stadium 14, Cobble Hill Cinemas, Bow Tie Chelsea Cinemas, AMC Newport Centre 11, Regal Battery Park Stadium 11, UA Court Street Stadium 12 & RPX, City Cinemas Village East Cinema, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Pavilion Cinema, AMC Village 7, UA Court Street Stadium 12 & RPX, AMC Loews 19th St. East 6, AMC Newport Centre 11, AMC Loews 34th Street 14, AMC Loews Kips Bay 15, Regal E-Walk Stadium 13 & RPX, Frank Theatres - South Cove Stadium 12]
如果我在一个甚至没有这些信息的对象中执行 getByXpath,我怎么可能得到所有的剧院?
【问题讨论】:
标签: java xpath web-crawler htmlunit