使用 R XML 包进行 Web 抓取 - xpathSapply答案

【问题标题】：Web scraping with R XML package - xpathSapply使用 R XML 包进行 Web 抓取 - xpathSapply
【发布时间】：2020-09-09 07:28:37
【问题描述】：

我正在尝试从该网站提取所有购物中心名称（例如 CityPlaza、Fashion Walk）： https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city.html

查看 html 代码，看起来购物中心的文本都存储在标签“h5”下。因此，我使用以下代码尝试提取，但它没有给我想要的文本。

url <- "https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city.html"
txt = getURL(url)
PARSED <- htmlParse(txt)
mall_text <- xpathSApply(PARSED, "//h5", xmlValue)

这肯定与我在 xpathSApply 函数中作为参数放置的路径有关，因为我对 html 知之甚少。有人可以帮忙吗？

【问题讨论】：

标签： html r xml web-scraping

【解决方案1】：

商城推荐是动态加载的，遗憾的是无法通过这种方式获取。
如果您在浏览器中右键单击网页，转到“检查元素”，单击“网络”选项卡并刷新页面，您可以看到正在发出的一堆 JSON/XHR 请求：

其中一个网址是this。你可以看到它包含了你想要的 JSON 格式的信息。

这可以使用 jsonlite 包轻松加载到 R 中。

library(jsonlite)

url <- "https://www.discoverhongkong.com/eng/explore/shopping/major-shopping-malls-throughout-city/_jcr_content/root/responsivegrid/dhkContainer/container/recommendationtiles_.recommendation-tiles.recommendationtiles_.json?path=/content/dhk/intl/en/explore/shopping/major-shopping-malls-throughout-city"
result <- read_json(url)
sapply(result$data, function(x) x$title)

这给了

 [1] "Cityplaza"                "Fashion Walk"            
 [3] "Horizon Plaza"            "Hysan Place"             
 [5] "ifc mall"                 "Island Beverley"         
 [7] "LANDMARK"                 "Lee Garden One - Six"    
 [9] "Lee Theatre and Leighton" "Lee Tung Avenue"         
[11] "Pacific Place"            "Peak Galleria"           
[13] "SOGO Causeway Bay Store"  "Times Square"            
[15] "Western Market"           "WTC"

【讨论】：

谢谢，这非常有帮助！你怎么知道这些购物中心的文本是动态加载的？是不是在 html 代码中说明了什么？
您可以在浏览器中看到这些文本，但在getURL(url) 的结果中看不到，这意味着它必须动态加载。这是最简单的检查方法。