R中的Web抓取表答案

【问题标题】：Web Scraping tables in RR中的Web抓取表
【发布时间】：2021-08-04 09:08:54
【问题描述】：

完整的菜鸟试图刮掉这个页面上的表格，我得到的最远的地方是加载 rvest 包。我的问题是：

我找不到合适的元素；我通过检查器尝试的元素是“table.w782.comm.lsjz”，但它返回一个长度为 0 的列表，并在 html_table() 之后执行 %>% .[[1]] 即fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table() %>% .[[1]] 也不起作用

（.[[1]] 中的错误：下标超出范围）

fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)
fund_table <- fund_page %>% html_nodes("table.w782.comm.lsjz") %>% html_table()

该表有多个页面 (113)，但单击第 2 页不会重新加载 html，因此我不知道如何将所有 113 页数据刮到一个页面上...

真的很感激任何关于我能做什么的指针......

【问题讨论】：

fundf10.eastmoney.com/… 找到了一个更简单的网站版本，但仍然......所有代码都没有工作
找不到表的原因是因为从技术上讲，代码中不存在表。相反，代码中有一个创建表的脚本。我知道它“基本上是同一件事”，但事实并非如此。您首先必须清理代码，以便那里只有一个表，没有 {{if}} 语句或脚本信息。
以为是我的问题，因为电源查询有效并从中提取了一个表，所以它一定在那里！

标签： r web-scraping

【解决方案1】：

在您最初的问题中，问题是该表显示为脚本而不是有效的 xml/html 表。使用您获得的 API 链接绝对是可行的方法。

library(rvest)

# You gave an API link and this is the best option for getting the data.
fund_link <- "https://fundf10.eastmoney.com/F10DataApi.aspx?type=lsjz&code=510300&page=1&sdate=2019-01-01&edate=2021-02-13&per=40"
fund_page <- read_html(fund_link)

# Any of these seem to work
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.comm") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.lsjz") %>% html_table() %>% .[[1]]
fund_table <- fund_page %>% html_nodes(css = "table.w782.comm.lsjz") %>% html_table() %>% .[[1]]


# Original Question:
fund_link <- "https://fundf10.eastmoney.com/jjjz_510300.html"
fund_page <- read_html(fund_link)

# The following doesn't work because the table you want is actually a script, not a table.
# <script id="lsjzTable" type="text/html">
#   {{if Data && Data.LSJZList}}
# <table class="w782 comm lsjz">
#   <thead>
#   <tr>
#   <th class="first"><U+51C0><U+503C><U+65E5><U+671F></th>
#   {{if ((Data.FundType!="004" && Data.FundType!="005") || "510300"=="511880")}}
# <th><U+5355><U+4F4D><U+51C0><U+503C></th>
#   <th><U+7D2F><U+8BA1><U+51C0><U+503C></th>
#   {{if Data.FundType=="100"}}
# <th><U+5468><U+589E><U+957F><U+7387></th>
#   {{else}}
# <th><U+65E5><U+589E><U+957F><U+7387><img id="jjjzTip" style="position: relative; top: 3px; left: 3px;" data-html="true" data-placement="bottom" title="<U+65E5><U+589E><U+957F><U+7387><U+4E3A><U+7A7A><U+539F><U+56E0><U+5982><U+4E0B>:<br>1<U+3001><U+975E><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+4E0D><U+53C2><U+4E0E><U+65E5><U+589E><U+957F><U+7387><U+8BA1><U+7B97>(<U+7070><U+8272><U+6570><U+636E><U+884C>)<U+3002><br>2<U+3001><U+4E0A><U+4E00><U+4EA4><U+6613><U+65E5><U+51C0><U+503C><U+672A><U+62AB><U+9732>,<U+65E5><U+589E><U+957F><U+7387><U+65E0><U+6CD5><U+8BA1><U+7B97><U+3002>" src="//j5.dfcfw.com/image/201307/20130708102440.gif"></th>
#   {{/if}}
fund_table <- fund_page %>% html_nodes(css = "table") %>% html_table() %>% .[[1]]

# The following is a partial solution but isn't fully working.
fund_table <- fund_page %>% 
  html_nodes("script#lsjzTable") %>%
  as.character(.) %>%
  stringr::str_remove_all("\\{\\{.+?\\}\\}") %>%
  stringr::str_remove_all("\\<\\/?script.*?\\>") %>%
  read_html() %>%
  html_nodes("table") %>%
  html_table()

【讨论】：

>使用你得到的 API 链接肯定是要走的路。非常感谢这个人，听到这个消息真的让人放心，而不必继续尝试使用旧链接。