解析缺少属性的 HTML 表答案

【问题标题】：parsing HTML table with missing attributes解析缺少属性的 HTML 表
【发布时间】：2013-12-12 02:28:50
【问题描述】：

我希望从在 R 中找到的表中创建一个 data.frame http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more

它由三列组成。前两列可能包含也可能不包含标志图像，第三列是文本。提取物是

<span class="listings">
  <table>
    <tr>
     <td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
     <td></td>
     <td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/70187567">1000         Ways to Die - Season 3</a> (2010)</b>&nbsp;&nbsp;<i style="font-size:small"> 3.6 stars, 1 Season&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=1000 Ways to Die - Season 3">imdb</a></i>
     </td>
    </tr> 
    <tr>
      <td><img class="flag" src="http://bit.ly/Y9CbVZ" /></td>
      <td><img class="flag" src="http://bit.ly/WXvnLp" /></td>
      <td><b><a target="_blank" href="http://movies.netflix.com/WiMovie/100_Below_Zero/70273426?trkid=1889703">100 Below Zero</a> (2013)</b>&nbsp;&nbsp;<i style="font-size:small"> 2.8 stars, 1hr 28m&nbsp;&nbsp;<a target="_blank" href="http://www.imdb.com/search/title?title=100 Below Zero">imdb</a></i></td>
    </tr>    
 </table>
</span>

所以这里第一行仅在第一列中有图像，第二行在两者中都有。我可以提取文本和图像 url，但无法将它们匹配以考虑丢失的数据。这是我迄今为止所做的 - URL 指的是上面的站点，我刚刚显示了提取的结果

library(XML)
myURL <- "http://netflixcanadavsusa.blogspot.ca/2013/11/alphabetical-list-k-4-am-fri-nov-22-2013.html#more"


basicInfo <- htmlParse(myURL, isURL = TRUE)

### text
 df <- readHTMLTable(myURL,header=c("flag1","flag2","movie"),  stringsAsFactors = FALSE)[[1]]
head(df,2)
# V1 V2                                                             V3
# 1       1000 Ways to Die - Season 3 (2010)   3.6 stars, 1 Season  imdb
# 2                     100 Below Zero (2013)   2.8 stars, 1hr 28m  imdb    

### images
xpathSApply(basicInfo, "//*/span[@class='listings']/table/tr/td/img/@src")
#                   src                    src                    src                    
#"http://bit.ly/Y9CbVZ" "http://bit.ly/Y9CbVZ" "http://bit.ly/WXvnLp"

所以我有图片，但不知道它们适用于哪一行/哪一列在这个问题中，每一列只能有一个特定的图像，所以知道它是否发生就足够了。更一般的情况可能按行有不同的 src

TIA

【问题讨论】：

一如既往，当有人询问解析 HTML 时：stackoverflow.com/questions/1732348/…
谢谢。该链接是中篇小说的长度！任何更具体的机会
只是一个警告，在尝试解析 HTML 文件时要非常小心:-)
好的。但我之前多次使用上述功能都没有问题。只是没有遇到这个具体问题

标签： html r xml-parsing

【解决方案1】：

这是我如何做到的。它有点长，但可以完成工作。

library(XML)
basicInfo <- htmlParse(myURL, isURL = TRUE,encoding='UTF-8')

## for some reason the data is divided into 2 html tags
rows1 <- xpathSApply(basicInfo, "//*/span[@class='listings']/table/tr")
rows2 <- xpathSApply(basicInfo, "//*/span[@id='listings']/*/tr")
## for each element in the list I create a dsamll xml document containg
## all tds 
ll <- lapply(c(rows1,rows2),function(x)xpathSApply(xmlDoc(x),'//*/td'))
ull <- unlist(ll)
## function to parse the tag imag from the xml document
## if the td don't contain an img it returns an NA
parse.img <-    function(x){
  res <- xpathSApply(xmlDoc(x),'//img',xmlGetAttr,'src')
  ifelse(length(res)==0,NA,res)

}

col1 <- unlist(lapply(ull[c(T,F,F)],parse.img))
col2 <- unlist(lapply(ull[c(F,T,F)],parse.img))
## the third column contain text so I use xmlValue to extract it
col3 <- unlist(lapply(ull[c(F,F,T)], 
               function(x)xpathSApply(xmlDoc(x),'//td',xmlValue)))

res <- data.frame(col1,col2,col3)

head(res)

                  col1                 col2                                                                              col3
1 http://bit.ly/Y9CbVZ                 <NA>                1000 Ways to Die - Season 3 (2010)Â Â  3.6 stars, 1 SeasonÂ Â imdb
2 http://bit.ly/Y9CbVZ                 <NA>                1000 Ways to Die - Season 3 (2010)Â Â  3.6 stars, 1 SeasonÂ Â imdb
3 http://bit.ly/Y9CbVZ http://bit.ly/WXvnLp                              100 Below Zero (2013)Â Â  2.8 stars, 1hr 28mÂ Â imdb
4 http://bit.ly/Y9CbVZ http://bit.ly/WXvnLp 100 Ghost Street: The Return of Richard Speck (2012)Â Â  3 stars, 1hr 23mÂ Â imdb
5                 <NA> http://bit.ly/WXvnLp                              100 Million BC (2008)Â Â  2.8 stars, 1hr 25mÂ Â imdb
6                 <NA> http://bit.ly/WXvnLp                           100 Years Of Evil (2012)Â Â  2.7 stars, 1hr 19mÂ Â imdb

【讨论】：

+1 感谢您抽出宝贵时间解决此问题。它看起来不错，但我会做更多的工作来检查在接受之前是否需要跟进我认为数据已被划分，因为这是另一个包含数据子集的后续 url - 因此 #more
@pssguy 我不明白你的意思。这个解决方案有什么问题？
什么都没有 - 只是不得不开始使用它。再次感谢