【问题标题】:R - Scraping an HTML table with rvest when there are missing <tr> tagsR - 当缺少 <tr> 标签时,使用 rvest 抓取 HTML 表格
【发布时间】:2015-09-08 11:17:57
【问题描述】:

我正在尝试使用 rvest 从网站上抓取 HTML 表格。唯一的问题是我试图抓取的表没有&lt;tr&gt; 标签,除了第一行。它看起来像这样:

<tr> 
  <td>6/21/2015 9:38 PM</td>
  <td>5311 Lake Park</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was causing a disturbance in the area.</td>
  <td>Name checked; no further action</td>
  <td>No</td>
</tr>

  <td>6/21/2015 10:37 PM</td>
  <td>5200 S Blackstone</td>
  <td>UCPD</td>
  <td>African American</td>
  <td>Male</td>
  <td>Subject was observed fighting in the McDonald's parking lot</td>
  <td>Warned; released</td>
  <td>No</td>
</tr>

等等。因此,使用以下代码,我只能将第一行放入我的数据框中:

library(rvest)
mydata <- html_session("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") %>%
    html_node("table") %>%
    html_table(header = TRUE, fill=TRUE)

我怎样才能改变它以让 html_table 了解行是行,即使它们没有打开 &lt;tr&gt; 标记?还是有更好的方法来解决这个问题?

【问题讨论】:

  • 为什么不先用&lt;/tr&gt;&lt;tr&gt; 替换任何结束的&lt;/tr&gt;,然后删除最后一个尾随&lt;tr&gt;

标签: html r html-table rvest


【解决方案1】:
library(rvest)

url_parse<- read_html("https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015") 

col_name<- url_parse %>%
  html_nodes("th") %>%
  html_text()

mydata <- url_parse %>%
  html_nodes("td") %>%
  html_text()

finaldata <- data.frame(matrix(mydata, ncol=7, byrow=TRUE))

names(finaldata) <- col_name

finaldata

                     Incident                                  Location    

    Reported                              Occurred
1                           Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM 5/31/15 to 6/1/15 8:00 PM to 12:00 PM
2                     Information                          5835 S. Kimbark   6/1/15 3:57 PM                        6/1/15 3:55 PM
3                     Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM                        6/2/15 2:18 AM
4 Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM                        6/2/15 8:00 AM
5     Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM             6/2/15 6:45 PM to 7:30 PM
                                                                                                                   Comments / Nature of Fire Disposition
1                                                                                       Bicycle secured to bike rack taken by unknown person        Open
2             Unknown person used staff member's personal information to file a fraudulent claim with U.S. Social Security Admin. / CPD case         CPD
3 Three unaffiliated individuals reported tampering with bicycles in bike rack / Subjects were given trespass warnings and sent on their way      Closed
4                                                                      Rear wiper blade assembly damaged on UC owned vehicle during car wash      Closed
5                                                           Unknown person(s) spray painted graffiti on north concrete wall of the structure        Open
  UCPDI#
1 E00344
2 E00345
3 E00346
4 E00347
5 E00348

【讨论】:

    【解决方案2】:

    与@user227710 的方法略有不同,但大致相同。同样,这利用了TDs 的数量是统一的这一事实。

    然而,这也将所有事件(rbinds 每一页合并到一个 incidents 数据帧中)。

    pblapply 只是为您提供进度条,因为这需要几秒钟。除非在交互式会话中,否则完全没有必要。

    library(rvest)
    library(stringr)
    library(dplyr)
    library(pbapply)
    
    url <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
    pg <- read_html(url)
    
    pg %>% 
      html_nodes("li.page-count") %>% 
      html_text() %>% 
      str_trim() %>% 
      str_split(" / ") %>%
      unlist %>% 
      as.numeric %>% 
      .[2] -> total_pages
    
    pblapply(1:(total_pages), function(j) {
    
      # get "column names"
      # NOTE that you get legit column names for use with "regular" 
      # data frames this way
    
      pg %>% 
        html_nodes("thead > tr > th") %>% 
        html_text() %>% 
        make.names -> tcols
    
      # get all the TDs
    
      pg %>% 
        html_nodes("td") %>%
        as_list() -> tds
    
      # how many rows do we have? (shld be 5, but you never know)
    
      trows <- length(tds) / 7
    
      # the basic idea is to grab all the TDs for each row
      # then cbind them together and then rbind the whole thing
      # while keeping decent column names
    
      bind_rows(lapply(1:trows, function(i) {
        setNames(cbind.data.frame(lapply(1:7, function(j) { 
          html_text(tds[[(i-1)*7 + j]])
        }), stringsAsFactors=FALSE), tcols)
      })) -> curr_tbl
    
      # get next url
    
      pg %>% 
        html_nodes("li.next > a") %>% 
        html_attr("href") -> next_url
    
      if (j < total_pages) {
        pg <<- read_html(sprintf("https://incidentreports.uchicago.edu/%s", next_url))
      }
    
      curr_tbl
    
    }) %>% bind_rows -> incidents
    
    incidents
    
    ## Source: local data frame [62 x 7]
    ## 
    ##                            Incident                                  Location        Reported
    ## 1                             Theft       1115 E. 58th St. (Walker Bike Rack) 6/1/15 12:18 PM
    ## 2                       Information                          5835 S. Kimbark   6/1/15 3:57 PM
    ## 3                       Information                  1025 E. 58th St. (Swift)  6/2/15 2:18 AM
    ## 4   Non-Criminal Damage to Property                850 E. 63rd St. (Car Wash)  6/2/15 8:48 AM
    ## 5       Criminal Damage to Property 5631 S. Cottage Grove (Parking Structure)  6/2/15 7:32 PM
    ## 6  Information / Aggravated Robbery                4701 S. Ellis (Public Way)  6/3/15 2:11 AM
    ## 7                     Lost Property           5800 S. University  (Main Quad)  6/3/15 8:30 AM
    ## 8       Criminal Damage to Property         5505 S. Ellis (Parking Structure) 5/29/15 5:00 PM
    ## 9       Information / Armed Robbery        6300 S. Cottage Grove (Public Way)  6/3/15 2:33 PM
    ## 10                    Lost Property                1414 E. 59th St. (I-House)  6/3/15 2:28 PM
    ## ..                              ...                                       ...             ...
    ## Variables not shown: Occurred (chr), Comments...Nature.of.Fire (chr), Disposition (chr), UCPDI. (chr)
    

    【讨论】:

      【解决方案3】:

      谢谢大家!我最终从另一个 R 用户离线获得了一些帮助,他们建议了以下解决方案。它获取 html,保存它,添加 &lt;tr&gt;(很像 @Bram Vanroy 建议的),然后将其转换回 html 对象,然后可以将其抓取到数据帧中。

      library(rvest)
      myurl <- "https://incidentreports.uchicago.edu/incidentReportArchive.php?startDate=06/01/2015&endDate=06/21/2015"
      download.file(myurl, destfile="myfile.html", method="curl")
      myhtml <- readChar("myfile.html", file.info("myfile.html")$size)
      myhtml <- gsub("</tr>", "</tr><tr>", myhtml, fixed = TRUE)
      mydata <- html(myhtml)
      
      mydf <- mydata %>%
        html_node("table") %>%
        html_table(fill = TRUE)
      
      mydf <- na.omit(mydf)
      

      最后一行是省略了一些奇怪的 NA 行,这些行出现在这个方法中。

      【讨论】:

        猜你喜欢
        • 2018-08-25
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-04-19
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多