【问题标题】:Parse HTML data using R使用 R 解析 HTML 数据
【发布时间】:2017-06-02 02:45:20
【问题描述】:

我有一个 html 数据集,如下所示,我想将其解析并转换为可以使用的表格格式。

<!DOCTYPE html>
<html>

<head>
    <title>Page Title</title>
</head>

<body>
    <div class="brewery" id="brewery">
        <ul class="vcard simple">
            <li class="name"> Bradley Farm / RB Brew, LLC</li>
            <li class="address">317 Springtown Rd </li>
            <li class="address_2">New Paltz, NY 12561-3020 | <a href='http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (845) 255-8769</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
    <div class="brewery">
        <ul class="vcard simple">
            <li class="name">(405) Brewing Co</li>
            <li class="address">1716 Topeka St </li>
            <li class="address_2">Norman, OK 73069-8224 | <a href='http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States' target='_blank'>Map</a> </li>
            <li class="telephone">Phone: (405) 816-0490</li>
            <li class="brewery_type">Type: Micro</li>
            <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
        </ul>
        <ul class="vcard simple col2"></ul>
    </div>
</body>

以下是我使用的代码。我面临的问题是它使用 Rvest 转换为文本文件,但似乎无法使其成为任何有用的格式。

library(dplyr)
library(rvest)

url<-html("beer.html")
selector_name<-".brewery"
fnames<-html_nodes(x = url, css = selector_name) %>%
html_text()
head(fnames)
fnames

这是一个正确的方法还是我应该使用其他包来遍历每个 div 和内部元素。

我想看的输出是

No.  Name  Address Type Website

谢谢。

【问题讨论】:

    标签: html r web-scraping rvest


    【解决方案1】:
    library(rvest)
    library(dplyr)
    
    html_file <- '<!DOCTYPE html>
    <html>
    
    <head>
        <title>Page Title</title>
    </head>
    
    <body>
        <div class="brewery" id="brewery">
            <ul class="vcard simple">
                <li class="name"> Bradley Farm / RB Brew, LLC</li>
                <li class="address">317 Springtown Rd </li>
                <li class="address_2">New Paltz, NY 12561-3020 | <a href="http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States" target="_blank">Map</a> </li>
                <li class="telephone">Phone: (845) 255-8769</li>
                <li class="brewery_type">Type: Micro</li>
                <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
            </ul>
            <ul class="vcard simple col2"></ul>
        </div>
        <div class="brewery">
            <ul class="vcard simple">
                <li class="name">(405) Brewing Co</li>
                <li class="address">1716 Topeka St </li>
                <li class="address_2">Norman, OK 73069-8224 | <a href="http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States" target="_blank">Map</a> </li>
                <li class="telephone">Phone: (405) 816-0490</li>
                <li class="brewery_type">Type: Micro</li>
                <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
            </ul>
            <ul class="vcard simple col2"></ul>
        </div>
    </body>'
    
    page <- read_html(html_file) 
    
    tibble(
      name = page %>% html_nodes(".vcard .name") %>% html_text(),
      address = page %>% html_nodes(".vcard .address") %>% html_text(),
      type = page %>% html_nodes(".vcard .brewery_type") %>% html_text() %>% stringr::str_replace_all("^Type: ", ""),
      website = page %>% html_nodes(".vcard .url a") %>% html_attr("href")
    )
    
    #> # A tibble: 2 x 4
    #>                           name            address  type                       website
    #>                          <chr>              <chr> <chr>                         <chr>
    #> 1  Bradley Farm / RB Brew, LLC 317 Springtown Rd  Micro http://www.raybradleyfarm.com
    #> 2             (405) Brewing Co    1716 Topeka St  Micro     http://www.405brewing.com
    

    【讨论】:

    • 非常感谢@austensen。我得到的唯一错误是在整个文件上运行 type 。当我们尝试替换空白类型值时会有所作为。 ` 错误:列 type 的长度必须为 1 或 7263,而不是 7147 `
    • 哦,听起来,与您的示例不同,有些啤酒厂在您的真实数据中缺少 type 字段,因此您的数据框中的列的长度不同。我得想一想如何解决这个问题。
    【解决方案2】:

    问题是它不是表格,所以解析起来不是很容易。它只是两个列表,下面的代码将它们连接成一个列表。另外仅供参考,请尝试查看 xml2 包以解析 html/xml。

    library(dplyr)
    library(rvest)
    library(xml2)
    
    vcard <- 
      '<!DOCTYPE html>
      <html>
    
      <head>
      <title>Page Title</title>
      </head>
    
      <body>
      <div class="brewery" id="brewery">
      <ul class="vcard simple">
      <li class="name"> Bradley Farm / RB Brew, LLC</li>
      <li class="address">317 Springtown Rd </li>
      <li class="address_2">New Paltz, NY 12561-3020 | <a href=\'http://www.google.com/maps/place/317 Springtown Rd++New Paltz+NY+United States\' target=\'_blank\'>Map</a> </li>
      <li class="telephone">Phone: (845) 255-8769</li>
      <li class="brewery_type">Type: Micro</li>
      <li class="url"><a href="http://www.raybradleyfarm.com" target="_blank">www.raybradleyfarm.com</a> </li>
      </ul>
      <ul class="vcard simple col2"></ul>
      </div>
      <div class="brewery">
      <ul class="vcard simple">
      <li class="name">(405) Brewing Co</li>
      <li class="address">1716 Topeka St </li>
      <li class="address_2">Norman, OK 73069-8224 | <a href=\'http://www.google.com/maps/place/1716 Topeka St++Norman+OK+United States\' target=\'_blank\'>Map</a> </li>
      <li class="telephone">Phone: (405) 816-0490</li>
      <li class="brewery_type">Type: Micro</li>
      <li class="url"><a href="http://www.405brewing.com" target="_blank">www.405brewing.com</a> </li>
      </ul>
      <ul class="vcard simple col2"></ul>
      </div>
      </body>' %>% 
        read_html(html) %>% 
        xml_find_all("//ul[@class = 'vcard simple']")
    
    two_children <- sapply(vcard, function(x) xml2::xml_children(x))
    
    data.frame(
      class = sapply(two_children, function(x) xml2::xml_attrs(x)),
      value = sapply(two_children, function(x) xml2::xml_text(x)),
      stringsAsFactors = FALSE
    )
    

    【讨论】:

      猜你喜欢
      • 2012-01-27
      • 2017-04-22
      • 1970-01-01
      • 1970-01-01
      • 2012-12-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多