【问题标题】:Read from HTML using rvest使用 rvest 从 HTML 读取
【发布时间】:2018-04-03 04:37:39
【问题描述】:

是否可以使用 rvest 包读取存储在 input type="radio" 标记中并后跟 TAG span class="glyphicon glyphicon-ok" 的文本。例如:我想在字符向量中读取“碳水化合物和脂肪”

R 代码#does not work and give NA is stored in p_ans

install.packages('rvest')
library('rvest')

url <- 'http://upscfever.com/upsc-fever/en/test/en-test-sci1.html'

webpage <- read_html(url)

p_ans <- webpage %>%
        html_nodes("input + glyphicon-ok") %>%
        html_text()

HTML 代码

<div class="form-group" id="myform">
            <label for="usr">Q1: Energy giving foods are </label>
     </div>
    <div class="radio">
      <label><input type="radio" value="1" name="optradio0">Carbohydrates and fats<span class="glyphicon glyphicon-ok"></span></label>
    </div>
    <div class="radio">
      <label><input type="radio" id="opt1" value="-0.33" name="optradio0">Carbohydrates and Proteins<span id="sp1" class="glyphicon glyphicon-remove"></span></label>
    </div>

【问题讨论】:

  • 非常聪明的方法来捕捉特定测验的所有正确答案 ;-)

标签: r web-scraping rvest


【解决方案1】:
library(rvest)

pg <- read_html("http://upscfever.com/upsc-fever/en/test/en-test-sci1.html")
html_nodes(pg, xpath=".//label[input and span[contains(@class, 'glyphicon glyphicon-ok')]]") %>% 
  html_text()
##  [1] "Carbohydrates and fats"                                         
##  [2] "saturated fatty acids"                                          
##  [3] "unsaturated fatty acids are good for health"                    
##  [4] "unsaturated fats"                                               
##  [5] "polypeptides"                                                   
##  [6] "Maerasmus"                                                      
##  [7] "Ribulose bisphosphate Carboxylase-Oxygenase "                   
##  [8] "Mercury"                                                        
##  [9] "Cadmium"                                                        
## [10] "Absorb free radicals"                                           
## [11] "A"                                                              
## [12] "Calcium - Goitre"                                               
## [13] "none"                                                           
## [14] "Excretion of undigested food"                                   
## [15] " complex components of food are broken into simpler substances."
## [16] "starch to sugar"                                                
## [17] "protection of stomach lining"                                   
## [18] "Liver"                                                          
## [19] "digestion of fats"                                              
## [20] "only HDC is good"                                               
## [21] "35-42"                                                          
## [22] "absorption of food"                                             
## [23] "digest cellulose"                                               
## [24] "meat is easily digested"                                        
## [25] "gall bladder"                  

【讨论】:

  • 没想到。
猜你喜欢
  • 2021-11-17
  • 1970-01-01
  • 2022-01-15
  • 2018-03-13
  • 1970-01-01
  • 2020-09-17
  • 2020-04-19
  • 2017-10-11
  • 2023-03-21
相关资源
最近更新 更多