【发布时间】:2019-11-09 00:01:38
【问题描述】:
我有一个带有 2 个p 标签的div。
我需要获取第二个 p 元素的文本。
<div class="fb-price-list">
<p class="fb-price">S/ 1,699 (Internet)</p>
<p class="fb-price">S/ 2,399 (Normal)</p>
</div>
预期结果:
S/ 2,399 (Normal)
我有这个但不工作:
tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")
product_price_actual <- tvs_url %>%
html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
html_text()
html:
<div class="pod-item"><div class="fb-form__input--checkbox fb-pod__item__compare"><input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"><label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label">Comparar</label></div><div class="pod-head"><a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="content__image"><img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0" alt="img" class="image"></div></a><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"><div class="pod-head__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div></a></div><div class="pod-body"><a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="section__pod-top-brand">SAMSUNG</div><div class="section__pod-top-title"><div class="LinesEllipsis ">LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class="section__pod-middle"><div class="section__pod-middle-content__stickers"><div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content="">29%</div></div><div class="section__information"><a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><div class="fb-price-list"><p class="fb-price">S/ 1,699 (Internet)</p><p class="fb-price">S/ 2,399 (Normal)</p></div></a></div><div class="section__pod-middle-content__button"><button class="btn-add-to-basket">AGREGAR A TU BOLSA</button></div></div><div class="section__pod-bottom"><div class="fb-pod__rating" style="visibility: hidden;"><a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"><div class="fb-rating-stars"><div class="fb-rating-stars__container"><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><div class="fb-rating-stars__holder"><span class=""><i class="icon-rating"></i></span></div><p class="fb-rating-stars__count">0 <span class="fb-rating-stars__count__max"> / 5</span></p></div></div></a></div><a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"><ul class="section__pod-bottom-description"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>
更新 1:
根据选择的答案,我使用ifelse 检查给定位置的字符数:
要监督的位置是第 4 个,当没有 precio_antes(价格之前)时,这个位置被另一个元素占据,所以我们需要在这些情况下输入 NA:
ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))
我如何构建最终的 df:
df <- data.frame(
brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
product = sapply(splitted, "[", 3),
precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
)
【问题讨论】:
-
在您的实际 URL 上是否有任何答案对您有用,因为当我将其应用于
"https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1" %>% read_html() %>% html_nodes(".fb-price-list p:nth-child(2)") %>% html_text()时,我得到character(0) -
@RonakShah 你是对的。它在 url 中不起作用,而只是在我处理的 html 部分中起作用。诡异的。请让我知道您是否可以帮助我解决这个问题。
-
你开放使用RSelenium吗?
-
@BigDataScientist 是的,我会使用 RSelenium,
标签: r web-scraping rvest rselenium