如何在div内的第二个p元素中提取文本答案

【问题标题】：How to extract text in second p element inside div如何在div内的第二个p元素中提取文本
【发布时间】：2019-11-09 00:01:38
【问题描述】：

我有一个带有 2 个p 标签的div。

我需要获取第二个 p 元素的文本。

<div class="fb-price-list">
      <p class="fb-price">S/  1,699 (Internet)</p>
      <p class="fb-price">S/  2,399 (Normal)</p>
</div>

预期结果：

S/  2,399 (Normal)

我有这个但不工作：

tvs_url <- read_html("https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1")

product_price_actual <- tvs_url %>% 
  html_nodes('div.pod-group pod-group__large-pod div.pod-body div.fb-price-list p.fb-price:nth-child(2)') %>%
  html_text()

html：

&lt;div class="pod-item"&gt;&lt;div class="fb-form__input--checkbox fb-pod__item__compare"&gt;&lt;input id="fb-pod__item__input-16754140" class="fb-pod__item__compare__input" type="checkbox" name="fb-pod__item__input-16754140" value="16754140"&gt;&lt;label for="fb-pod__item__input-16754140" class="fb-pod__item__compare__label"&gt;Comparar&lt;/label&gt;&lt;/div&gt;&lt;div class="pod-head"&gt;&lt;a class="pod-head__image" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"&gt;&lt;div class="content__image"&gt;&lt;img src="//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;amp;hei=544&amp;amp;qlt=70&amp;amp;anchor=750,750&amp;amp;crop=0,0,0,0" alt="img" class="image"&gt;&lt;/div&gt;&lt;/a&gt;&lt;a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140" class="pod-head__stickerslink"&gt;&lt;div class="pod-head__stickers"&gt;&lt;div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content=""&gt;29%&lt;/div&gt;&lt;/div&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="pod-body"&gt;&lt;a class="section__pod-top" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"&gt;&lt;div class="section__pod-top-brand"&gt;SAMSUNG&lt;/div&gt;&lt;div class="section__pod-top-title"&gt;&lt;div class="LinesEllipsis  "&gt;LED UHD 4K 55" Smart TV UN55RU7100GXPE SERIE RU7100&lt;wbr&gt;&lt;/div&gt;&lt;/div&gt;&lt;/a&gt;&lt;div class="section__pod-middle"&gt;&lt;div class="section__pod-middle-content__stickers"&gt;&lt;div class="fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff" data-discount-content=""&gt;29%&lt;/div&gt;&lt;/div&gt;&lt;div class="section__information"&gt;&lt;a class="section__information-link" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"&gt;&lt;div class="fb-price-list"&gt;&lt;p class="fb-price"&gt;S/  1,699 (Internet)&lt;/p&gt;&lt;p class="fb-price"&gt;S/  2,399 (Normal)&lt;/p&gt;&lt;/div&gt;&lt;/a&gt;&lt;/div&gt;&lt;div class="section__pod-middle-content__button"&gt;&lt;button class="btn-add-to-basket"&gt;AGREGAR A TU BOLSA&lt;/button&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="section__pod-bottom"&gt;&lt;div class="fb-pod__rating" style="visibility: hidden;"&gt;&lt;a href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments"&gt;&lt;div class="fb-rating-stars"&gt;&lt;div class="fb-rating-stars__container"&gt;&lt;div class="fb-rating-stars__holder"&gt;&lt;span class=""&gt;&lt;i class="icon-rating"&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="fb-rating-stars__holder"&gt;&lt;span class=""&gt;&lt;i class="icon-rating"&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="fb-rating-stars__holder"&gt;&lt;span class=""&gt;&lt;i class="icon-rating"&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="fb-rating-stars__holder"&gt;&lt;span class=""&gt;&lt;i class="icon-rating"&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="fb-rating-stars__holder"&gt;&lt;span class=""&gt;&lt;i class="icon-rating"&gt;&lt;/i&gt;&lt;/span&gt;&lt;/div&gt;&lt;p class="fb-rating-stars__count"&gt;0 &lt;span class="fb-rating-stars__count__max"&gt; / 5&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;/div&gt;&lt;/a&gt;&lt;/div&gt;&lt;a class="section__pod-bottom-descriptionlink" href="/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140"&gt;&lt;ul class="section__pod-bottom-description"&gt;&lt;li&gt;Modelo: UN55RU7100GXPE&lt;/li&gt;&lt;li&gt;Tamaño de la pantalla: 55"&lt;/li&gt;&lt;li&gt;Resolución: 4K Ultra HD&lt;/li&gt;&lt;li&gt;Tecnología: Led&lt;/li&gt;&lt;li&gt;Conexión bluetooth: Sí&lt;/li&gt;&lt;/ul&gt;&lt;/a&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;

更新 1：

根据选择的答案，我使用ifelse 检查给定位置的字符数：

要监督的位置是第 4 个，当没有 precio_antes（价格之前）时，这个位置被另一个元素占据，所以我们需要在这些情况下输入 NA：

ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6))

我如何构建最终的 df：

df <- data.frame(
    brand = sapply(splitted, "[", 2), #We don't need the "comparar" text so we start from 2
    product = sapply(splitted, "[", 3),
    precio_antes = ifelse(nchar(sapply(splitted, "[", 4))>3, NA, sapply(splitted, "[", 6)),
    precio_actual = ifelse(nchar(sapply(splitted, "[", 4))<=3, sapply(splitted, "[", 5), sapply(splitted, "[", 4))
  )

【问题讨论】：

在您的实际 URL 上是否有任何答案对您有用，因为当我将其应用于 "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1" %>% read_html() %>% html_nodes(".fb-price-list p:nth-child(2)") %>% html_text() 时，我得到 character(0)
@RonakShah 你是对的。它在 url 中不起作用，而只是在我处理的 html 部分中起作用。诡异的。请让我知道您是否可以帮助我解决这个问题。
你开放使用RSelenium吗？
@BigDataScientist 是的，我会使用 RSelenium，

标签： r web-scraping rvest rselenium

【解决方案1】：

您还考虑RSelenium，这是一个带有相应软件包的解决方案。

您可以通过xpath 找到这些元素。在您的情况下，xpath 将是：/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div/p。

它类似于@gersht 的解决方案，但仅使用RSelenium。

可重现的例子：

library(RSelenium)

rD <- rsDriver() 
remDr <- rD$client

remDr$navigate(url)
priceElems = remDr$findElements(
  using = "xpath", 
  value = "/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']"
)

rawPrices = sapply(
  X = priceElems, 
  FUN = function(elem) elem$getElementText()
)

splitted = sapply(
  X = rawPrices, 
  FUN = strsplit, 
  split = "\nS/"
)

prices = data.frame(
  internetPrices = sapply(splitted, "[", 1),
  normalPrices = sapply(splitted, "[", 2)
)

结果/输出：

> head(prices, 8)
       internetPrices    normalPrices
1 S/ 1,099 (Internet)  1,599 (Normal)
2 S/ 2,299 (Internet)  3,999 (Normal)
3 S/ 1,699 (Internet)  2,399 (Normal)
4   S/ 999 (Internet)  1,149 (Normal)
5   S/ 999 (Internet)  1,399 (Normal)
6 S/ 1,399 (Internet)  1,699 (Normal)
7 S/ 2,199 (Internet)            <NA>
8 S/ 2,699 (Internet)  4,999 (Normal)

设置：

如果需要，请参阅此处了解如何设置RSenelium：How to set up rselenium for R?。

编辑：

按照评论中的注释还捕获空元素，我将获取父元素，然后处理价格文本。

父元素是/html/body/div/main/div/div/div/section/div/div/div/div/div/a/div[@class = 'fb-price-list']，如果其中一个价格不可用，则包含一个空字符串。

【讨论】：

如何为第二个 p 标签指定 xpath？ 在提供的 xpath 中，我只看到 1 个 p 标签。我需要从两个 p 标签中获取文本，在我在答案中提供的 html 中，并使用元素制作一个 data frame（还考虑到有时任何 p 标签内都没有文本，因为如果我需要一个 NA 来填充该列的行。
查看我上面的编辑，.. 如果你想捕捉“缺失的价格”，我会选择父元素。
ty，根据您的回答，我已经完成了我需要的工作。您可能会看到我的更新以检查何时没有 precio_antes

【解决方案2】：

这里我使用 css 选择类 fb-price-list 的节点，然后选择第二个 p 子节点：

library(rvest)

"<div class=\"pod-item\"><div class=\"fb-form__input--checkbox fb-pod__item__compare\"><input id=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__input\" type=\"checkbox\" name=\"fb-pod__item__input-16754140\" value=\"16754140\"><label for=\"fb-pod__item__input-16754140\" class=\"fb-pod__item__compare__label\">Comparar</label></div><div class=\"pod-head\"><a class=\"pod-head__image\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"content__image\"><img src=\"//falabella.scene7.com/is/image/FalabellaPE/16754140?wid=544&amp;hei=544&amp;qlt=70&amp;anchor=750,750&amp;crop=0,0,0,0\" alt=\"img\" class=\"image\"></div></a><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\" class=\"pod-head__stickerslink\"><div class=\"pod-head__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div></a></div><div class=\"pod-body\"><a class=\"section__pod-top\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"section__pod-top-brand\">SAMSUNG</div><div class=\"section__pod-top-title\"><div class=\"LinesEllipsis  \">LED UHD 4K 55\" Smart TV UN55RU7100GXPE SERIE RU7100<wbr></div></div></a><div class=\"section__pod-middle\"><div class=\"section__pod-middle-content__stickers\"><div class=\"fb-responsive-flag fb-responsive-stylised-caps fb-pod__flag fb-pod__flag--percentoff\" data-discount-content=\"\">29%</div></div><div class=\"section__information\"><a class=\"section__information-link\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><div class=\"fb-price-list\"><p class=\"fb-price\">S/  1,699 (Internet)</p><p class=\"fb-price\">S/  2,399 (Normal)</p></div></a></div><div class=\"section__pod-middle-content__button\"><button class=\"btn-add-to-basket\">AGREGAR A TU BOLSA</button></div></div><div class=\"section__pod-bottom\"><div class=\"fb-pod__rating\" style=\"visibility: hidden;\"><a href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140#comments\"><div class=\"fb-rating-stars\"><div class=\"fb-rating-stars__container\"><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><div class=\"fb-rating-stars__holder\"><span class=\"\"><i class=\"icon-rating\"></i></span></div><p class=\"fb-rating-stars__count\">0 <span class=\"fb-rating-stars__count__max\"> / 5</span></p></div></div></a></div><a class=\"section__pod-bottom-descriptionlink\" href=\"/falabella-pe/product/16754140/LED-UHD-4K-55-Smart-TV-UN55RU7100GXPE-SERIE-RU7100/16754140\"><ul class=\"section__pod-bottom-description\"><li>Modelo: UN55RU7100GXPE</li><li>Tamaño de la pantalla: 55\"</li><li>Resolución: 4K Ultra HD</li><li>Tecnología: Led</li><li>Conexión bluetooth: Sí</li></ul></a></div></div></div>" %>% 
  read_html() %>% 
  html_nodes(".fb-price-list p:nth-child(2)") %>% 
  html_text()

【讨论】：

【解决方案3】：

tl;dr

内容是动态加载的，但可以作为字符串使用，源是javascript字典，可以在正则表达式后用json解析器解析得到字符串。 This是当前提取的json。

如果您使用 F12 打开开发工具并检查页面 html，您将看到 script 标记包含 javascript 字典，可以通过 json 解析器提取和处理该字典。这确实意味着您可以定位显示的 script 标记，然后从节点和子字符串中提取文本，但我更喜欢字符串上的正则表达式（请参阅我将正文提取为字符串。通常不建议将正则表达式用于 HTML，但使用字符串很好）。

代码输出：

json$state$searchItemList$resultList$prices

为您提供包含数据帧的长度为 32 的列表。您可以看到在每个数据框 originalPice 中包含您想要的信息（label 列 == (Normal) 所在的行）

并非每件商品都有原价。以下是写出值的一种简单但不一定最有效的方法：

l <- json$state$searchItemList$resultList$prices

for (i in l){
  if (length(i$originalPrice)>1){
    print(i$originalPrice[2])
  } else {
    print("No original price")
  }
}

library(rvest)
library(jsonlite)
library(stringr)

url = 'https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1'
r <- read_html(url) %>%
  html_node('body') %>%
  html_text() %>%
  toString()
x <- str_match_all(r,'fbra_browseProductListConfig = (.*);')
json <- jsonlite::fromJSON(x[[1]][,2])
print(json$state$searchItemList$resultList$prices)

正则表达式解释：

【讨论】：

泰。非常有趣的发现和解决方案。显然，他们在前端使用 react 并将数据发送为 json。我需要进一步调查一下，因为我发现我还可以通过它获得产品名称和品牌。
这很容易。关键是名称的标题和品牌的品牌

【解决方案4】：

看起来是动态的，所以数据来自其他地方。我用数据寻找带有 JSON、XML 等的 GET 响应，但没有找到任何东西。在这一点上，我会选择 RSelenium。以下应该提取正确的节点。您可以使用任何您喜欢的方法从结果字符串中提取数字：

# install.packages("RSelenium")
library(RSelenium)
library(rvest)

driver <- rsDriver(4444L, "firefox")
fox_client <- driver$client

url <- "https://www.falabella.com.pe/falabella-pe/category/cat210477/TV-Televisores?page=1"
fox_client$navigate(url = url)

html <- fox_client$getPageSource()[[1]]

read_html(html) %>% 
    html_nodes(".fb-price:nth-child(2)") %>% 
    html_text()

#### OUTPUT ####

 [1] "S/  1,599 (Normal)"  "S/  3,999 (Normal)"  "S/  2,399 (Normal)"  "S/  1,149 (Normal)" 
 [5] "S/  1,399 (Normal)"  "S/  1,699 (Normal)"  "S/  4,999 (Normal)"  "S/  7,999 (Normal)" 
 [9] "S/  3,499 (Normal)"  "S/  12,999 (Normal)" "S/  9,798 (Normal)"  "S/  1,999 (Normal)" 
[13] "S/  2,499 (Normal)"  "S/  1,299 (Normal)"  "S/  2,499 (Normal)"  "S/  3,599 (Normal)" 
[17] "S/  8,999 (Normal)"  "S/  2,499 (Normal)"  "S/  8,599 (Normal)"  "S/  1,499 (Normal)" 
[21] "S/  2,199 (Normal)"  "S/  1,199 (Normal)"  "S/  699 (Normal)"    "S/  999 (Normal)"   
[25] "S/  29,999 (Normal)" "S/  499 (Normal)"    "S/  699 (Normal)"    "S/  4,999 (Normal)" 
[29] "S/  17,999 (Normal)" "S/  1,399 (Normal)"

您还可以使用findElement 和clickElement 浏览页面。有关更多信息，请参阅Issue scraping page with "Load more" button with rvest。

【讨论】：

我有类似的p.fb-price:nth-child(2) 而你只使用.fb-price:nth-child(2)，我不应该定位标签和类吗？