【问题标题】:Beautifulsoup not returning child elementsBeautifulsoup 不返回子元素
【发布时间】:2020-10-13 01:59:17
【问题描述】:

我已经尝试了一百万种不同的方法,但无法弄清楚为什么 Beautifulsoup 和我所有的前任一样难以预测。

我只是想将表格复制到熊猫数据框。表中有大约 280 行。

这是网址:

https://www.brilliantearth.com/design-your-own-engagement-ring/?sid=3755106&dc=

这是我的部分代码不起作用:

with requests.Session() as s:
    url = "https://www.brilliantearth.com/design-your-own-engagement-ring/?sid=3755106&dc="
    r = s.get(url, headers=req_headers)

#add contents of urls to soup variable from each url
soup = BeautifulSoup(r.content, 'lxml')
rows = soup.find_all("div", {"id": "diamonds_search_table"})
rows

这是表格所在的 url 中的 are:

接下来我可以尝试什么?

【问题讨论】:

  • 能发一张被检查的div的照片吗?
  • 你的代码在什么情况下不起作用?

标签: python html beautifulsoup html-parsing


【解决方案1】:

数据通过 JavaScript 动态加载。你可以使用requests模块来模拟它。

例如:

import json
import requests


search_parameters = {
'shapes':  "Round",
'cuts':    "Fair,Good,Very Good,Ideal,Super Ideal",
'colors':  "J,I,H,G,F,E,D",
'clarities':   "SI2,SI1,VS2,VS1,VVS2,VVS1,IF,FL",
'polishes':    "Good,Very Good,Excellent",
'symmetries':  "Good,Very Good,Excellent",
'fluorescences':   "Very Strong,Strong,Medium,Faint,None",
'min_carat':   "0.25",
'max_carat':  "11.58",
'min_table':   "50.00",
'max_table':   "86.00",
'min_depth':   "46.20",
'max_depth':   "629.00",
'min_price':   "420",
'max_price':   "1258930",
'stock_number':    "",
'row': "0",
'page':    "1",
'requestedDataSize':   "200",
'order_by':    "price",
'order_method':    "asc",
'currency':    "$",
'has_v360_video':  "",
'dedicated':   "",
'sid': "",
'min_ratio':   "1.00",
'max_ratio':   "2.75",
'shipping_day':    "",
'MIN_PRICE':   "420",
'MAX_PRICE':   "1258930",
'MIN_CARAT':   "0.25",
'MAX_CARAT':  "11.58",
'MIN_TABLE':   "45",
'MAX_TABLE':   "86",
'MIN_DEPTH':   "46.2",
'MAX_DEPTH':   "629"
}

data = requests.get('https://www.brilliantearth.com/loose-diamonds/list/', params=search_parameters).json()

# uncomment this to print all data:
# print(json.dumps(data, indent=4))

for d in data['diamonds']:
    print('{:<30} {:<15} {}'.format(d['title'], d['cut'], d['price']))

打印:

0.30 Carat Round Diamond       Very Good       420
0.30 Carat Round Diamond       Very Good       420
0.30 Carat Round Diamond       Ideal           430
0.30 Carat Round Diamond       Ideal           430
0.30 Carat Round Diamond       Good            430
0.30 Carat Round Diamond       Ideal           430
0.30 Carat Round Diamond       Very Good       430
0.25 Carat Round Diamond       Super Ideal     430
0.30 Carat Round Diamond       Very Good       430
0.32 Carat Round Diamond       Ideal           430

... and so on.

【讨论】:

  • 伙计,这太棒了。你从哪里得到那个网址?它返回一些数据,但不是来自原始 url/查询的数据
  • @max 我在 Firefox 开发者工具中找到了这个 URL——页面从这个 URL 加载数据。 (Chrome 也有类似的)
  • 知道了。那就是我认为它的来源。我还没有找到它 - 仍在挖掘以找出如何在那个 search-diamonds/list/ url 中获取正确的数据。
【解决方案2】:

你可以使用selenium来解析html。你可以试试:

from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('https://www.brilliantearth.com/design-your-own-engagement-ring/?sid=3755106&dc=')

html = driver.page_source
soup = BeautifulSoup(html)


rows = soup.find_all("div", {"id": "diamonds_search_table"})
print(rows)

您将获得如下所示的所有行:

[<div class="search-table" id="diamonds_search_table" style="position: relative; height: 34000px;">
<div class="inner item" data-have="true" data-position="0" style="position: absolute; width: 100%; height: 34px;top:0px;"><a class="td-n2" href="/rings/cyorings/view_diamond/9361809/?sid=3755106&amp;first=diamond&amp;show_diamond_tab=true"></a><table border="0" cellpadding="0" cellspacing="0" class="table-striped table-hover search-result-table" width="100%"><tbody><tr class="search-item"><td data-id="9361809" onclick="dtl.stop_jump();" scope="col" width="7%"><div class="checkbox checkbox-ty4"><label><input class="hidden"/><span class="sr-only">checkbox</span><i class="icons-checkbox"></i></label></div></td><td scope="col" width="9%">Round</td><td scope="col" width="9%">0.30</td><td scope="col" width="8%">H</td><td scope="col" width="8%">SI2</td><td scope="col" width="12%">Very Good</td><td scope="col" width="8%">GIA</td><td scope="col" width="12%">Botswana Sort</td><td class="width_ratio_hide" scope="col" width="8%">1</td><td scope="col" width="10%">$420</td><td scope="col" width="7%"><span class="view">View</span></td></tr></tbody></table></div><div class="inner item" data-have="true" data-position="34" style="position: absolute; width: 100%; height: 34px;top:34px;"><a class="td-n2" href="/rings/cyorings/view_diamond/9391074/?sid=3755106&amp;first=diamond&amp;show_diamond_tab=true"></a><table border="0" cellpadding="0" cellspacing="0" class="table-striped table-hover search-result-table" width="100%"><tbody><tr class="search-item"><td data-id="9391074"


and so on...........]

【讨论】:

    猜你喜欢
    • 2020-07-31
    • 2020-04-13
    • 2019-02-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-03-01
    相关资源
    最近更新 更多