【问题标题】:Web scraping with Python, BeautifulSoup使用 Python、BeautifulSoup 进行网页抓取
【发布时间】:2020-10-04 21:45:51
【问题描述】:

我在使用 Python 解析链接时遇到问题。有我的代码:

def get_content(html):
    soup = BeautifulSoup(html, 'lxml')
    items = soup.find_all('div', class_='grid-item___eaXVb')

    for item in items:
        link = item.find('a', class_='gl-product-card__details-link')
        print(link.get('href'))

我得到这个错误:

Traceback (most recent call last):
  File "parser.py", line 32, in <module>
    parse()
  File "parser.py", line 27, in parse
    get_content(html.text)
  File "parser.py", line 21, in get_content
    print(link.get('href'))
AttributeError: 'NoneType' object has no attribute 'get'

但是当我尝试这个时:

    for item in items:
        link = item.find('a', class_='gl-product-card__details-link')
        print(type(link))

我得到一个回复​​,所有链接都有类型:

<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>
...
...
...
<class 'bs4.element.Tag'>
<class 'bs4.element.Tag'>

我在哪里做错了?怎么了?

【问题讨论】:

标签: python parsing web-scraping beautifulsoup lxml


【解决方案1】:

要获取所有产品的标题和链接,可以使用这个例子:

import requests
from bs4 import BeautifulSoup


url = 'https://www.adidas.com/us/men-shoes?price=price%3C50.0'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}

soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')

for a in soup.select('div[class^="product-container"] a.gl-product-card__media-link'):
    label = a.find_next(class_='gl-label')
    print('{:<50} {}'.format(label.text, 'https://www.adidas.com' + a['href']))

打印:

Adilette Lite Slides                               https://www.adidas.com/us/adilette-lite-slides/FU8299.html
Adilette Aqua Slides                               https://www.adidas.com/us/adilette-aqua-slides/F35550.html
U_Path Run Shoes                                   https://www.adidas.com/us/u_path-run-shoes/EE4466.html
adiease Shoes                                      https://www.adidas.com/us/adiease-shoes/BY4027.html
Nizza RF Slip-on Shoes                             https://www.adidas.com/us/nizza-rf-slip-on-shoes/EF1410.html
Adilette Slides                                    https://www.adidas.com/us/adilette-slides/280647.html
Goletto VII Turf Shoes                             https://www.adidas.com/us/goletto-vii-turf-shoes/FV8703.html
Adilette Comfort Slides                            https://www.adidas.com/us/adilette-comfort-slides/FW5337.html
Adilette Comfort Slides                            https://www.adidas.com/us/adilette-comfort-slides/FW5353.html
Adizero Spark MD Cleats                            https://www.adidas.com/us/adizero-spark-md-cleats/EF3476.html
CP Traxion Spikeless Shoes                         https://www.adidas.com/us/cp-traxion-spikeless-shoes/EE9206.html
CP Traxion Spikeless Shoes                         https://www.adidas.com/us/cp-traxion-spikeless-shoes/BB7900.html
CP Traxion Spikeless Shoes                         https://www.adidas.com/us/cp-traxion-spikeless-shoes/BD7138.html
CP Traxion Spikeless Shoes                         https://www.adidas.com/us/cp-traxion-spikeless-shoes/F34996.html
Adilette Lite Slides                               https://www.adidas.com/us/adilette-lite-slides/FU8296.html
Afterburner 6 Grail MD Cleats                      https://www.adidas.com/us/afterburner-6-grail-md-cleats/DB3106.html
Lite Racer CLN Shoes                               https://www.adidas.com/us/lite-racer-cln-shoes/EE8138.html

... and so on.

【讨论】:

    猜你喜欢
    • 2021-01-31
    • 1970-01-01
    • 2018-10-16
    • 2020-08-09
    • 1970-01-01
    • 1970-01-01
    • 2018-08-02
    • 1970-01-01
    • 2019-02-15
    相关资源
    最近更新 更多