【问题标题】:Scraping with Beautiful Soup and Python KeyError: 'href'用 Beautiful Soup 和 Python KeyError 抓取:'href'
【发布时间】:2017-08-14 08:40:24
【问题描述】:

我收到KeyError: 'href'。我认为这是因为我的属性没有定义,我试图找到一个解决方案,但是到目前为止还没有成功。我的代码如下:

import requests
from bs4 import BeautifulSoup

main_url = "https://www.chapter-living.com/properties/highbury/"
re = requests.get(main_url)
soup = BeautifulSoup(re.text, "html.parser")
city_tags = soup.find_all('h2', class_="title")  # The section containing the links to the cities
cities_links = [main_url + tag['href'] for tag in city_tags]  # Iterates through city_tags and stores them in a [list]

调用cities_links时出错

【问题讨论】:

    标签: python web-scraping beautifulsoup screen-scraping


    【解决方案1】:
    import requests
    from bs4 import BeautifulSoup
    
    main_url = "http://www.chapter-living.com/properties/highbury"
    re = requests.get(main_url)
    soup = BeautifulSoup(re.text, "html.parser")
    city_tags = soup.find_all('h2', class_="title")
    cities_links = [main_url + tag.find('a').get('href','') if tag.find('a') else '' for tag in city_tags]
    print cities_links
    

    这将导致:

    [u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-en-suite/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-en-suite/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-premium-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/gold-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/platinum-studio/', u'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/two-bed-flat/', '', '', '', '', '', '']
    

    或者,您可以使用比 BeautifulSoup 快一个数量级的 lxml 模块:

    import requests
    from lxml import html
    
    main_url = "http://www.chapter-living.com/properties/highbury"
    re = requests.get(main_url)
    root = html.fromstring(re.content)
    cities_links = [main_url + link for link in root.xpath('//h2[@class="title"]/a/@href')]
    print cities_links
    

    这将导致:

    ['http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-en-suite/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-en-suite/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/bronze-premium-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/silver-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/gold-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/platinum-studio/', 'http://www.chapter-living.com/properties/highbury/properties/highbury/rooms/two-bed-flat/']
    

    【讨论】:

    • 谢谢,非常感谢!
    【解决方案2】:

    h2 标签没有href 属性。那属于a 标签。这就是您收到此错误的原因,您正在尝试访问不存在的属性。

    【讨论】:

    • 我不确定我是否同意。你会说它们存储在哪里......?
    • 我会说它们存储在a 标签中。这就是为什么上面的响应是tag.find('a'),因为h2 标签没有href 属性。
    猜你喜欢
    • 2022-10-15
    • 2020-04-22
    • 2013-01-09
    • 2022-08-22
    • 1970-01-01
    • 2011-11-03
    • 1970-01-01
    • 2019-12-23
    • 1970-01-01
    相关资源
    最近更新 更多