【问题标题】:Trying to Extract Weblinks BeautifulSoup尝试提取 Weblinks BeautifulSoup
【发布时间】:2020-07-21 15:15:59
【问题描述】:

我正在尝试提取此page 上的所有 PDF 链接。

我的代码是:

import requests
from bs4 import BeautifulSoup
from pprint import pprint

base_url = 'https://usda.library.cornell.edu'

url = 'https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items'

soup = BeautifulSoup(requests.get(url).pdf, 'html.parser')
b = []

page = 1
while True:
    pdf_urls = [a["href"] for a in soup.select('#release-items a[href$=".pdf"]')]
    pprint(pdf_urls)
    b.append(pdf_urls)

    m = soup.select_one('a[rel="next"][href]')
    if m and m['href'] != '#':
        soup = BeautifulSoup(requests.get(base_url + m['href']).pdf, 'html.parser')
    else:
        break

我收到以下错误:

AttributeError: 'Response' object has no attribute 'pdf'

文本文件的类似代码工作。我哪里错了?

【问题讨论】:

    标签: python pdf beautifulsoup python-requests


    【解决方案1】:

    对您的代码稍作改动可能会成功:

    import requests
    from bs4 import BeautifulSoup
    from pprint import pprint
    
    base_url = 'https://usda.library.cornell.edu'
    
    url = 'https://usda.library.cornell.edu/concern/publications/3t945q76s?locale=en#release-items'
    
    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    b = []
    
    page = 1
    while True:
        pdf_urls = [a["href"] for a in soup.select('#release-items a[href$=".pdf"]')]
        pprint(pdf_urls)
        b.append(pdf_urls)
    
        m = soup.select_one('a[rel="next"][href]')
        if m and m['href'] != '#':
            soup = BeautifulSoup(requests.get(base_url + m['href']).text, 'html.parser')
        else:
            break
    

    这个:

    soup = BeautifulSoup(requests.get(url).pdf, 'html.parser')
    

    到:

    soup = BeautifulSoup(requests.get(url).text, 'html.parser')
    

    还有这个:

    soup = BeautifulSoup(requests.get(base_url + m['href']).pdf, 'html.parser')
    

    到这里:

    soup = BeautifulSoup(requests.get(base_url + m['href']).text, 'html.parser')
    

    输出:

    ['https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/sb397x16q/b8516938c/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/g158c396h/8910kd95z/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/w6634p60m/2v23wd923/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/q237jb60d/8910kc45j/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/02871d57q/tx31r242v/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/pz50hc74s/pz50hc752/latest.pdf',
     'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/79408c82d/jw827v53v/latest.pdf',...
    

    等等……

    【讨论】:

      【解决方案2】:

      我收到以下错误:

      AttributeError: 'Response' object has no attribute 'pdf'
      

      resquests.get() 方法总是会返回一个响应对象:

      print(requests.get("https://stackoverflow.com/"))
      

      将显示:

      <Response [200]>
      

      如果你用dir()函数检查可能的属性,你会发现这个响应对象没有pdf属性:

      ['__attrs__', '__bool__', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__enter__', '__eq__', '__exit__', '__format__', '__ge__', '__getattribute__', '__getstate__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__iter__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setstate__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_content', '_content_consumed', '_next', 'apparent_encoding', 'close', 'connection', 'content', 'cookies', 'elapsed', 'encoding', 'headers', 'history', 'is_permanent_redirect', 'is_redirect', 'iter_content', 'iter_lines', 'json', 'links', 'next', 'ok', 'raise_for_status', 'raw', 'reason', 'request', 'status_code', 'text', 'url']
      

      你需要使用requests.get(url).content来做汤:

      soup = BeautifulSoup(requests.get(url).content,'html.parser')
      

      我正在尝试提取此页面上的所有 PDF 链接。

      检查 HTML 正文,您将看到所有文件都有一个 "file_set" 类。您可以使用生成器表达式直接获取此类的"href"

      pdf_urls = [x.a["href"] for x in soup.find_all(class_ = "file_set")]
      

      打印你会得到所有的pdf链接:print(pdf_urls)

      ['https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/sb397x16q/b8516938c/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/g158c396h/8910kd95z/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/w6634p60m/2v23wd923/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/q237jb60d/8910kc45j/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/02871d57q/tx31r242v/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/pz50hc74s/pz50hc752/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/79408c82d/jw827v53v/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/1544c4419/6108vs89v/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/k930cb595/8910k788h/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/st74d522v/qb98mv97t/latest.pdf', 'https://downloads.usda.library.cornell.edu/usda-esmis/files/3t945q76s/sb397x16q/b8516938c/latest.pdf']
      

      【讨论】:

        猜你喜欢
        • 2021-10-11
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-11-04
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多