【问题标题】:How can I scrape the URL of the PDF from the website?如何从网站上抓取 PDF 的 URL?
【发布时间】:2021-09-08 12:00:16
【问题描述】:

谁能帮我知道最后一个列表名称 - EventLinks。实际上,我想抓取下面代码中提到的 PDF 和其他数据的 URL。但是,我很难从这里获取 URL - https://ibbi.gov.in/public-announcement?ann=&title=&date=

 CompanyName = driver.find_elements_by_xpath('/html/body/div[5]/div/div/div/div/div/div/div/div[2]/table/tbody/tr/td[4]')
    Date = driver.find_elements_by_xpath('/html/body/div[5]/div/div/div/div/div/div/div/div[2]/table/tbody/tr/td[2]')
    EventType = driver.find_elements_by_xpath('/html/body/div[5]/div/div/div/div/div/div/div/div[2]/table/tbody/tr/td[1]')
    EvidenceLink = driver.find_elements_by_xpath('/html/body/div[5]/div/div/div/div/div/div/div/div[2]/table/tbody/tr/td[7]/a')
    
    for i in range(len(CompanyName)):
        print(CompanyName[i].text)
        Name_.append(CompanyName[i].text)
    
   
    for i in range(len(Date)):
        print(Date[i].text)
        Date_.append(Date[i].text)
    
    for i in range(len(EventType)):
        print(EventType[i].text)
        EventType_.append(EventType[i].text)
    
    for i in range(len(EvidenceLink)):
        print(EvidenceLink[i])
        EvidenceLink_.append(EvidenceLink[i])

URL 的 XPATH 是 - /html/body/div[5]/div/div/div/div/div/div/div/div[2]/table/tbody/tr[1]/td[ 7]/一

【问题讨论】:

    标签: python selenium web-scraping


    【解决方案1】:

    您需要从 td 标签的 'onclick' 参数中提取 pdf url:

    import requests as rq
    from bs4 import BeautifulSoup as bs
    
    headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Firefox/78.0"}
    
    final_data = []
    
    for curr_page in range(1, 300): # loop through page 1 to 300
        # curr_page = 1
        url = "https://ibbi.gov.in/public-announcement?ann=&title=&date=&page=%s" % curr_page
        resp = rq.get(url, headers=headers, verify=False)
        soup = bs(resp.content, "lxml")
    
        table = soup.find_all("div", {"class": "table-responsive"})[0].find('tbody')
    
        rows = table.find_all("tr")
        data = []
    
        for row in rows:
            row_data = []
            for (icol, col) in enumerate(row.find_all('td')):
                if icol == 6:
                    pdf_link = col.find('a')['onclick']
                    start = pdf_link.index('https://')
                    end = pdf_link.index('.pdf')
                    row_data.append(pdf_link[start:end+4])
                else:
                    row_data.append(col.text.strip())
            data.append(row_data)
    
        final_data.extend(data)
    
        time.sleep(2)
    

    导致以下输出:

    [['Public Announcement of Corporate Insolvency Resolution Process',
      '07-09-2021',
      '21-09-2021',
      'SUBHASHRI BIO-ENERGIES PRIVATE LIMITED',
      'Indian Overseas Bank',
      'Palanigounder Eswaramoorthy',
      'https://ibbi.gov.in//uploads/announcement/9c534bb61bb51c02ec1b59df7c9f416b.pdf',
      ''],
     ['Public Announcement of Corporate Insolvency Resolution Process',
      '06-09-2021',
      '17-09-2021',
      'VME PROPERTIES PRIVATE LIMITED',
      'ALCHEMIST ASSET RECONSTRUCTION COMPANY LIMITED',
      'Sapan Mohan Garg',
      'https://ibbi.gov.in//uploads/announcement/a9fc3a2266d4138b6fc693696dd1f6f9.pdf',
      ''],
    ...]
    

    【讨论】:

    • 谢谢先生,但如果我想废弃多页。我该怎么做?
    • 在一个简单的循环中,你必须增加“curr_page”,你可以'import time'并在每个循环的末尾加上'sleep(2)'(以避免ban或其他坏事)。
    • 你能帮我用相同代码中的页面循环脚本吗?我对此很陌生,我不知道该怎么做。
    • 感谢您的帮助。但是,我收到一个错误 - ValueError Traceback (最近一次调用最后一次) in 23 pdf_link = col.find('a')['onclick'] 24 start = pdf_link .index('https://') ---> 25 end = pdf_link.index('.pdf') 26 row_data.append(pdf_link[start:end+4]) 27 else: ValueError: substring not found跨度>
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-09-08
    • 2021-06-22
    • 2011-08-02
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多