【问题标题】:Scraping excel from website using python with _doPostBack link url hidden使用隐藏_doPostBack链接url的python从网站上抓取excel
【发布时间】:2016-07-24 16:28:36
【问题描述】:

在过去的几天里,我试图废弃以下网站(链接粘贴在下面),该网站在表格中提供了一些 excel 和 pdf。我能够成功地为主页做到这一点。总共有 59 页必须删除这些 excel/pdf。到目前为止,在我看到的大多数网站中,网站 url 中都有一个查询参数,当你从一个页面移动到另一个页面时,它会发生变化。在这种情况下,我们有一个 _doPostBack 函数,可能是因为 URL 在您访问的每个页面上都保持不变。我查看了多个解决方案和帖子,建议查看post 调用的参数并使用它们,但我无法理解post 调用中提供的参数(这是我第一次报废网站)。

有人可以推荐一些资源来帮助我编写代码,帮助我使用 python 从一个页面移动到另一个页面。详情如下:

网站链接 - http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx

我当前从主页中提取 CAP excel 表的代码(这是完美的,仅供参考)

from urllib.request import urlopen
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import re
import urllib

Base = "http://accord.fairfactories.org/ffcweb/Web"
html = urlopen("http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx")
bs = BeautifulSoup(html)
name = bs.findAll("td", {"class":"column_style_right column_style_left"})
i = 1
for link in bs.findAll("a", {"id":re.compile("CAP(?!\w)")}):
    if 'href' in link.attrs:
        name = str(i)+".xlsx"
        a = link.attrs['href']
        b = a.strip("..")
        c = Base+b
        urlretrieve(c, name)
        i = i+1

如果我在提供信息时遗漏了什么,请告诉我,请不要给我评分 - 否则我将无法进一步提问

【问题讨论】:

    标签: python web-scraping dopostback


    【解决方案1】:

    对于 aspx 网站,您需要查找 __EVENTTARGET__EVENTVALIDATION 等内容并在每个请求中发布这些参数,这将获取所有页面并使用 requestsbs4 em>:

    import requests
    from bs4 import BeautifulSoup
    from urlparse import urljoin # python 3 use from urllib.parse import urljoin   
    
    
    
    # All the keys need values set bar __EVENTTARGET, that stays the same.
    data = {
        "__EVENTTARGET": "gvFlex",
        "__VIEWSTATE": "",
        "__VIEWSTATEGENERATOR": "",
        "__VIEWSTATEENCRYPTED": "",
        "__EVENTVALIDATION": ""}
    
    
    def validate(soup, data):
        for k in data:
            # update post values in data.
            if k != "__EVENTTARGET":
                data[k] = soup.select_one("#{}".format(k))["value"]
    
    
    def get_all_excel():
        base = "http://accord.fairfactories.org/ffcweb/Web"
        url = "http://accord.fairfactories.org/ffcweb/Web/ManageSuppliers/InspectionReportsEnglish.aspx"
        with requests.Session() as s:
            # Add a user agent for each subsequent request.
            s.headers.update({"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"})
            r = s.get(url)
            bs = BeautifulSoup(r.content, "lxml")
            # get links from initial page.
            for xcl in bs.select("a[id*=CAP]"):
                yield urljoin(base, xcl["href"])
            # need to re-validate the post data in our dict for each request.
            validate(bs, data)
            last = bs.select_one("a[href*=Page$Last]")
            i = 2
            # keep going until the last page button is not visible
            while last:
                # Increase the counter to set the target to the next page
                data["__EVENTARGUMENT"] = "Page${}".format(i)
                r = s.post(url, data=data)
                bs = BeautifulSoup(r.content, "lxml")
                for xcl in bs.select("a[id*=CAP]"):
                    yield urljoin(base, xcl["href"])
                last = bs.select_one("a[href*=Page$Last]")
                # again re-validate for next request
                validate(bs, data)
                i += 1
    
    
    for x in (get_all_excel()):
        print(x)
    

    如果我们在前三页运行它,你可以看到我们得到了你想要的数据:

    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9965
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9552
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10650
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11969
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10086
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10905
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10840
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9229
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11310
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9178
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9614
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9734
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10063
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10871
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9468
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9799
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9278
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12252
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9342
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9966
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11595
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9652
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10271
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10365
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10087
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9967
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11740
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12375
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11643
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10952
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12013
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9810
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10953
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10038
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9664
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12256
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9262
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9210
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9968
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9811
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11610
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9455
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11899
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10273
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9766
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9969
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10088
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10366
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9393
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9813
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11795
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9814
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11273
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=12187
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10954
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9556
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11709
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9676
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10251
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10602
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10089
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9908
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10358
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9469
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11333
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9238
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9816
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9817
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10736
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10622
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9394
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9818
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=10592
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=9395
    http://accord.fairfactories.org/Utilities/DownloadFile.aspx?id=11271
    

    【讨论】:

    • 非常感谢帕德莱克。你是明星:)
    • 亲爱的 Padraic 我在尝试执行代码时收到以下错误。你能再帮忙吗:
    • 错误:data[k] = soup.select_one("#{}".format(k))["value"] TypeError: 'NoneType' object is not callable
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2019-06-06
    • 1970-01-01
    • 2020-05-11
    • 1970-01-01
    • 1970-01-01
    • 2017-01-05
    • 1970-01-01
    相关资源
    最近更新 更多