【问题标题】:Download all csv files from URL从 URL 下载所有 csv 文件
【发布时间】:2016-08-19 07:44:55
【问题描述】:

我想下载所有 csv 文件,知道我该怎么做吗?

from bs4 import BeautifulSoup
import requests
url = requests.get('http://www.football-data.co.uk/englandm.php').text
soup = BeautifulSoup(url)
for link in soup.findAll("a"):
    print link.get("href")

【问题讨论】:

  • 您的意思是要下载所有从一页链接的 csv 文件?我认为遍历所有链接并检查文件扩展名并不是一个坏主意。

标签: python-2.7 csv download beautifulsoup


【解决方案1】:

这样的事情应该可以工作:

from bs4 import BeautifulSoup
from time import sleep
import requests


if __name__ == '__main__':
    url = requests.get('http://www.football-data.co.uk/englandm.php').text
    soup = BeautifulSoup(url)
    for link in soup.findAll("a"):
        current_link = link.get("href")
        if current_link.endswith('csv'):
            print('Found CSV: ' + current_link)
            print('Downloading %s' % current_link)
            sleep(10)
            response = requests.get('http://www.football-data.co.uk/%s' % current_link, stream=True)
            fn = current_link.split('/')[0] + '_' + current_link.split('/')[1] + '_' + current_link.split('/')[2]
            with open(fn, "wb") as handle:
                for data in response.iter_content():
                    handle.write(data)

【讨论】:

    【解决方案2】:

    您只需要过滤 hrefs 即可使用 css 选择器,a[href$=.csv].csv 中找到 href 的结尾,然后将每个连接到基本 url,请求并最终写入内容:

    from bs4 import BeautifulSoup
    import requests
    from urlparse import urljoin
    from os.path import basename
    
    base = "http://www.football-data.co.uk/"
    url = requests.get('http://www.football-data.co.uk/englandm.php').text
    soup = BeautifulSoup(url)
    for link in (urljoin(base, a["href"]) for a in soup.select("a[href$=.csv]")):
        with open(basename(link), "w") as f:
            f.writelines(requests.get(link))
    

    这将为您提供五个文件,E0.csv、E1.csv、E2.csv、E3.csv、E4.csv,其中包含所有数据。

    【讨论】:

    • 它只打印响应 200
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-10-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多