遍历 url 目录失败 - Python 3.x答案

【问题标题】：Looping through url directory failing - Python 3.x遍历 url 目录失败 - Python 3.x
【发布时间】：2021-07-09 01:31:06
【问题描述】：

我正在尝试完成一个相当简单的任务...

我希望遍历指定 github 存储库中的所有 .csv 文件，特别是 this one

以下minimal, complete, reproducible example 应该可以说明问题：

import pandas as pd, urllib, requests, os, glob
base_url = 'https://github.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'
# https://stackoverflow.com/questions/39065921/what-do-raw-githubusercontent-com-urls-represent
base_raw_url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

#base_dir = os.listdir(base_url)
#base_raw_dir = os.listdir(base_raw_url)

# https://stackoverflow.com/questions/61036695/import-multiple-csv-files-from-github-folder-python-covid-19
csv_files = glob.glob(base_raw_url+'/*.csv')
print(csv_files)

[]

csv_files 是一个空列表，os.listdir() 的两次尝试都会导致：

OSError: [WinError 123] 文件名、目录名或卷标语法不正确： 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/tree/master/csse_covid_19_data/csse_covid_19_time_series'

我怎样才能简单地遍历目录？我希望最终获得每个 .csv 文件的完整路径（url）。

【问题讨论】：

标签： python github url

【解决方案1】：

您无法使用网址访问此类文件。 'Os.listdir()' 仅适用于您的本地计算机。您正在尝试做的事情称为“网络抓取”，您将希望尝试使用“bs4”来完成您的任务。您需要通过 html 解析并获取每个文件的相关链接。

关于 BS4 的便捷教程： https://realpython.com/beautiful-soup-web-scraper-python/

【讨论】：

data-pjax="#repo-content-pjax-container" 属性应该足以在抓取时获取文件。