抓取数千个网址答案

【问题标题】：Scraping thousand of urls抓取数千个网址
【发布时间】：2021-11-15 16:13:01
【问题描述】：

我有一个抓取 url 列表的功能，200k url，花了很多时间，有什么方法可以加快这个过程？

def get_odds(ids):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  for id in ids:
    url = f'https://www.betexplorer.com{id}'
    response = s.get(url, headers=headers)
    soup = BeautifulSoup(response.text,'html.parser')
    season = url.split('/')[5]

    "do stuff.."

ids 是list

['/soccer/england/premier-league/brentford-norwich/vyLXXRcE/'
...]

【问题讨论】：

第一步是使用分析器来衡量花费最多的时间。
每个网址需要 1-2 秒，但乘以 200k 是很多时间..

标签： python web-scraping beautifulsoup python-requests

【解决方案1】：

是的，您可以使用多处理。

类似：

from multiprocessing import Pool

if __name__ == "__main__":
    threads = 10 # The number of concurrent requests
    p = Pool(threads)
    p.map(get_odds, ids)
    p.terminate()

其中 ids 是 id 列表，get_odds 是您提供的函数，但已修改为仅对其中一个 id 进行操作。请记住，您一次将向他们的服务器发送 10 个请求，这可能会导致临时 IP 阻塞（因为您被视为敌对）。您应该注意这一点并调整池大小或添加 sleep() 逻辑。

获取赔率函数应该是这样的：

def get_odds(id):
  headers = {"Referer": "https://www.betexplorer.com",
                    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.107 Safari/537.36'}
  s = requests.Session()
  matches=[]
  url = f'https://www.betexplorer.com{id}'
  response = s.get(url, headers=headers)
  soup = BeautifulSoup(response.text,'html.parser')
  season = url.split('/')[5]

  "do stuff.."

【讨论】：

嗯，这个问题似乎是 i/o-bound，所以你最好使用threading 或asyncio？
好像不行我得到season = url.split('/')[5] IndexError: list index out of range
@Timus Multiprocessing 比线程具有更高的内存占用，但根据我的经验，它会导致可读性更高的代码，因此比线程更少的错误。对于网络抓取，我发现您遇到速率限制比内存问题要早得多。我没有使用 asyncio 的经验，因此无法发表评论。
@luka 这是您的 get_odds 函数的问题，而不是我提供的代码。将您的 id 格式化为 url 后，其中的正斜杠少于 6 个，因此您会收到索引错误。你能分享一下你的身份证是什么样子的吗？
@PeterWhite '/soccer/england/premier-league/brentford-norwich/vyLXXRcE/' 这看起来像是一个 id 列表。当我在没有多线程的情况下运行脚本时，我没有错误。

【解决方案2】：

您可以让它们通过多线程并行运行。例如。创建 10 个线程并根据您的 id (0, 1, 2, 3, ...) 的结尾知道应该抓取哪个 ID 的线程。仅适用于足够的计算能力和稳定的互联网连接。

编辑：由于 ID 是一个列表，因此请检查索引以确定哪个线程应该抓取哪个网站。

【讨论】：