【问题标题】:Scraping multiple web pages, but the results are overwritten by the last url抓取多个网页,但结果被最后一个url覆盖
【发布时间】:2019-05-23 14:18:46
【问题描述】:

我想从多个网页中抓取所有 URL。它可以工作,但只有最后一个网页的结果会保存在文件中。

from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
import re
import requests

urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2', '...']

for url in urls:
    response = requests.get(url)
    req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
    html_page = urlopen(req).read()
    soup = BeautifulSoup(html_page, features="html.parser")

links = []
for link in soup.findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
    links.append(link.get('href'))

filename = 'output.csv'

with open(filename, mode="w") as outfile:
    for s in links:
        outfile.write("%s\n" % s)

我在这里错过了什么?

如果我可以使用包含所有 url 的 csv 文件而不是列表,那就更酷了。但是我尝试的任何东西都离我很远......

【问题讨论】:

    标签: python python-3.x web-scraping beautifulsoup urllib


    【解决方案1】:

    您正在使用您的网址的最后一汤。您应该将每个的第二个移到第一个中。此外,您还将获得与您的正则表达式匹配的所有元素。您尝试抓取的表格之外的元素。

    from bs4 import BeautifulSoup
    from urllib.request import Request, urlopen
    import re
    import requests
    
    urls=['https://www.metacritic.com/browse/movies/genre/date?page=0', 'https://www.metacritic.com/browse/movies/genre/date?page=2']
    
    links = []
    for url in urls:
        response = requests.get(url)
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        html_page = urlopen(req).read()
        soup = BeautifulSoup(html_page, features="html.parser")
        #You should get only movies from list otherwise you will also append coming soon section. That is why we added select_one
        for link in soup.select_one('ol.list_products').findAll('a', attrs={'href': re.compile("^/movie/([a-zA-Z0-9\-])+$")}):
            links.append(link.get('href'))
    
    
    filename = 'output.csv'
    
    with open(filename, mode="w") as outfile:
        for s in links:
            outfile.write("%s\n" % s)
    

    这是结果。

    /movie/woman-at-war
    /movie/destroyer
    /movie/aquaman
    /movie/bumblebee
    /movie/between-worlds
    /movie/american-renegades
    /movie/mortal-engines
    /movie/spider-man-into-the-spider-verse
    /movie/the-quake
    /movie/once-upon-a-deadpool
    /movie/all-the-devils-men
    /movie/dead-in-a-week-or-your-money-back
    /movie/blood-brother-2018
    /movie/ghostbox-cowboy
    /movie/robin-hood-2018
    /movie/creed-ii
    /movie/outlaw-king
    /movie/overlord-2018
    /movie/the-girl-in-the-spiders-web
    /movie/johnny-english-strikes-again
    /movie/hunter-killer
    /movie/bullitt-county
    /movie/the-night-comes-for-us
    /movie/galveston
    /movie/the-oath-2018
    /movie/mfkz
    /movie/viking-destiny
    /movie/loving-pablo
    /movie/ride-2018
    /movie/venom-2018
    /movie/sicario-2-soldado
    /movie/black-water
    /movie/jurassic-world-fallen-kingdom
    /movie/china-salesman
    /movie/incredibles-2
    /movie/superfly
    /movie/believer
    /movie/oceans-8
    /movie/hotel-artemis
    /movie/211
    /movie/upgrade
    /movie/adrift-2018
    /movie/action-point
    /movie/solo-a-star-wars-story
    /movie/feral
    /movie/show-dogs
    /movie/deadpool-2
    /movie/breaking-in
    /movie/revenge
    /movie/manhunt
    /movie/avengers-infinity-war
    /movie/supercon
    /movie/love-bananas
    /movie/rampage
    /movie/ready-player-one
    /movie/pacific-rim-uprising
    /movie/tomb-raider
    /movie/gringo
    /movie/the-hurricane-heist
    

    【讨论】:

    • 非常感谢。这真的很有用,我没想到要排除即将推出的部分!
    【解决方案2】:

    嘿,这是我的第一个答案,所以我会尽力提供帮助。

    数据覆盖的问题是您在一个循环中遍历您的 url,然后在另一个循环中遍历汤对象。

    这将始终在循环结束时返回最后一个汤对象,因此最好的办法是将每个汤对象从 url 循环中附加到数组中,或者在 url 循环中实际查询汤对象:

    soup_obj_list = []
    for url in urls:
        response = requests.get(url)
        req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
        html_page = urlopen(req).read()
        soup = BeautifulSoup(html_page, features="html.parser")
        soup_obj_list.append(soup)
    

    希望能解决您的第一个问题。无法真正帮助解决 csv 问题。

    【讨论】:

      猜你喜欢
      • 2019-09-25
      • 2021-06-10
      • 2013-10-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多