【问题标题】:Is it possible to iterate down a webpage when scraping in Python?在 Python 中抓取时是否可以迭代网页?
【发布时间】:2022-01-20 14:58:14
【问题描述】:

我正在尝试从 https://www.skysports.com/premier-league-fixtures 中抓取足球比赛,并且我正在努力为每场比赛分配日期,因为它们位于单独的元素中,而不是在单个元素之下。

到目前为止我所做的是:

def get_fixtures(webpage):
    table = webpage.find('div', {'class': 'fixres__body'})
    dates = table.find_all('h4', {'class': 'fixres__header2'})
    fixtures = table.find_all('div', 'fixres__item')
    dates_list = []
    fixtures_list = []
    for date in dates:
        dates_list.append(date.text)

    for fixture in fixtures:
        a = fixture.find('a')
        home_span = a.find('span',
                      {'class': 'matches__item-col matches__participant matches__participant--side1'})
        home_team = home_span.find('span').text.strip()

        away_span = a.find('span',
                      {'class': 'matches__item-col matches__participant matches__participant--side2'})
        away_team = away_span.find('span').text.strip()
        match = [home_team, away_team]

        fixtures_list.append(match)

    return dates_list, fixtures_list

这将为我提供一个赛程列表和一个日期列表,但它们没有相互分配。有没有办法迭代网页,以便当我点击日期“div”时,我可以在下一个日期“div”之后立即拉出所有固定装置?

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    我会认为它略有不同。与其获取日期并分配匹配项,不如找到每个匹配项然后分配日期。通过首先获取所有固定装置并遍历它们来做到这一点。一旦你有了一个灯具,让 BeautifulSoup 用.findPrevious() 向后看,看看这个灯具的日期是什么。

    def get_fixtures(webpage):
        table = webpage.find('div', {'class': 'fixres__body'})
           
        dates_list = []
        fixtures_list = []
    
        fixtures = table.find_all('div', {'class':'fixres__item'})
        for fixture in fixtures:
            a = fixture.find('a')
            home_span = a.find('span',
                          {'class': 'matches__item-col matches__participant matches__participant--side1'})
            home_team = home_span.find('span').text.strip()
    
            away_span = a.find('span',
                          {'class': 'matches__item-col matches__participant matches__participant--side2'})
            away_team = away_span.find('span').text.strip()
            match = [home_team, away_team]
            
            fixtures_list.append(match)
            
            date = fixture.findPrevious('h4', {'class': 'fixres__header2'}).text
            dates_list.append(date)
    
        return dates_list, fixtures_list
    

    输出:

    ['Friday 21st January', 'Saturday 22nd January', 'Saturday 22nd January', 'Saturday 22nd January', 'Saturday 22nd January', 'Saturday 22nd January', 'Sunday 23rd January', 'Sunday 23rd January', 'Sunday 23rd January', 'Sunday 23rd January', 'Tuesday 8th February', 'Tuesday 8th February', 'Tuesday 8th February', 'Wednesday 9th February', 'Wednesday 9th February', 'Wednesday 9th February', 'Wednesday 9th February', 'Thursday 10th February', 'Thursday 10th February', 'Saturday 12th February', 'Saturday 12th February', 'Saturday 12th February', 'Saturday 12th February', 'Saturday 12th February', 'Saturday 12th February', 'Sunday 13th February', 'Sunday 13th February', 'Sunday 13th February', 'Sunday 13th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Saturday 19th February', 'Sunday 20th February', 'Sunday 20th February', 'Friday 25th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Saturday 26th February', 'Sunday 27th February', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 5th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 12th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 19th March', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 2nd April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 9th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 16th April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 23rd April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 30th April', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Saturday 7th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 15th May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May', 'Sunday 22nd May']
    [['Watford', 'Norwich City'], ['Everton', 'Aston Villa'], ['Brentford', 'Wolverhampton Wanderers'], ['Leeds United', 'Newcastle United'], ['Manchester United', 'West Ham United'], ['Southampton', 'Manchester City'], ['Arsenal', 'Burnley'], ['Crystal Palace', 'Liverpool'], ['Leicester City', 'Brighton and Hove Albion'], ['Chelsea', 'Tottenham Hotspur'], ['Newcastle United', 'Everton'], ['West Ham United', 'Watford'], ['Burnley', 'Manchester United'], ['Manchester City', 'Brentford'], ['Norwich City', 'Crystal Palace'], ['Tottenham Hotspur', 'Southampton'], ['Aston Villa', 'Leeds United'], ['Liverpool', 'Leicester City'], ['Wolverhampton Wanderers', 'Arsenal'], ['Manchester United', 'Southampton'], ['Brentford', 'Crystal Palace'], ['Chelsea', 'Arsenal'], ['Everton', 'Leeds United'], ['Watford', 'Brighton and Hove Albion'], ['Norwich City', 'Manchester City'], ['Burnley', 'Liverpool'], ['Newcastle United', 'Aston Villa'], ['Tottenham Hotspur', 'Wolverhampton Wanderers'], ['Leicester City', 'West Ham United'], ['West Ham United', 'Newcastle United'], ['Arsenal', 'Brentford'], ['Aston Villa', 'Watford'], ['Brighton and Hove Albion', 'Burnley'], ['Crystal Palace', 'Chelsea'], ['Liverpool', 'Norwich City'], ['Southampton', 'Everton'], ['Manchester City', 'Tottenham Hotspur'], ['Leeds United', 'Manchester United'], ['Wolverhampton Wanderers', 'Leicester City'], ['Southampton', 'Norwich City'], ['Leeds United', 'Tottenham Hotspur'], ['Arsenal', 'Liverpool'], ['Brentford', 'Newcastle United'], ['Brighton and Hove Albion', 'Aston Villa'], ['Crystal Palace', 'Burnley'], ['Manchester United', 'Watford'], ['West Ham United', 'Wolverhampton Wanderers'], ['Everton', 'Manchester City'], ['Chelsea', 'Leicester City'], ['Aston Villa', 'Southampton'], ['Burnley', 'Chelsea'], ['Leicester City', 'Leeds United'], ['Liverpool', 'West Ham United'], ['Manchester City', 'Manchester United'], ['Newcastle United', 'Brighton and Hove Albion'], ['Norwich City', 'Brentford'], ['Tottenham Hotspur', 'Everton'], ['Watford', 'Arsenal'], ['Wolverhampton Wanderers', 'Crystal Palace'], ['Arsenal', 'Leicester City'], ['Brentford', 'Burnley'], ['Brighton and Hove Albion', 'Liverpool'], ['Chelsea', 'Newcastle United'], ['Crystal Palace', 'Manchester City'], ['Everton', 'Wolverhampton Wanderers'], ['Leeds United', 'Norwich City'], ['Manchester United', 'Tottenham Hotspur'], ['Southampton', 'Watford'], ['West Ham United', 'Aston Villa'], ['Aston Villa', 'Arsenal'], ['Burnley', 'Southampton'], ['Leicester City', 'Brentford'], ['Liverpool', 'Manchester United'], ['Manchester City', 'Brighton and Hove Albion'], ['Newcastle United', 'Crystal Palace'], ['Norwich City', 'Chelsea'], ['Tottenham Hotspur', 'West Ham United'], ['Watford', 'Everton'], ['Wolverhampton Wanderers', 'Leeds United'], ['Brighton and Hove Albion', 'Norwich City'], ['Burnley', 'Manchester City'], ['Chelsea', 'Brentford'], ['Crystal Palace', 'Arsenal'], ['Leeds United', 'Southampton'], ['Liverpool', 'Watford'], ['Manchester United', 'Leicester City'], ['Tottenham Hotspur', 'Newcastle United'], ['West Ham United', 'Everton'], ['Wolverhampton Wanderers', 'Aston Villa'], ['Arsenal', 'Brighton and Hove Albion'], ['Aston Villa', 'Tottenham Hotspur'], ['Brentford', 'West Ham United'], ['Everton', 'Manchester United'], ['Leicester City', 'Crystal Palace'], ['Manchester City', 'Liverpool'], ['Newcastle United', 'Wolverhampton Wanderers'], ['Norwich City', 'Burnley'], ['Southampton', 'Chelsea'], ['Watford', 'Leeds United'], ['Aston Villa', 'Liverpool'], ['Everton', 'Crystal Palace'], ['Leeds United', 'Chelsea'], ['Manchester United', 'Norwich City'], ['Newcastle United', 'Leicester City'], ['Southampton', 'Arsenal'], ['Tottenham Hotspur', 'Brighton and Hove Albion'], ['Watford', 'Brentford'], ['West Ham United', 'Burnley'], ['Wolverhampton Wanderers', 'Manchester City'], ['Arsenal', 'Manchester United'], ['Brentford', 'Tottenham Hotspur'], ['Brighton and Hove Albion', 'Southampton'], ['Burnley', 'Wolverhampton Wanderers'], ['Chelsea', 'West Ham United'], ['Crystal Palace', 'Leeds United'], ['Leicester City', 'Aston Villa'], ['Liverpool', 'Everton'], ['Manchester City', 'Watford'], ['Norwich City', 'Newcastle United'], ['Aston Villa', 'Norwich City'], ['Everton', 'Chelsea'], ['Leeds United', 'Manchester City'], ['Manchester United', 'Brentford'], ['Newcastle United', 'Liverpool'], ['Southampton', 'Crystal Palace'], ['Tottenham Hotspur', 'Leicester City'], ['Watford', 'Burnley'], ['West Ham United', 'Arsenal'], ['Wolverhampton Wanderers', 'Brighton and Hove Albion'], ['Arsenal', 'Leeds United'], ['Brentford', 'Southampton'], ['Brighton and Hove Albion', 'Manchester United'], ['Burnley', 'Aston Villa'], ['Chelsea', 'Wolverhampton Wanderers'], ['Crystal Palace', 'Watford'], ['Leicester City', 'Everton'], ['Liverpool', 'Tottenham Hotspur'], ['Manchester City', 'Newcastle United'], ['Norwich City', 'West Ham United'], ['Aston Villa', 'Crystal Palace'], ['Everton', 'Brentford'], ['Leeds United', 'Brighton and Hove Albion'], ['Manchester United', 'Chelsea'], ['Newcastle United', 'Arsenal'], ['Southampton', 'Liverpool'], ['Tottenham Hotspur', 'Burnley'], ['Watford', 'Leicester City'], ['West Ham United', 'Manchester City'], ['Wolverhampton Wanderers', 'Norwich City'], ['Arsenal', 'Everton'], ['Brentford', 'Leeds United'], ['Brighton and Hove Albion', 'West Ham United'], ['Burnley', 'Newcastle United'], ['Chelsea', 'Watford'], ['Crystal Palace', 'Manchester United'], ['Leicester City', 'Southampton'], ['Liverpool', 'Wolverhampton Wanderers'], ['Manchester City', 'Aston Villa'], ['Norwich City', 'Tottenham Hotspur']]
    

    也只是一个旁注,我可能会以不同的方式构建它。你最终会得到很多重复的值(即日期)。而且您必须注意所有列表的长度相同。在这种情况下,只有 2 个列表,管理起来并不难,但有时会让人头疼。

    我会考虑制作如下格式的 json 类型,您可以使用键:值关系包含其他值:

    def get_fixtures(webpage):
        table = webpage.find('div', {'class': 'fixres__body'})
           
        data = {}
    
        fixtures = table.find_all('div', {'class':'fixres__item'})
        for fixture in fixtures:
            a = fixture.find('a')
            home_span = a.find('span',
                          {'class': 'matches__item-col matches__participant matches__participant--side1'})
            home_team = home_span.find('span').text.strip()
            home_score = a.find_all('span', {'class':'matches__teamscores-side'})[0].text.strip()
    
            away_span = a.find('span',
                          {'class': 'matches__item-col matches__participant matches__participant--side2'})
            away_team = away_span.find('span').text.strip()
            away_score = a.find_all('span', {'class':'matches__teamscores-side'})[-1].text.strip()
            
            game_time = a.find('span', {'class':'matches__date'}).text.strip()
            
            game_status = a['data-status']
            
            match = {'homeTeam':home_team, 
                     'homeScore':home_score,
                     'awayTeam':away_team,
                     'awayScore':away_score,
                     'gameTime':game_time,
                     'gameStatus':game_status}
            
            date = fixture.findPrevious('h4', {'class': 'fixres__header2'}).text
            if date not in data.keys():
                data[date] = []
            data[date].append(match)
    
        return data
    

    现在您有一本字典,其中在给定日期,您有一个与确定的主客队的比赛列表、开始时间、得分(如果它是实时的/当前的)。特别是,它还会为游戏拿起后置标志。

    【讨论】:

    • 是的,为了回答您的问题,可以按照您在问题中提出的要求进行操作。基本上希望它找到一个日期,然后添加随后的所有夹具标签,直到它到达下一个日期,然后重复直到没有更多日期。我只是认为这是一种更简单的方法。
    猜你喜欢
    • 2017-04-05
    • 1970-01-01
    • 2015-12-12
    • 2015-04-26
    • 1970-01-01
    • 2016-02-24
    • 1970-01-01
    • 2016-09-25
    • 2019-11-30
    相关资源
    最近更新 更多