Beautiful Soup 多个 findAll 类答案

【问题标题】：Beautiful Soup multiple findAll classesBeautiful Soup 多个 findAll 类
【发布时间】：2020-04-08 23:41:11
【问题描述】：

我正在开展一个项目，以获取有关在比赛前有多少狗被撤回的数据库。我需要抓取数据然后写入 csv。我的问题是我正在抓取的数据有图像而不是文本（在网页上的 PLC 和 Greayhound 之间）。这意味着我运行 2 个不同的循环来获取我需要的信息，然后很难将其连接回正确的位置。

这里是代码。

import requests
import csv
URL = "https://www.thedogs.com.au/Racing/MeetResults.aspx?meetId=255268"
page = requests.get(URL)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'html.parser')

#soup.findAll('td', class_='ResultsCenteredCellContents'):

odds=[]
dog = soup.findAll('img' )
for a in dog:

    odds.append(a['src'].strip())


odds1=[]
for b in soup.findAll('td'):
    odds1.append(b.text.strip())

所以，如果我可以在一个循环中运行我需要的所有代码，并且可以用 CSV 编写，那就太好了。

【问题讨论】：

标签： python-2.7 beautifulsoup

【解决方案1】：

是的，它们会显示在图像中，但如果您注意到图像是根据它代表src="../Images/BoxNumber4.gif 的数字命名的，那么该图像代表4

import requests , csv
from bs4 import BeautifulSoup

def SaveAsCsv(list_of_rows,file_name):
    try:
        print('\nSaving CSV Result')
        with open(file_name, 'a',  newline='', encoding='utf-8') as outfile:
            writer = csv.writer(outfile)
            writer.writerow(list_of_rows)
            print("rsults saved successully")
    except PermissionError:
        print(f"Please make sure {file_name} is closed \n")

URL = "https://www.thedogs.com.au/Racing/MeetResults.aspx?meetId=255268"
page = requests.get(URL, headers={'User-Agent': 'Mozilla/5.0'})

soup = BeautifulSoup(page.text, 'html.parser')
tables = soup.findAll('table',{'id':"gvRaceResults"}) # getting all 11 tabs
for  table_index , table in enumerate(tables,1): 
    print(f'Getting tab {table_index} out of {len(tables)} ')
    rows = table.findAll('tr')
    header_row = [row.text.strip() for row in rows[0].findAll('th')]
    SaveAsCsv(header_row , 'thedogs.csv')
    for index , row in enumerate(rows[1:],1):
            print(f'Getting row {index} out of {len(rows[1:])} ')
            row_list = []
            colms = row.findAll('td')
            for col in colms :
                    if  'ResultsContentsNumber' in col.attrs['class']:
                            image_src = col.find('img').get('src') #src="../Images/BoxNumber4.gif">
                            image_num = image_src.split('BoxNumber')[1].split('.')[0] # 4
                            row_list.append(image_num)
                            continue
                    row_list.append(col.text.strip())
            SaveAsCsv(row_list , 'thedogs.csv')

输出：

【讨论】：

太好了，这很有效。好东西@用户：5794970。我唯一的问题是，在网页上有 11 个选项卡，其中包含该赛道不同比赛的结果。最好的定位是循环通过所有这些比赛吗？
@MBill 而不是使用.find('table',{'id':"gvRaceResults"}) 使用find_all 并遍历每个表。请采纳答案
这种解决方法会产生此错误。 -> 1884 "ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key