从网站 Beautifulsoup 中抓取特定表答案

【问题标题】：Scraping specific table from a website Beautifulsoup从网站 Beautifulsoup 中抓取特定表
【发布时间】：2021-12-02 11:46:32
【问题描述】：

我想从这个网站获取一个名为 Form table (last 8) https://www.soccerstats.com/pmatch.asp?league=italy&stats=145-7-5-2022 的特定表格，但我得到了AttributeError: 'NoneType' object has no attribute 'text'

代码

  headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.212 Safari/537.36'}
  s = requests.Session()
  s.headers.update(headers)

  response = requests.get(link, headers=headers)
  soup = BeautifulSoup(response.text, 'html.parser')

  standings_forms = soup.find_all('table', border='0', cellspacing='0', cellpadding='0', width='100%')
  for t in standings_forms:
    if t.find('b').text == 'Form table (last 8)':
      print(t)

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

尝试使用以下脚本从该特定表中获取所需信息。在执行脚本之前，请确保通过运行此命令pip install bs4 --upgrade 升级您的 bs4 版本，因为我在脚本中使用了伪 css 选择器，bs4 仅在它是最新版本或至少等于版本 4.7.0.

import requests
from bs4 import BeautifulSoup

link = 'https://www.soccerstats.com/pmatch.asp?league=italy&stats=145-7-5-2022'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
    res = s.get(link)
    soup = BeautifulSoup(res.text,"html.parser")
    for item in soup.select("table:has(> tr > td > b:contains('Form table')) table > tr")[1:]:
        name = item.select("td")[0].get_text(strip=True)
        gp = item.select("td")[1].get_text(strip=True)
        pts = item.select("td")[2].get_text(strip=True)
        print((name,gp,pts))

以上脚本生成以下输出：

('Atalanta', '8', '20')
('Inter Milan', '8', '17')
('AC Milan', '8', '16')
('Napoli', '8', '15')
('Juventus', '8', '13')
('Bologna', '8', '13')
('Fiorentina', '8', '12')
('Sassuolo', '8', '12')
('Hellas Verona', '8', '12')
('AS Roma', '8', '10')
('Empoli', '8', '10')
('Lazio', '8', '10')
('Venezia', '8', '10')
('Torino', '8', '9')
('Sampdoria', '8', '9')
('Udinese', '8', '8')
('Spezia', '8', '7')
('Cagliari', '8', '6')
('Genoa', '8', '5')
('Salernitana', '8', '4')

【讨论】：

我更新了bs4 版本，它可以工作了！
如何使用 Offense 表，我尝试了此代码 for item in soup.select("table:has(> tr > td > b:contains('Offense')) > tr")[2:]: team = item.select("td")[1].get_text(strip=True) gp = item.select("td")[2].get_text(strip=True) pt = item.select("td")[3].get_text(strip=True)，但得到了 IndexError: list index out of range
在我的情况下，我完全尝试了您在 cmets 中粘贴的内容，发现它可以完美运行。