【发布时间】:2025-08-15 21:20:02
【问题描述】:
我有一个代码可以抓取奇怪的网站。
有时在抓取时,我得到ValueError("No tables found"),当我手动刷新浏览器时,页面加载。
如何通过代码实现?
我的代码如下:
import pandas as pd
from selenium import webdriver
from bs4 import BeautifulSoup as bs
browser = webdriver.Chrome()
class GameData:
def __init__(self):
self.date = []
self.time = []
self.game = []
self.score = []
self.home_odds = []
self.draw_odds = []
self.away_odds = []
self.country = []
self.league = []
def parse_data(url):
browser.get(url)
df = pd.read_html(browser.page_source, header=0)[0]
html = browser.page_source
soup = bs(html, "lxml")
cont = soup.find('div', {'id': 'wrap'})
content = cont.find('div', {'id': 'col-content'})
content = content.find('table', {'class': 'table-main'}, {'id': 'tournamentTable'})
main = content.find('th', {'class': 'first2 tl'})
if main is None:
return None
count = main.findAll('a')
country = count[1].text
league = count[2].text
game_data = GameData()
game_date = None
for row in df.itertuples():
if not isinstance(row[1], str):
continue
elif ':' not in row[1]:
game_date = row[1].split('-')[0]
continue
game_data.date.append(game_date)
game_data.time.append(row[1])
game_data.game.append(row[2])
game_data.score.append(row[3])
game_data.home_odds.append(row[4])
game_data.draw_odds.append(row[5])
game_data.away_odds.append(row[6])
game_data.country.append(country)
game_data.league.append(league)
return game_data
urls = {
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/3/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/4/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/5/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/6/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/7/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/8/",
"https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/9/",
}
if __name__ == '__main__':
results = None
for url in urls:
try:
game_data = parse_data(url)
if game_data is None:
continue
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
except ValueError:
game_data = parse_data(url)
if game_data is None:
continue
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
except AttributeError:
game_data = parse_data(url)
if game_data is None:
continue
result = pd.DataFrame(game_data.__dict__)
if results is None:
results = result
else:
results = results.append(result, ignore_index=True)
有时我会收到此浏览器错误。
Traceback (most recent call last):
File "C:/Users/harsh/AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_29.py", line 10098, in <module>
game_data = parse_data(url)
File "C:/Users/harsh/AppData/Roaming/JetBrains/PyCharmCE2021.1/scratches/scratch_29.py", line 37, in parse_data
df = pd.read_html(browser.page_source, header=0)[0]
File "C:\Python\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Python\lib\site-packages\pandas\io\html.py", line 1100, in read_html
displayed_only=displayed_only,
File "C:\Python\lib\site-packages\pandas\io\html.py", line 913, in _parse
raise retained
File "C:\Python\lib\site-packages\pandas\io\html.py", line 893, in _parse
tables = p.parse_tables()
File "C:\Python\lib\site-packages\pandas\io\html.py", line 213, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "C:\Python\lib\site-packages\pandas\io\html.py", line 543, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
我最好的猜测是我没有在代码中正确构建ValueError: No tables found。
我该如何处理?
【问题讨论】:
-
错误描述了自己。
pandas无法从HTML读取表,您可以将try/except保存到.html文件中,然后查看源代码。我也强烈建议使用seleniumwait,因为我相信你已经使用了 selenium,因为一旦页面加载,数据就会动态呈现,所以你必须等到页面源中显示的对象。 -
@αԋɱҽԃαмєяιcαη 这很有趣!我怎样才能将它构建到这里的代码中?
-
我的猜测是在遇到此错误时等待 2 秒,然后重试 URL。如何编码?
-
我已经为您提到了selenium-wait 方法。但是用一种简单的方式,你可以在
browser.get(url)之后尝试time.sleep(5) -
这是无效的吧?因为我正在解析 75K 网址并等待 2 秒或
time.sleep(5)将需要很长时间才能完成哈哈。只有当我遇到错误时才尝试这个有用吗?
标签: python selenium web-scraping beautifulsoup retry-logic