您可以在.find_all() 中使用已编译的正则表达式作为href= 参数。
例如:
import re
import requests
from bs4 import BeautifulSoup
url = 'https://www.basketball-reference.com/players/'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
r = re.compile(r'/players/.+/(.*?)\.html')
out = []
for a in soup.find('ul', class_="page_index").find_all('a', href=r):
out.append('{}/{}'.format(a.get_text(strip=True), r.search(a['href']).group(1)))
from pprint import pprint
pprint(out)
打印:
['Kareem Abdul-Jabbar/abdulka01',
'Ray Allen/allenra02',
'LaMarcus Aldridge/aldrila01',
'Paul Arizin/arizipa01',
'Carmelo Anthony/anthoca01',
'Tiny Archibald/architi01',
'Charles Barkley/barklch01',
'Kobe Bryant/bryanko01',
'Larry Bird/birdla01',
'Walt Bellamy/bellawa01',
'Rick Barry/barryri01',
'Chauncey Billups/billuch01',
'Wilt Chamberlain/chambwi01',
'Vince Carter/cartevi01',
'Maurice Cheeks/cheekma01',
'Stephen Curry/curryst01',
...and so on.