要获取链接,我可以使用
pattern = re.compile(r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$')
但这仍然会得到类似的链接
/wiki/United_States
所以首先我会使用其他函数来仅获取带有所需链接的<table>(或表中的事件列)
编辑:它与/wiki/Bengt_R._Holmstr%C3%B6m (Bengt Holmström) 有问题,它在链接中有两个_,他的名字在链接中转换为%C3%B6 的本机字符ö
import requests
from bs4 import BeautifulSoup as BS
import re
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')
pattern = re.compile(r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$')
all_tables = soup.find_all('table')
all_items = all_tables[1].find_all('a', {'href': pattern})
for item in all_items:
print(item['href'], '|', item['title'])
结果:
/wiki/Ragnar_Frisch | Ragnar Frisch
/wiki/Jan_Tinbergen | Jan Tinbergen
/wiki/Paul_Samuelson | Paul Samuelson
/wiki/Simon_Kuznets | Simon Kuznets
/wiki/John_Hicks | John Hicks
/wiki/Kenneth_Arrow | Kenneth Arrow
/wiki/Wassily_Leontief | Wassily Leontief
/wiki/Gunnar_Myrdal | Gunnar Myrdal
/wiki/Friedrich_Hayek | Friedrich Hayek
/wiki/Leonid_Kantorovich | Leonid Kantorovich
/wiki/Tjalling_Koopmans | Tjalling Koopmans
/wiki/Milton_Friedman | Milton Friedman
/wiki/Bertil_Ohlin | Bertil Ohlin
/wiki/James_Meade | James Meade
/wiki/Theodore_Schultz | Theodore Schultz
/wiki/Lawrence_Klein | Lawrence Klein
/wiki/James_Tobin | James Tobin
/wiki/George_Stigler | George Stigler
/wiki/Richard_Stone | Richard Stone
/wiki/Franco_Modigliani | Franco Modigliani
/wiki/Robert_Solow | Robert Solow
/wiki/Maurice_Allais | Maurice Allais
/wiki/Trygve_Haavelmo | Trygve Haavelmo
/wiki/Harry_Markowitz | Harry Markowitz
/wiki/Merton_Miller | Merton Miller
/wiki/Ronald_Coase | Ronald Coase
/wiki/Gary_Becker | Gary Becker
/wiki/Robert_Fogel | Robert Fogel
/wiki/Douglass_North | Douglass North
/wiki/John_Harsanyi | John Harsanyi
/wiki/Reinhard_Selten | Reinhard Selten
/wiki/James_Mirrlees | James Mirrlees
/wiki/William_Vickrey | William Vickrey
/wiki/Myron_Scholes | Myron Scholes
/wiki/Amartya_Sen | Amartya Sen
/wiki/Robert_Mundell | Robert Mundell
/wiki/James_Heckman | James Heckman
/wiki/George_Akerlof | George Akerlof
/wiki/Michael_Spence | Michael Spence
/wiki/Joseph_Stiglitz | Joseph Stiglitz
/wiki/Daniel_Kahneman | Daniel Kahneman
/wiki/Clive_Granger | Clive Granger
/wiki/Robert_Aumann | Robert Aumann
/wiki/Thomas_Schelling | Thomas Schelling
/wiki/Edmund_Phelps | Edmund Phelps
/wiki/Leonid_Hurwicz | Leonid Hurwicz
/wiki/Eric_Maskin | Eric Maskin
/wiki/Roger_Myerson | Roger Myerson
/wiki/Paul_Krugman | Paul Krugman
/wiki/Elinor_Ostrom | Elinor Ostrom
/wiki/Peter_Diamond | Peter Diamond
/wiki/Lloyd_Shapley | Lloyd Shapley
/wiki/Eugene_Fama | Eugene Fama
/wiki/Jean_Tirole | Jean Tirole
/wiki/Angus_Deaton | Angus Deaton
/wiki/Richard_Thaler | Richard Thaler
/wiki/William_Nordhaus | William Nordhaus
/wiki/Paul_Romer | Paul Romer
/wiki/Abhijit_Banerjee | Abhijit Banerjee
/wiki/Esther_Duflo | Esther Duflo
/wiki/Michael_Kremer | Michael Kremer
编辑:
为了减少Unitet_State,我决定单独处理每一行,只获得与第三列的链接。但存在问题,因为 HTML 使用colspan 连接两/三行中的列,因此在每一行中,此链接位于不同的列中。
我决定在行中找到与r'^/wiki/[^:]*$' 匹配的第一个链接(跳过带有图像/wiki/File:... 的链接)。因为我使用find() 而不是find_all(),所以我只找到指向laureat 的链接,而我没有找到指向下一列中的United State 的链接。
import requests
from bs4 import BeautifulSoup as BS
import re
r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')
all_tables = soup.find_all('table')
pattern = re.compile(r'^/wiki/[^:]*$')
for row in all_tables[0].find_all('tr'):
item = row.find('a', {'href': pattern})
if item:
print(item['href'], '|', item['title'])