正则表达式：如何同时设置两个条件？答案

【问题标题】：regular expression: How to set two conditions at the same time?正则表达式：如何同时设置两个条件？
【发布时间】：2020-06-20 22:55:47
【问题描述】：

我是学习网络抓取的初学者。这里有一个关于如何使用正则表达式同时施加两个条件的问题。我做了一些研究，了解到代码假设应该是 (condition 1 +) (condition2) 但我不知道为什么它对我不起作用。

这里是我要抓取的网站：

https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics

我尝试使用 re.search 和循环来仅使用获奖者的姓名，因为我发现模式是：以 \wiki\ 和 Firstname_Lastname 开头。

例如

/wiki/Paul_Krugman

我的逻辑是尝试用正则表达式设置两个条件。

all_urls_regex＿findname=[]    
for url in soup.find_all('a',
{'href':re.compile(r'/wiki+')}): 
# make sure it starts with /wiki
 all_urls_regex_findname.append(url.get('href'))

for url in soup.find_all('a',
                         {'href':re.compile(r'\_+')}): 
                             # make sure there's a  underline)

`all_urls_regex_findname_1.append(url.get('href'))`

(r'/wiki+') try to apply url以wiki开头 (r' _+' )尝试施加姓名（模式为Firstname_Lastname）

以上两个分别运行良好。但我想要的是“和”逻辑所以我尝试同时运行两个条件

`all_urls_regex＿findname_2=[]    
for url in soup.find_all('a',
                         {'href':re.compile(r'/wiki+')(r'\_+')}):`
all_urls_regex_findname.append(url.get('href'))
# but it didn't work, the result is a empty set.

谁能给我一些提示我的代码发生了什么？提前致谢！！

【问题讨论】：

您可以尝试使用 find_all 分别获取这两个条件的列表，然后使用列表推导式：common_elements = [x for x in list1 if x in list2] 获取共同元素的列表。
您没有收到错误消息吗？始终将完整的错误消息（从“Traceback”一词开始）作为文本（不是屏幕截图）提出问题（不是评论）。还有其他有用的信息。
您必须将所有字符串放在compile() 中的一个字符串中 - 即compile(r"/wiki/.*_+.*") - 顺便说一句：( ) 不是regex 表达式的一部分，而是正常的函数执行。所以你有regex：r'/wiki+'和r' _+'不是(r'/wiki+')和(r' _+' )
知道了@shanylong tks
啊，我知道了，会再试一次！谢谢@furas

标签： python regex web-scraping beautifulsoup

【解决方案1】：

要获取链接，我可以使用

pattern = re.compile(r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$')

但这仍然会得到类似的链接

/wiki/United_States

所以首先我会使用其他函数来仅获取带有所需链接的<table>（或表中的事件列）

编辑：它与/wiki/Bengt_R._Holmstr%C3%B6m (Bengt Holmström) 有问题，它在链接中有两个_，他的名字在链接中转换为%C3%B6 的本机字符ö

import requests
from bs4 import BeautifulSoup as BS
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')

pattern = re.compile(r'^/wiki/[A-Z][a-z]*_[A-Z][a-z]*$')

all_tables = soup.find_all('table')

all_items = all_tables[1].find_all('a', {'href': pattern})
for item in all_items:
    print(item['href'], '|', item['title'])

结果：

/wiki/Ragnar_Frisch | Ragnar Frisch
/wiki/Jan_Tinbergen | Jan Tinbergen
/wiki/Paul_Samuelson | Paul Samuelson
/wiki/Simon_Kuznets | Simon Kuznets
/wiki/John_Hicks | John Hicks
/wiki/Kenneth_Arrow | Kenneth Arrow
/wiki/Wassily_Leontief | Wassily Leontief
/wiki/Gunnar_Myrdal | Gunnar Myrdal
/wiki/Friedrich_Hayek | Friedrich Hayek
/wiki/Leonid_Kantorovich | Leonid Kantorovich
/wiki/Tjalling_Koopmans | Tjalling Koopmans
/wiki/Milton_Friedman | Milton Friedman
/wiki/Bertil_Ohlin | Bertil Ohlin
/wiki/James_Meade | James Meade
/wiki/Theodore_Schultz | Theodore Schultz
/wiki/Lawrence_Klein | Lawrence Klein
/wiki/James_Tobin | James Tobin
/wiki/George_Stigler | George Stigler
/wiki/Richard_Stone | Richard Stone
/wiki/Franco_Modigliani | Franco Modigliani
/wiki/Robert_Solow | Robert Solow
/wiki/Maurice_Allais | Maurice Allais
/wiki/Trygve_Haavelmo | Trygve Haavelmo
/wiki/Harry_Markowitz | Harry Markowitz
/wiki/Merton_Miller | Merton Miller
/wiki/Ronald_Coase | Ronald Coase
/wiki/Gary_Becker | Gary Becker
/wiki/Robert_Fogel | Robert Fogel
/wiki/Douglass_North | Douglass North
/wiki/John_Harsanyi | John Harsanyi
/wiki/Reinhard_Selten | Reinhard Selten
/wiki/James_Mirrlees | James Mirrlees
/wiki/William_Vickrey | William Vickrey
/wiki/Myron_Scholes | Myron Scholes
/wiki/Amartya_Sen | Amartya Sen
/wiki/Robert_Mundell | Robert Mundell
/wiki/James_Heckman | James Heckman
/wiki/George_Akerlof | George Akerlof
/wiki/Michael_Spence | Michael Spence
/wiki/Joseph_Stiglitz | Joseph Stiglitz
/wiki/Daniel_Kahneman | Daniel Kahneman
/wiki/Clive_Granger | Clive Granger
/wiki/Robert_Aumann | Robert Aumann
/wiki/Thomas_Schelling | Thomas Schelling
/wiki/Edmund_Phelps | Edmund Phelps
/wiki/Leonid_Hurwicz | Leonid Hurwicz
/wiki/Eric_Maskin | Eric Maskin
/wiki/Roger_Myerson | Roger Myerson
/wiki/Paul_Krugman | Paul Krugman
/wiki/Elinor_Ostrom | Elinor Ostrom
/wiki/Peter_Diamond | Peter Diamond
/wiki/Lloyd_Shapley | Lloyd Shapley
/wiki/Eugene_Fama | Eugene Fama
/wiki/Jean_Tirole | Jean Tirole
/wiki/Angus_Deaton | Angus Deaton
/wiki/Richard_Thaler | Richard Thaler
/wiki/William_Nordhaus | William Nordhaus
/wiki/Paul_Romer | Paul Romer
/wiki/Abhijit_Banerjee | Abhijit Banerjee
/wiki/Esther_Duflo | Esther Duflo
/wiki/Michael_Kremer | Michael Kremer

编辑：

为了减少Unitet_State，我决定单独处理每一行，只获得与第三列的链接。但存在问题，因为 HTML 使用colspan 连接两/三行中的列，因此在每一行中，此链接位于不同的列中。

我决定在行中找到与r'^/wiki/[^:]*$' 匹配的第一个链接（跳过带有图像/wiki/File:... 的链接）。因为我使用find() 而不是find_all()，所以我只找到指向laureat 的链接，而我没有找到指向下一列中的United State 的链接。

import requests
from bs4 import BeautifulSoup as BS
import re

r = requests.get('https://en.wikipedia.org/wiki/List_of_Nobel_Memorial_Prize_laureates_in_Economics')
soup = BS(r.text, 'html.parser')

all_tables = soup.find_all('table')

pattern = re.compile(r'^/wiki/[^:]*$')

for row in all_tables[0].find_all('tr'):
    item = row.find('a', {'href': pattern})
    if item:
        print(item['href'], '|', item['title'])

【讨论】：