有几种方法可以做到,但我更喜欢 pandas 方式:
from urllib import request
import pandas as pd # you need to install pandas
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
web_df: pd.DataFrame = pd.read_html(web_request, attrs={'class': 'members'})
web_df = web_df[0].set_index(keys=['Name'])
# print(web_df)
user_name_to_find_in_table = 'SteveMoody'
user_name_df = web_df.loc[user_name_to_find_in_table]
print(user_name_df)
那么有很多方法可以做到这一点。只使用 Beautifulsoup 的 find 或 css 选择器,或者像 Peter 建议的那样使用 re?
使用beautifulsoup 和“find”方法,以及re,您可以通过以下方式进行:
import re
from bs4 import BeautifulSoup as bs # you need to install beautifullsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.find(
lambda t: t.name == "td"
and re.findall(user_name_to_find_in_table, t.text, flags=re.I)
).find_parent(name="tr")
print(row_tag.get_text().strip('tr'))
使用 Beautifulsoup 和 CSS 选择器(不是重新,而是 Beautifulsoup):
from bs4 import BeautifulSoup as bs # you need to install beautifulsoup
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
page_soup = bs(web_request, 'lxml') # need to install lxml and bs4(beautifulsoup for Python 3+)
user_name_to_find_in_table = 'SteveMoody'
row_tag = page_soup.select_one(f'tr:has(> td:contains({user_name_to_find_in_table})) ')
print(row_tag.get_text().strip('tr'))
在您的情况下,我更喜欢 pandas 示例,因为您可以保留标题并且可以轻松获取其他统计信息,并且运行速度非常快。
使用回复:
所以,最好的输入是 Peters 的评论Link,所以我只是将它改编为 Python 代码(很高兴得到编辑),因为这个解决方案不需要安装任何额外的库。
import re
from urllib import request
base_url = 'https://apps.foldingathome.org/teamstats/team3446.html'
web_request = request.urlopen(url=base_url).read()
user_name_to_find_in_table = 'SteveMoody'
re_patern = rf'<tr>(?:(?:(?:(?!<\/tr>).)*?)\{user_name_to_find_in_table}\b(?:.*?))<\/tr>'
res = re.search(pattern=re_patern, string= str(web_request))
print(res.group(0))
帮助 lin 在正则表达式中使用变量:stackflow