【发布时间】:2018-08-23 14:57:54
【问题描述】:
我想从网页中提取所有(在本例中为两个)hast-tags。
<html>
<head>
</head>
<body>
<div class="predefinition">
<p class="part1">
<span class="part1-head">Entries:</span>
<a class="pr" href="/go_somewhere/">#hashA with space</a>,
<a class="pr" href="/go_somewhere/">#hashBwithoutsace</a>,
</p>
<span class="part2">Boundaries:</span>
<p>some boundary statement</p>
</div>
<div class="wrapper"> <!– I only want to search here–>
<p class="part1">
<span class="part1-head">Entries:</span>
<a class="pr" href="/go_somewhere/">#hash1 with space</a>, <!– I only want to find this–>
<a class="pr" href="/go_somewhere/">#hash2withoutsace</a>, <!– and this–>
</p>
<span class="part2">Boundaries:</span>
<p>some other boundary statement</p>
</div>
</body>
</html>
但我只对一个分支(在此示例包装器中)中的井号标签感兴趣:“#hash1 with space”和“#hash2withoutsace”。现在我的代码如下所示:
from bs4 import BeautifulSoup
import io
import re
f = io.open("minimal.html", mode="r", encoding="utf-8")
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
mydivs = soup.findAll("a", {"class": "pr"})
for div in mydivs:
print(re.findall(r'(?i)\#\w+', str(div)))
- 如何将搜索重点放在“包装”div 上?
- 以及如何在主题标签中添加空格?
【问题讨论】:
标签: python regex beautifulsoup