【问题标题】:Find hashtags with in specified part of website在网站的指定部分查找主题标签
【发布时间】:2018-08-23 14:57:54
【问题描述】:

我想从网页中提取所有(在本例中为两个)hast-tags。

<html>
    <head>
    </head>
    <body>
        <div class="predefinition">
            <p class="part1">
              <span class="part1-head">Entries:</span>
                <a class="pr" href="/go_somewhere/">#hashA with space</a>, 
                <a class="pr" href="/go_somewhere/">#hashBwithoutsace</a>,
            </p>
            <span class="part2">Boundaries:</span>
            <p>some boundary statement</p>
        </div>        
        <div class="wrapper"> <!– I only want to search here–>
            <p class="part1">
              <span class="part1-head">Entries:</span>
                <a class="pr" href="/go_somewhere/">#hash1 with space</a>, <!– I only want to find this–>
                <a class="pr" href="/go_somewhere/">#hash2withoutsace</a>, <!– and this–>
            </p>
            <span class="part2">Boundaries:</span>
            <p>some other boundary statement</p>
        </div>        
    </body>
</html>

但我只对一个分支(在此示例包装器中)中的井号标签感兴趣:“#hash1 with space”和“#hash2withoutsace”。现在我的代码如下所示:

from bs4 import BeautifulSoup
import io
import re

f = io.open("minimal.html", mode="r", encoding="utf-8")
contents = f.read()
soup = BeautifulSoup(contents, 'lxml')
mydivs = soup.findAll("a", {"class": "pr"})

for div in mydivs:
    print(re.findall(r'(?i)\#\w+', str(div)))
  • 如何将搜索重点放在“包装”div 上?
  • 以及如何在主题标签中添加空格?

【问题讨论】:

    标签: python regex beautifulsoup


    【解决方案1】:

    classpr可以找到所有a标签的文字,然后选择最后两个:

    from bs4 import BeautifulSoup as soup
    results = [i.text for i in soup(content, 'html.parser').find('div', {'class':'wrapper'}).find_all('a', {'class':'pr'})]
    

    输出:

    ['#hash1 with space', '#hash2withoutsace']
    

    【讨论】:

    • 好的,你明白了 :-) 问题是我不知道“包装器”和“预定义”有多少标签。那么是否可以将 find_all 集中在 html 的“包装器”部分?
    猜你喜欢
    • 1970-01-01
    • 2011-04-09
    • 1970-01-01
    • 2019-11-03
    • 2016-07-07
    • 2021-12-21
    • 1970-01-01
    • 2012-02-21
    • 1970-01-01
    相关资源
    最近更新 更多