【问题标题】:Python, Regex: Extract string after matching stringPython,Regex:匹配字符串后提取字符串
【发布时间】:2019-03-14 13:23:58
【问题描述】:

我想使用正则表达式来匹配一个模式并提取模式的一部分。

我已经抓取了 HTML 数据,一个说明性的 sn-p 看起来像:

</script>
</li>
<li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
<span class="hide" itemprop="position">1</span>
<div class="result-heading">
<a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
<img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
<div class="result-heading-texts">
<a href="/projects/quickfixj/" itemprop="url" title="Find out more 
<a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
<div class="description">
<p class="description-inner">DESMO-<em>J</em> is a framework for 
<a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
</strong>
<strong>

find_all('a') 更具代表性的子集突出显示问题:

<!-- Menu -->
<ul class="header-nav-menulist">
<li class="highlight social row">
<span class="social-label">Connect</span>
<span class="social-icons">
<span></span>
<a class="twitter" href="https://twitter.com/sourceforge" rel="nofollow" target="_blank">
<svg viewbox="0 0 1792 1792" xmlns="http://www.w3.org/2000/svg"><path d="M1684 408q-67 98-162 167 1 14 1 42 0 130-38 259.5t-115.5 248.5-184.5 210.5-258 146-323 54.5q-271 0-496-145 35 4 78 4 225 0 401-138-105-2-188-64.5t-114-159.5q33 5 61 5 43 0 85-11-112-23-185.5-111.5t-73.5-205.5v-4q68 38 146 41-66-44-105-115t-39-154q0-88 44-163 121 149 294.5 238.5t371.5 99.5q-8-38-8-74 0-134 94.5-228.5t228.5-94.5q140 0 236 102 109-21 205-78-37 115-142 178 93-10 186-50z"></path></svg></a>
<a class="facebook" href="https://www.facebook.com/sourceforgenet/" rel="nofollow" target="_blank">

HTML 当前存储为 BeautifulSoup 对象,即它已被传递:

html_soup= BeautifulSoup(response.text, 'html.parser')

我想在整个对象中搜索/projects/ 的所有实例,并提取后续斜杠之间的字符串。例如:

from "/projects/quickfixj/" I would like to store "quickfixj".

我最初的想法是使用re.findall() 并尝试匹配(/projects/./)*,但这不起作用。

非常感谢任何帮助。

【问题讨论】:

    标签: python regex web-scraping beautifulsoup


    【解决方案1】:

    你已经完成了一半

    a='''</script>
    </li>
    <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
    <span class="hide" itemprop="position">1</span>
    <div class="result-heading">
    <a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
    <img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
    <div class="result-heading-texts">
    <a href="/projects/quickfixj/" itemprop="url" title="Find out more 
    <a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
    <div class="description">
    <p class="description-inner">DESMO-<em>J</em> is a framework for 
    <a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
    </strong>
    <strong>'''
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(a,"html.parser")
    for i in soup.find_all('a'):
        print(re.findall('/projects/(\w{1,})/',i.get('href')))
    

    如果您需要独特的项目。将最后几行更改为

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(a,"html.parser")
    project_set=set()
    for i in soup.find_all('a'):
        project_set.add(*re.findall('/projects/(\w{1,})/',i.get('href')))
    
    print(project_set) #{u'desmoj', u'quickfixj'}
    

    【讨论】:

    • 感谢您的回答,我在使用它时遇到一个问题,主要是因为我没有使用完全具有代表性的示例数据。我已经更新了问题以显示这一点。您会在新数据中注意到,一些以a 开头的实例有一个不包含/projects/href,因此在尝试使用set.add() 时会抛出错误,因为要添加的对象是空的。如果你能提供帮助,我是 python 的新手,所以很难解决这个问题。谢谢!
    • 没关系,只需在添加到集合之前检查对象的len
    【解决方案2】:

    您可以提取所有链接,然后应用正则表达式:

    from bs4 import BeautifulSoup
    
    html = '''</script>
    </li>
    <li itemprop="itemListElement" itemscope="" itemtype="http://schema.org/ListItem">
    <span class="hide" itemprop="position">1</span>
    <div class="result-heading">
    <a class="project-icon show-outline" href="/projects/quickfixj/" title="Find out more about QuickFIX/J - Open Source Java FIX Engine">
    <img alt="QuickFIX/J - Open Source Java FIX Engine Icon" src="//a.fsdn.com/allura/p/quickfixj/icon?1533295730"/></a>
    <div class="result-heading-texts">
    <a href="/projects/quickfixj/" itemprop="url" title="Find out more 
    <a href="/projects/desmoj/" itemprop="url" title="Find out more about DESMO-J"><h2>DESMO-J</h2></a>
    <div class="description">
    <p class="description-inner">DESMO-<em>J</em> is a framework for 
    <a href="/projects/desmoj/files/stats/timeline" title="Downloads This Week">29 This Week</a>
    </strong>
    <strong>'''
    
    html_soup = BeautifulSoup(html, 'html.parser')
    
    links = [i.get('href') for i in html_soup.find_all('a', href=True)]
    

    产量:

    ['/projects/quickfixj/', '/projects/quickfixj/', '/projects/desmoj/files/stats/timeline']
    

    然后你可以应用你的正则表达式:

    cleaned = [re.findall(r'(?<=projects\/)(.*?)\/', i)[0] for i in links]
    

    产量:

    ['quickfixj', 'quickfixj', 'desmoj']
    

    【讨论】:

      【解决方案3】:

      这样的正则表达式应该可以解决问题(?&lt;=\/projects\/).+?(?=\/)

      会像这样工作

      import re
      regex = "(?<=\/projects\/).+?(?=\/)"
      string = "<a href="/projects/quickfixj/" itemprop="url" title="Find out more...."
      matches = re.findall(regex, string)
      print(matches)
      

      输出:["quickfixj"]

      【讨论】:

        猜你喜欢
        • 2020-11-13
        • 1970-01-01
        • 2014-08-05
        • 2023-01-07
        • 1970-01-01
        • 2014-11-18
        • 1970-01-01
        • 1970-01-01
        • 2015-03-23
        相关资源
        最近更新 更多