【问题标题】:How to extract URLs matching a pattern如何提取与模式匹配的 URL
【发布时间】:2016-05-17 20:01:29
【问题描述】:

我正在尝试从具有以下模式的网页中提取 URL:

'http://www.realclearpolitics.com/epolls/????/governor/??/-.html'

我当前的代码提取所有链接。如何更改我的代码以仅提取与模式匹配的 URL?谢谢!

import requests
from bs4 import BeautifulSoup

def find_governor_races(html):
    url = html
    base_url = 'http://www.realclearpolitics.com/'
    page = requests.get(html).text
    soup = BeautifulSoup(page,'html.parser')  
    links = []
    for a in soup.findAll('a', href=True):
            links.append(a['href'])
find_governor_races('http://www.realclearpolitics.com/epolls/2010/governor/2010_elections_governor_map.html')

【问题讨论】:

    标签: python-2.7 web-scraping beautifulsoup python-requests


    【解决方案1】:

    您可以为.find_all() 提供regular expression pattern 作为href 参数值:

    import re
    
    pattern = re.compile(r"http://www.realclearpolitics.com\/epolls/\d+/governor/.*?/.*?.html")
    links = soup.find_all("a", href=pattern)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-10-15
      • 2021-08-24
      • 2018-12-10
      • 1970-01-01
      • 2019-06-26
      相关资源
      最近更新 更多