【问题标题】:BeautifulSoup [ ] contains no linksBeautifulSoup [ ] 不包含链接
【发布时间】:2019-12-25 22:03:49
【问题描述】:

所以我试图从这个网站上抓取链接:https://spotlightstockmarket.com/sv/market-overview/nyheter/

我的程序似乎找不到链接。我认为这是一种安全措施,该网站不希望人们检索信息(?)。

我是否必须添加一个额外的行来挖掘“li”选项卡?

如果有人帮助我解决这个问题,我将不胜感激。

from bs4 import BeautifulSoup
import requests


result = requests.get("https://spotlightstockmarket.com/sv/market-overview/nyheter/")
src = result.content
soup = BeautifulSoup(src, 'lxml')

urls = []
for h2_tag in soup.find_all('li'):
    a_tag = h2_tag.find('a')
    urls.append(a_tag.attrs['href'])

print(urls)```

【问题讨论】:

    标签: html python-3.x web-scraping beautifulsoup


    【解决方案1】:

    其实网页是通过JavaScript渲染的

    这里是Selenium 方法:

    from selenium import webdriver
    from bs4 import BeautifulSoup
    from selenium.webdriver.firefox.options import Options
    
    options = Options()
    options.add_argument('--headless')
    
    driver = webdriver.Firefox(options=options)
    driver.get('https://spotlightstockmarket.com/sv/market-overview/nyheter/')
    html = driver.page_source
    soup = BeautifulSoup(html, 'html.parser')
    
    for item in soup.findAll('a', {'class': 'text'}):
        item = item.get("href")
        print(f"https://spotlightstockmarket.com{item}")
    
    driver.quit()
    
    

    输出:

    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54904&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54902&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54903&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54901&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54900&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54899&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54898&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54897&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54896&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54894&publisher=370
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26715&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26714&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26713&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=1880&publisher=372
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=1879&publisher=372
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26712&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26711&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26710&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26709&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=26708&publisher=371
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54808&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54809&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54790&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54776&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54747&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54741&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54721&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54720&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54707&publisher=369
    https://spotlightstockmarket.com/sv/market-overview/nyheter/nyhets-artikel/?id=54706&publisher=369
    

    关于li 它不是用JavaScript 渲染的,所以你可以使用:

    from bs4 import BeautifulSoup
    import requests
    
    
    r = requests.get(
        "https://spotlightstockmarket.com/sv/market-overview/nyheter/")
    soup = BeautifulSoup(r.text, 'html.parser')
    
    urls = set()
    for item in soup.find_all(lambda tag: tag.name == 'li' and not tag.attrs):
        for href in item.findAll("a"):
            href = href.get("href")
            if href:
                href = f"https://spotlightstockmarket.com{href}"
            urls.add(href)
    
    print(urls)
    

    输出:

    {'https://spotlightstockmarket.com/sv/om-spotlight/kontakt', 'https://spotlightstockmarket.com/sv/market-overview/rapportkalender', 'https://spotlightstockmarket.com/sv/redan-noterad/next', 'https://spotlightstockmarket.com/sv/bli-delaegare', 'https://spotlightstockmarket.com/sv/om-spotlight', 'https://spotlightstockmarket.com/sv/redan-noterad/regelverk', 'https://spotlightstockmarket.com/sv/medlemmar/medlemslista', 'https://spotlightstockmarket.com/sv/redan-noterad/i-fokus', 'https://spotlightstockmarket.com/sv/redan-noterad/information-foer-att-uppraetta-din-ir-sida', 'https://spotlightstockmarket.com/sv/redan-noterad/kapitalanskaffning', 'https://spotlightstockmarket.com/sv/market-overview/nyheter', 'https://spotlightstockmarket.com/sv/market-overview/kurser', 'https://spotlightstockmarket.com/sv/market-overview/bolagshaendelser', 'https://spotlightstockmarket.com/sv/market-overview', 'https://spotlightstockmarket.com/sv/market-overview/vaara-bolag', 'https://spotlightstockmarket.com/sv/redan-noterad/investor-relations', 'https://spotlightstockmarket.com/sv/market-overview/filmer', 'https://spotlightstockmarket.com/sv/om-spotlight/koncerninformation', 'https://spotlightstockmarket.com/en/market-overview/news', 'https://spotlightstockmarket.com/sv/bli-delaegare/hur-blir-jag-delaegare', 'https://spotlightstockmarket.com/sv/om-spotlight/oeppettider', 'https://spotlightstockmarket.com/sv/bli-noterad/go-public', 'https://spotlightstockmarket.com/sv/redan-noterad/disciplinnaemnden', 'https://spotlightstockmarket.com/sv/market-overview/noteringar', 'https://spotlightstockmarket.com/sv/medlemmar/regelverk-och-prislista', 'https://spotlightstockmarket.com/sv/redan-noterad', 'https://spotlightstockmarket.com/sv/bli-noterad/vaart-erbjudande', 'https://spotlightstockmarket.com/sv/redan-noterad/vaart-erbjudande', 'https://spotlightstockmarket.com/sv/market-overview/analyser', 'https://spotlightstockmarket.com/sv/bli-noterad', 'https://spotlightstockmarket.com/sv/bli-noterad/hur-gaar-en-notering-till', 'https://spotlightstockmarket.com/sv/redan-noterad/vaegledning', 'https://spotlightstockmarket.com/sv/redan-noterad/boka-utbildning', 'https://spotlightstockmarket.com/sv/bli-noterad/spotlight-stories', 'https://spotlightstockmarket.com/sv/om-spotlight/pressbilder', 'https://spotlightstockmarket.com/sv/bli-noterad/varfoer-bli-noterad', 'https://spotlightstockmarket.com/sv/medlemmar', 'https://spotlightstockmarket.com/dk/market-overview/nyheder', 'https://spotlightstockmarket.com/sv/market-overview/spotlight-index', 'https://spotlightstockmarket.com/sv/bli-delaegare/varfoer-bli-delaegare', 'https://spotlightstockmarket.com/sv/market-overview/emissioner'}
    

    【讨论】:

      【解决方案2】:

      当 javascript 在浏览器中运行时,从 javascript 对象中动态检索数据。由于该对象存在于 response.text 中,因此您可以按如下方式简单地对 url 进行正则表达式。这避免了使用浏览器的开销。

      import requests, re
      
      p = re.compile(r'"url": "(.*?)",')
      r = requests.get('https://spotlightstockmarket.com/sv/market-overview/nyheter/')
      links = ['https://spotlightstockmarket.com' + link for link in p.findall(r.text)]
      print(links)
      

      正则表达式:

      【讨论】:

        猜你喜欢
        • 2013-01-30
        • 1970-01-01
        • 2014-04-30
        • 2015-03-07
        • 2022-11-19
        • 2013-03-16
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多