【问题标题】:beautiful soup extract a href from google search美丽的汤从谷歌搜索中提取一个href
【发布时间】:2012-05-09 05:39:48
【问题描述】:

谷歌搜索在 HTML 上给出了以下第一个结果:

<h3 class="r"><a href="https://rads.stackoverflow.com/amzn/click/com/0470284889" rel="nofollow noreferrer" class="l vst" onmousedown="return rwt(this,'','','','1','AFQjCNEv1W9YC2jcSKYdEo2kNqBMJ-Utmg','k89K9hF4cVNpxQYHtEKiUQ','0CCoQFjAA',null,event)"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</a></h3>

我想从中提取链接http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889,但是当我用美汤提取信息时,我得到了

soup.find("h3").find("a").get("href")

我获得了以下字符串:

/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A

我知道链接在那里,我可以通过删除 /url?q= 和 & 符号后面的所有内容来解析它,但我想知道是否有更简洁的解决方案。

谢谢!

【问题讨论】:

    标签: python html beautifulsoup google-search


    【解决方案1】:

    您可以使用urlparse.urlparseurlparse.parse_qs 的组合,例如

    >>> import urlparse
    >>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'
    >>> data = urlparse.parse_qs(
    ...     urlparse.urlparse(url).query
    ... )
    >>> data
    {'ei': ['P2ycT6OoNuasiAL2ncV5'],
     'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'],
     'sa': ['U'],
     'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'],
     'ved': ['0CBIQFjAA']}
    >>> data['q'][0]
    'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'
    

    【讨论】:

    • 谢谢,这就是我想要的!只是想知道,为什么 BeautifulSoup() 将 javascript 解析为与我的网络浏览器显示的内容不同的内容?这是否意味着我必须使用 html5lib 解析器才能获得正确的结果?
    • @ejang: 抱歉,我不知道 BeautifulSoup 是怎么做到的 :( 如果你愿意,可以发布一个新问题,这会很有趣 :)
    【解决方案2】:

    要仅从页面中提取第一个结果,您可以通过传递 CSS 选择器或 find() bs4 方法来使用 select_one()

    代码和example in the online IDE

    import requests, lxml
    from bs4 import BeautifulSoup
    
    headers = {
        "User-Agent":
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    # passing parameters in URLs
    # https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
    params = {'q': 'Quantitative Trading How to Build Your Own Algorithmic - amazon'}
    
    def bs4_get_first_googlesearch():
        html = requests.get('https://www.google.com/search', headers=headers, params=params).text
        soup = BeautifulSoup(html, 'lxml')
    
        first_link = soup.select_one('.yuRUbf').a['href']
        print(first_link)
    
    bs4_get_first_googlesearch()
    
    # output:
    '''
    https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
    '''
    

    或者,您可以使用来自 SerpApi 的 Google Search Engine Results API 来做同样的事情。这是一个付费 API,可免费试用 5,000 次搜索。查看playground

    最大的区别在于,所有事情都已经为最终用户完成了:选择元素、绕过阻塞、代理轮换等等。

    要集成的代码:

    from serpapi import GoogleSearch
    import os
    
    def serpapi_get_first_googlesearch():
        params = {
          "api_key": os.getenv("API_KEY"),
          "engine": "google",
          "q": "Quantitative Trading How to Build Your Own Algorithmic - amazon",
          "hl": "en",
        }
    
        search = GoogleSearch(params)
        results = search.get_dict()
        # [0] - first element from the search results
        first_link = results['organic_results'][0]['link']
        print(first_link)
    
    serpapi_get_first_googlesearch()
    
    # output:
    '''
    https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
    '''
    

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 2018-07-31
      • 2021-02-08
      • 2020-03-17
      • 1970-01-01
      • 1970-01-01
      • 2019-07-17
      • 2018-06-04
      • 2018-05-08
      • 2016-12-18
      相关资源
      最近更新 更多