【问题标题】:Losing information when using BeautifulSoup使用 BeautifulSoup 时丢失信息
【发布时间】:2019-12-22 17:19:51
【问题描述】:

我正在遵循“使用 Python 自动化无聊的东西”的指南 练习一个名为“Project: “I'm Feeling Lucky” Google Search'的项目

但 CSS 选择器不返回任何内容

import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])
else:
    address = pyperclip.paste()

res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))**

我已经在 IDLE shell 中测试了相同的代码

好像

linkElems = soup.select('.r') 

什么都不返回

在我检查了美丽汤返回的值之后

soup = bs4.BeautifulSoup(res.text,"html.parser")

我发现所有class='r'class='rc' 都无缘无故消失了。 但它们在原始 HTML 文件中。

请告诉我为什么以及如何避免此类问题

【问题讨论】:

    标签: css python-3.x beautifulsoup css-selectors


    【解决方案1】:

    Google 阻止您的请求的原因是因为默认请求用户代理是python-requestsCheck what's your user-agent 从而阻止您的请求并导致具有不同元素和选择器的完全不同的 HTML。但有时您可以在使用 user-agent 时收到不同的 HTML,使用不同的选择器。

    详细了解user-agentHTTP request headers

    user-agent 传递给请求headers

    headers = {
        'User-agent':
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    requests.get('YOUR_URL', headers=headers)
    

    尝试改用lxml解析器it's faster.


    代码和full example in the online IDE

    from bs4 import BeautifulSoup
    import requests
    
    headers = {
        'User-agent':
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
    }
    
    params = {
      "q": "My query goes here"
    }
    
    html = requests.get('https://www.google.com/search', headers=headers, params=params)
    soup = BeautifulSoup(html.text, 'lxml')
    
    for result in soup.select('.tF2Cxc'):
      link = result.select_one('.yuRUbf a')['href']
      print(link)
    
    -----
    
    '''
    https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
    https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
    https://www.exoscale.com/syslog/explaining-mysql-queries/
    https://blog.hubspot.com/marketing/sql-tutorial-introduction
    https://mode.com/sql-tutorial/sql-sub-queries/
    https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
    https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
    https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
    http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
    https://www.sommarskog.se/query-plan-mysteries.html
    '''
    

    或者,您可以使用来自 SerpApi 的 Google Organic Results API 来做同样的事情。这是一个带有免费计划的付费 API。

    您的情况不同的是,您只需要从 JSON 字符串中提取所需的数据,而不是弄清楚如何从 Google 提取、维护或绕过块。

    要集成的代码:

    
    
    params = {
        "engine": "google",
        "q": "My query goes here",
        "hl": "en",
        "api_key": os.getenv("API_KEY"),
    }
    
    search = GoogleSearch(params)
    results = search.get_dict()
    
    for result in results["organic_results"]:
      print(result['link'])
    
    -------
    '''
    https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
    https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
    https://www.exoscale.com/syslog/explaining-mysql-queries/
    https://blog.hubspot.com/marketing/sql-tutorial-introduction
    https://mode.com/sql-tutorial/sql-sub-queries/
    https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
    https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
    https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
    http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
    https://www.sommarskog.se/query-plan-mysteries.html
    '''
    

    免责声明,我为 SerpApi 工作。

    【讨论】:

      【解决方案2】:

      要获取定义类r的HTML版本,需要在标题中设置User-Agent

      import requests
      from bs4 import BeautifulSoup
      
      address = 'linux'
      
      headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}
      
      res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
      res.raise_for_status()
      soup = BeautifulSoup(res.text,"html.parser")
      
      linkElems = soup.select('.r a')
      
      for a in linkElems:
          if a.text.strip() == '':
              continue
          print(a.text)
      

      打印:

      Linux.orghttps://www.linux.org/
      Puhverdatud
      Tõlgi see leht
      Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
      Puhverdatud
      Sarnased
      Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux
      
      ...and so on.
      

      【讨论】:

      • 非常感谢,它有效!但我还是不知道原因
      • @Tritium 一些网站根据User-Agent返回不同的HTML版本。谷歌就是其中之一。但是,是的,有时很难找到它 - 而且它会发生不可预测的变化。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-01-08
      • 2011-06-11
      • 1970-01-01
      • 1970-01-01
      • 2014-09-04
      相关资源
      最近更新 更多