【问题标题】:Scraping Google search results was working yesterday, now it doesn't昨天抓取谷歌搜索结果还可以,现在不行了
【发布时间】:2019-12-18 03:30:39
【问题描述】:

所以我昨天的程序正在运行,我保存并关闭它,现在它没有。第一个 for 循环应该从谷歌搜索中附加网站链接,现在它根本不运行循环

import bs4
import requests


def Google(word):

    linkelem = []
    strlink = []
    httplink = []
    extractedhttp = []
    brokenlinks = []

    websiteheadlines = []
    websitebody = []

    res2 = requests.get(f'https://google.com/search?q={word}')
    res2.raise_for_status()

    soup2 = bs4.BeautifulSoup(res2.text, 'html.parser')
    #print(soup2)

    for div in soup2.find_all("div", {"class": "jfp3ef"}):
        for link in div.select("a"):
            linkelem.append(link)

我需要它将链接附加到列表“linkelem”中

这是不工作的部分,还有更多内容,但这一切都依赖于第一部分来工作。如果我需要添加其余部分,尽管我可以。我尝试在 for 循环中添加打印语句,但它没有打印它们。我不知道那之后该怎么办。

【问题讨论】:

  • 相当肯定jfp3ef 是一个随机类,它不时更改以防止像你这样的机器人。查看您请求的此页面的源代码,我在任何地方都找不到。
  • 另外你也不应该像这样废弃他们的网站,他们为developers.google.com/custom-search/v1/overview提供了适当的API
  • 啊拍摄我不知道,这非常有帮助,非常感谢! @Havenard

标签: python for-loop beautifulsoup request


【解决方案1】:

Havenard 建议的最明显的一个是类已更改。此外,这可能是因为您的请求没有 user-agent 来伪造真实的用户访问。 List of user agents.

headers = {
  "User-Agent":
  "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

您始终可以通过if 语句检查此类是否存在:

# scrapes all titles from page result
# try to remove one letter from CSS selector and it will print "Nothing has been found."
if soup.select('.DKV0Md'):
  print('Found elements:')
  for result in soup.select('.DKV0Md'):
    print(result.text)
else:
  print('Nothing has been found.')

# output:
'''
Found elements:
Minecraft Official Site | Minecraft
Minecraft - Wikipedia
Minecraft - Apps on Google Play
Minecraft - YouTube
'''

代码:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {'q': 'Minecraft'}

html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')

if soup.select('.DKV0Md'):
  print('Found elements:')
  for result in soup.select('.DKV0Md'):
    print(result.text)
else:
  print('Nothing has been.')

或者,您也可以使用来自 SerpApi 的 Google Search Engine Results API 来执行此操作。这是一个付费 API,可免费试用 5,000 次搜索。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "Minecraft",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
# try/except is better is this case.
# If nothing has been found, it will just print 'Nothing has been found.'
  try:  
    print('Found elements:')
    title = result['title']
    print(title)
  except:
    print('Nothing has been found.')

免责声明,我为 SerpApi 工作。

【讨论】:

    猜你喜欢
    • 2018-01-15
    • 2020-10-09
    • 2018-12-19
    • 2020-05-03
    • 1970-01-01
    • 1970-01-01
    • 2021-09-11
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多