没有得到正确的 URL Beautifulsoup python答案

【问题标题】：not getting correct url beautifulsoup python没有得到正确的 URL Beautifulsoup python
【发布时间】：2018-11-26 13:49:54
【问题描述】：

我正在尝试使用 python 和 beautifulsoup 抓取谷歌搜索结果。在我的第一个程序中，我只是想获取搜索结果页面上的所有链接。最终我想做的是跟随其他网站的链接，然后抓取这些网站。问题是当我查看程序给我的链接时，它们没有指向正确的 url。例如，在 google 中搜索“what is python”后的第一个网站 url 是 'https://www.python.org/doc/essays/blurb/' 但是我的程序给了我 '/url?q=https://www.python.org/doc/essays/blurb/&sa=U&ved=0ahUKEwirv7mZzNnbAhXD5YMKHdl0AFsQFggUMAA&usg=AOvVaw3Q2RD0gl-X3BiEJ-5HIxmF'

查看 BeautifulSoup 文档，我期望输出类似于他们的示例：

for link in soup.find_all('a'):
    print(link.get('href'))
# http://example.com/elsie
# http://example.com/lacie
# http://example.com/tillie

相反，我在网站地址之后得到了一个前面的“/url?q=”和许多意想不到的字符。有人可以解释为什么我没有得到预期的输出吗？这是我的代码：

import requests
from bs4 import BeautifulSoup

search_item = 'what is python'
url = "https://www.google.ca/search?q=" + search_item

response = requests.get(url)
soup = BeautifulSoup(response.text, "lxml")

for link in soup.find_all('a'):
    print(link.get('href'))

【问题讨论】：

Parse URL beautifulsoup的可能重复
谢谢，我想这可能是我的解决方案。但是谁能解释为什么我没有得到预期的输出？例如查看 beautifulsoup 文档，我期待与此类似的输出： for link in soup.find_all('a'): print(link.get('href')) # example.com/elsie # example.com/lacie # @987654326 @

标签： python web-scraping beautifulsoup

【解决方案1】：

这是因为没有指定user-agent，并且默认requests user-agent 是python-requests，因此Google 会阻止请求，因为它知道这是机器人而不是“真正的”用户访问。 user-agent 通过将此信息添加到 HTTP request headers 来伪造用户访问。

此外，您不会使用此代码精确定位您正在寻找的链接，它会从 HTML 中提取所有链接：

for link in soup.find_all('a'):
    print(link.get('href'))

相反，您正在寻找来自自然搜索结果的链接，例如：

# container with needed data (title, link, snippet, displayed link, etc.)
for result in soup.select('.tF2Cxc'):
  # grabbing just links from the container
  link = result.select_one('.yuRUbf a')['href']

代码：

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "what does katana mean",  # query
  "gl": "us",                    # country to search from
  "hl": "en"                     # language
}

html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

或者，您可以使用来自 SerpApi 的 Google Organic Results API 来实现相同的目的。这是一个带有免费计划的付费 API。

您的情况的不同之处在于您只需要从您想要快速的结构化 JSON 中提取数据，而不是找出某些事情无法正常工作的原因。

要集成的代码：


import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "what does katana mean",
    "hl": "en",
    "gl": "us",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['title'])
  print(result['link'])

免责声明，我为 SerpApi 工作。

【讨论】：

【解决方案2】：

我想提供这个问题的更新。我发现通过添加标题：

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                         'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 
Safari/537.36'}
r = requests.get(url, headers=headers)

google 为我提供了正确的链接，我无需对字符串进行任何操作。

【讨论】：