如何在 Python 中打开 URL 并提取信息答案

【问题标题】：How to Open an URL and Extract information in Python如何在 Python 中打开 URL 并提取信息
【发布时间】：2017-03-22 16:18:41
【问题描述】：

我写了一个网络爬虫来提取谷歌学者信息。但是，任何方便的工具（例如 urllib2 pr 请求）都失败了。它给了我 503 错误代码。

我正在寻找另一种提取信息的方法。有没有可能我可以让程序在浏览中打开url而不是提取信息。

例如，它是一个链接：

'http://scholar.google.com/citations?user=lTCxlGYAAAAJ&hl=en'

以及如何继续获取H-index等？

【问题讨论】：

docs.python-guide.org/en/latest/scenarios/scrape
不，它不起作用。仍然是“503”错误。

标签： python scrape

【解决方案1】：

Google 学术搜索似乎暂时禁止了执行频繁查询或看似自动化的客户（使用 503 错误代码）。您可能在查询过于频繁或因为它认为您正在从脚本运行后被暂时禁止。您可以使用 cookie 在单个会话中执行多个查询。或者等到禁令解除，或者在两次尝试之间等待，或者编写您的脚本，使其看起来像是来自网络浏览器（更改它在查询中发送的“userAgent”字符串）。

在“谷歌学者 503”上进行谷歌搜索，以获取有关此主题的大量信息（这就是我所做的全部）。

另请参阅此主题：503 error when trying to access Google Patents using python

【讨论】：

好吧，我实际上对此进行了一些研究，但几乎没有找到有用的解决方案。
您需要在原始问题中包含您研究、发现、尝试过的内容等。例如，您是否尝试过我链接到的页面上的答案和 cmets 中的解决方案（即睡眠请求之间等）？他们失败了吗？您是否阅读了 retry-after 标头？向我们展示您的代码。

【解决方案2】：

您收到503 response code 可能是因为 Google 将您的脚本检测为发送自动请求的脚本。您始终可以打印响应代码文本以查看发生了什么。它可能是每 X 时间的请求数限制，或者其他。

为避免这种情况，您可以尝试的第一件事是使用代理。

在online IDE（bs4 文件夹 -> get_citedby_public_access.py）中抓取整个引用的表格（包括图表）或测试的代码：

from bs4 import BeautifulSoup
import requests, lxml, os, json

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?user=8Cuk5vYAAAAJ&hl', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

# Cited by and public access results
for cited_by_public_access in soup.select('.gsc_rsb'):
  citations_all = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std').text
  citations_since2016 = cited_by_public_access.select_one('tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std').text
  h_index_all = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std').text
  h_index_2016 = cited_by_public_access.select_one('tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std').text
  i10_index_all = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std').text
  i10_index_2016 = cited_by_public_access.select_one('tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std').text
  articles_num = cited_by_public_access.select_one('.gsc_rsb_m_a:nth-child(1) span').text.split(' ')[0]
  articles_link = cited_by_public_access.select_one('#gsc_lwp_mndt_lnk')['href']
  
  print('Citiation info:')
  print(f'{citations_all}\n{citations_since2016}\n{h_index_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n{articles_num}\nhttps://scholar.google.com{articles_link}\n')

# Graph results
years = [graph_year.text for graph_year in soup.select('.gsc_g_t')]
citations = [graph_citation.text for graph_citation in soup.select('.gsc_g_a')]

data = []

for year, citation in zip(years,citations):
  # Basic prints
  print(f'{year} {citation}\n')

  data.append({
    'year': year,
    'citation': citation,
  })

# JSON output, if needed
print(json.dumps(data, indent=2))

部分输出：

Citation info:
3208
2184
21
21
28
23
2
https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=8Cuk5vYAAAAJ

# Portion of the regular output
2007 24

2008 30

2009 46

# Portion of JSON
[
  {
    "year": "2007",
    "citation": "24"
  },
  {
    "year": "2008",
    "citation": "30"
  }
]

或者，您可以使用来自 SerpApi 的 Google Scholar Author Cited By API。这是一个付费 API，可免费试用 5,000 次搜索。

它的作用与上面的代码相同，只是您不必避免阻塞和维护解析器。

要集成的代码：

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google_scholar_author",
  "author_id": "m8dFEawAAAAJ",
}

search = GoogleSearch(params)
results = search.get_dict()

# Cited By and public access results
citations_all = results['cited_by']['table'][0]['citations']['all']
citations_2016 = results['cited_by']['table'][0]['citations']['since_2016']
h_inedx_all = results['cited_by']['table'][1]['h_index']['all']
h_index_2016 = results['cited_by']['table'][1]['h_index']['since_2016']
i10_index_all = results['cited_by']['table'][2]['i10_index']['all']
i10_index_2016 = results['cited_by']['table'][2]['i10_index']['since_2016']

print(f'{citations_all}\n{citations_2016}\n{h_inedx_all}\n{h_index_2016}\n{i10_index_all}\n{i10_index_2016}\n')

public_access_link = results['public_access']['link']
public_access_available_articles = results['public_access']['available']

print(f'{public_access_link}\n{public_access_available_articles}\n')

# Graph results
for graph_results in results['cited_by']['graph']:
  year = graph_results['year']
  citations = graph_results['citations']

  print(f'{year} {citations}\n')

部分输出：

946
563
17
12
27
18

https://scholar.google.com/citations?view_op=list_mandates&hl=en&user=m8dFEawAAAAJ
23

2004 6

2005 20

2006 11

免责声明，我为 SerpApi 工作。

【讨论】：