谷歌学者验证码验证问题答案

【问题标题】：Google scholar Captcha verification problem谷歌学者验证码验证问题
【发布时间】：2011-05-30 20:47:51
【问题描述】：

我正在做一个项目，我需要从 Google Scholar 中提取一些数据。我的 PHP 程序从我的本地机器中获取一个字符串，将其传递给 Google Scholar，然后在搜索结果页面上取出第一个结果并将其保存到数据库中。

我必须为近 90,000 个字符串/查询执行此操作。问题是，在数百次输入后，程序停止，因为 Google Scholar 要求进行验证码验证。我该怎么办？

【问题讨论】：

标签： captcha verification google-scholar

【解决方案1】：

由于 Google Scholar 没有 API，因此没有文档化的方式来做您想做的事。您不应该像这样抓取数据，这就是您遇到 Google 机器人保护功能的原因。我认为您唯一真正的选择是等待 Google 创建 API。

【讨论】：

或与 Google 谈谈你在做什么！
我怀疑他们会回应。已经有一个 Google Groups 线程请求 API 访问。

【解决方案2】：

您可以尝试使用来自 SerpApi 的Google Scholar Organic Results API。这是一个带有免费计划的付费 API。

它通过专用代理、CAPTCHA 解决服务绕过搜索引擎的块、处理缩放、无需从头开始创建解析器并随着时间的推移对其进行维护。

代码和example to integrate with PHP in the online IDE：

<?php
ini_set('display_errors', 1);
ini_set('display_startup_errors', 1);
error_reporting(E_ALL);

require __DIR__ . '/vendor/autoload.php';

$queries = array(
    "moon",
    "pandas",
    "python",
    "data science",
    "ML",
    "AI",
    "animals",
    "amd",
    "nvidia",
    "intel",
    "asus",
    "robbery pi",
    "latex, tex",
    "amg",
    "blizzard",
    "world of warcraft",
    "cs go",
    "antarctica",
    "fifa",
    "amsterdam",
    "usa",
    "tesla",
    "economy",
    "ecology",
    "biology"
);

foreach ($queries as $query) {
    $params = [
        "engine" => "google_scholar",
        "q" => $query,
        "hl" => "en"
    ];

    $client = new GoogleSearch(getenv("API_KEY"));
    $response = $client->get_json($params);

    print_r("Extracting search query: {$query}\n");

    foreach ($response->organic_results as $result) {
        print_r("{$result->title}\n");
    }
}
?>

代码和example code to integrate with Python：

from serpapi import GoogleScholarSearch
import os

queries = ["moon",
           "pandas",
           "python",
           "data science",
           "ML",
           "AI",
           "animals",
           "amd",
           "nvidia",
           "intel",
           "asus",
           "robbery pi",
           "latex, tex",
           "amg",
           "blizzard",
           "world of warcraft",
           "cs go",
           "antarctica",
           "fifa",
           "amsterdam",
           "usa",
           "tesla",
           "economy",
           "ecology",
           "biology"]

for query in queries:
    params = {
        "api_key": os.getenv("API_KEY"),
        "engine": "google_scholar",
        "q": query,
        "hl": "en"
        }

    search = GoogleScholarSearch(params)
    results = search.get_dict()

    print(f"Extracting search query: {query}")

    for result in results["organic_results"]:
        print(result["title"])

输出：

Extracting search query: moon
Cellulose nanomaterials review: structure, properties and nanocomposites
Reflection in learning and professional development: Theory and practice

...

Extracting search query: biology
A new biology for a new century
The biology of mycorrhiza.

免责声明，我为 SerpApi 工作。

【讨论】：