【问题标题】:Google scholar Captcha verification problem谷歌学者验证码验证问题
【发布时间】:2011-05-30 20:47:51
【问题描述】:

我正在做一个项目,我需要从 Google Scholar 中提取一些数据。我的 PHP 程序从我的本地机器中获取一个字符串,将其传递给 Google Scholar,然后在搜索结果页面上取出第一个结果并将其保存到数据库中。

我必须为近 90,000 个字符串/查询执行此操作。问题是,在数百次输入后,程序停止,因为 Google Scholar 要求进行验证码验证。我该怎么办?

【问题讨论】:

    标签: captcha verification google-scholar


    【解决方案1】:

    由于 Google Scholar 没有 API,因此没有文档化的方式来做您想做的事。您不应该像这样抓取数据,这就是您遇到 Google 机器人保护功能的原因。我认为您唯一真正的选择是等待 Google 创建 API。

    【讨论】:

    • 或与 Google 谈谈你在做什么!
    • 我怀疑他们会回应。已经有一个 Google Groups 线程请求 API 访问。
    【解决方案2】:

    您可以尝试使用来自 SerpApi 的Google Scholar Organic Results API。这是一个带有免费计划的付费 API。

    它通过专用代理、CAPTCHA 解决服务绕过搜索引擎的块、处理缩放、无需从头开始创建解析器并随着时间的推移对其进行维护。

    代码和example to integrate with PHP in the online IDE

    <?php
    ini_set('display_errors', 1);
    ini_set('display_startup_errors', 1);
    error_reporting(E_ALL);
    
    require __DIR__ . '/vendor/autoload.php';
    
    $queries = array(
        "moon",
        "pandas",
        "python",
        "data science",
        "ML",
        "AI",
        "animals",
        "amd",
        "nvidia",
        "intel",
        "asus",
        "robbery pi",
        "latex, tex",
        "amg",
        "blizzard",
        "world of warcraft",
        "cs go",
        "antarctica",
        "fifa",
        "amsterdam",
        "usa",
        "tesla",
        "economy",
        "ecology",
        "biology"
    );
    
    foreach ($queries as $query) {
        $params = [
            "engine" => "google_scholar",
            "q" => $query,
            "hl" => "en"
        ];
    
        $client = new GoogleSearch(getenv("API_KEY"));
        $response = $client->get_json($params);
    
        print_r("Extracting search query: {$query}\n");
    
        foreach ($response->organic_results as $result) {
            print_r("{$result->title}\n");
        }
    }
    ?>
    

    代码和example code to integrate with Python

    from serpapi import GoogleScholarSearch
    import os
    
    queries = ["moon",
               "pandas",
               "python",
               "data science",
               "ML",
               "AI",
               "animals",
               "amd",
               "nvidia",
               "intel",
               "asus",
               "robbery pi",
               "latex, tex",
               "amg",
               "blizzard",
               "world of warcraft",
               "cs go",
               "antarctica",
               "fifa",
               "amsterdam",
               "usa",
               "tesla",
               "economy",
               "ecology",
               "biology"]
    
    for query in queries:
        params = {
            "api_key": os.getenv("API_KEY"),
            "engine": "google_scholar",
            "q": query,
            "hl": "en"
            }
    
        search = GoogleScholarSearch(params)
        results = search.get_dict()
    
        print(f"Extracting search query: {query}")
    
        for result in results["organic_results"]:
            print(result["title"])
    

    输出:

    Extracting search query: moon
    Cellulose nanomaterials review: structure, properties and nanocomposites
    Reflection in learning and professional development: Theory and practice
    
    ...
    
    Extracting search query: biology
    A new biology for a new century
    The biology of mycorrhiza.
    

    免责声明,我为 SerpApi 工作。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-12-13
      • 2012-03-02
      • 2016-07-07
      • 1970-01-01
      • 2013-03-23
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多