如何在 python 中解析 Google 自定义搜索 javascript 输出？答案

【问题标题】：How to parse Google custom search javascript output in python?如何在 python 中解析 Google 自定义搜索 javascript 输出？
【发布时间】：2021-11-07 15:33:23
【问题描述】：

我正在尝试根据输入的关键字从ACL website 获取一些文章。该网站使用 google 自定义搜索 API，API 的输出是一个 javascript 对象。

如何在 python 中解析返回的对象，并从该对象中获取研究论文的文章名称、URL 和摘要。

我用来获取文章的脚本：

import requests


params = (
    ('rsz', 'filtered_cse'),
    ('num', '10'),
    ('hl', 'en'),
    ('source', 'gcsc'),
    ('gss', '.com'),
    ('cselibv', 'cc267ab8871224bd'),
    ('cx', '000299513257099441687:fkkgoogvtaw'),
    ('q', 'multi-label text classification'),
    ('safe', 'off'),
    ('cse_tok', 'AJvRUv1dd6NHqw5GKAoRSg3lLILE:1636278007905'),
    ('sort', ''),
    ('exp', 'csqr,cc,4618906'),
    ('callback', 'google.search.cse.api12760'),
)

response = requests.get('https://cse.google.com/cse/element/v1', params=params)

print(response.headers['Content-Type'])
# 'application/javascript; charset=utf-8'

输出如下所示：

'/*O_o*/\ngoogle.search.cse.api12760({\n  "cursor": {\n    "currentPageIndex": 0,\n    "estimatedResultCount": "21600",\n    "moreResultsUrl": "http://www.google.com/cse?oe=utf8&ie=utf8&source=uds&q=multi-label+text+classification&safe=off&sort=&cx=000299513257099441687:fkkgoogvtaw&start=0",\n    "resultCount": "21,600",\n    "searchResultTime": "0.16",\n    "pages": [\n      {\n        "label": 1,\n        "start": "0"\n      },\n      {\n        "label": 2,\n        "start": "10"\n      },\n      {\n        "label":

虽然在启动search command时chrome的network选项卡中的输出是JSON：

如何从 python 中的 js 对象获取文章及其链接？

【问题讨论】：

希望这篇文章对你有帮助click here to navigate
也许如果你跳过callback，那么它会以纯JSON的形式发送，你可以使用模块json将其转换为Python字典。此时您可以从字符串末尾删除/*O_o*/\ngoogle.search.cse.api12760( 和);，您应该有JSON，您可以将其转换为Python 字典。
你真的需要抓取抽象部分吗？如果是这样，那么你可以使用 selenium 之类的自动化工具来做到这一点。因为 api 没有产生完整的数据。

标签： python python-3.x web-scraping python-requests

【解决方案1】：

response.text 为您提供字符串，如果您在开头删除 /*O_o*/\ngoogle.search.cse.api12760(，最后删除 );，那么您将拥有正常的 JSON，您可以使用 json.loads() 将其转换为 Python 字典 - 然后您可以使用[key]从字典中获取数据。

最小的工作示例

import requests
import json

params = (
    ('rsz', 'filtered_cse'),
    ('num', '10'),
    ('hl', 'en'),
    ('source', 'gcsc'),
    ('gss', '.com'),
    ('cselibv', 'cc267ab8871224bd'),
    ('cx', '000299513257099441687:fkkgoogvtaw'),
    ('q', 'multi-label text classification'),
    ('safe', 'off'),
    ('cse_tok', 'AJvRUv1dd6NHqw5GKAoRSg3lLILE:1636278007905'),
    ('sort', ''),
    ('exp', 'csqr,cc,4618906'),
    ('callback', 'google.search.cse.api12760'),
)

response = requests.get('https://cse.google.com/cse/element/v1', params=params)

start = len('''/*O_o*/
google.search.cse.api12760(''')
end = len(');')

text = response.text[start:-end]
data = json.loads(text)

#print(data)

for item in data['results']:
    #print('keys:', item.keys())
    print('title:', item['title'])
    print('url:', item['url'])
    #print('content:', item['content'])
    #print('title:', item['titleNoFormatting'])
    #meta = item['richSnippet']['metatags']
    #if 'author' in meta:
    #    print('author:', meta['author'])
    print('---')

结果：

title: Large-Scale <b>Multi</b>-<b>Label Text Classification</b> on EU Legislation - ACL ...
url: https://www.aclweb.org/anthology/P19-1636/
---
title: <b>Label</b>-Specific Document Representation for <b>Multi</b>-<b>Label Text</b> ...
url: https://www.aclweb.org/anthology/D19-1044/
---
title: Initializing neural networks for hierarchical <b>multi</b>-<b>label text</b> ...
url: https://www.aclweb.org/anthology/W17-2339
---
title: TaxoClass: Hierarchical <b>Multi</b>-<b>Label Text Classification</b> Using Only ...
url: https://www.aclweb.org/anthology/2021.naacl-main.335/
---
title: NeuralClassifier: An Open-source Neural Hierarchical <b>Multi</b>-<b>label</b> ...
url: https://www.aclweb.org/anthology/P19-3015/
---
title: Extreme <b>Multi</b>-<b>Label</b> Legal <b>Text Classification</b>: A Case Study in EU ...
url: https://www.aclweb.org/anthology/W19-2209
---
title: Hierarchical Transfer Learning for <b>Multi</b>-<b>label Text Classification</b> ...
url: https://www.aclweb.org/anthology/P19-1633/
---
title: Global Model for Hierarchical <b>Multi</b>-<b>Label Text Classification</b> - ACL ...
url: https://www.aclweb.org/anthology/I13-1006
---
title: Hierarchical <b>Multi</b>-<b>label Classification</b> of <b>Text</b> with Capsule Networks ...
url: https://www.aclweb.org/anthology/P19-2045
---
title: Improving Pretrained Models for Zero-shot <b>Multi</b>-<b>label Text</b> ...
url: https://www.aclweb.org/anthology/2021.naacl-main.83.pdf
---

顺便说一句：

如果你显示item.keys()，那么你应该看看你还能得到什么：

'cacheUrl', 'clicktrackUrl', 'content', 'contentNoFormatting', 
'title', 'titleNoFormatting', 'formattedUrl', 'unescapedUrl', 'url', 
'visibleUrl', 'richSnippet', 'breadcrumbUrl'

或者你可以使用for循环来显示所有的键和值

for item in data['results']:
    for key, value in item.items():
        print(f'{key}: {value}')
        print('---')
    print('===================================')

其中一些可能有子字典 - 例如item['richSnippet']['metatags']['author']

【讨论】：