有没有办法从谷歌专利搜索中抓取所有专利的链接？答案

【问题标题】：Is there any way to scrape the links to all patents from a Google Patents search?有没有办法从谷歌专利搜索中抓取所有专利的链接？
【发布时间】：2021-06-02 16:47:52
【问题描述】：

我想使用 BeautifulSoup 从 Google 专利搜索中抓取专利链接，但我不确定 Google 是否将他们的 html 转换为无法通过 BeautifulSoup 解析的 javascript，或者问题是什么。

下面是一些简单的代码：

url = 'https://patents.google.com/?assignee=Roche&after=priority:20110602&type=PATENT&num=100'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

links = []
for link in soup.find_all('a', href=True):
    print(link['href'])

我还想将链接附加到列表中，但没有打印任何内容，因为汤中没有“a”标签。有什么方法可以获取所有专利的链接？

【问题讨论】：

标签： python beautifulsoup google-patent-search

【解决方案1】：

数据是动态渲染的，所以很难从bs4 获取，所以你可以尝试进入 chrome 开发者模式。

然后转到网络选项卡，您现在可以找到 xhr 选项卡重新加载您的网页，因此名称选项卡下会有链接，其中一个链接包含所有 json 格式的数据

所以您可以复制该链接的地址，然后您可以使用requests 模块拨打电话，现在您可以提取您想要的任何数据

如果你想要单独的链接，它是由publication_number组成的，你可以用旧链接加入它以获得出版物的链接。

import requests
main_url="https://patents.google.com/"
params="?assignee=Roche&after=priority:20110602&type=PATENT&num=100"

res=requests.get("https://patents.google.com/xhr/query?url=assignee%3DRoche%26after%3Dpriority%3A20110602%26type%3DPATENT%26num%3D100&exp=")
main_data=res.json()
data=main_data['results']['cluster']

for i in range(len(data[0]['result'])): 
    num=data[0]['result'][i]['patent']['publication_number']
    print(num)
    print(main_url+"patent/"+num+"/en"+params)

输出：

US10287352B2
https://patents.google.com/patent/US10287352B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10364292B2
https://patents.google.com/patent/US10364292B2/en?assignee=Roche&after=priority:20110602&type=PATENT&num=100
US10494633B2
.....

图片：

【讨论】：