从 NCBI 书籍部分刮取数据？答案

【问题标题】：Scrape data from NCBI books section?从 NCBI 书籍部分刮取数据？
【发布时间】：2020-10-24 06:45:49
【问题描述】：

我目前正在编写一个程序，该程序需要我从 NCBI 中抓取文章。

我正在使用 Entrez 实用程序来执行此操作 (https://www.ncbi.nlm.nih.gov/books/NBK25497/)。

我已经想出了如何使用 PubMed 数据执行此操作，即使用 handle = Entrez.efetch(db='pubmed', id=pmid, retmode='text', rettype='abstract')。

但是，我想从 NCBI 的书籍部分抓取数据，因为 pubmed 部分包含不完整的文章（例如比较 https://pubmed.ncbi.nlm.nih.gov/20301533/ 与 https://www.ncbi.nlm.nih.gov/books/NBK1359/）。

我有一个文本文件中所有 GeneReviews ID（例如 NB1359、NB1400 等）的列表，但我不确定如何抓取这些数据，因为handle = Entrez.esearch(db='books', term="NB1359", retmode='text') 不会返回文章中的文本。

【问题讨论】：

标签： python text-mining biopython ncbi

【解决方案1】：

我也没有看到使用Entrez.esearch 的方法，但我只会下载页面的可打印版本并直接解析：

import requests
from bs4 import BeautifulSoup

genereview_ids = ['NBK1359', 'NBK1400']

for genereview_id in genereview_ids:
  url = f"https://www.ncbi.nlm.nih.gov/books/{genereview_id}/?report=printable"
  r = requests.get(url)
  html_doc = r.text
  soup = BeautifulSoup(html_doc, 'html.parser')
  print(soup.find('meta', {'name': 'description'})['content'])

输出：

The purpose of this overview is to increase the awareness of clinicians regarding congenital diaphragmatic hernia and its genetic causes and management.
Cystinosis comprises three allelic phenotypes:

【讨论】：