有没有办法获取给定 pubmed id 列表的摘要？答案

【问题标题】：Is there any way to get abstracts for a given list of pubmed ids?有没有办法获取给定 pubmed id 列表的摘要？
【发布时间】：2017-11-29 18:07:33
【问题描述】：

我有 pmids 列表我想在一个 url hit 中获得他们两个的摘要

    pmids=[17284678,9997]
    abstract_dict={}
    url = https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?
    db=pubmed&id=**17284678,9997**&retmode=text&rettype=xml

我的要求是获取这种格式

   abstract_dict={"pmid1":"abstract1","pmid2":"abstract2"}

我可以通过尝试每个 id 并更新字典来获得上述格式，但为了优化时间，我想将所有 id 提供给 url 并处理并只获取摘要部分。

【问题讨论】：

标签： biopython pubmed

【解决方案1】：

使用 BioPython，您可以将 Pubmed ID 的加入列表提供给 Entrez.efetch，这将执行单个 URL 查找：

from Bio import Entrez

Entrez.email = 'your_email@provider.com'

pmids = [17284678,9997]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
                       rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]
             for pubmed_article in records['PubmedArticle']]


abstract_dict = dict(zip(pmids, abstracts))

结果如下：

{9997: 'Electron paramagnetic resonance and magnetic susceptibility studies of Chromatium flavocytochrome C552 and its diheme flavin-free subunit at temperatures below 45 degrees K are reported. The results show that in the intact protein and the subunit the two low-spin (S = 1/2) heme irons are distinguishable, giving rise to separate EPR signals. In the intact protein only, one of the heme irons exists in two different low spin environments in the pH range 5.5 to 10.5, while the other remains in a constant environment. Factors influencing the variable heme iron environment also influence flavin reactivity, indicating the existence of a mechanism for heme-flavin interaction.',
 17284678: 'Eimeria tenella is an intracellular protozoan parasite that infects the intestinal tracts of domestic fowl and causes coccidiosis, a serious and sometimes lethal enteritis. Eimeria falls in the same phylum (Apicomplexa) as several human and animal parasites such as Cryptosporidium, Toxoplasma, and the malaria parasite, Plasmodium. Here we report the sequencing and analysis of the first chromosome of E. tenella, a chromosome believed to carry loci associated with drug resistance and known to differ between virulent and attenuated strains of the parasite. The chromosome--which appears to be representative of the genome--is gene-dense and rich in simple-sequence repeats, many of which appear to give rise to repetitive amino acid tracts in the predicted proteins. Most striking is the segmentation of the chromosome into repeat-rich regions peppered with transposon-like elements and telomere-like repeats, alternating with repeat-free regions. Predicted genes differ in character between the two types of segment, and the repeat-rich regions appear to be associated with strain-to-strain variation.'}

编辑：

如果 pmids 没有相应的摘要，请注意您建议的修复：

abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0] 
             for pubmed_article in records['PubmedArticle'] if 'Abstract' in
             pubmed_article['MedlineCitation']['Article'].keys()]

假设您有 Pubmed ID 列表 pmids = [1, 2, 3]，但 pmid 2 没有摘要，所以 abstracts = ['abstract of 1', 'abstract of 3']

这将导致最后一步出现问题，我将zip 两个列表放在一起制作一个字典：

>>> abstract_dict = dict(zip(pmids, abstracts))
>>> print(abstract_dict)
{1: 'abstract of 1', 
 2: 'abstract of 3'}

请注意，摘要现在与其对应的 Pubmed ID 不同步，因为您没有过滤掉没有摘要的 pmid，zip 被截断为最短的 list。

改为：

abstract_dict = {}
without_abstract = []

for pubmed_article in records['PubmedArticle']:
    pmid = int(str(pubmed_article['MedlineCitation']['PMID']))
    article = pubmed_article['MedlineCitation']['Article']
    if 'Abstract' in article:
        abstract = article['Abstract']['AbstractText'][0]
        abstract_dict[pmid] = abstract
    else:
       without_abstract.append(pmid)

print(abstract_dict)
print(without_abstract)

【讨论】：

我尝试了您的代码，它在某些方面起作用，并在其他没有摘要的文章上给出“关键错误”。下面附上'KeyError'code ----------------------------------------- ---------------------------------- KeyError Traceback（最近一次调用最后一次） in () 1 abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0] ----> 2 for pubmed_article in records['PubmedArticle'] 3 ] KeyError: '抽象'
code abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract'] ['AbstractText'][0] for pubmed_article in records['PubmedArticle'] if 'Abstract ' in pubmed_article['MedlineCitation']['Article'].keys()]
请注意，您现在需要过滤掉来自 pmids 的没有摘要的 Pubmed ID，否则 abstract_dict 将不同步。查看我的编辑。
您添加的代码将正确过滤带和不带摘要的 pmid。我有点好奇这里。如果 pmid 没有摘要，它可以有文章的标题，所以我想获得那些没有摘要的 pmid 的标题。我选择了这种方法，以免留下没有摘要的 pmid。

【解决方案2】：

from Bio import Entrez
import time
Entrez.email = 'your_email@provider.com'
pmids = [29090559 29058482 28991880 28984387 28862677 28804631 28801717 28770950 28768831 28707064 28701466 28685492 28623948 28551248]
handle = Entrez.efetch(db="pubmed", id=','.join(map(str, pmids)),
                   rettype="xml", retmode="text")
records = Entrez.read(handle)
abstracts = [pubmed_article['MedlineCitation']['Article']['Abstract']['AbstractText'][0]  if 'Abstract' in pubmed_article['MedlineCitation']['Article'].keys() else pubmed_article['MedlineCitation']['Article']['ArticleTitle']  for pubmed_article in records['PubmedArticle']]
abstract_dict = dict(zip(pmids, abstracts))
print abstract_dict

【讨论】：