如何下载已发表的文章并阅读它们？答案

【问题标题】：How to download pubmed articles and read them?如何下载已发表的文章并阅读它们？
【发布时间】：2026-01-11 03:30:02
【问题描述】：

我无法保存已发布的文章并阅读它们。我在此页面上看到here 有一些特殊的文件类型，但没有一个适合我。我想以一种可以连续使用键获取数据的方式保存它们。如果我将其保存为文本文件，我不知道是否可以使用它。我的代码是这个：

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
    def __init__(self):
        Entrez.email='myemail@gmail.com'
        self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')

    '''Metodo 4 ler dado em forma de texto.'''  
    def saveArticlesFilesInXMLMode(self,dbs, ids):
        net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
        directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
        # if not os.path.exists(directory):
        # os.makedirs(directory)
        # filename = directory + '/'
        # if not os.path.exists(filename):
        out_handle = open(directory, "w+")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved")
        print("Parsing...")
        record = SeqIO.read(directory, "fasta")
        print(record)
        return(record.read())

我收到此错误：ValueError: No records found in handle请有人可以帮助我吗？

现在我的代码是这样的，我正在尝试像你一样做一个保存在.fasta 中的函数。还有一个阅读.fasta 文件，如上面的答案。

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

def save_Articles_Files(dbName, idNum, rettypeName):
    net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
    filename = path  + idNum + ".fasta"
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")
enter code here

Entrez.email='myemail@gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)

但是我的功能不起作用，我需要一些帮助！

【问题讨论】：

标签： python-3.x io bioinformatics biopython

【解决方案1】：

您混淆了两个概念。

1) Entrez.efetch() 用于访问 NCBI。在您的情况下，您正在从 Pubmed 下载一篇文章。您从net_handle.read() 获得的结果如下所示：

PMID- 26837606
OWN - NLM
STAT- In-Process
DA  - 20160203
LR  - 20160210
IS  - 2045-2322 (Electronic)
IS  - 2045-2322 (Linking)
VI  - 6
DP  - 2016 Feb 03
TI  - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
PG  - 20315
LID - 10.1038/srep20315 [doi]
AB  - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted 
      genome modification in eukaryotic organisms from yeast to human cell lines. Its
      successful application in several plant species promises enormous potential for
      basic and applied plant research. However, extensive studies are still needed to 
      assess this system in other important plant species, to broaden its fields of
      application and to improve methods. Here we showed that the CRISPR/Cas9 system is
      efficient in petunia (Petunia hybrid), an important ornamental plant and a model 
      for comparative research. When PDS was used as target gene, transgenic shoot
      lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
      Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
      readily generated and identified in the first generation. A sequential
      transformation strategy--introducing Cas9 and sgRNA expression cassettes
      sequentially into petunia--can be used to make targeted mutations with short
      indels or chromosomal fragment deletions. Our results present a new plant species
      amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
      exploitation.
FAU - Zhang, Bin
AU  - Zhang B
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Xia
AU  - Yang X
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Yang, Chunping
AU  - Yang C
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Li, Mingyang
AU  - Li M
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
FAU - Guo, Yulong
AU  - Guo Y
AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
      Horticulture Science for Southern Mountainous Regions, Ministry of Education,
      College of Horticulture and Landscape Architecture, Southwest University,
      Chongqing 400716, China.
LA  - eng
PT  - Journal Article
PT  - Research Support, Non-U.S. Gov't
DEP - 20160203
PL  - England
TA  - Sci Rep
JT  - Scientific reports
JID - 101563288
SB  - IM
PMC - PMC4738242
OID - NLM: PMC4738242
EDAT- 2016/02/04 06:00
MHDA- 2016/02/04 06:00
CRDT- 2016/02/04 06:00
PHST- 2015/09/21 [received]
PHST- 2015/12/30 [accepted]
AID - srep20315 [pii]
AID - 10.1038/srep20315 [doi]
PST - epublish
SO  - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.

2) SeqIO.read() 用于读取和解析FASTA files。这是一种用于存储序列的格式。 FASTA 格式的序列表示为一系列行。 FASTA 文件的第一行以“>”（大于）符号开头。在第一行（用于序列的唯一描述）之后是标准单字母代码中的实际序列本身。

如您所见，您从Entrez.efetch()（我在上面粘贴的）返回的结果看起来不像 FASTA 文件。所以SeqIO.read()给出了在文件中找不到任何序列记录的错误。

【讨论】：

那么我如何得到这个输出？
您已经拥有了获取此输出的所有代码。正如我所写，它是net_handle.read() 的输出。
您可以print() 它或将其写入文件，就像您已经在做的那样。只是不要用SeqIO.read() 解析它。如果您打开 /dataset/Pubmed/DatasetArticles/26837606.fasta 文件，它可能已经包含此输出。
我的输出不同，结构不如你的
非常感谢！！！你知道告诉我它是否缓存了文章中的所有元数据吗？