【问题标题】:How to download pubmed articles and read them?如何下载已发表的文章并阅读它们?
【发布时间】:2026-01-11 03:30:02
【问题描述】:

我无法保存已发布的文章并阅读它们。我在此页面上看到here 有一些特殊的文件类型,但没有一个适合我。我想以一种可以连续使用键获取数据的方式保存它们。如果我将其保存为文本文件,我不知道是否可以使用它。我的代码是这个:

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

'''Class Crawler is responsable to browse the biological databases
from DownloadArticles import DownloadArticles
c = DownloadArticles()
c.articles_dataset_list
'''
class DownloadArticles():
    def __init__(self):
        Entrez.email='myemail@gmail.com'
        self.dataC = self.saveArticlesFilesInXMLMode('pubmed', '26837606')

    '''Metodo 4 ler dado em forma de texto.'''  
    def saveArticlesFilesInXMLMode(self,dbs, ids):
        net_handle = Entrez.efetch(db=dbs, id=ids, rettype="medline", retmode="txt")
        directory = "/dataset/Pubmed/DatasetArticles/"+ ids + ".fasta"
        # if not os.path.exists(directory):
        # os.makedirs(directory)
        # filename = directory + '/'
        # if not os.path.exists(filename):
        out_handle = open(directory, "w+")
        out_handle.write(net_handle.read())
        out_handle.close()
        net_handle.close()
        print("Saved")
        print("Parsing...")
        record = SeqIO.read(directory, "fasta")
        print(record)
        return(record.read())

我收到此错误:ValueError: No records found in handle 请有人可以帮助我吗?


现在我的代码是这样的,我正在尝试像你一样做一个保存在.fasta 中的函数。还有一个阅读.fasta 文件,如上面的答案。

import sys
from Bio import Entrez
import re
import os
from Bio import Medline
from Bio import SeqIO

def save_Articles_Files(dbName, idNum, rettypeName):
    net_handle = Entrez.efetch(db=dbName, id=idNum, rettype=rettypeName, retmode="txt")
    filename = path  + idNum + ".fasta"
    out_handle = open(filename, "w")
    out_handle.write(net_handle.read())
    out_handle.close()
    net_handle.close()
    print("Saved")
enter code here

Entrez.email='myemail@gmail.com'
dbName = 'pubmed'
idNum = '26837606'
rettypeName = "medline"
path ="/run/media/Dropbox/codigos/Codes/"+dbName
save_Articles_Files(dbName, idNum, rettypeName)

但是我的功能不起作用,我需要一些帮助!

【问题讨论】:

    标签: python-3.x io bioinformatics biopython


    【解决方案1】:

    您混淆了两个概念。

    1) Entrez.efetch() 用于访问 NCBI。在您的情况下,您正在从 Pubmed 下载一篇文章。您从net_handle.read() 获得的结果如下所示:

    PMID- 26837606
    OWN - NLM
    STAT- In-Process
    DA  - 20160203
    LR  - 20160210
    IS  - 2045-2322 (Electronic)
    IS  - 2045-2322 (Linking)
    VI  - 6
    DP  - 2016 Feb 03
    TI  - Exploiting the CRISPR/Cas9 System for Targeted Genome Mutagenesis in Petunia.
    PG  - 20315
    LID - 10.1038/srep20315 [doi]
    AB  - Recently, CRISPR/Cas9 technology has emerged as a powerful approach for targeted 
          genome modification in eukaryotic organisms from yeast to human cell lines. Its
          successful application in several plant species promises enormous potential for
          basic and applied plant research. However, extensive studies are still needed to 
          assess this system in other important plant species, to broaden its fields of
          application and to improve methods. Here we showed that the CRISPR/Cas9 system is
          efficient in petunia (Petunia hybrid), an important ornamental plant and a model 
          for comparative research. When PDS was used as target gene, transgenic shoot
          lines with albino phenotype accounted for 55.6%-87.5% of the total regenerated T0
          Basta-resistant lines. A homozygous deletion close to 1 kb in length can be
          readily generated and identified in the first generation. A sequential
          transformation strategy--introducing Cas9 and sgRNA expression cassettes
          sequentially into petunia--can be used to make targeted mutations with short
          indels or chromosomal fragment deletions. Our results present a new plant species
          amenable to CRIPR/Cas9 technology and provide an alternative procedure for its
          exploitation.
    FAU - Zhang, Bin
    AU  - Zhang B
    AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
          Horticulture Science for Southern Mountainous Regions, Ministry of Education,
          College of Horticulture and Landscape Architecture, Southwest University,
          Chongqing 400716, China.
    FAU - Yang, Xia
    AU  - Yang X
    AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
          Horticulture Science for Southern Mountainous Regions, Ministry of Education,
          College of Horticulture and Landscape Architecture, Southwest University,
          Chongqing 400716, China.
    FAU - Yang, Chunping
    AU  - Yang C
    AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
          Horticulture Science for Southern Mountainous Regions, Ministry of Education,
          College of Horticulture and Landscape Architecture, Southwest University,
          Chongqing 400716, China.
    FAU - Li, Mingyang
    AU  - Li M
    AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
          Horticulture Science for Southern Mountainous Regions, Ministry of Education,
          College of Horticulture and Landscape Architecture, Southwest University,
          Chongqing 400716, China.
    FAU - Guo, Yulong
    AU  - Guo Y
    AD  - Chongqing Engineering Research Centre for Floriculture, Key Laboratory of
          Horticulture Science for Southern Mountainous Regions, Ministry of Education,
          College of Horticulture and Landscape Architecture, Southwest University,
          Chongqing 400716, China.
    LA  - eng
    PT  - Journal Article
    PT  - Research Support, Non-U.S. Gov't
    DEP - 20160203
    PL  - England
    TA  - Sci Rep
    JT  - Scientific reports
    JID - 101563288
    SB  - IM
    PMC - PMC4738242
    OID - NLM: PMC4738242
    EDAT- 2016/02/04 06:00
    MHDA- 2016/02/04 06:00
    CRDT- 2016/02/04 06:00
    PHST- 2015/09/21 [received]
    PHST- 2015/12/30 [accepted]
    AID - srep20315 [pii]
    AID - 10.1038/srep20315 [doi]
    PST - epublish
    SO  - Sci Rep. 2016 Feb 3;6:20315. doi: 10.1038/srep20315.
    

    2) SeqIO.read() 用于读取和解析FASTA files。这是一种用于存储序列的格式。 FASTA 格式的序列表示为一系列行。 FASTA 文件的第一行以“>”(大于)符号开头。在第一行(用于序列的唯一描述)之后是标准单字母代码中的实际序列本身。

    如您所见,您从Entrez.efetch()(我在上面粘贴的)返回的结果看起来不像 FASTA 文件。所以SeqIO.read()给出了在文件中找不到任何序列记录的错误。

    【讨论】:

    • 那么我如何得到这个输出?
    • 您已经拥有了获取此输出的所有代码。正如我所写,它是net_handle.read() 的输出。
    • 您可以print() 它或将其写入文件,就像您已经在做的那样。只是不要用SeqIO.read() 解析它。如果您打开 /dataset/Pubmed/DatasetArticles/26837606.fasta 文件,它可能已经包含此输出。
    • 我的输出不同,结构不如你的
    • 非常感谢!!!你知道告诉我它是否缓存了文章中的所有元数据吗?