【问题标题】:Extract info based on name tag from XML file by beautifulsoup python通过beautifulsoup python从XML文件中提取基于名称标签的信息
【发布时间】:2023-03-14 06:51:01
【问题描述】:
  • 在 python 3.5 中——我正在使用 Entrez biopython 从 pubmed 生物医学网站的 Database = pmc 中提取一些信息。现在我想从 XML 文件:

    <DocSum>
    <Id>5412469</Id>
    <Item Name="PubDate" Type="Date">2017 Apr 22</Item>
    <Item Name="EPubDate" Type="Date">2017 Apr 22</Item>
    <Item Name="Source" Type="String">Int J Mol Sci</Item>
    <Item Name="AuthorList" Type="List">
        <Item Name="Author" Type="String">Guo Y</Item>
        <Item Name="Author" Type="String">Bao Y</Item>
        <Item Name="Author" Type="String">Yang W</Item>
    </Item>
    <Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>
    <Item Name="Volume" Type="String">18</Item>
    <Item Name="Issue" Type="String">4</Item>
    <Item Name="Pages" Type="String">890</Item>
    <Item Name="ArticleIds" Type="List">
        <Item Name="pmid" Type="String">28441730</Item>
        <Item Name="doi" Type="String">10.3390/ijms18040890</Item>
        <Item Name="pmcid" Type="String">PMC5412469</Item>
    </Item>
    <Item Name="DOI" Type="String">10.3390/ijms18040890</Item>
    <Item Name="FullJournalName" Type="String">International Journal of Molecular Sciences</Item>
    <Item Name="SO" Type="String">2017 Apr 22;18(4):890</Item>
    

extract Name=Title {Exact below line}:

 <Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>

但是我该如何解决这个问题呢? 虽然我用过这段代码:

    for tag in soup.findAll("docsum"): # I'm working with multiple articles in one file
    for a_tag in tag.findAll("item"):
        a_recs.append(a_tag.text)

return a_recs

但它返回一个列表中的所有值,而我只想要标题。如下:

['2017 Apr 22', '2017 Apr 22', 'Int J Mol Sci', '\nGuo Y\nBao Y\nYang W\n', 'Guo Y', 'Bao Y', 'Yang W', 'Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis', '18', '4', '890', '\n28441730\n10.3390/ijms18040890\nPMC5412469\n', '28441730', '10.3390/ijms18040890', 'PMC5412469', '10.3390/ijms18040890', 'International Journal of Molecular Sciences', '2017 Apr 22;18(4):890']

【问题讨论】:

    标签: python xml python-3.x beautifulsoup extract


    【解决方案1】:

    试试:

    >>> data = '''
    ... <DocSum>
    ... <Id>5412469</Id>
    ... <Item Name="PubDate" Type="Date">2017 Apr 22</Item>
    ... <Item Name="EPubDate" Type="Date">2017 Apr 22</Item>
    ... <Item Name="Source" Type="String">Int J Mol Sci</Item>
    ... <Item Name="AuthorList" Type="List">
    ...     <Item Name="Author" Type="String">Guo Y</Item>
    ...     <Item Name="Author" Type="String">Bao Y</Item>
    ...     <Item Name="Author" Type="String">Yang W</Item>
    ... </Item>
    ... <Item Name="Title" Type="String">Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis</Item>
    ... <Item Name="Volume" Type="String">18</Item>
    ... <Item Name="Issue" Type="String">4</Item>
    ... <Item Name="Pages" Type="String">890</Item>
    ... <Item Name="ArticleIds" Type="List">
    ...     <Item Name="pmid" Type="String">28441730</Item>
    ...     <Item Name="doi" Type="String">10.3390/ijms18040890</Item>
    ...     <Item Name="pmcid" Type="String">PMC5412469</Item>
    ... </Item>
    ... <Item Name="DOI" Type="String">10.3390/ijms18040890</Item>
    ... <Item Name="FullJournalName" Type="String">International Journal of Molecular Sciences</Item>
    ... <Item Name="SO" Type="String">2017 Apr 22;18(4):890</Item>'''
    >>> 
    >>> from bs4 import BeautifulSoup
    >>> soup = BeautifulSoup(data, 'xml')
    
    >>> for tag in soup.findAll("DocSum"):
    ...    for a_tag in tag.find("Item", {"Name" : "Title"}):
    ...       a_recs.append(a_tag)
    ... 
    >>> a_recs
    ['Regulatory miRNAs in Colorectal Carcinogenesis and Metastasis']
    

    【讨论】:

      猜你喜欢
      • 2012-05-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-07-01
      • 1970-01-01
      • 2013-12-26
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多