如何在 Python 中从 XML 文件创建嵌套的字典列表答案

【问题标题】：how to create nested list of dictionaries from XML file in Python如何在 Python 中从 XML 文件创建嵌套的字典列表
【发布时间】：2020-12-16 04:57:35
【问题描述】：

此 XML 样本代表来自 HMDB 和 Serum Metabolites 数据集的样本代谢物。

<?xml version="1.0" encoding="UTF-8"?>
<hmdb xmlns="http://www.hmdb.ca">
<metabolite>
  <version>4.0</version>
  <creation_date>2005-11-16 15:48:42 UTC</creation_date>
  <update_date>2019-01-11 19:13:56 UTC</update_date>
  <accession>HMDB0000001</accession>
  <status>quantified</status>
  <secondary_accessions>
    <accession>HMDB00001</accession>
    <accession>HMDB0004935</accession>
    <accession>HMDB0006703</accession>
    <accession>HMDB0006704</accession>
    <accession>HMDB04935</accession>
    <accession>HMDB06703</accession>
    <accession>HMDB06704</accession>
  </secondary_accessions>
  <name>1-Methylhistidine</name>
  <cs_description>1-Methylhistidine, also known as 1-mhis, belongs to the class of organic compounds known as histidine and derivatives. Histidine and derivatives are compounds containing cysteine or a derivative thereof resulting from reaction of cysteine at the amino group or the carboxy group, or from the replacement of any hydrogen of glycine by a heteroatom. 1-Methylhistidine has been found in human muscle and skeletal muscle tissues, and has also been detected in most biofluids, including cerebrospinal fluid, saliva, blood, and feces. Within the cell, 1-methylhistidine is primarily located in the cytoplasm. 1-Methylhistidine participates in a number of enzymatic reactions. In particular, 1-Methylhistidine and Beta-alanine can be converted into anserine; which is catalyzed by the enzyme carnosine synthase 1. In addition, Beta-Alanine and 1-methylhistidine can be biosynthesized from anserine; which is mediated by the enzyme cytosolic non-specific dipeptidase. In humans, 1-methylhistidine is involved in the histidine metabolism pathway. 1-Methylhistidine is also involved in the metabolic disorder called the histidinemia pathway.</cs_description>
  <description>One-methylhistidine (1-MHis) is derived mainly from the anserine of dietary flesh sources, especially poultry. The enzyme, carnosinase, splits anserine into b-alanine and 1-MHis. High levels of 1-MHis tend to inhibit the enzyme carnosinase and increase anserine levels. Conversely, genetic variants with deficient carnosinase activity in plasma show increased 1-MHis excretions when they consume a high meat diet. Reduced serum carnosinase activity is also found in patients with Parkinson's disease and multiple sclerosis and patients following a cerebrovascular accident. Vitamin E deficiency can lead to 1-methylhistidinuria from increased oxidative effects in skeletal muscle. 1-Methylhistidine is a biomarker for the consumption of meat, especially red meat.</description>
  <synonyms>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid</synonym>
    <synonym>1-Methylhistidine</synonym>
    <synonym>Pi-methylhistidine</synonym>
    <synonym>(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate</synonym>
    <synonym>1 Methylhistidine</synonym>
    <synonym>1-Methyl histidine</synonym>
  </synonyms>
  <chemical_formula>C7H11N3O2</chemical_formula>
  <smiles>CN1C=NC(C[C@H](N)C(O)=O)=C1</smiles>
  <inchikey>BRMWTNUJHUMWMS-LURJTMIESA-N</inchikey>
<diseases>
    <disease>
      <name>Kidney disease</name>
      <omim_id/>
      <references>
        <reference>
          <reference_text>McGregor DO, Dellow WJ, Lever M, George PM, Robson RA, Chambers ST: Dimethylglycine accumulates in uremia and predicts elevated plasma homocysteine concentrations. Kidney Int. 2001 Jun;59(6):2267-72.</reference_text>
          <pubmed_id>11380830</pubmed_id>
        </reference>
        <reference>
          <reference_text>Ehrenpreis ED, Salvino M, Craig RM: Improving the serum D-xylose test for the identification of patients with small intestinal malabsorption. J Clin Gastroenterol. 2001 Jul;33(1):36-40.</reference_text>
          <pubmed_id>11418788</pubmed_id>
        </reference>
      </references>
    </disease>
</diseases>

我想要做的是运行一个嵌套循环并创建一个字典列表。

每本字典都代表一种代谢物。

字典中的每个键都将被选择节点（按标签名称）。

键的值可以是字符串列表或单个字符串。

这是我认为需要的结构（也欢迎更好的想法）：

[  
    {
    "accession":"accession.value", 
    "name": "name.value",
    "synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ], 
    "chemical_formula":"chemical_formula.value", 
    "smiles": "smiles.value",
    "inchikey":"inchikey.value", 
    "biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
    "diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
    "pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    "pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    .
    .
    .
    }, 
    {"accession":"accession.value", 
    "name": "name.value",
    "synonyms":[synonyms.value.1, synonyms.value.2, synonyms.value.3,... ], 
    "chemical_formula":"chemical_formula.value", 
    "smiles": "smiles.value",
    "inchikey":"inchikey.value", 
    "biological_properties_pathways":[pathways.value1, pathways.value2, pathways.value3,.. ]
    "diseases":[disease.name.1, disease.name.2, disease.name.3,.. ]
    "pubmed_id's for disease.name.1":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    "pubmed_id's for disease.name.2":[pubmed_id.value.1, pubmed_id.value.2, pubmed_id.value.3,... ]
    .
    .
    .
    },
    .
    .
    .
]

这就是我到目前为止所做的

# Import packges
from xml.dom import minidom
import xml.etree.ElementTree as et

# load data 
data1 = et.parse('D:/path/to/my/Projects/HMDB/DataSets/saliva_metabolites/saliva_metabolites.xml')

# create name space 
ns = {"h": "http://www.hmdb.ca"}

# extract the first 3 metabolites only for easy work
metabolites = root.findall('./h:metabolite', ns)   [0:3]

现在在 3 个代谢物上运行嵌套循环并选择特定节点（我需要的前 2 个）作为字典。

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts={"accession":  subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts = {"name": subchild.text}
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

我收到了这个输出：

>> print(newlist)
[{'name': '1-Methylhistidine'}, {'name': '2-Ketobutyric acid'}, {'name': '2-Hydroxybutyric acid'}]

而不是

[{'accession': 'HMDB0000001','name': '1-Methylhistidine' },
 {'accession': 'HMDB0000005', 'name': '2-Ketobutyric acid'},
 {'accession': 'HMDB0000008', 'name': '2-Hydroxybutyric acid'}]

意思是<name>超过<accession>。

还尝试输入一个列表作为键的值

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        # if subchild.tag=='{http://www.hmdb.ca}accession':
        #     dicts={"accession":  subchild.text}
        # if subchild.tag == '{http://www.hmdb.ca}name':
        #     dicts = {"name": subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                dicts = {"synonyms": synonym.text}
                print(synonym.text)
            innerlist.append(subchild.text)
            print(innerlist)

    newlist.append(dicts)

            innerlist.append(subchild.text)

        newlist.append(innerlist)

输出再次被超越：

>> print(newlist)
[{'synonyms': '1-Methylhistidine dihydrochloride'},
 {'synonyms': 'alpha-Ketobutyric acid, sodium salt'},
 {'synonyms': '2-Hydroxybutyric acid, monosodium salt, (+-)-isomer'}]

以上 3 个键中的每一个都包含每个列表中的最后一个值，而不是值列表。

应该收到类似的东西（但每个同义词都有所有值）：

>> print(newlist)
[{'synonyms': ['(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid',
               '1-Methylhistidine',
               ....
               '1-Methylhistidine dihydrochloride' ]},

 {'synonyms': ['2-Ketobutanoic acid',
               '2-Oxobutyric acid',
                ....
               'alpha-Ketobutyric acid, sodium salt']},

 {'synonyms': [ '2-Hydroxybutanoic acid',
                'alpha-Hydroxybutanoic acid',
                ....
                '2-Hydroxybutyric acid, monosodium salt, (+-)-isomer']}
]

我正在使用这些问题来编写循环：

Create List of Dictionary Python - 我认为非常相似，但无法使其工作
How to create and fill a list of lists in a for loop
Python ElementTree - iterate through child nodes and text in order
Populating a dictionary using for loops (python) [duplicate]
Generating nested lists from XML doc

任何想法、提示、线索或想法将不胜感激

【问题讨论】：

你试过了吗，xmltodict 包？
一直在尝试，但在我的情况下，每个代谢物只需要几个节点（大约 12-14 个），这就是为什么转向使用条件循环的这种方法
严肃的问题：为什么要麻烦转换？为什么不直接使用原生 xml？
@Jack Fleeting，我对 Python 和 XML 很陌生，所以对它并不熟悉。您的意思是“native xml”中的原生 XML database 吗？我现在阅读但不确定它是什么以及它是如何工作的。谢谢@Jack Fleeting。
您可以移动到像 BaseX 这样的 xml 数据库，或者 - 继续使用 python - 使用支持 xpath 的 lxml 等库。在这两个基础上，您都可以使用 xpath 将特定项目归零。比如//metabolite//synonyms/synonym[3]会输出Pi-methylhistidine等

标签： python xml dictionary nested-loops

【解决方案1】：

第一个代码sn-p的问题可能是把新字典重新赋值给变量dict：

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts={"accession":  subchild.text}
        if subchild.tag == '{http://www.hmdb.ca}name':
           # here the old value of dict is overriden with new value
            dicts = {"name": subchild.text}
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

您可能应该使用 dict[key] = value 形式的赋值：

newlist = []
for child in metabolites:
    innerlist = []
    dicts = {}
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts["accession"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts["name"] =  subchild.text
            innerlist.append(subchild.text)
            print(innerlist)
    newlist.append(dicts)

第二个代码 sn-p 似乎也有类似的问题：

newlist = []
for child in metabolites:
    dicts = {}
    innerlist = []
    for subchild in child:
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                innerlist.append(synonym.text)
    dicts["synonyms"] = innerlist

    newlist.append(dicts)

但是（正如已经指出的那样）您可以使用一些更方便的库而不是手动解析 XML。

这里是合并脚本：

newlist = []
for child in metabolites:
    dicts = {}
    innerlist = []
    for subchild in child:
        if subchild.tag=='{http://www.hmdb.ca}accession':
            dicts["accession"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}name':
            dicts["name"] =  subchild.text
        if subchild.tag == '{http://www.hmdb.ca}synonyms':
            for synonym in subchild:
                innerlist.append(synonym.text)
            dicts["synonyms"] = innerlist
    newlist.append(dicts)
   
print(newlist)

输出如下结果：

[{'accession': 'HMDB0000001', 'name': '1-Methylhistidine', 'synonyms': ['(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoic acid', '1-Methylhistidine', 'Pi-methylhistidine', '(2S)-2-amino-3-(1-Methyl-1H-imidazol-4-yl)propanoate', '1 Methylhistidine', '1-Methyl histidine']}]

【讨论】：

感谢您的宝贵时间！这两个脚本效果很好。如果好的，我想问另一个问题。我正在尝试将您编写的两个脚本合并为一个，但我不明白为什么在第一个脚本中 innerlist.append(subchild.text) 是在为字典（dicts["name"] = subchild.text）分配值之后出现的，而在第二个脚本中 innerlist.append(synonym.text)在它之前（在dicts["synonyms"] = innerlist 之前）。
我用合并的脚本更新了答案。通常在脚本 1 中，innerlist 仅用于调试目的，在第二个中，innerlist 收集同义词。