在 python 3 中解析 XML答案

【问题标题】：parsing XML in python 3在 python 3 中解析 XML
【发布时间】：2018-04-20 11:23:22
【问题描述】：

我有很大的 XML 文件需要解析、转换成 json 并将其存储在 mongodb 中。

XML 如下所示：

Headers
 <response>
    <tag1>sssss</tag1>
    <tag2>kkkkkk</tag2>
    <tag3>aaaaaa</tag3> 
 </response>
Footers

我只需要两个response 标签之间的文本。当我尝试解析它时会出现问题。代码如下所示：

import pymysql
import re
import json
import xmltodict
from pymongo import MongoClient

# Open Database Connection.
db = pymysql.connect("hjj","fnddd","feoifh","fdfsddfs")

# prepare a cursor object
cursor = db.cursor()

# execute SQL query 
cursor.execute("SQL Query")

# Fetch all rows
data = cursor.fetchall()

a = (r'(?=<response>)(.*)(?<=</response>)')
def cleanxml(xml):
    file = re.findall(a, xml, re.DOTALL)
    return file
data = list(data)
for row in data:
    thexml = cleanxml(row[-1])
    jsonString = json.dumps(xmltodict.parse(thexml), indent = 4) #error here

上面的代码给了我一个错误：a bytes-like object is required, not 'list'

我尝试将 list(thexml) 转换为 str，如下所示：

thexml = ','.join(str(x) for x in thexml)

在此之后解析也不起作用：

xmltodict.parse(thexml) #no element found: line 1, column 0

我该怎么做？任何帮助表示赞赏。谢谢。

我解决了上述问题只是为了解决另一个问题。解决上述问题的代码：

a = (r'(?=<response>)(.*)(?<=</response>)')
def cleanxml(xml):
    if re.findall(a, xml, re.S):
        file = re.findall(a, xml, re.S)[0]
    else:
        file = "<response>NA</response>"
    return file
data = list(data)

for row in data:
    thexml  = cleanxml(row[1])
    jsonString = json.dumps(xmltodict.parse(thexml), indent = 4)
    d = json.loads(jsonString)
    newdict = {"caseid" : row[0]}
    newdict.update(d)
    jsondata = json.dumps(newdict, indent = 3)

现在，我面临的问题是如何将其插入 mongodb。我尝试使用以下代码，但它不起作用，我不知道如何解决这个问题：

client = MongoClient('localhost', 27017)
db = client.lexnex
collection = db['userdata']
collection.insert(newdict)

我明白了

 DeprecationWarning: insert is deprecated. Use insert_one or insert_many instead.
  after removing the cwd from sys.path.

当我尝试使用循环插入它时，我仍然收到错误，因为它应该是一个 SON 对象等。有人帮忙吗？确切的错误：document must be an instance of dict, bson.son.SON, bson.raw_bson.RawBSONDocument, or a type that inherits from collections.MutableMapping

【问题讨论】：

该数据不是有效的 XML。 XML 要求在顶层有元素 (<tag>)（即整个文档包含在 <tag>...</tag> 中）。
以及cleanxml 函数。
虽然是个好主意，但“未找到元素”表明您的 cleanxml 函数不起作用。如果你打印出它的结果呢？它符合您的期望吗？
是的，它确实完美。这是其中的一部分：<response><Header><TransactionId>66215947R1376304</TransactionId> <Status>0</Status> </Header> <RecordCount>1</RecordCount><Records>
以正确的格式提供 xml 内容

标签： json xml mongodb python-3.x parsing

【解决方案1】：

您可以使用 pyparsing 从格式不佳的 XML 中提取位，方法是只定义您感兴趣的部分，然后使用 searchString 或 scanString 查找这些位，同时跳过不需要的垃圾：

import pyparsing as pp

uglyxml = """
Headers
 <response>
    <tag1>sssss</tag1>
    <tag2>kkkkkk</tag2>
    <tag3>aaaaaa</tag3> 
 </response>
Footers
"""

# define pyparsing expressions for starting and ending tags
# (suppress them because the tags themselves aren't interesting,
# just the content between the tags)
t1, t1_end = map(pp.Suppress, pp.makeXMLTags('tag1'))
t2, t2_end = map(pp.Suppress, pp.makeXMLTags('tag2'))
t3, t3_end = map(pp.Suppress, pp.makeXMLTags('tag3'))
resp, resp_end = map(pp.Suppress, pp.makeXMLTags('response'))

parser = (resp 
            + t1 + pp.SkipTo(t1_end)('tag1') + t1_end 
            + t2 + pp.SkipTo(t2_end)('tag2') + t2_end 
            + t3 + pp.SkipTo(t3_end)('tag3') + t3_end
          + resp_end)

# use searchString to skip over unwanted stuff in input string
parsed_responses = parser.searchString(uglyxml)

# dump out the parsed structure
print(parsed_responses[0].dump())

# convert to a nested dict
print(parsed_responses[0].asDict())

# access the `tag1` result using object attribute form
t1 = parsed_responses[0].tag1

# print matched values by tag name - pyparsing's parsed results
# can work as mappings for str.format
print("tag1={tag1!r}, tag2={tag2!r}, tag3={tag3!r}".format(**parsed_responses[0]))

import json
print("as JSON:")
print(json.dumps(parsed_responses[0].asDict()))

打印：

['sssss', 'kkkkkk', 'aaaaaa']
- tag1: 'sssss'
- tag2: 'kkkkkk'
- tag3: 'aaaaaa'
{'tag1': 'sssss', 'tag3': 'aaaaaa', 'tag2': 'kkkkkk'}
tag1='sssss', tag2='kkkkkk', tag3='aaaaaa'

as JSON:
{"tag2": "kkkkkk", "tag1": "sssss", "tag3": "aaaaaa"}

【讨论】：

它是一个带有大量标签的大 xml。我的问题是没有提取有用的位。我已经做到了。我的问题是将它转换成json。

【解决方案2】：

在您的原始代码中，更改

xmltodict.parse(thexml)

到

[xmltodict.parse(response) for response in thexml]

【讨论】：

它有效，但我在每个元素之前都得到OrderedDict 吗？示例结果：[OrderedDict([('response', OrderedDict([('Individual', OrderedDict([('PhonesPluses', OrderedDict([('PhonesPlus', OrderedDict([('Address', OrderedDict([('StreetName', 'BELLEZZA')
以及如何将其转换为 json？
大概 xmltodict.parse() 将其结果作为 OrderedDict 返回，它是 dict 的子类。如果你看到了，你一定是在查看列表推导的结果，而不是把它传递给 json.dumps()，所以听起来你没有按照我说的去做。
没关系，我得到了它，我有另一个关于 ref 的问题。同样的问题。我应该用我更新的代码发布另一个问题吗？
我不这么认为。如果其他问题与您最初提出的问题无关，那么您或许应该在原始帖子的底部添加一个勘误表。