如何将 .txt 文件解析为 .xml？答案

【问题标题】：How to parse a .txt file into .xml?如何将 .txt 文件解析为 .xml？
【发布时间】：2017-08-07 17:43:03
【问题描述】：

这是我的 txt 文件：

In File Name:   C:\Users\naqushab\desktop\files\File 1.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size:   Low:    22636   High:   0
Total Process time: 1.859000
Out File Size:  Low:    77619   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 2.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size:   Low:    20673   High:   0
Total Process time: 3.094000
Out File Size:  Low:    94485   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 3.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size:   Low:    66859   High:   0
Total Process time: 3.516000
Out File Size:  Low:    217268  High:   0

我正在尝试将其解析为这样的 XML 格式：

<?xml version='1.0' encoding='utf-8'?>
<root>
    <filedata>
        <InFileName>File 1.m1</InFileName>
        <OutFileName>File 1.m2</OutFileName>
        <InFileSize>22636</InFileSize>
        <OutFileSize>77619</OutFileSize>
        <ProcessTime>1.859000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 2.m1</InFileName>
        <OutFileName>File 2.m2</OutFileName>
        <InFileSize>20673</InFileSize>
        <OutFileSize>94485</OutFileSize>
        <ProcessTime>3.094000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 3.m1</InFileName>
        <OutFileName>File 3.m2</OutFileName>
        <InFileSize>66859</InFileSize>
        <OutFileSize>217268</OutFileSize>
        <ProcessTime>3.516000</ProcessTime>
    </filedata>
</root>

这是我试图实现的代码（我使用的是 Python 2）：

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>In File Name:
                       |Out File Name:
                       |In File Size:   Low:
                       |Total Process time:
                       |Out File Size:  Low:
                     )
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('Performance.txt') as f:
    celldata = ET.SubElement(root, 'filedata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'filedata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但我得到的是空值，是否可以将此 txt 解析为 XML？

【问题讨论】：

你在哪里得到空值？能不能说的清楚点！
如果一个完整的程序没有给出预期的结果，只需将其拆分成更小的部分，然后分别尝试。在这里，您应该首先简单地解析输入并打印您可以找到的部分。只有他们尝试构建 XML 文件。
而且，您的正则表达式和子元素名称不匹配！他们是故意的吗？
我尝试了这个程序，我得到了 XML 结构，而这也只是 filedata 标记。我帮助回答了一个 SO 问题，并根据我的结构更改了正则表达式..
@KeerthanaPrabhakaran 抱歉，我在将文本文件上传到 SO 之前正在对其进行编辑。我将更新我使用的正则表达式。不过，我认为它不正确。

标签： python xml python-2.7 parsing elementtree

【解决方案1】：

对你的正则表达式进行更正：应该是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

而不是你给的。因为在您的正则表达式中，In File Name|Out File Name 的意思是，它会检查 In File Nam 后跟但 e 或 O 后跟 ut File Name 等等。

建议，

您可以在不使用正则表达式的情况下做到这一点。 xml.dom.minidom 可用于美化您的 xml 字符串。

为了更好地理解，我已经内联了 cmets！

Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])

返回文档的精美打印版本。 indent 指定缩进字符串，默认为制表符； newl 指定在每行末尾发出的字符串，默认为

编辑

import itertools as it
[line[0] for line in it.groupby(lines)]
您可以使用 itertools 包的 groupby 来对列表行中的连续重复数据进行分组

所以，

import xml.etree.ElementTree as ET
root = ET.Element('root')

with open('file1.txt') as f:
    lines = f.read().splitlines()

#add first subelement
celldata = ET.SubElement(root, 'filedata')

import itertools as it
#for every line in input file
#group consecutive dedup to one 
for line in it.groupby(lines):
    line=line[0]
    #if its a break of subelements  - that is an empty space
    if not line:
        #add the next subelement and get it as celldata
        celldata = ET.SubElement(root, 'filedata')
    else:
        #otherwise, split with : to get the tag name
        tag = line.split(":")
        #format tag name
        el=ET.SubElement(celldata,tag[0].replace(" ",""))
        tag=' '.join(tag[1:]).strip()
        
        #get file name from file path
        if 'File Name' in line:
            tag = line.split("\\")[-1].strip()
        elif 'File Size' in line:
            splist =  filter(None,line.split(" "))
            tag = splist[splist.index('Low:')+1]
            #splist[splist.index('High:')+1]
        el.text = tag

#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
                          ET.tostring(
                                      root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML

#write the formatedXML to file.
with open("Performance.xml","w+") as f:
    f.write(formatedXML)

输出： Performance.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
 <filedata>
  <InFileName>File 1.m1</InFileName>
  <OutFileName>File 1.m2</OutFileName>
  <InFileSize>22636</InFileSize>
  <TotalProcesstime>1.859000</TotalProcesstime>
  <OutFileSize>77619</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 2.m1</InFileName>
  <OutFileName>File 2.m2</OutFileName>
  <InFileSize>20673</InFileSize>
  <TotalProcesstime>3.094000</TotalProcesstime>
  <OutFileSize>94485</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 3.m1</InFileName>
  <OutFileName>File 3.m2</OutFileName>
  <InFileSize>66859</InFileSize>
  <TotalProcesstime>3.516000</TotalProcesstime>
  <OutFileSize>217268</OutFileSize>
 </filedata>
</root>

希望对你有帮助！

【讨论】：

完美！只是一件事，我如何检查多个新行，因为生成的 txt 可以在开头和结尾有一些空行？
itertools 的 groupby 应该可以解决问题！我已经添加了相同的编辑。

【解决方案2】：

来自文档（重点是我的）：

re.VERBOSE
此标志允许您编写正则表达式看起来更好。 模式中的空格被忽略，除非在字符类或前面有一个未转义的反斜杠，并且，当行中既不在字符类中也不在前面包含“#” 未转义的反斜杠，从最左边开始的所有字符，例如 '#' 到行尾被忽略。

在正则表达式中转义空格或使用\s 类

【讨论】：