【问题标题】:How to parse a .txt file into .xml?如何将 .txt 文件解析为 .xml?
【发布时间】:2017-08-07 17:43:03
【问题描述】:

这是我的 txt 文件:

In File Name:   C:\Users\naqushab\desktop\files\File 1.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size:   Low:    22636   High:   0
Total Process time: 1.859000
Out File Size:  Low:    77619   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 2.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size:   Low:    20673   High:   0
Total Process time: 3.094000
Out File Size:  Low:    94485   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 3.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size:   Low:    66859   High:   0
Total Process time: 3.516000
Out File Size:  Low:    217268  High:   0

我正在尝试将其解析为这样的 XML 格式:

<?xml version='1.0' encoding='utf-8'?>
<root>
    <filedata>
        <InFileName>File 1.m1</InFileName>
        <OutFileName>File 1.m2</OutFileName>
        <InFileSize>22636</InFileSize>
        <OutFileSize>77619</OutFileSize>
        <ProcessTime>1.859000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 2.m1</InFileName>
        <OutFileName>File 2.m2</OutFileName>
        <InFileSize>20673</InFileSize>
        <OutFileSize>94485</OutFileSize>
        <ProcessTime>3.094000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 3.m1</InFileName>
        <OutFileName>File 3.m2</OutFileName>
        <InFileSize>66859</InFileSize>
        <OutFileSize>217268</OutFileSize>
        <ProcessTime>3.516000</ProcessTime>
    </filedata>
</root>

这是我试图实现的代码(我使用的是 Python 2):

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>In File Name:
                       |Out File Name:
                       |In File Size:   Low:
                       |Total Process time:
                       |Out File Size:  Low:
                     )
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('Performance.txt') as f:
    celldata = ET.SubElement(root, 'filedata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'filedata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但我得到的是空值,是否可以将此 txt 解析为 XML?

【问题讨论】:

  • 你在哪里得到空值?能不能说的清楚点!
  • 如果一个完整的程序没有给出预期的结果,只需将其拆分成更小的部分,然后分别尝试。在这里,您应该首先简单地解析输入并打印您可以找到的部分。只有他们尝试构建 XML 文件。
  • 而且,您的正则表达式和子元素名称不匹配!他们是故意的吗?
  • 我尝试了这个程序,我得到了 XML 结构,而这也只是 filedata 标记。我帮助回答了一个 SO 问题,并根据我的结构更改了正则表达式..
  • @KeerthanaPrabhakaran 抱歉,我在将文本文件上传到 SO 之前正在对其进行编辑。我将更新我使用的正则表达式。不过,我认为它不正确。

标签: python xml python-2.7 parsing elementtree


【解决方案1】:

对你的正则表达式进行更正:应该是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

而不是你给的。因为在您的正则表达式中,In File Name|Out File Name 的意思是,它会检查 In File Nam 后跟但 eO 后跟 ut File Name 等等。

建议,

您可以在不使用正则表达式的情况下做到这一点。 xml.dom.minidom 可用于美化您的 xml 字符串。

为了更好地理解,我已经内联了 cmets!

Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])

返回文档的精美打印版本。 indent 指定缩进字符串,默认为制表符; newl 指定在每行末尾发出的字符串,默认为

编辑

import itertools as it
[line[0] for line in it.groupby(lines)]

您可以使用 itertools 包的 groupby 来对列表行中的连续重复数据进行分组

所以,

import xml.etree.ElementTree as ET
root = ET.Element('root')

with open('file1.txt') as f:
    lines = f.read().splitlines()

#add first subelement
celldata = ET.SubElement(root, 'filedata')

import itertools as it
#for every line in input file
#group consecutive dedup to one 
for line in it.groupby(lines):
    line=line[0]
    #if its a break of subelements  - that is an empty space
    if not line:
        #add the next subelement and get it as celldata
        celldata = ET.SubElement(root, 'filedata')
    else:
        #otherwise, split with : to get the tag name
        tag = line.split(":")
        #format tag name
        el=ET.SubElement(celldata,tag[0].replace(" ",""))
        tag=' '.join(tag[1:]).strip()
        
        #get file name from file path
        if 'File Name' in line:
            tag = line.split("\\")[-1].strip()
        elif 'File Size' in line:
            splist =  filter(None,line.split(" "))
            tag = splist[splist.index('Low:')+1]
            #splist[splist.index('High:')+1]
        el.text = tag

#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
                          ET.tostring(
                                      root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML

#write the formatedXML to file.
with open("Performance.xml","w+") as f:
    f.write(formatedXML)

输出: Performance.xml

<?xml version="1.0" encoding="utf-8"?>
<root>
 <filedata>
  <InFileName>File 1.m1</InFileName>
  <OutFileName>File 1.m2</OutFileName>
  <InFileSize>22636</InFileSize>
  <TotalProcesstime>1.859000</TotalProcesstime>
  <OutFileSize>77619</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 2.m1</InFileName>
  <OutFileName>File 2.m2</OutFileName>
  <InFileSize>20673</InFileSize>
  <TotalProcesstime>3.094000</TotalProcesstime>
  <OutFileSize>94485</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 3.m1</InFileName>
  <OutFileName>File 3.m2</OutFileName>
  <InFileSize>66859</InFileSize>
  <TotalProcesstime>3.516000</TotalProcesstime>
  <OutFileSize>217268</OutFileSize>
 </filedata>
</root>

希望对你有帮助!

【讨论】:

  • 完美!只是一件事,我如何检查多个新行,因为生成的 txt 可以在开头和结尾有一些空行?
  • itertools 的 groupby 应该可以解决问题!我已经添加了相同的编辑。
【解决方案2】:

来自文档(重点是我的):

re.VERBOSE
此标志允许您编写正则表达式 看起来更好。 模式中的空格被忽略,除非在 字符类或前面有一个未转义的反斜杠,并且,当 行中既不在字符类中也不在前面包含“#” 未转义的反斜杠,从最左边开始的所有字符,例如 '#' 到 行尾被忽略。

在正则表达式中转义空格或使用\s

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2016-04-07
    • 2013-07-11
    • 1970-01-01
    • 2016-05-21
    • 2016-10-29
    • 2015-05-03
    • 1970-01-01
    相关资源
    最近更新 更多