不要不要使用字符串处理工具来处理XML! XML 不是常规格式,使用str.replace() 或sed 或任何此类工具可能会导致误报和错误。
使用 XML 解析器; Python 有xml.etree.ElementTree,这使得这个任务很简单:
from pathlib import Path
from xml.etree import ElementTree as ET
for xmlfile in Path("directory_with_xml_files").glob("*.xml"):
tree = ET.parse(xmlfile)
namespace = tree.getroot().tag.partition('}')[0][1:]
elem = tree.find(f".//a:fileName", {'a': namespace})
elem.text = f"{xmlfile.stem}.tiff"
tree.write(xmlfile, default_namespace=namespace,
encoding="UTF-8", xml_declaration=True)
上面处理给定目录中的所有 XML 文件(使用 pathlib module,使用 Path.glob() method 查找 XML 文件)。对于每个文件,它将 XML 数据解析为 XML 树,使用该元素的简单 XPath expression、updates the text 查找树中的第一个 <fileName> 元素(使用 filename stem,这是没有.xml 扩展名)并将 XML 树写回原始文件。
您说您使用ALTO schema,它使用XML namespaces 来区分版本;以上应该从根元素中选择要使用的正确命名空间,然后在 XPath 查询中使用该命名空间(以 a 作为前缀)。
演示:
$ mkdir demo
$ cat << EOF > demo/foo.xml
> <?xml version="1.0" encoding="UTF-8"?>
> <alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
> <Description>
> <MeasurementUnit>pixel</MeasurementUnit>
> <sourceImageInformation>
> <fileName> </fileName>
> </sourceImageInformation>
> </Description>
> </alto>
> EOF
$ cp demo/foo.xml demo/bar.xml
$ cp demo/foo.xml demo/baz.xml
$ python3.7
Python 3.7.4 (default, Jul 9 2019, 19:45:08)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from pathlib import Path
>>> from xml.etree import ElementTree as ET
>>> for xmlfile in Path("demo").glob("*.xml"):
... tree = ET.parse(xmlfile)
... namespace = tree.getroot().tag.partition('}')[0][1:]
... elem = tree.find(f".//a:fileName", {'a': namespace})
... elem.text = f"{xmlfile.stem}.tiff"
... tree.write(xmlfile, default_namespace=namespace,
... encoding="UTF-8", xml_declaration=True)
...
>>> ^D
$ cat demo/*.xml
<?xml version='1.0' encoding='UTF-8'?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>bar.tiff</fileName>
</sourceImageInformation>
</Description>
</alto><?xml version='1.0' encoding='UTF-8'?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>baz.tiff</fileName>
</sourceImageInformation>
</Description>
</alto><?xml version='1.0' encoding='UTF-8'?>
<alto xmlns="http://www.loc.gov/standards/alto/ns-v3#" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.loc.gov/standards/alto/ns-v3# http://www.loc.gov/alto/v3/alto-3-0.xsd">
<Description>
<MeasurementUnit>pixel</MeasurementUnit>
<sourceImageInformation>
<fileName>foo.tiff</fileName>
</sourceImageInformation>
</Description>
</alto>