【发布时间】:2017-02-19 06:05:18
【问题描述】:
import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io
xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')
for line in xml:
line_num+=1
if line_num == 1:
print (line)
if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
count =count+1
line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')
count2+=1
if "!DOCTYPE" in line:
line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')
f.write(line)
f.close()
with open("workfile.xml") as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
root= tree.getroot()
结果:
<?xml version="1.0" encoding="UTF-8"?>
0
Traceback (most recent call last):
File "<ipython-input-164-4d6fc9ea9aac>", line 1, in <module>
runfile('C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py', wdir='C:/Users/Harshit/Downloads/ipa110106 (1)')
File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\Harshit\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/Harshit/Downloads/ipa110106 (1)/parsing_test5.py", line 41, in <module>
root= tree.getroot()
AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getroot'
我正在尝试解析 USPTO XML 文件以提取相关信息。这些文件是多个 XML 文件的串联,并遵循本论坛中给出的标准建议,我删除了多个实例:<?xml version="1.0" encoding="UTF-8"?> 和 <!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
因为它们也导致了错误:
ParseError: not well-formed (invalid token): line 2, column 2.
最后,在从 XML 中删除这些麻烦的元素后,我创建了一个合成父根来将此文件转换为适当的 XML 格式。但是,当我试图解析这个文件并访问它的根时,我遇到了一个错误。我在帖子中附上了代码。
import xml.etree.ElementTree as ET
import csv
import re
import codecs
import io
xml = open('ipa110106.xml')
line_num=0
f = open('workfile.xml', 'w')
for line in xml:
line_num+=1
if line_num == 1:
print (line)
if '<?xml version="1.0" encoding="UTF-8"?>' in line and line_num !=1:
count =count+1
line = line.replace('<?xml version="1.0" encoding="UTF-8"?>', '')
if '<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>' in line:
line = line.replace('<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>', '')
count2+=1
if "!DOCTYPE" in line:
line=line.replace('<!DOCTYPE sequence-cwu SYSTEM "us-sequence-listing.dtd" [ ]>','')
f.write(line)
f.close()
with open("workfile.xml") as f:
xml = f.read()
tree = ET.fromstring(re.sub(r"(<\?xml[^>]+\?>)", r"\1<root>", xml) + "</root>")
root= tree.getroot()
另外,XML 文件很大,我只能分享它的链接-enter link description here
XML(类似)文件的小样本:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE us-patent-application SYSTEM "us-patent-application-v42-2006-08-23.dtd" [ ]>
<us-patent-application lang="EN" dtd-version="v4.2 2006-08-23" file="US20110000001A1-20110106.XML" status="PRODUCTION" id="us-patent-application" country="US" date-produced="20101222" date-publ="20110106">
<us-bibliographic-data-application lang="EN" country="US">
<publication-reference>
<document-id>
<country>US</country>
<doc-number>20110000001</doc-number>
<kind>A1</kind>
<date>20110106</date>
</document-id>
</publication-reference>
<application-reference appl-type="utility">
<document-id>
<country>US</country>
<doc-number>12838840</doc-number>
<date>20100719</date>
</document-id>
</application-reference>
<us-application-series-code>12</us-application-series-code>
<priority-claims>
<priority-claim sequence="01" kind="national">
<country>IL</country>
<doc-number>189088</doc-number>
<date>20080128</date>
</priority-claim>
</priority-claims>
<classifications-ipcr>
<classification-ipcr>
【问题讨论】:
-
代码在哪里?和 xml 示例
-
请勿在外部链接中发布代码;将其包含在您的帖子中。
-
您好,我已经发布了 XML 的代码和链接。抱歉,帖子不完整。