【问题标题】:parsing CDATA (one more)解析CDATA(一更)
【发布时间】:2021-01-21 22:11:15
【问题描述】:

我需要从以下 svg 文档中解析 CDATA:

<?xml version='1.0' encoding='UTF-8'?>
<!-- This file was generated by dvisvgm 2.4 -->
<svg height='28.692695pt' version='1.1' viewBox='-72.000004 -70.904267 60.575314 28.692695' width='60.575314pt' xmlns='http://www.w3.org/2000/svg' xmlns:xlink='http://www.w3.org/1999/xlink'>

<style type='text/css'>
<![CDATA[
text.f0 {font-family:cmex10;font-size:11.955168px}
text.f1 {font-family:cmmi12;font-size:11.955168px}
text.f2 {font-family:cmr12;font-size:11.955168px}
]]>
</style>
<g id='page1'>
<text class='f1' x='-72.000004' y='-53.569135'>c</text>
<text class='f2' x='-63.641186' y='-53.569135'>=</text>
<text class='f0' x='-51.215706' y='-70.426073'></text>
<text class='f1' x='-42.415333' y='-60.891712'>a<tspan x='-25.754955'>b</tspan>
<tspan x='-41.861851' y='-46.445899'>c</tspan>
<tspan x='-26.307752'>d</tspan>
</text>
<text class='f0' x='-20.225063' y='-70.426073'></text>
</g>
</svg>

我使用的代码如下:

import xml.dom.minidom

file_svg= "my_path"

doc = xml.dom.minidom.parse(file_svg)

style = doc.getElementsByTagName('style')

cdata = style[0].firstChild.wholeText

这给了我这样的 CDATA 内的文本(打印 cdata):


text.f0 {font-family:cmex10;font-size:11.955168px}
text.f1 {font-family:cmmi12;font-size:11.955168px}
text.f2 {font-family:cmr12;font-size:11.955168px}

但我需要将这段文本组织成这样的东西:

{"f0":"cmex10","f1":"cmmi12","f2":"cmr12"}

我确信有一种方法可以根据文本值提取数据:f0、f1、f2 和字体系列的值:cmex10、cmmi12、cmr12,并使用标准的 xml.dom.minidom 操作。

我试过了:

style[0].firstChild.nodeValue

但它产生了一个空字符串。

你能帮我解决这个问题吗?

【问题讨论】:

  • CDATA 块的要点是要超出 XML 解析的范围。尝试找到正则表达式或其他搜索技术,以便从 CDATA 中提取信息并将其转换为您喜欢的格式。

标签: python xml cdata


【解决方案1】:

下面(使用ElementTree

import xml.etree.ElementTree as ET


xml = '''<?xml version="1.0" encoding="UTF-8"?>
<!-- This file was generated by dvisvgm 2.4 -->
<svg xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink" height="28.692695pt" version="1.1" viewBox="-72.000004 -70.904267 60.575314 28.692695" width="60.575314pt">
   <style type="text/css"><![CDATA[text.f0 {font-family:cmex10;font-size:11.955168px}
text.f1 {font-family:cmmi12;font-size:11.955168px}
text.f2 {font-family:cmr12;font-size:11.955168px}]]></style>
   <g id="page1">
      <text class="f1" x="-72.000004" y="-53.569135">c</text>
      <text class="f2" x="-63.641186" y="-53.569135">=</text>
      <text class="f0" x="-51.215706" y="-70.426073"></text>
      <text class="f1" x="-42.415333" y="-60.891712">
         a
         <tspan x="-25.754955">b</tspan>
         <tspan x="-41.861851" y="-46.445899">c</tspan>
         <tspan x="-26.307752">d</tspan>
      </text>
      <text class="f0" x="-20.225063" y="-70.426073"></text>
   </g>
</svg>'''
root = ET.fromstring(xml)
style = root.find('{http://www.w3.org/2000/svg}style')
cdata_lines = style.text.split('\n')
data = {}
for line in cdata_lines:
  dot_idx = line.find('.') + 1
  space_idx = line.find(' ')
  f = line[dot_idx:space_idx]
  colon_idx = line.find(':') + 1
  other_idx = line.find(';')
  cmex = line[colon_idx:other_idx]
  data[f] = cmex
print(data)

输出

{'f0': 'cmex10', 'f1': 'cmmi12', 'f2': 'cmr12'}

【讨论】:

    【解决方案2】:

    正如 cmets 中所指出的,CDATA 应该被解析为文本。下面是一个简单解析的例子:

    text = '''text.f0 {font-family:cmex10;font-size:11.955168px}
    text.f1 {font-family:cmmi12;font-size:11.955168px}
    text.f2 {font-family:cmr12;font-size:11.955168px}'''
    
    d = {}
    
    for line in text.split('\n'):
      value = line.split(':')[1].split(';')[0]
      key = line.split('.')[1].split(' ')[0]
      d[key] = value
      
    print(d)
    

    输出:

    {'f0': 'cmex10', 'f1': 'cmmi12', 'f2': 'cmr12'}
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2014-10-24
      • 2019-02-06
      • 2021-12-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多