【问题标题】:Parser for xml DTD filexml DTD 文件的解析器
【发布时间】:2014-03-31 10:02:34
【问题描述】:

我在实现解析器方面非常新,我正在尝试解析 xml DTD 文件以为其生成上下文无关语法。我尝试了 pyparsing 和 yacc,但仍然可以得到任何结果。因此,如果有人可以为我提供一些技巧或示例代码来编写这样的解析器,我将不胜感激。以下是一个示例 DTD 文件:

<!DOCTYPE PcSpecs [
<!ELEMENT PCS (PC*)>
<!ELEMENT PC (MODEL, PRICE, PROCESSOR, RAM, DISK+)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT PRICE (\#PCDATA)>
<!ELEMENT PROCESSOR (MANF, MODEL, SPEED)>
<!ELEMENT MANF (\#PCDATA)>
<!ELEMENT MODEL (\#PCDATA)>
<!ELEMENT SPEED (\#PCDATA)>
<!ELEMENT RAM (\#PCDATA)>
<!ELEMENT DISK (HARDDISK | CD | DVD)>
<!ELEMENT HARDDISK (MANF, MODEL, SIZE)>
<!ELEMENT SIZE (\#PCDATA)>
<!ELEMENT CD (SPEED)>
<!ELEMENT DVD (SPEED)>
]>

提前致谢。

【问题讨论】:

    标签: python xml-parsing yacc lexer pyparsing


    【解决方案1】:

    这是一个开始,它会将数据解析为 ParseResults 数据结构,然后您可以遍历并为定义的文档类型创建解析器:

    from pyparsing import *
    
    LT,GT,EXCLAM,LBRACK,RBRACK,LPAR,RPAR = map(Suppress,"<>![]()")
    DOCTYPE = Keyword("DOCTYPE").suppress()
    ELEMENT = Keyword("ELEMENT").suppress()
    ident = Word(alphas, alphanums+"_")
    elementRef = Group(ident("name") + Optional(oneOf("* +")("rep")))
    elementExpr = infixNotation(elementRef,
        [
        (',', 2, opAssoc.LEFT),
        ('|', 2, opAssoc.LEFT),
        ])
    PCDATA = Literal(r"\#PCDATA")
    elementDefn = Group(LT+EXCLAM + ELEMENT + ident("name") + 
                      LPAR + (elementExpr | PCDATA("PCDATA"))("contents") + RPAR + GT)
    doctypeDefn = LT+EXCLAM + DOCTYPE + ident("name") + 
                        LBRACK + ZeroOrMore(elementDefn)("elements") + RBRACK + GT
    

    我开始只对每个 ELEMENT 定义中的元素列表使用 delimitedList,但后来我注意到 ',' 和 '|'实际上是运算符,而不仅仅是分隔符,甚至可以混合使用,如 "A,B,C|D,E"。所以我使用了 pyparsing 的 infixNotation 助手来允许这些类型的定义。

    使用您的输入样本,我可以解析并显示结果:

    doctype = doctypeDefn.parseString(sample)
    print doctype.dump()
    for elem in doctype.elements:
        print elem.dump()
    

    给予:

    ['PcSpecs', ['PCS', ['PC', '*']], ['PC', [['MODEL'], ...
    - elements: [['PCS', ['PC', '*']], ['PC', [['MODEL'], ...
    - name: PcSpecs
    ['PCS', ['PC', '*']]
    - contents: ['PC', '*']
      - name: PC
      - rep: *
    - name: PCS
    ['PC', [['MODEL'], ',', ['PRICE'], ',', ['PROCESSOR'], ',', ['RAM'], ',', ['DISK', '+']]]
    - contents: [['MODEL'], ',', ['PRICE'], ',', ['PROCESSOR'], ',', ['RAM'], ',', ['DISK', '+']]
    - name: PC
    ['MODEL', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: MODEL
    ['PRICE', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: PRICE
    ['PROCESSOR', [['MANF'], ',', ['MODEL'], ',', ['SPEED']]]
    - contents: [['MANF'], ',', ['MODEL'], ',', ['SPEED']]
    - name: PROCESSOR
    ['MANF', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: MANF
    ['MODEL', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: MODEL
    ['SPEED', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: SPEED
    ['RAM', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: RAM
    ['DISK', [['HARDDISK'], '|', ['CD'], '|', ['DVD']]]
    - contents: [['HARDDISK'], '|', ['CD'], '|', ['DVD']]
    - name: DISK
    ['HARDDISK', [['MANF'], ',', ['MODEL'], ',', ['SIZE']]]
    - contents: [['MANF'], ',', ['MODEL'], ',', ['SIZE']]
    - name: HARDDISK
    ['SIZE', '\\#PCDATA']
    - PCDATA: \#PCDATA
    - contents: \#PCDATA
    - name: SIZE
    ['CD', ['SPEED']]
    - contents: ['SPEED']
      - name: SPEED
    - name: CD
    ['DVD', ['SPEED']]
    - contents: ['SPEED']
      - name: SPEED
    - name: DVD
    

    【讨论】:

      猜你喜欢
      • 2016-04-23
      • 2017-08-19
      • 1970-01-01
      • 2014-09-28
      • 1970-01-01
      • 2012-04-15
      • 1970-01-01
      • 1970-01-01
      • 2023-03-22
      相关资源
      最近更新 更多