这是一个开始,它会将数据解析为 ParseResults 数据结构,然后您可以遍历并为定义的文档类型创建解析器:
from pyparsing import *
LT,GT,EXCLAM,LBRACK,RBRACK,LPAR,RPAR = map(Suppress,"<>![]()")
DOCTYPE = Keyword("DOCTYPE").suppress()
ELEMENT = Keyword("ELEMENT").suppress()
ident = Word(alphas, alphanums+"_")
elementRef = Group(ident("name") + Optional(oneOf("* +")("rep")))
elementExpr = infixNotation(elementRef,
[
(',', 2, opAssoc.LEFT),
('|', 2, opAssoc.LEFT),
])
PCDATA = Literal(r"\#PCDATA")
elementDefn = Group(LT+EXCLAM + ELEMENT + ident("name") +
LPAR + (elementExpr | PCDATA("PCDATA"))("contents") + RPAR + GT)
doctypeDefn = LT+EXCLAM + DOCTYPE + ident("name") +
LBRACK + ZeroOrMore(elementDefn)("elements") + RBRACK + GT
我开始只对每个 ELEMENT 定义中的元素列表使用 delimitedList,但后来我注意到 ',' 和 '|'实际上是运算符,而不仅仅是分隔符,甚至可以混合使用,如 "A,B,C|D,E"。所以我使用了 pyparsing 的 infixNotation 助手来允许这些类型的定义。
使用您的输入样本,我可以解析并显示结果:
doctype = doctypeDefn.parseString(sample)
print doctype.dump()
for elem in doctype.elements:
print elem.dump()
给予:
['PcSpecs', ['PCS', ['PC', '*']], ['PC', [['MODEL'], ...
- elements: [['PCS', ['PC', '*']], ['PC', [['MODEL'], ...
- name: PcSpecs
['PCS', ['PC', '*']]
- contents: ['PC', '*']
- name: PC
- rep: *
- name: PCS
['PC', [['MODEL'], ',', ['PRICE'], ',', ['PROCESSOR'], ',', ['RAM'], ',', ['DISK', '+']]]
- contents: [['MODEL'], ',', ['PRICE'], ',', ['PROCESSOR'], ',', ['RAM'], ',', ['DISK', '+']]
- name: PC
['MODEL', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: MODEL
['PRICE', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: PRICE
['PROCESSOR', [['MANF'], ',', ['MODEL'], ',', ['SPEED']]]
- contents: [['MANF'], ',', ['MODEL'], ',', ['SPEED']]
- name: PROCESSOR
['MANF', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: MANF
['MODEL', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: MODEL
['SPEED', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: SPEED
['RAM', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: RAM
['DISK', [['HARDDISK'], '|', ['CD'], '|', ['DVD']]]
- contents: [['HARDDISK'], '|', ['CD'], '|', ['DVD']]
- name: DISK
['HARDDISK', [['MANF'], ',', ['MODEL'], ',', ['SIZE']]]
- contents: [['MANF'], ',', ['MODEL'], ',', ['SIZE']]
- name: HARDDISK
['SIZE', '\\#PCDATA']
- PCDATA: \#PCDATA
- contents: \#PCDATA
- name: SIZE
['CD', ['SPEED']]
- contents: ['SPEED']
- name: SPEED
- name: CD
['DVD', ['SPEED']]
- contents: ['SPEED']
- name: SPEED
- name: DVD