如何将 Markdown 列表解析为 Python 中的字典？答案

【问题标题】：How can a Markdown list be parsed to a dictionary in Python?如何将 Markdown 列表解析为 Python 中的字典？
【发布时间】：2014-10-07 08:19:08
【问题描述】：

我有如下列表：

- launchers
   - say hello
      - command: echo "hello" | festival --tts
      - icon: sayHello.png
   - say world
      - command: echo "world" | festival --tts
      - icon: sayWorld.png
   - wait
      - command: for ((x = 0; x < 10; ++x)); do :; done
      - icon: wait.png

我想将其解析为如下字典：

{
    "launchers": {
        "say hello": {
            "command": "echo \"hello\" | festival --tts",
            "icon": "sayHello.png"
        }
        "say world": {
            "command": "echo \"world\" | festival --tts",
            "icon": "sayWorld.png"
        }
        "wait": {
            "command": "for ((x = 0; x < 10; ++x)); do :; done",
            "icon": "wait.png"
        }
    }
}

我已经开始编写一些计算前导空格的非常手动的代码（例如len(line.rstrip()) - len(line.rstrip().lstrip())），但我想知道是否有更明智的方法来解决这个问题。我知道 JSON 可以导入 Python，但这不符合我的目的。那么，如何将文件中的 Markdown 列表解析为 Python 中的字典呢？有没有一种有效的方法来做到这一点？

这是我现在正在使用的一些基本代码：

for line in open("configuration.md", 'r'):
    indentation = len(line.rstrip()) - len(line.rstrip().lstrip())
    listItem = line.split('-')[1].strip()
    listItemSplit = listItem.split(':')
    key = listItemSplit[0].strip()
    if len(listItemSplit) == 2:
        value = listItemSplit[1].strip()
    else:
        value = ""
    print(indentation, key, value)

【问题讨论】：

参考this和this。
在我看来这很像 [YAML](yaml.org)。为什么不直接使用它？

标签： python list parsing markdown

【解决方案1】：

我会采用更严格的格式并使用堆栈和正则表达式：

import re    

line = re.compile(r'( *)- ([^:\n]+)(?:: ([^\n]*))?\n?')
depth = 0
stack = [{}]
for indent, name, value in line.findall(inputtext):
    indent = len(indent)
    if indent > depth:
        assert not stack[-1], 'unexpected indent'
    elif indent < depth:
        stack.pop()
    stack[-1][name] = value or {}
    if not value:
        # new branch
        stack.append(stack[-1][name])
    depth = indent

result = stack[0]

这会产生：

>>> import re
>>> inputtext = '''\
... - launchers
...    - say hello
...       - command: echo "hello" | festival --tts
...       - icon: sayHello.png
...    - say world
...       - command: echo "world" | festival --tts
...       - icon: sayWorld.png
...    - wait
...       - command: for ((x = 0; x < 10; ++x)); do :; done
...       - icon: wait.png
... '''
>>> line = re.compile(r'( *)- ([^:\n]+)(?:: ([^\n]*))?\n?')
>>> depth = 0
>>> stack = [{}]
>>> for indent, name, value in line.findall(inputtext):
...     indent = len(indent)
...     if indent > depth:
...         assert not stack[-1], 'unexpected indent'
...     elif indent < depth:
...         stack.pop()
...     stack[-1][name] = value or {}
...     if not value:
...         # new branch
...         stack.append(stack[-1][name])
...     depth = indent
... 
{'command': 'echo "hello" | festival --tts', 'icon': 'sayHello.png'}
{'command': 'echo "world" | festival --tts', 'icon': 'sayWorld.png'}
>>> result = stack[0]
>>> from pprint import pprint
>>> pprint(result)
{'launchers': {'say hello': {'command': 'echo "hello" | festival --tts',
                             'icon': 'sayHello.png'},
               'say world': {'command': 'echo "world" | festival --tts',
                             'icon': 'sayWorld.png'},
               'wait': {'command': 'for ((x = 0; x < 10; ++x)); do :; done',
                        'icon': 'wait.png'}}}

来自您的输入文本。

【讨论】：

这太棒了！但是解析 html 不是更容易吗？ md to html 和 html to dict.
@user3197452：我想说，这个简单的解释有点矫枉过正。
@user3197452：注意markdown中没有定义列表这样的东西；没有- some term: term definition，所以你当然仍然需要从 HTML 中解析出来。
哦，好的 :) 感谢您的信息！
好吧，把它放在你的 Sylladex 和 Captchalogue 中。这是优秀而简洁的。非常感谢您的帮助。

【解决方案2】：

您是否考虑过解析 markdown，然后将输出发送到 HTML 解析器？

您可以使用Markdown package 将markdown 解析为HTML。

然后您可以使用内置的HTMLParser library 来查找列表并解析出值。或者，您可以使用 lxml 来解析 HTML。

这样您就不必担心不同的缩进级别。 markdown 库会为您处理这些问题，并将其转换为您可以轻松进行额外处理的格式。

【讨论】：

这是一个聪明的主意。提取键值对需要额外的解析，但这是一个好方法。非常感谢您的帮助。