【问题标题】:Parse elements and sub-elements from wikitext with Python 3使用 Python 3 从 wikitext 解析元素和子元素
【发布时间】:2015-08-21 02:23:55
【问题描述】:

我正在尝试解析一些wikitext。这是我需要解析的文本示例:

== title ==
=== subtopic ===
*text_1
**text dependent on text_1
**text_2 dependent on text_1
*text_2
**text dependent on text_2
=== other subtopic ===
*text_2
**text dependent on text_2
== other title ==
...

这里的结构并不复杂:
title我相信整个文档中至少有一个title
子主题是可选的
元素每个主题/子主题必须至少有一个
子元素是可选的,可以重复

如果sub-elements 被重复,我打算使用\ln 统一它们。

我想要做的是把它解析成字典,结构如下:

{
"title": "title"
"subtopic": "subtopic"
"main_text": "text_1"
"sub_text": "text dependent on text_1 \ln text_2 dependent on text_1"}

你知道任何 pythonic 的方式或想法来将它解析成我想要的吗?非常感谢您的宝贵时间。

PS。这是我试图解析和提取引号的完整文件: Woody Allen

【问题讨论】:

  • Parsing a Wikipedia dump的可能重复
  • 伍迪艾伦的维基百科页面上似乎没有与您的格式相匹配的列表...
  • @poke,因为这是维基语录页面的格式,请参阅我的回答。

标签: parsing python-3.x text wikitext


【解决方案1】:

您说的是“引用”,但您链接了维基百科。你是说维基语录吗?

无论如何,您不得自己解析维基文本。您可以通过parse API 实现您的目标,您可以使用Python client 访问它。

例如,他的 Wikiquote 文章 https://en.wikiquote.org/w/api.php?action=parse&page=Woody_Allen&prop=sections 上的章节列表(即引用的作品):

{
    "parse": {
        "title": "Woody Allen",
        "pageid": 80,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes",
                "number": "1",
                "index": "1",
                "fromtitle": "Woody_Allen",
                "byteoffset": 657,
                "anchor": "Quotes"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Getting Even</i> (1971)",
                "number": "1.1",
                "index": "2",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11322,
                "anchor": "Getting_Even_.281971.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "<i>My Philosophy</i>",
                "number": "1.1.1",
                "index": "3",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11471,
                "anchor": "My_Philosophy"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Everything You Always Wanted to Know About Sex* (*But Were Afraid to Ask)</i> (1972)",
                "number": "1.2",
                "index": "4",
                "fromtitle": "Woody_Allen",
                "byteoffset": 11814,
                "anchor": "Everything_You_Always_Wanted_to_Know_About_Sex.2A_.28.2ABut_Were_Afraid_to_Ask.29_.281972.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Sleeper</i> (1973)",
                "number": "1.3",
                "index": "5",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12364,
                "anchor": "Sleeper_.281973.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Love and Death</i> (1975)",
                "number": "1.4",
                "index": "6",
                "fromtitle": "Woody_Allen",
                "byteoffset": 12858,
                "anchor": "Love_and_Death_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Without Feathers</i> (1975)",
                "number": "1.5",
                "index": "7",
                "fromtitle": "Woody_Allen",
                "byteoffset": 14090,
                "anchor": "Without_Feathers_.281975.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Annie Hall</i> (1977)",
                "number": "1.6",
                "index": "8",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16485,
                "anchor": "Annie_Hall_.281977.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Side Effects</i> (1980)",
                "number": "1.7",
                "index": "9",
                "fromtitle": "Woody_Allen",
                "byteoffset": 16899,
                "anchor": "Side_Effects_.281980.29"
            },
            {
                "toclevel": 3,
                "level": "4",
                "line": "My Apology",
                "number": "1.7.1",
                "index": "10",
                "fromtitle": "Woody_Allen",
                "byteoffset": 17529,
                "anchor": "My_Apology"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Manhattan Murder Mystery</i> (1993)",
                "number": "1.8",
                "index": "11",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18579,
                "anchor": "Manhattan_Murder_Mystery_.281993.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Don't Drink the Water</i> (1994)",
                "number": "1.9",
                "index": "12",
                "fromtitle": "Woody_Allen",
                "byteoffset": 18960,
                "anchor": "Don.27t_Drink_the_Water_.281994.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Deconstructing Harry</i> (1997)",
                "number": "1.10",
                "index": "13",
                "fromtitle": "Woody_Allen",
                "byteoffset": 19228,
                "anchor": "Deconstructing_Harry_.281997.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Standup Comic</i> (1999)",
                "number": "1.11",
                "index": "14",
                "fromtitle": "Woody_Allen",
                "byteoffset": 21289,
                "anchor": "Standup_Comic_.281999.29"
            },
            {
                "toclevel": 2,
                "level": "3",
                "line": "<i>Mere Anarchy</i> (2007)",
                "number": "1.12",
                "index": "15",
                "fromtitle": "Woody_Allen",
                "byteoffset": 22463,
                "anchor": "Mere_Anarchy_.282007.29"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Attributed",
                "number": "2",
                "index": "16",
                "fromtitle": "Woody_Allen",
                "byteoffset": 24181,
                "anchor": "Attributed"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Others",
                "number": "3",
                "index": "17",
                "fromtitle": "Woody_Allen",
                "byteoffset": 25045,
                "anchor": "Others"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "Quotes about Allen",
                "number": "4",
                "index": "18",
                "fromtitle": "Woody_Allen",
                "byteoffset": 27525,
                "anchor": "Quotes_about_Allen"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "External links",
                "number": "5",
                "index": "19",
                "fromtitle": "Woody_Allen",
                "byteoffset": 29106,
                "anchor": "External_links"
            }
        ]
    }
}

【讨论】:

  • 这不会给你实际的部分文本,如果你为此使用解析 API,你会得到 HTML——它也需要解析。所以你只是将“我需要解析这个”问题从 wikitext 转移到 HTML。
  • @poke,OP 从未说过他们需要纯文本。至于内容,为了简洁起见,我只包含了部分标题,但我将解释如何使用 sectionprop=text 参数获取其中包含的文本的文档链接起来。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-08-27
  • 2011-11-16
  • 1970-01-01
  • 1970-01-01
  • 2022-06-23
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多