【问题标题】:ANTLR4 parsing a Wiktionary article fails weirdlyANTLR4 解析维基词典的文章奇怪地失败了
【发布时间】:2020-12-17 23:35:13
【问题描述】:

我正在尝试解析 mediawiki 标记,特别是英文维基词典中使用的标记。
它不是一种编程语言,对空格和换行符的处理有点奇怪,而且我觉得每一步都是试验和(很多)错误。

这里是回购:https://github.com/WorDB/wikitext-parser

测试输入文件是饼图文章:pie.txt
(https://en.wiktionary.org/wiki/pie)

注意:我正在解析维基词典的整个 XML 转储,所以我宁愿找到使用 Antlr 解析的解决方案,而不是获得使用某些在线 API 之类的建议。

wikitext.g4

grammar wikitext;

/**
 Grammar
 */

page: EOL? ((wikitem | bullet_line) EOL? )+ EOF;

wikitem:
      wikitem wikitem
    | title 
    | template
    | link
    | text
    ;

title: title2 | title3 | title4 | title5;
title5: '=====' text '=====';
title4: '====' text '====';
title3: '===' text '===';
title2: '==' text '==';

template: '{{' parameter ('|' parameter)* '}}';
link: '[[' parameter ('|' parameter)* ']]';

parameter: wikitem?; // parameter can be empty, I.E. {{a|}}

bullet: ('*'|'#'|'#:'|'#*');
bullet_line: WS? EOL WS? bullet WS? wikitem;

text: (CHAR | WS)+;

/**
 Lexicon
 */
EOL: [\f\r\n]+;
CHAR: ~[ \t\f\r\n];
WS: [ \t]+;  

Error:

> cd ./java && grun wikitext page -gui ../data/pie.txt

line 190:137 no viable alternative at input 'rom {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{inh|en|enm|pye}}, from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|fro|pie}}, from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|la|pīca}}, feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'feminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'eminine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'minine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'inine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'nine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ine of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'ne of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'e of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'of {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'f {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{m|la|pīcus||woodpecker}}, from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'from {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'rom {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'om {{der|en|ine-pro|*'
line 190:137 no viable alternative at input 'm {{der|en|ine-pro|*'
line 190:137 no viable alternative at input ' {{der|en|ine-pro|*'
line 190:137 extraneous input '*' expecting {'|', '}}'}
line 190:146 no viable alternative at input 's)peyk-|'
line 190:146 no viable alternative at input ')peyk-|'
line 190:146 no viable alternative at input 'peyk-|'
line 190:146 no viable alternative at input 'eyk-|'
line 190:146 no viable alternative at input 'yk-|'
line 190:146 no viable alternative at input 'k-|'
line 190:146 no viable alternative at input '-|'
line 190:146 mismatched input '|' expecting {<EOF>, '=====', '====', '===', '==', '{{', '[[', EOL, CHAR, WS}

【问题讨论】:

  • 在 Antlr 中,词法分析器独立于解析器运行;没有上下文相关的词法分析。 '{{' 由词法分析器返回,但解析器规则 text: (CHAR | WS)+; 不接受第 190 行中的 '{{'。您需要在解析器规则中列出所有多字符标记,或者将字符串文字分解为一系列单个字符。
  • 除了编写自己的语法或使用 API 之外,还有其他选择。对于像 Mediawiki 这样的标记语言来说,编写自己的 ANTLR 语法并不是一件容易的事(如果你想自己做的话,PEG 会是一个更自然的候选者)。但我会寻找现有的解析器:mediawiki.org/wiki/Parsing
  • 有人已经尝试过编写语法(使用 ANTLR3):mediawiki.org/wiki/Markup_spec/ANTLR/draft。看看语法,我猜这并不顺利:有这么多谓词(包含目标特定代码),全局回溯打开了。再次:尝试为此找到现有的解析器。
  • @kaby76 没关系,我不想将 {{ 解析为文本,而是解析为模板。但两者都没有发生。
  • @BartKiers 是的,我已经阅读了该内容以及许多其他页面和项目,例如 Parsoid 进行了从 JS 到 PHP 的繁琐切换,我在 20 年后完成了反向路径并且不去背部。无论如何,当安装说明和依赖关系太复杂并且错误报告很多时,我更愿意花时间自己推出一个,在此期间我学习是一个加分项。但是谢谢,我可能会重新考虑用 parsoid 重新审视 PHP。

标签: parsing antlr mediawiki antlr4 text-parsing


【解决方案1】:

我改变了一些规则。你能查一下吗?

grammar wikitext;

/**
 Grammar
 */

page: EOL? (wikitem EOL? )+ EOF;

wikitem:
      wikitem wikitem
    | title
    | template
    | link
    | text
    | bullet_line
    ;

title: title2 | title3 | title4 | title5;
title5: '=====' text '=====';
title4: '====' text '====';
title3: '===' text '===';
title2: '==' text '==';

template: '{{' parameter ('|' parameter)* '}}';
link: '[[' parameter ('|' parameter)* ']]';

parameter: wikitem?; // parameter can be empty, I.E. {{a|}}

bullet_line: WS? bullet=('*'|'#'|'#:'|'#*') WS? wikitem;

text: (CHAR | WS)+;

/**
 Lexicon
 */
EOL: [\f\r\n]+;
CHAR: ~[ \t\f\r\n];
WS: [ \t]+;

【讨论】:

  • 谢谢!它确实工作得很好。现在我需要处理一些 wikitems 中的换行符,比如 templates 可以在不同的行中有很多参数。
猜你喜欢
  • 1970-01-01
  • 2012-12-13
  • 1970-01-01
  • 2017-12-07
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多