如何提高词法效率？答案

【问题标题】：How can lexing efficiency be improved?如何提高词法效率？
【发布时间】：2019-01-18 18:35:42
【问题描述】：

在使用DCG 解析一个 3 GB 的大文件时，效率很重要。

我的词法分析器的当前版本主要使用或谓词;/2，但我读到索引可以提供帮助。

Indexing 是一种用于快速选择候选子句的技术特定目标的谓词。在大多数 Prolog 系统中，索引是（仅）在头部的第一个参数上完成。如果这个论点是用仿函数实例化为原子、整数、浮点数或复合项，散列用于快速选择第一个参数所在的所有子句可以与目标的第一个参数统一。 SWI-Prolog 支持即时和多参数索引。见2.18部分。

有人可以举一个使用索引进行词法分析的例子，并可能解释它如何提高效率吗？

详情

注意：在将源代码处理成这个问题之前，我更改了一些名称。如果您发现错误，请随时在此处编辑或给我留言，我会很乐意修复它。

目前我的词法分析器/标记器（基于 mzapotoczny/prolog-interpreter parser.pl）是这个

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

这里有一些用于开发和测试的辅助谓词。

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 允许用户在终端输入字符串以进行词法分析并查看令牌。

read_file_for_lexing_and_user_review/1 读取文件以进行词法分析，并一次一行地查看每一行的标记。

read_file_for_lexing_and_performance/2 读取一个文件进行词法分析，对 lex 的行数有限制。这用于收集基本性能统计信息以衡量效率。旨在与time/1 一起使用。

【问题讨论】：

感兴趣的：Choice points and Redo's in Prolog - 索引如何影响 SWI-Prolog 跟踪器。
感兴趣的：How is a integer created as a character code constant? - 解释 Prolog 字符代码常量的使用，例如 0'\n
感兴趣的：Stack overflow in Prolog DCG grammar rule: how to handle large lists efficiently or lazily 这是一个关于使用 DCG 进行解析的问答，答案有一个关于利用索引的部分。
感兴趣的：GitHub SWI-Prolog swipl-devel/src/Tests/core/test_dcg.pl
感兴趣的：GitHub SWI-Prolog swipl-devel/src/Unicode/derived_core_properties.pl - 使用 DCG 进行解析的真实示例。

标签： performance prolog tokenize lexical-analysis

【解决方案1】：

这意味着这是愚蠢的代码：

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

这是不那么愚蠢的代码：

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

但还是不太好。也许更好：

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

最后一个例子也开始看起来很傻，但请记住，现在您已经对第一个参数进行了索引。你读一次，没有选择点，没有回溯。

但如果你想真正知道如何高效写作，你需要衡量时间和空间的去向。你量过吗？

但是如果你真的想知道如何解决你可能会阅读“Prolog 的工艺”，我不明白这本书的全部内容，但我记得它有很大一部分是关于 DCG 的。

但如果你真的想解析这种格式的大文件，可能会找到其他语言的现有库，它可能比最快的 Prolog 快得多。

【讨论】：

【解决方案2】：

解决方案：

您应该替换以下内容：

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

与

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

Guy Coder 编辑

我使用问题中发布的示例数据运行一个测试到一个列表中，其中列表中的每个项目都是转换为字符代码的数据中的一行。然后使用 time/1 对列表中的每个项目调用词法分析器，并对列表重复测试 10000 次。将数据加载到列表中并在 time/1 之前转换为字符代码的原因是这些过程不会扭曲结果。每次运行重复 5 次以获得数据的一致性。

在下面的运行中，对于所有不同版本的词法分析器都进行了扩展，以涵盖所有 7 位 ASCII 字符，这显着增加了特殊字符的案例数量。

以下使用的 Prolog 版本是 SWI-Prolog 8.0。

对于问题中的版本。

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

对于此答案中上面发布的版本

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

第 2 版通过使用第 1 版的索引进行了显着改进。

在对代码进行进一步研究时，在查看 op_token 即 DCG 并有两个隐藏变量用于隐式传递状态表示时，使用 listing/1 显示：

op_token(tokUnderscore,['_'|A], A).

注意第一个参数不是要搜索的字符，在这个answer中，索引代码写成

c_digit(0'0,0).

第一个参数是要搜索的字符，第二个参数是结果。

所以改变这个

op_token(Token), !

到这里

[S], { special_character_indexed(S,Token) }

索引子句为

special_character_indexed( ';' ,tokSemicolon).

版本：3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

与版本 2 相比，版本 3 的结果略好但始终更好。

最后只是将 double_quotes 标志更改为 atom，如 AntonDanilov 的评论中所述

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

第 4 版与第 3 版几乎相同。

仅查看 CPU 编号，使用索引会更快，例如（版本：1）151.875 与（版本：3）74.547

【讨论】：

我不知道sky-scrapper 在评论% replace OR sky-scrapper with call to new predicate 中的含义，所以我将其作为单独的question 询问。
摩天大楼是多层极高的建筑。在美国的大城市里有很多=D
我称之为分离链
看看第一个代码片段看起来像摩天大楼曼哈顿纽约等不是吗？
正确的拼写是“skyscraper”（在这种情况下，一个“p”在长“a”的发音中很重要，它恰好是一个单词）。但我以前从未听说过这个术语用于指代任何类型的代码结构。你有参考吗？