【问题标题】:Parse a penn syntax tree to extract its grammar rules解析 Penn 语法树以提取其语法规则
【发布时间】:2017-06-22 20:24:17
【问题描述】:

我有一个 PENN 语法树,我想递归地获取该树包含的所有规则。

(ROOT 
(S 
   (NP (NN Carnac) (DT the) (NN Magnificent)) 
   (VP (VBD gave) (NP ((DT a) (NN talk))))
)
)

我的目标是获得如下语法规则:

ROOT --> S
S --> NP VP
NP --> NN
...

正如我所说,我需要递归地执行此操作,并且不使用 NLTK 包或任何其他模块或正则表达式。这是我到目前为止所拥有的。参数tree 是在每个空间上拆分的 Penn-Tree。

def extract_rules(tree):
    tree = tree[1:-1]
    print("\n\n")

    if len(tree) == 0:
        return

    root_node = tree[0]
    print("Current Root: "+root_node)

    remaining_tree = tree[1:]
    right_side = []

    temp_tree = list(remaining_tree)
    print("remaining_tree: ", remaining_tree)
    symbol = remaining_tree.pop(0)

    print("Symbol: "+symbol)

    if symbol not in ["(", ")"]:
        print("CASE: No Brackets")
        print("Rule: "+root_node+" --> "+str(symbol))

        right_side.append(symbol)

    elif symbol == "(":
        print("CASE: Opening Bracket")
        print("Temp Tree: ", temp_tree)
        cursubtree_end = bracket_depth(temp_tree)
        print("Subtree ends at position "+str(cursubtree_end)+" and Element is "+temp_tree[cursubtree_end])
        cursubtree_start = temp_tree.index(symbol)

        cursubtree = temp_tree[cursubtree_start:cursubtree_end+1]
        print("Subtree: ", cursubtree)

        rnode = extract_rules(cursubtree)
        if rnode:
            right_side.append(rnode)
            print("Rule: "+root_node+" --> "+str(rnode))

    print(right_side)
    return root_node


def bracket_depth(tree):
    counter = 0
    position = 0
    subtree = []

    for i, char in enumerate(tree):
        if char == "(":
            counter = counter + 1
        if char == ")":
            counter = counter - 1

        if counter == 0 and i != 0:
            counter = i
            position = i
            break

    subtree = tree[0:position+1]

    return position

目前它适用于S 的第一个子树,但所有其他子树都不会被递归解析。很高兴有任何帮助..

【问题讨论】:

    标签: python python-3.x recursion nlp


    【解决方案1】:

    我倾向于让它尽可能简单,而不是尝试重新发明您目前不允许使用的解析模块。比如:

    string = '''
        (ROOT
            (S
                (NP (NN Carnac) (DT the) (NN Magnificent))
                (VP (VBD gave) (NP (DT a) (NN talk)))
            )
        )
    '''
    
    def is_symbol_char(character):
        '''
        Predicate to test if a character is valid
        for use in a symbol, extend as needed.
        '''
    
        return character.isalpha() or character in '-=$!?.'
    
    def tokenize(characters):
        '''
        Process characters into a nested structure.  The original string
        '(DT the)' is passed in as ['(', 'D', 'T', ' ', 't', 'h', 'e', ')']
        '''
    
        tokens = []
    
        while characters:
            character = characters.pop(0)
    
            if character.isspace():
                pass  # nothing to do, ignore it
    
            elif character == '(':  # signals start of recursive analysis (push)
                characters, result = tokenize(characters)
                tokens.append(result)
    
            elif character == ')':  # signals end of recursive analysis (pop)
                break
    
            elif is_symbol_char(character):
                # if it looks like a symbol, collect all
                # subsequents symbol characters
                symbol = ''
    
                while is_symbol_char(character):
                    symbol += character
                    character = characters.pop(0)
    
                # push unused non-symbol character back onto characters
                characters.insert(0, character)
    
                tokens.append(symbol)
    
        # Return whatever tokens we collected and any characters left over
        return characters, tokens
    
    def extract_rules(tokens):
        ''' Recursively walk tokenized data extracting rules. '''
    
        head, *tail = tokens
    
        print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail])
    
        for token in tail:  # recurse
            if isinstance(token, list):
                extract_rules(token)
    
    characters, tokens = tokenize(list(string))
    
    # After a successful tokenization, all the characters should be consumed
    assert not characters, "Didn't consume all the input!"
    
    print('Tokens:', tokens[0], 'Rules:', sep='\n\n', end='\n\n')
    
    extract_rules(tokens[0])
    

    输出

    Tokens:
    
    ['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', ['DT', 'a'], ['NN', 'talk']]]]]
    
    Rules:
    
    ROOT --> S
    S --> NP VP
    NP --> NN DT NN
    NN --> Carnac
    DT --> the
    NN --> Magnificent
    VP --> VBD NP
    VBD --> gave
    NP --> DT NN
    DT --> a
    NN --> talk
    

    注意

    我把你原来的树改成了这个子句:

    (NP ((DT a) (NN talk)))
    

    似乎不正确,因为它在网络上可用的语法树图示器上生成了一个空节点,所以我将其简化为:

    (NP (DT a) (NN talk))
    

    根据需要进行调整。

    【讨论】:

    • 首先感谢答案,它似乎最有效,您能否解释或评论代码的关键部分?提取规则方法也给了我一个错误:File "nnn.py", line 52, in read_treebank extract_penn_rules(tokens[0]) File "nnn.py", line 88, in extract_penn_rules print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail]) File "nnn.py", line 88, in <listcomp> print(head, '-->', *[x[0] if isinstance(x, list) else x for x in tail]) IndexError: list index out of range
    • 其次,我的一些 POS-Tags 包含破折号 '-' 并且 tokenize 方法似乎将它们分成多个头:例如:[['S', ['NP', 'SBJ', ['EX', 'There']], ['VP', ['VBZ', 'is'], ['NP', 'PRD', ['DT', 'no'], ['NN', 'asbestos']], ['PP', 'LOC', ['IN', 'in'], ['NP', ['PRP', 'our'], ['NNS', 'products']]], ['ADVP', 'TMP', ['RB', 'now']]], []]] 'PP' 和 'LOC' 应该是 'PP- LOC'
    • @SaifDeen,我已经添加了破折号处理,尽管您可以自行添加。 SO 在这里帮助您克服障碍,而不是处理我评论我的代码以帮助解释我的逻辑的所有完成细节。至于extract_rules() 问题,请为我提供引发错误的示例输入,我会看看我能做什么。另外,你没有评论我的语法注释。
    • 对你的帮助太大了。我本可以在看到之后进行破折号处理,但老实说:我现在在这个问题上工作了一周,但我已经失去了对它的概述。关于您的注释:我不确定它只是我需要解析的许多 penn 树之一。我认为你修复它是对的。
    • 关于extract_rules 的错误,这是我的输入:( S ( NP-SBJ ( EX There ) ) ( VP ( VBZ is ) ( NP-PRD ( DT no ) ( NN asbestos ) ) ( PP-LOC ( IN in ) ( NP ( PRP$ our ) ( NNS products ) ) ) ( ADVP-TMP ( RB now ) ) ) ( . . ) )
    【解决方案2】:

    这可以通过更简单的方式完成。鉴于我们知道我们的语法结构是 CNF LR,我们可以使用递归正则表达式解析器来解析文本。

    有一个叫做 pyparser 的包(如果你还没有的话,你可以用pip install pyparser 安装它)。

    from pyparsing import nestedExpr
    
    astring = '''(ROOT 
    (S 
       (NP (NN Carnac) (DT the) (NN Magnificent)) 
       (VP (VBD gave) (NP ((DT a) (NN talk))))
    )
    )'''
    
    expr = nestedExpr('(', ')')
    result = expr.parseString(astring).asList()[0]
    print(result)
    

    这给了

    ['ROOT', ['S', ['NP', ['NN', 'Carnac'], ['DT', 'the'], ['NN', 'Magnificent']], ['VP', ['VBD', 'gave'], ['NP', [['DT', 'a'], ['NN', 'talk']]]]]]
    

    所以我们已经成功地将我们的字符串翻译成列表的层次结构。现在我们需要编写一些代码来解析列表并提取规则。

    def get_rules(result, rules):
        for l in result[1:]:
            if isinstance(l, list) and not isinstance(l[0], list):
                rules.add((result[0], l[0]))  
                get_rules(l, rules)
    
            elif isinstance(l[0], list):
                rules.add((result[0], tuple([x[0] for x in l])))
            else:
                rules.add((result[0], l))
    
        return rules
    

    正如我所提到的,我们已经知道语法的结构,所以我们在这里只需要处理有限数量的条件。

    这样调用这个函数:

    rules = get_rules(result, set()) # results was obtained from before
    
    for i in rules:
       print i
    

    输出:

    ('ROOT', 'S')
    ('VP', 'NP')
    ('DT', 'the')
    ('NP', 'NN')
    ('NP', ('DT', 'NN'))
    ('NP', 'DT')
    ('S', 'VP')
    ('VBD', 'gave')
    ('NN', 'Carnac')
    ('NN', 'Magnificent')
    ('S', 'NP')
    ('VP', 'VBD')
    

    根据需要订购。

    【讨论】:

    • 很抱歉,我不能使用任何模块或正则表达式。我应该说:(
    • 好吧,你可能想现在就说出来,以免浪费别人的时间。
    • 好吧,如果你能弄清楚如何解析字符串得到一个列表,你可以用get_rules提取规则。
    • 有没有办法扩展我的函数来解析其他子树?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2013-08-27
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多