【问题标题】:Regex find list values within multiple multi-line strings python正则表达式在多个多行字符串python中查找列表值
【发布时间】:2021-03-18 15:41:52
【问题描述】:

我正在寻找一些帮助,以在具有多行但模式相似的字符串中搜索列表条件。每个子查询都有as ( ),。所以as space (space ), 模式。我想在每个模式中搜索子条件并生成输出。

sub = ['apple.apple','event.pear','strawberry']

还有一个像这样的多行多行字符串 -

qry = 

with

qry_1 as ( select some code, var as var_1 from apple where code.. and code..
),
qry_2 as ( select some from_code some code, where var as var_2 from pear where code.. and code..
),
qry_3 as ( select some from_code some from strawberry join some code, from apple where var as var_3, )
)

我希望能够在这些查询中搜索子列表并确定它们存在的位置。我有类似的东西,但不确定如何使它工作。

find_sub = [re.findall(sub, i)
     for i in re.findall(search separate text, qry)]
# create dictionary output
dict_sub = dict([k,dict_sub[k]) for k in sub])
dict_sub

它的样子 -

'apple.apple' : ['qry_1','qry_3']
'event.pear' : ['qry_2']
'strawberry' : ['qry_3']

我想我已经很接近并在这方面得到了一些帮助,但我坚持这样做了。

【问题讨论】:

    标签: python regex string


    【解决方案1】:

    您可以找到子查询名称和关联的字段,然后构建所需的字典:

    import re, collections
    qry = '\nwith\n\nqry_1 as ( select some code, var as var_1 from apple where code.. and code..\n),\nqry_2 as ( select some code, where var as var_2 from pear where code.. and code..\n),\nqry_3 as ( select some code from strawberry join some code, from apple where var as var_3, )\n)\n'
    d, d1 = collections.defaultdict(list), {}
    for i in re.split('(?<=\),)\n', qry):
        a, *_b = re.findall('\w+(?=\sas\s\()|(?<=from\s)\w+', i)
        b = [i for i in _b if i in sub]
        for k in b:
           d[k].append(a)
        d1[a] = b
    
    print(dict(d))
    print(dict(d1))
    

    输出:

    {'apple': ['qry_1', 'qry_3'], 'pear': ['qry_2'], 'strawberry': ['qry_3']}
    {'qry_1': ['apple'], 'qry_2': ['pear'], 'qry_3': ['strawberry', 'apple']}
    

    编辑:由于您的查询很复杂,我建议使用sqlparse 包。 sqlparse 将创建一个可导航的结构,可以遍历该结构以获取所需的信息。

    首先,安装sqlparse

    pip3 install sqlparse
    

    然后,解析并遍历查询。函数get_fields 搜索出现在fromjoin 关键字之后的标识符。这些标识符可以是表名或查询。参数all_identifiers 将获取任何标识符语句,无论它是否执行fromjoin。在解析问题的上下文中,将此参数设置为True 将搜索select 块选择的字段,以及fromjoin 之后的标识符:

    import sqlparse
    from sqlparse import tokens as T
    sub = ['apple.apple','event.pear','strawberry']
    qry = """
    with qry_1 as (
       select a.* from apple.apple a
    ),
    with qry_2 as (
       select a.* from apple a join strawberry s on a.id = s.id
    ),
    with qry_3 as (
       select a.* from (select k.* from event.pear p) l join apple.apple a on l.id = a.id join (select x.* s from strawberry s where s.m = (select max(l) from ignore_field where l.id = s.id)) k3 on k3 = a.id
    )
    """
    def get_fields(block, all_identifiers = False):
       seen_id = all_identifiers
       for i in getattr(block, 'tokens', []):
          if i.ttype == T.Keyword and i.value.lower() in {'from', 'join'}:
             seen_id = True
          if seen_id and isinstance(i, sqlparse.sql.Identifier):
             yield i.get_alias()
             if any(isinstance(k, sqlparse.sql.Parenthesis) for k in getattr(i, 'tokens', [])):
                yield from get_fields(i, all_identifiers = seen_id)
             else:
                yield from re.findall('^[\w+\.]+|\w+', str(i))
          elif seen_id:
              yield from get_fields(i, all_identifiers = seen_id)
    
    p = sqlparse.parse(qry)
    k = {i.tokens[0].value:list(get_fields(i.tokens[-1])) for j in p for i in j.tokens if isinstance(i, sqlparse.sql.Identifier)}
    d1, d2 = collections.defaultdict(list), {}
    for a, _b in k.items():
        for i in (b:=[j for j in _b if j in sub]):
           d1[i].append(a)
        d2[a] = b
    
    print(dict(d1))
    print(dict(d2))
    

    输出:

    {'apple.apple': ['qry_1', 'qry_3'], 'strawberry': ['qry_2', 'qry_3'], 'event.pear': ['qry_3']}
    {'qry_1': ['apple.apple'], 'qry_2': ['strawberry'], 'qry_3': ['event.pear', 'apple.apple', 'strawberry']}
    

    注意事项:

    1. 目前只搜索from/join关键字之后的标识符。要搜索在 select 关键字之后选择的字段名称,请使用 list(get_fields(i.tokens[-1], True))
    2. get_fields 也将产生 yield 子查询/表别名,即如果 apple.apple a 存在,那么 a 也将与 apple.apple 一起产生。如果您不希望出现这种行为,只需将 yield i.get_alias() 注释掉即可。

    【讨论】:

    • 这适用于上面的示例,但无法在我的查询字符串上运行...我已使用 r'' '(.+?) (?i)as \(' 识别子查询,但在 re.split 中替换它也会产生错误
    • @paranormaldist 你能发布你的完整查询字符串吗?
    • @paranormaldist 请查看我最近的编辑。
    • @paranormaldist 很抱歉,我现在明白这个问题了。请查看我最近的编辑,因为代码现在过滤来自 re.findall 的结果,只包含 sub 中的字段
    • @paranormaldist 我认为最好的方法是使用 sql 解析器。我将很快添加一个使用它的解决方案
    猜你喜欢
    • 2020-05-05
    • 1970-01-01
    • 1970-01-01
    • 2020-06-03
    • 2015-06-21
    • 2018-01-02
    • 1970-01-01
    • 1970-01-01
    • 2019-09-07
    相关资源
    最近更新 更多