正则表达式在多个多行字符串python中查找列表值答案

【问题标题】：Regex find list values within multiple multi-line strings python正则表达式在多个多行字符串python中查找列表值
【发布时间】：2021-03-18 15:41:52
【问题描述】：

我正在寻找一些帮助，以在具有多行但模式相似的字符串中搜索列表条件。每个子查询都有as ( 和 ),。所以as space ( 和space ), 模式。我想在每个模式中搜索子条件并生成输出。

sub = ['apple.apple','event.pear','strawberry']

还有一个像这样的多行多行字符串 -

qry = 

with

qry_1 as ( select some code, var as var_1 from apple where code.. and code..
),
qry_2 as ( select some from_code some code, where var as var_2 from pear where code.. and code..
),
qry_3 as ( select some from_code some from strawberry join some code, from apple where var as var_3, )
)

我希望能够在这些查询中搜索子列表并确定它们存在的位置。我有类似的东西，但不确定如何使它工作。

find_sub = [re.findall(sub, i)
     for i in re.findall(search separate text, qry)]
# create dictionary output
dict_sub = dict([k,dict_sub[k]) for k in sub])
dict_sub

它的样子 -

'apple.apple' : ['qry_1','qry_3']
'event.pear' : ['qry_2']
'strawberry' : ['qry_3']

我想我已经很接近并在这方面得到了一些帮助，但我坚持这样做了。

【问题讨论】：

标签： python regex string

【解决方案1】：

您可以找到子查询名称和关联的字段，然后构建所需的字典：

import re, collections
qry = '\nwith\n\nqry_1 as ( select some code, var as var_1 from apple where code.. and code..\n),\nqry_2 as ( select some code, where var as var_2 from pear where code.. and code..\n),\nqry_3 as ( select some code from strawberry join some code, from apple where var as var_3, )\n)\n'
d, d1 = collections.defaultdict(list), {}
for i in re.split('(?<=\),)\n', qry):
    a, *_b = re.findall('\w+(?=\sas\s\()|(?<=from\s)\w+', i)
    b = [i for i in _b if i in sub]
    for k in b:
       d[k].append(a)
    d1[a] = b

print(dict(d))
print(dict(d1))

输出：

{'apple': ['qry_1', 'qry_3'], 'pear': ['qry_2'], 'strawberry': ['qry_3']}
{'qry_1': ['apple'], 'qry_2': ['pear'], 'qry_3': ['strawberry', 'apple']}

编辑：由于您的查询很复杂，我建议使用sqlparse 包。 sqlparse 将创建一个可导航的结构，可以遍历该结构以获取所需的信息。

首先，安装sqlparse：

pip3 install sqlparse

然后，解析并遍历查询。函数get_fields 搜索出现在from 或join 关键字之后的标识符。这些标识符可以是表名或查询。参数all_identifiers 将获取任何标识符语句，无论它是否执行from 或join。在解析问题的上下文中，将此参数设置为True 将搜索select 块选择的字段，以及from 或join 之后的标识符：

import sqlparse
from sqlparse import tokens as T
sub = ['apple.apple','event.pear','strawberry']
qry = """
with qry_1 as (
   select a.* from apple.apple a
),
with qry_2 as (
   select a.* from apple a join strawberry s on a.id = s.id
),
with qry_3 as (
   select a.* from (select k.* from event.pear p) l join apple.apple a on l.id = a.id join (select x.* s from strawberry s where s.m = (select max(l) from ignore_field where l.id = s.id)) k3 on k3 = a.id
)
"""
def get_fields(block, all_identifiers = False):
   seen_id = all_identifiers
   for i in getattr(block, 'tokens', []):
      if i.ttype == T.Keyword and i.value.lower() in {'from', 'join'}:
         seen_id = True
      if seen_id and isinstance(i, sqlparse.sql.Identifier):
         yield i.get_alias()
         if any(isinstance(k, sqlparse.sql.Parenthesis) for k in getattr(i, 'tokens', [])):
            yield from get_fields(i, all_identifiers = seen_id)
         else:
            yield from re.findall('^[\w+\.]+|\w+', str(i))
      elif seen_id:
          yield from get_fields(i, all_identifiers = seen_id)

p = sqlparse.parse(qry)
k = {i.tokens[0].value:list(get_fields(i.tokens[-1])) for j in p for i in j.tokens if isinstance(i, sqlparse.sql.Identifier)}
d1, d2 = collections.defaultdict(list), {}
for a, _b in k.items():
    for i in (b:=[j for j in _b if j in sub]):
       d1[i].append(a)
    d2[a] = b

print(dict(d1))
print(dict(d2))

输出：

{'apple.apple': ['qry_1', 'qry_3'], 'strawberry': ['qry_2', 'qry_3'], 'event.pear': ['qry_3']}
{'qry_1': ['apple.apple'], 'qry_2': ['strawberry'], 'qry_3': ['event.pear', 'apple.apple', 'strawberry']}

注意事项：

目前只搜索from/join关键字之后的标识符。要搜索在 select 关键字之后选择的字段名称，请使用 list(get_fields(i.tokens[-1], True))。
get_fields 也将产生 yield 子查询/表别名，即如果 apple.apple a 存在，那么 a 也将与 apple.apple 一起产生。如果您不希望出现这种行为，只需将 yield i.get_alias() 注释掉即可。

【讨论】：

这适用于上面的示例，但无法在我的查询字符串上运行...我已使用 r'' '(.+?) (?i)as \(' 识别子查询，但在 re.split 中替换它也会产生错误
@paranormaldist 你能发布你的完整查询字符串吗？
@paranormaldist 请查看我最近的编辑。
@paranormaldist 很抱歉，我现在明白这个问题了。请查看我最近的编辑，因为代码现在过滤来自 re.findall 的结果，只包含 sub 中的字段
@paranormaldist 我认为最好的方法是使用 sql 解析器。我将很快添加一个使用它的解决方案