【问题标题】:Python regex for parsing SQL statement用于解析 SQL 语句的 Python 正则表达式
【发布时间】:2018-03-16 13:19:22
【问题描述】:

我需要使用正则表达式从 SQL DDL 语句中解析一些信息。 SQL 语句如下所示:

CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)

我需要在 Python 中解析它并提取在PARTITIONED BY 子句中命名的列。在删除换行符后,我想出了一个正则表达式来实现它,但如果那里有换行符,我就无法让它工作。这是一些演示代码:

import re
def print_partition_columns_if_found(ddl_string):
    regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
    match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
    if match.group("pcol"):
        print match.group("pcol").strip()
    else:
        print 'did not find any pcols in {matches}'.format(matches=match.groups())        


ddl_string1 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
print_partition_columns_if_found(ddl_string1)

print "--------"

ddl_string2 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
  serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
"""
print_partition_columns_if_found(ddl_string2)

返回:

数据2,数据3
--------
在(无,'default.','table1','DATA4 BIGINT,DATA5 BIGINT,DATA2 BIGINT,DATA3 BIGINT','parquet',无,无,无)中找不到任何 pcols

任何正则表达式专家愿意帮助我吗?

【问题讨论】:

    标签: python regex


    【解决方案1】:

    让我们检查一下 python sqlparse 文档:Documentation - getting started

    >>> import sqlparse
    >>> ddl_string2 = """
    ... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
    ... USING parquet
    ... OPTIONS (
    ...   serialization.format '1'
    ... )
    ... PARTITIONED BY (DATA2, DATA3)
    ... """
    >>> ddl_string1 = """
    ... CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
    ... USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
    >>> def print_partition_columns_if_found(sql):
    ...     parse = sqlparse.parse(sql)
    ...     data = next(item for item in reversed(parse[0].tokens) if item.ttype is None)[1]
    ...     print(data)
    ...
    >>> print_partition_columns_if_found(ddl_string1)
    DATA2, DATA3
    >>> print_partition_columns_if_found(ddl_string2)
    DATA2, DATA3
    >>>
    

    【讨论】:

    • 有趣,我会深入研究一下。谢谢。
    猜你喜欢
    • 1970-01-01
    • 2021-05-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-02-24
    • 1970-01-01
    • 2017-10-27
    • 1970-01-01
    相关资源
    最近更新 更多