【发布时间】:2018-03-16 13:19:22
【问题描述】:
我需要使用正则表达式从 SQL DDL 语句中解析一些信息。 SQL 语句如下所示:
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
我需要在 Python 中解析它并提取在PARTITIONED BY 子句中命名的列。在删除换行符后,我想出了一个正则表达式来实现它,但如果那里有换行符,我就无法让它工作。这是一些演示代码:
import re
def print_partition_columns_if_found(ddl_string):
regex = r'CREATE +?(TEMPORARY +)?TABLE *(?P<db>.*?\.)?(?P<table>.*?)\((?P<col>.*?)\).*?USING +([^\s]+)( +OPTIONS *\([^)]+\))?( *PARTITIONED BY \((?P<pcol>.*?)\))?'
match = re.search(regex, ddl_string, re.MULTILINE | re.DOTALL)
if match.group("pcol"):
print match.group("pcol").strip()
else:
print 'did not find any pcols in {matches}'.format(matches=match.groups())
ddl_string1 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet OPTIONS (serialization.format '1') PARTITIONED BY (DATA2, DATA3)"""
print_partition_columns_if_found(ddl_string1)
print "--------"
ddl_string2 = """
CREATE TABLE default.table1 (DATA4 BIGINT, DATA5 BIGINT, DATA2 BIGINT, DATA3 BIGINT)
USING parquet
OPTIONS (
serialization.format '1'
)
PARTITIONED BY (DATA2, DATA3)
"""
print_partition_columns_if_found(ddl_string2)
返回:
数据2,数据3
--------
在(无,'default.','table1','DATA4 BIGINT,DATA5 BIGINT,DATA2 BIGINT,DATA3 BIGINT','parquet',无,无,无)中找不到任何 pcols
任何正则表达式专家愿意帮助我吗?
【问题讨论】: