从 Python 中的 SQL 语句字符串获取 INSERT INTO 语句答案

【问题标题】：Get INSERT INTO statement from SQL statements string in Python从 Python 中的 SQL 语句字符串获取 INSERT INTO 语句
【发布时间】：2021-12-27 20:12:40
【问题描述】：

我有如下字符串：

sql = """DROP TABLE IF EXISTS table1;

ALTER TABLE table1 DROP PRIMARY KEY;

INSERT INTO table1 (id, created, name, telefonnummer, erPatient_id) VALUES
    (1, '2015-08-06 12;09:08', ' ', ' ', 16528),
    (2, '2015-08-06 12:43:11', ' ', ' ', 16529)
;

INSERT INTO table2 (comment, id) VALUES
('hello this is a semicolon ;', 2);"""

我想得到语句 INSERT INTO table1：

INSERT INTO table1 (id, created, name, telefonnummer, erPatient_id) VALUES
        (1, '2015-08-06 12;09:08', ' ', ' ', 16528),
        (2, '2015-08-06 12:43:11', ' ', ' ', 16529)
    ;

我无法用sql.split(';) 拆分字符串，因为要插入的VALUES 中有分号。

我尝试了正则表达式但没有成功：

import re
pattern_string = r"INSERT INTO table1[(]*[^)]+\)[^)]"
q = re.findall(pattern_string, data, re.MULTILINE | re.DOTALL)

在真正的字符串中，将插入数千个值和数十个表。

【问题讨论】：

如果您的数据不规则，那么正则表达式是错误的工具。你需要一个解析引擎。这个问题并不新鲜。 CSV 和无数其他东西也存在同样的问题。

标签： python sql regex

【解决方案1】：

使用

import re
pattern_string = r"\bINSERT INTO \w+\s.*?;\s*$"
q = re.findall(pattern_string, data, re.MULTILINE | re.DOTALL)

见regex proof。

解释

NODE                     EXPLANATION
--------------------------------------------------------------------------------
  \b                       the boundary between a word char (\w) and
                           something that is not a word char
--------------------------------------------------------------------------------
  INSERT INTO              'INSERT INTO '
--------------------------------------------------------------------------------
  \w+                      word characters 1 or more times 
                           (matching the most amount possible))
--------------------------------------------------------------------------------
  \s                       whitespace (\n, \r, \t, \f, and " ")
--------------------------------------------------------------------------------
  .*?                      any character (0 or more times
                           (matching the least amount possible))
--------------------------------------------------------------------------------
  ;                        ';'
--------------------------------------------------------------------------------
  \s*                      whitespace (\n, \r, \t, \f, and " ") (0 or
                           more times (matching the most amount
                           possible))
--------------------------------------------------------------------------------
  $                        end of a line

【讨论】：

【解决方案2】：

你可以使用我的图书馆SQLGlot。

import sqlglot
import sqlglot.expressions as exp

sql = ...

for expression in sqlglot.parse(sql):
    if isinstance(expression, exp.Insert):
        print(expression)

【讨论】：