PostgreSQL：正则表达式用大括号替换第一级方括号答案

【问题标题】：PostgreSQL: regex replace first level square brackets with curly bracesPostgreSQL：正则表达式用大括号替换第一级方括号
【发布时间】：2014-11-22 06:07:18
【问题描述】：

我有一个 TEXT 类型的 PostgreSQL 列中的数据，我需要对其进行一些字符替换。具体来说，我想用花括号替换方括号。问题是我只想替换深度不超过两层的括号，如果你包括主括号的话。这些字符串可能很长，所以我认为正则表达式可能是要走的路（regexp_replace 函数），但我不擅长正则表达式。这是一个这样的值的示例：

[0,0,0,[12,2],0,0,[12,[1,2,3]],12,0,[12,2,[2]],12,0,12,0,0]

所以我希望这个字符串更改为：

{0,0,0,{12,2},0,0,{12,[1,2,3]},12,0,{12,2,[2]},12,0,12,0,0}

提前致谢！

【问题讨论】：

哇，感谢您提供详细的解决方案。我需要一点时间来评估这些。但是，我会提到，性能是一个重要的考虑因素。我已经创建了自己的 plpgsql 解决方案，但是在包含超过 1000 万个字符的列上使用 position() 和 substr() 太慢了，就像我的情况一样。抱歉，我在最初的帖子@wildplasser 中没有提到这一点。
我会进一步提到我已经开始使用 plpythonu 解决方案，但当前未安装该语言扩展，并且尝试安装它时出现错误。我可能不得不硬着头皮想办法。

标签： regex postgresql

【解决方案1】：

使用正则表达式作为in PostgreSQL flavor possibly no recursion is available 会很痛苦。

最多 2 级嵌套深度检查以下双重替换是否有效（无法测试）

regexp_replace(
  regexp_replace('str', E'\\[(([^][]|\\[([^][]|\\[[^][]*\\])*\\])*)\\]', E'{\\1}', 'g')
, E'\\[(([^][]|\\[([^][]|\\[[^][]*\\])*\\])*)\\]', E'{\\1}', 'g')

这个想法是在两遍中匹配和替换最外面的[]。请参阅 regex101 中的示例：

pass 1:{0,0,0,[12,2],0,0,[12,[1,2,3]],12,0,[12,2,[2]],12,0,12,0,0}
pass 2:{0,0,0,{12,2},0,0,{12,[1,2,3]},12,0,{12,2,[2]},12,0,12,0,0}

\[[^][]*\]（未转义）匹配[...]的实例

\[ 左方括号
[^][]* 后跟任意数量的字符，不包括方括号
\] 后跟一个右方括号

请注意，如果字符串始终以[ 开头，以] 结尾并表示级别0 的一个实例（不被][ 分隔），则第一个/内部regexp_replace 也可以通过替换来完成[ 在^ 开始和] 在$ 结束：E'^\\[(.*)\\]$' 和E'{\\1}'

要在此处添加嵌套，以最大 4 级深度的示例：

\[([^][]|    # outer
\[([^][]|    # lvl 1
\[([^][]|    # lvl 2
\[([^][]|    # lvl 3
\[[^][]*\]   # lvl 4
)*\]
)*\]
)*\]
)*\]

将外部[] 中的内容包装到capture group 中，4 个级别的模式将变为：

\[(([^][]|\[([^][]|\[([^][]|\[([^][]|\[[^][]*\])*\])*\])*\])*)\]

用于regex_replace 可能需要额外转义[]

\\[(([^][]|\\[([^][]|\\[([^][]|\\[([^][]|\\[[^][]*\\])*\\])*\\])*\\])*)\\]

这可以像两次传递中的第一个模式一样使用，并替换为E'{\\1}'

【讨论】：

强尼 5，这行得通！提供的其他解决方案看起来也很有希望，但这成功地使用了我或多或少要求的正则表达式。此外，正如我在后来的 cmets 中指出的那样，我已经尝试过 @wildplasser 的方法，但它在非常长的文本值上太慢了。在几十万个字符的字符串上，位置和子字符串方法在处理数小时后永远不会完成。 pgAdmin 报告的这种正则表达式方法需要 270 毫秒！我也非常感谢正则表达式的详细解释以及必要时如何进行更深入的解释。干得好！
@PaulAngelno 好吧，如果速度是一个问题，我建议使用 PL/Python，这要快得多，或者使用像我发布的示例一样的 C 扩展。我不认为正则表达式是这项工作的正确工具，尽管这正是你所要求的。
@CraigRinger，同意并且我有兴趣在 plpythonu 中使用它，看看它是如何执行的，特别是因为数据已经是 Python 列表格式。不幸的是，Python 扩展在我的环境中不起作用，并且搜索该问题表明我必须从源代码等重建我的 PostgreSQL。现在我将采取最后抵抗的道路。
Paul，很高兴这对你有用，因为我无法对其进行测试 :) 以及 @CraigRinger 提供关于问题的不同观点并进行所有基准测试。

【解决方案2】：

这很难看，但它有效（并且避免了正则表达式的复杂性 ;-) 我希望我已经涵盖了所有极端情况......

CREATE OR REPLACE FUNCTION replbracket( _source text ) returns text
AS $func$
DECLARE
        pos_end INTEGER;
        pos_begin INTEGER;
        level INTEGER;
        result text;
BEGIN
        result = '' ;
        level = 0;
LOOP
        pos_begin = position ( '[' IN _source );
        pos_end = position ( ']' IN _source );
        -- raise notice 'Source=% Result=% Begin = % End=%'
                -- ,_source, result, pos_begin, pos_end;

        if (pos_begin < 1 AND pos_end < 1) THEN EXIT ;
        elsif (pos_begin < 1 ) THEN pos_begin =  pos_end + 1 ;
        elsif (pos_end < 1 ) THEN pos_end =  pos_begin + 1 ;
        end if;
        if (pos_begin < pos_end) THEN
                result = result || LEFT(_source, pos_begin-1);
                level = level + 1;
                if (level <= 2) THEN result = result || '{'; else result = result || '['; end if;
                _source = SUBSTR(_source, pos_begin+1);
        ELSE
                result = result || LEFT(_source, pos_end-1);
                level  = level - 1;
                if (level < 2) THEN result = result || '}'; else result = result || ']'; end if;
                _source = SUBSTR(_source, pos_end+1);
        END IF;
END LOOP;
        result = result || _source ;
        return result;
END

$func$ LANGUAGE plpgsql;

【讨论】：

【解决方案3】：

只是为了好玩，这里有一个完全在 SQL 中的解决方案。它使用 CTE 来表示清晰，但您可以在 FROM 中使用子查询，而不是使用递归 CTE。

编辑：添加了简化、更快的 SQL 版本、Pl/Python 版本和 C 版本。C 版本快一点 - 大约快 250 倍。

create or replace function repl(text) 
returns text 
language sql
as $$
with 
chars(pos, ch) as (
    -- In PostgreSQL 9.4 this can be replaced with an UNNEST ... WITH ORDINALITY
    -- it turns the string into a list of chars accompanied by their position within
    -- the string.
    select row_number() OVER (), ch
    from regexp_split_to_table($1,'') ch
),
nesting(ch, pos, lvl) as (
    -- This query then determines how many levels of nesting of [s and ]s are
    -- in effect for each character.
    select ch, pos, 
        sum(case ch when '[' then 1 when ']' then -1 else 0 end) OVER (ORDER BY pos) 
        from chars
),
transformed(ch, pos) as (
    -- and this query transforms [s to {s or ]s to }s if the nesting
    -- level is appropriate. Note that we use one less level of nesting
    -- for closing brackets because the closing bracket it self has already
    -- reduced the nesting level.
    select 
      case
        when ch = '[' and lvl <= 2 then '{' 
        when ch = ']' and lvl <= 1 then '}' 
        else ch
      end,
      pos
    from nesting
)
-- Finally, reconstruct the new string from the (char, position) tuples
select 
  string_agg(ch, '' order by pos)
from transformed;
$$;

但是，它比其他解决方案要慢。

Johnny 5 的正则表达式解决方案需要 450 毫秒进行 10,000 次迭代。
wildplasser 的replbracket 10,000 次迭代需要 950 毫秒。
此 CTE 解决方案需要 2050 毫秒进行 10,000 次迭代。

摆脱 CTE 并使用 unnest ... with ordinality 将其加速到大约 1400 毫秒：

create or replace function repl(text) returns text language sql volatile as
$$
    select
      string_agg(ch, '' order by pos)
    from (
        select
          case
            when ch = '[' and sum(case ch when '[' then 1 when ']' then -1 else 0 end) OVER (ORDER BY pos) <= 2 then '{'
            when ch = ']' and sum(case ch when '[' then 1 when ']' then -1 else 0 end) OVER (ORDER BY pos) <= 1 then '}'
            else ch
          end,
          pos
        from unnest(regexp_split_to_array($1,'')) with ordinality as chars(ch, pos)
    ) as transformed(ch, pos)
$$;

如果您想要快速，请使用适当的过程语言 - 或 C。在 PL/Python2 中：

create or replace function replpy(instr text) returns text language plpythonu as $$
def pyrepl(instr):
    level=0
    for ch in instr:
        if ch == '[':
                level += 1
                if level <= 2:
                        yield '{'
                else:
                        yield '['
        elif ch == ']':
                if level <= 2:
                        yield '}'
                else:
                        yield ']'
                level -= 1
        else:
                yield ch

return ''.join(pyrepl(instr))
$$;

需要 160 毫秒。

好吧，鞭笞一匹死马，让我们用 C 来做吧。Full source code as an extension is here 但这里是 .c 文件：

#include "postgres.h"
#include "fmgr.h"
#include "utils/builtins.h"

PG_MODULE_MAGIC;

PG_FUNCTION_INFO_V1(replc);
Datum replc(PG_FUNCTION_ARGS);

PGDLLEXPORT Datum
replc(PG_FUNCTION_ARGS)
{
    /* Set `buf` to a palloc'd copy of the input string, deTOASTed if needed */
    char * const buf = text_to_cstring(PG_GETARG_TEXT_PP(0));
    char * ch = buf;
    int depth = 0;


    while (*ch != '\0')
    {
        switch (*ch)
        {
            case '[':
                depth++;
                if (depth <= 2)
                    *ch = '{';
                break;
            case ']':
                if (depth <= 2)
                    *ch = '}';
                depth--;
                break;
        }
        ch++;
    }
    if (depth != 0)
        ereport(WARNING,
                (errmsg("Opening and closing []s did not match, got %d extra [s", depth)));

    PG_RETURN_DATUM(CStringGetTextDatum(buf));
}

运行时间：10,000 次迭代需要 8 毫秒。很好，它比原来的速度快 250 倍，而且还有强制子查询的开销。

【讨论】：