为什么 '['||chr(128)||'-'||chr(255)||']' 不起作用答案

【问题标题】：why '['||chr(128)||'-'||chr(255)||']' doesn't work为什么 '['||chr(128)||'-'||chr(255)||']' 不起作用
【发布时间】：2021-10-12 04:44:15
【问题描述】：

我想在 oracle 查询中查找符号超过 chr(127) 的 ascii 字符串

我看到很多建议 '['||chr(128)||'-'||chr(255)||']' 必须有效，但它没有

所以 next 必须返回 OK，但它没有

select 'OK' as result from dual where regexp_like('why Ä ?', '['||chr(128)||'-'||chr(255)||']')

next 不能返回 OK，但它确实

select 'OK' as result from dual where regexp_like('why - ?', '['||chr(128)||'-'||chr(255)||']')

UPD：抱歉，大写的变音符号在我的情况下是 \xC4 (ISO 8859 Latin 1) ，但在这里它变成了 unicode chr(50052)

【问题讨论】：

您假设“超过 127”意味着“介于 128 和 255 之间”。这种假设是错误的。有许多字符（实际上绝大多数）的代码点大于 255。例如，大写的变音符号（在您的第一个字符串中）是 chr(50052)。在您的第二个字符串中，您认为哪个字符不是低于chr(127)？
关于字符范围，你可能会感兴趣：stackoverflow.com/questions/50914930/…
NLS_CHARACTERSET 的值是多少？对于多字节字符集，会有高于 255 的代码点。但您可以考虑使用 TRANSLATE 函数删除 127 之前的所有字符并检查剩余字符串的长度。或使用not regexp_like(
您到底想达到什么目的？为什么要使用代码点？

标签： regex oracle

【解决方案1】：

换一种方法怎么样？将字符串拆分成字符，检查最大值是否大于127。

例如：

SQL> with test (col) as
  2    (select 'why Ä ?' from dual)
  3  select substr(col, level, 1) one_character,
  4         ascii(substr(col, level, 1)) ascii_of_one_character
  5  from test
  6  connect by level <= length(col);

ONE_ ASCII_OF_ONE_CHARACTER
---- ----------------------
w                       119
h                       104
y                       121
                         32
Ä                     50621         --> here it is!
                         32
?                        63

7 rows selected.

SQL>

现在，将其移动到子查询中并获取结果：

SQL> with test (col) as
  2    (select 'why Ä ?' from dual)
  3  select case when max(ascii_of_one_character) > 127 then 'OK'
  4              else 'Not OK'
  5         end result
  6  from (select substr(col, level, 1) one_character,
  7          ascii(substr(col, level, 1)) ascii_of_one_character
  8        from test
  9        connect by level <= length(col)
 10       );

RESULT
------
OK

或者：

SQL> with test (col) as
  2    (select 'why - ?' from dual)
  3  select case when max(ascii_of_one_character) > 127 then 'OK'
  4              else 'Not OK'
  5         end result
  6  from (select substr(col, level, 1) one_character,
  7          ascii(substr(col, level, 1)) ascii_of_one_character
  8        from test
  9        connect by level <= length(col)
 10       );

RESULT
------
Not OK

数百万行？好吧，即使我发布的两行查询也无法正常工作。切换到

SQL> with test (col) as
  2    (select 'why - ?' from dual union all
  3     select 'why Ä ?' from dual
  4    )
  5  select col,
  6         case when max(ascii_of_one_character) > 127 then 'OK'
  7              else 'Not OK'
  8         end result
  9  from (select col,
 10               substr(col, column_value, 1) one_character,
 11               ascii(substr(col, column_value, 1)) ascii_of_one_character
 12        from test cross join table(cast(multiset(select level from dual
 13                                                 connect by level <= length(col)
 14                                                ) as sys.odcinumberlist))
 15       )
 16  group by col;

COL      RESULT
-------- ------
why - ?  Not OK
why Ä ?  OK

SQL>

它将如何表现？我不知道，试试看告诉我们。请注意，对于大型数据集，正则表达式实际上可能比简单的 substr 选项慢。

还有一个选择：TRANSLATE 怎么样？在这种情况下，您不必拆分任何东西。例如：

SQL> with test (col) as
  2    (select 'why - ?' from dual union all
  3     select 'why Ä ?' from dual
  4    )
  5  select col,
  6         case when nvl(length(res), 0) > 0 then 'OK'
  7              else 'Not OK'
  8         end result
  9  from (select col,
 10        translate
 11        (col,
 12         '!"#$%&''()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ',
 13         '!') res
 14        from test
 15       );

COL      RESULT
-------- ------
why - ?  Not OK
why Ä ?  OK

SQL>

【讨论】：

有趣的方法，我会尝试，但我有数百万行相当长的行，所以我担心将每一行分成数百行会杀死我的 oracle，然后 DBA 会杀死我
对；好吧，你从来没有说过行数。我编辑了答案，因为在这种情况下应该修改查询。请看一看。哦，是的——你喜欢哪种花？
还有一个选项，@Dmitry - 翻译。检查编辑的答案。
不了解鲜花，但您的回答似乎对我有用。抱歉，我会在星期一更仔细地检查它们
当DBA杀了你，我应该带什么花去参加葬礼？哦，杀戮不再是一种选择？那么，对不起，没有花给你。

【解决方案2】：

还有另一种方法：

with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select 
  str,
  case 
     when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']') 
       then 'Ok' 
       else 'Not ok' 
  end as res,
  xmlcast(
    xmlquery(
       'count(string-to-codepoints(.)[. > 127])' 
       passing t.str 
       returning content)
    as int) cnt_over_127
from t;

结果：

STR        RES    CNT_OVER_127
---------- ------ ------------
why Ä ?    Ok                1
why - ?    Not ok            0
why - ? Ä  Ok                1
why        Not ok            0

如您所见，我将xmlquery() 与string-to-codepoints xpath 函数一起使用，然后过滤掉>127 的代码点并返回它们的count()。

您也可以使用dump 或utl_raw.cast_to_raw() 函数，但它有点复杂，我有点懒得使用它们编写完整的解决方案。但只是草稿：

with t(str) as (
select 'why Ä ?' from dual union all
select 'why - ?' from dual union all
select 'why - ? Ä' from dual union all
select 'why' from dual
)
select 
  str,
  case 
     when regexp_like(str, '[^'||chr(1)||'-'||chr(127)||']') 
       then 'Ok' 
       else 'Not ok' 
  end as res,
  dump(str,1016) dmp,
  dump(str,1015) dmp,
  utl_raw.cast_to_raw(str) as_row,
  regexp_count(dump(str,1016)||',', '[89a-f][0-9a-f],') xs
from t;

结果：

STR        RES    DMP                                                                 DMP                                                                     AS_ROW               XS
---------- ------ ------------------------------------------------------------------- ----------------------------------------------------------------------- -------------------- --
why Ä ?    Ok     Typ=1 Len=8 CharacterSet=AL32UTF8: 77,68,79,20,c3,84,20,3f          Typ=1 Len=8 CharacterSet=AL32UTF8: 119,104,121,32,195,132,32,63         77687920C384203F      2
why - ?    Not ok Typ=1 Len=7 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f             Typ=1 Len=7 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63              776879202D203F        0
why - ? Ä  Ok     Typ=1 Len=10 CharacterSet=AL32UTF8: 77,68,79,20,2d,20,3f,20,c3,84   Typ=1 Len=10 CharacterSet=AL32UTF8: 119,104,121,32,45,32,63,32,195,132  776879202D203F20C384  2
why        Not ok Typ=1 Len=3 CharacterSet=AL32UTF8: 77,68,79                         Typ=1 Len=3 CharacterSet=AL32UTF8: 119,104,121                          776879                0

注意：因为那是 unicode，所以第一个字节 >127 表示它是一个多字节字符，所以它计算 'Ä' 两次 - c3,84, - 两个字节都高于 127。

【讨论】：

【解决方案3】：

不知道为什么要使用代码点而不是字符集，但您可以颠倒逻辑 - 使用 not 1-127 - [^1-127] ： DBFiddle

select 'OK' as result 
from dual 
where regexp_like('why Ä ?', '[^'||chr(1)||'-'||chr(127)||']');
select regexp_substr('why Ä ?', '[^'||chr(1)||'-'||chr(127)||']') x from dual;

别忘了有些字符可能是特殊字符，例如]，甚至是不可打印的

【讨论】：

你检查你的答案吗？将 "'why Ä ?'" 替换为简单的 "why"，然后您再次看到 OK。这是我最初的问题。
@DmitryPerfilyev 你确定吗？ dbfiddle.uk/…
是的，我确定，但我在 SQLDeveloper 版本 4.1.3.20 中执行此查询我的结果：1 查询：OK 2 查询：OK 3 查询：w
@DmitryPerfilyev 正确配置您的 nls 设置并显示它们，还可以在正确配置的 sql*plus 中重试此示例并显示返回此查询的内容：select dump('why Ä',1016) dmp from dual