有没有办法列出 perluniprops 中的所有类别？答案

【问题标题】：Is there a way to list all categories in perluniprops?有没有办法列出 perluniprops 中的所有类别？
【发布时间】：2021-07-12 06:20:50
【问题描述】：

perluniprops 列出了它支持的 Unicode 版本的 Unicode 属性。 For Perl 5.32.1，这是 Unicode 13.0.0。

您可以使用Unicode::Tussle的unichars获取与某个类别匹配的字符列表。

unichars '\p{Close_Punctuation}'

还有帮助：

$ unichars --help
Usage:
    unichars [*options*] *criterion* ...

    Each criterion is either a square-bracketed character class, a regex
    starting with a backslash, or an arbitrary Perl expression. See the
    EXAMPLES section below.

    OPTIONS:

     Selection Options:

        --bmp           include the Basic Multilingual Plane (plane 0) [DEFAULT]
        --smp           include the Supplementary Multilingual Plane (plane 1)
        --astral    -a  include planes above the BMP (planes 1-15)
        --unnamed   -u  include various unnamed characters (see DESCRIPTION)
        --locale    -l  specify the locale used for UCA functions

     Display Options:

        --category  -c  include the general category (GC=)
        --script    -s  include the script name (SC=)
        --block     -b  include the block name (BLK=)
        --bidi      -B  include the bidi class (BC=)
        --combining -C  include the canonical combining class (CCC=)
        --numeric   -n  include the numeric value (NV=)
        --casefold  -f  include the casefold status
        --decimal   -d  include the decimal representation of the code point

     Miscellaneous Options:

        --version   -v  print version information and exit
        --help      -h  this message
        --man       -m  full manpage
        --debug     -d  show debugging of criteria and examined code point span

     Special Functions:

         $_    is the current code point
         ord   is the current code point's ordinal

         NAME is charname::viacode(ord)
         NUM is Unicode::UCD::num(ord), not code point number
         CF is casefold->{status}
         NFD, NFC, NFKD, NFKC, FCD, FCC  (normalization)
         UCA, UCA1, UCA2, UCA3, UCA4 (binary sort keys)

         Singleton, Exclusion, NonStDecomp, Comp_Ex
         checkNFD, checkNFC, checkNFKD, checkNFKC, checkFCD, checkFCC
         NFD_NO, NFC_NO, NFC_MAYBE, NFKD_NO, NFKC_NO, NFKC_MAYBE

除了从网页中读取类别列表之外，有没有办法以编程方式获取所有可能的 \p{...} 类别？

【问题讨论】：

没有简单的方法，我在成像。检查uniprops 的作用。
字面意思是parses perluniprops.pod
你想完成什么？
该程序是否真的接受\p{} 表达式作为输入？也许我错过了它，但似乎并非如此，那么为什么知道 unicode 属性列表会对您有所帮助呢？您需要知道您已经知道的每个属性匹配哪些字符。除非你真的没有。使用regex 模块而不是re 模块，您也将使用真正的Unicode 属性。最新的甚至像最新的 Perl 一样使用 Unicode 13.0.0。
（我很感激您努力弄清楚程序的作用，而不仅仅是说“我正在尝试翻译这个。”）

标签： string perl unicode character-set

【解决方案1】：

来自 cmets，我相信您正在尝试将使用 \p 正则表达式属性的 Perl 程序移植到 Python。您不需要所有类别的列表（无论这意味着什么）；您只需要知道程序使用的每个属性匹配的代码点。

现在，您可以从Unicode database 获取代码点列表。但更简单的解决方案是使用 Python 的 regex 模块而不是 re 模块。这将使您能够访问 Perl 公开的相同 Unicode 定义的属性。

regex 模块的最新版本甚至像最新的 Perl 一样使用 Unicode 13.0.0。

请注意，该程序使用\p{IsAlnum}，写成\p{Alnum} 的方式很长。 \p{Alnum} 不是标准的 Unicode 属性，而是 Perl 扩展。它是 Unicode 属性 \p{Alpha} 和 \p{Nd} 的联合。我不知道 regex 模块是否同样定义了Alnum，但它可能确实如此。

【讨论】：