找出给定字体支持的字符答案

【问题标题】：Finding out what characters a given font supports找出给定字体支持的字符
【发布时间】：2011-05-26 09:55:57
【问题描述】：

如何从 Linux 上的 TrueType 或嵌入式 OpenType 字体中提取支持的 Unicode 字符列表？

是否有工具或库可用于处理 .ttf 或 .eot 文件并构建字体提供的代码点列表（如 U+0123、U+1234 等）？

【问题讨论】：

尝试fc-list :charset=1234，但仔细检查它的输出……（它对我有用，它显示 Gentium 有 2082 但没有 2161）
@mirabilos 这不是问题所在。它显示包含给定字符（即 1234）的字体。
哦，对了。但这两个问题是交织在一起的（您会在“答案”部分找到许多错误问题的答案）。
@mirabilos 好点。我稍微编辑了标题，以使问题的意图更加明显。

【解决方案1】：

这是一种使用fontTools Python 库的方法（您可以使用pip install fonttools 之类的东西进行安装）：

#!/usr/bin/env python
from itertools import chain
import sys

from fontTools.ttLib import TTFont
from fontTools.unicode import Unicode

with TTFont(
    sys.argv[1], 0, allowVID=0, ignoreDecompileErrors=True, fontNumber=-1
) as ttf:
    chars = chain.from_iterable(
        [y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables
    )
    if len(sys.argv) == 2:  # print all code points
        for c in chars:
            print(c)
    elif len(sys.argv) >= 3:  # search code points / characters
        code_points = {c[0] for c in chars}
        for i in sys.argv[2:]:
            code_point = int(i)   # search code point
            #code_point = ord(i)  # search character
            print(Unicode[code_point])
            print(code_point in code_points)

脚本将字体路径和可选的要搜索的代码点/字符作为参数：

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf
(32, 'space', 'SPACE')
(33, 'exclam', 'EXCLAMATION MARK')
(34, 'quotedbl', 'QUOTATION MARK')
…

$ python checkfont.py /usr/share/fonts/**/DejaVuSans.ttf 65 12622  # a ㅎ
LATIN CAPITAL LETTER A
True
HANGUL LETTER HIEUH
False

【讨论】：

int(sys.argv[2], 0) 在大多数情况下可能会因“无效文字”而失败，因为人们可能想要找到特殊字符。请改用ord(sys.argv[2].decode('string_escape').decode('utf-8'))。
不管怎样，这个基于python-fontconfig的脚本似乎要快得多：unix.stackexchange.com/a/268286/26952
@SkippyleGrandGourou 这句话似乎是对的？它通过sys.argv[1] 到TTFont()?
您可以简化：chars = chain.from_iterable([y + (Unicode[y[0]],) for y in x.cmap.items()] for x in ttf["cmap"].tables) by chars = list(y + (Unicode[y[0]],) for x in ttf["cmap"].tables for y in x.cmap.items())

【解决方案2】：

X 程序xfd 可以做到这一点。要查看“DejaVu Sans Mono”字体的所有字符，请运行：

xfd -fa "DejaVu Sans Mono"

它包含在 Debian/Ubuntu 上的 x11-utils 包、Fedora/RHEL 上的 xorg-x11-apps 和 Arch Linux 上的 xorg-xfd 中。

【讨论】：

xfd 还提供十六进制值，因为您需要为 unicode ala ctrl+shift+u 键入它们
打开一个 GUI 字符映射与列出支持的字符完全不同。
我想知道内置位图字体是否有类似的事情，比如6x13？
不幸的是，这只适用于已安装的字体。在安装字体之前获取此列表会很方便。
这为不支持的字符显示了空矩形。

【解决方案3】：

fontconfig 命令可以将字形列表输出为范围的紧凑列表，例如：

$ fc-match --format='%{charset}\n' OpenSans
20-7e a0-17f 192 1a0-1a1 1af-1b0 1f0 1fa-1ff 218-21b 237 2bc 2c6-2c7 2c9
2d8-2dd 2f3 300-301 303 309 30f 323 384-38a 38c 38e-3a1 3a3-3ce 3d1-3d2 3d6
400-486 488-513 1e00-1e01 1e3e-1e3f 1e80-1e85 1ea0-1ef9 1f4d 2000-200b
2013-2015 2017-201e 2020-2022 2026 2030 2032-2033 2039-203a 203c 2044 2070
2074-2079 207f 20a3-20a4 20a7 20ab-20ac 2105 2113 2116 2120 2122 2126 212e
215b-215e 2202 2206 220f 2211-2212 221a 221e 222b 2248 2260 2264-2265 25ca
fb00-fb04 feff fffc-fffd

fc-query 用于 .ttf 文件，fc-match 用于安装的字体名称。

这可能不涉及安装任何额外的包，也不涉及翻译位图。

使用fc-match --format='%{file}\n'检查是否匹配了正确的字体。

【讨论】：

这是谎言：它说“Gentium Italic”有“2150-2185”等，但绝对没有2161。
@mirabilos 我有 Gentium 5.000，它肯定包含 2161：ttx -t cmap -o - /usr/share/fonts/truetype/GentiumPlus-I.ttf | grep 0x2161 返回<map code="0x2161" name="uni2161"/>。 FontConfig 可能与不同的字体匹配。在我安装gentium 之前，fc-match 'Gentium Italic' 返回了FreeMono.ttf: "FreeMono" "Regular"。如果是这样，--format=%{charset} 的输出将不会显示您的期望。
我添加了一个注释，提到需要检查是否匹配正确的字体
Gentium Plus ≠ Gentium（我已经安装了所有三个，普通，基本和 Plus，但我想知道 Gentium） - 啊 nvm，我看到了问题：$ fc-match --format=' %{file}\n' Gentium /usr/share/fonts/truetype/gentium/Gentium-R.ttf $ fc-match --format='%{file}\n' Gentium\ 斜体 /usr/share/fonts/ truetype/dejavu/DejaVuSans.ttf $ fc-match --format='%{file}\n' Gentium:Italic /usr/share/fonts/truetype/gentium/Gentium-I.ttf 和fc-match --format='%{file} ⇒ %{charset}\n' Gentium:Italic DTRT，太好了。
很高兴它为您解决了问题。关于Gentium:Italic 而不是Gentium Italic 的好提示也是。谢谢你。

【解决方案4】：

fc-query my-font.ttf 将为您提供支持的字形地图以及该字体适合的所有区域设置根据 fontconfig

由于几乎所有现代 linux 应用程序都是基于 fontconfig 的，这比原始 unicode 列表有用得多

这里讨论实际的输出格式 http://lists.freedesktop.org/archives/fontconfig/2013-September/004915.html

【讨论】：

【解决方案5】：

ttf/otf 字体的字符代码点存储在CMAP 表中。

您可以使用ttx 生成CMAP 表的XML 表示。见here。

您可以运行命令ttx.exe -t cmap MyFont.ttf，它应该会输出一个文件MyFont.ttx。在文本编辑器中打开它，它应该会显示在字体中找到的所有字符代码。

【讨论】：

请注意，ttx 是接受的答案中提到的fonttools 的一部分。这是一个 Python 脚本，因此它也可以在 Mac 和 Linux 上使用。
您可以使用-o - 使ttx 在STDOUT 中显示输出。例如，ttx -o - -t cmap myfont.ttf 会将cmap 表中myfont.ttf 字体的内容转储到STDOUT。然后您可以使用它来查看给定字符是否在给定字符中定义（例如$ font ttx -o - -t cmap myfont.ttf | grep '5c81'）

【解决方案6】：

这是一个 ~~POSIX~~[1] shell 脚本，它可以在fc-match 的帮助下以一种简单易用的方式打印代码点和字符，这在Neil Mayhew's answer 中提到（它甚至可以处理多达 8 个十六进制数字的 Unicode）：

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        n_hex=$(printf "%04x" "$n")
        # using \U for 5-hex-digits
        printf "%-5s\U$n_hex\t" "$n_hex"
        count=$((count + 1))
        if [ $((count % 10)) = 0 ]; then
            printf "\n"
        fi
    done
done
printf "\n"

您可以传递字体名称或fc-match 接受的任何内容：

$ ls-chars "DejaVu Sans"

更新内容：

我了解到 subshell 非常耗时（我的脚本中的 printf subshell）。所以我设法写了一个改进的版本，速度提高了 5-10 倍！

#!/bin/bash
for range in $(fc-match --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        printf "%04x\n" "$n"
    done
done | while read -r n_hex; do
    count=$((count + 1))
    printf "%-5s\U$n_hex\t" "$n_hex"
    [ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"

旧版本：

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m2.876s
user    0m2.203s
sys     0m0.888s

新版本（行号表示5910+个字符，0.4秒！）：

$ time ls-chars "DejaVu Sans" | wc
    592   11269   52740

real    0m0.399s
user    0m0.446s
sys     0m0.120s

更新结束

示例输出（它在我的 st 终端中对齐得更好 ?）：

0020    0021 !  0022 "  0023 #  0024 $  0025 %  0026 &  0027 '  0028 (  0029 )
002a *  002b +  002c ,  002d -  002e .  002f /  0030 0  0031 1  0032 2  0033 3
0034 4  0035 5  0036 6  0037 7  0038 8  0039 9  003a :  003b ;  003c <  003d =
003e >  003f ?  0040 @  0041 A  0042 B  0043 C  0044 D  0045 E  0046 F  0047 G
...
1f61a? 1f61b? 1f61c? 1f61d? 1f61e? 1f61f? 1f620? 1f621? 1f622? 1f623?
1f625? 1f626? 1f627? 1f628? 1f629? 1f62a? 1f62b? 1f62d? 1f62e? 1f62f?
1f630? 1f631? 1f632? 1f633? 1f634? 1f635? 1f636? 1f637? 1f638? 1f639?
1f63a? 1f63b? 1f63c? 1f63d? 1f63e? 1f63f? 1f640? 1f643?

[1] printf 中的 \U 似乎不是 POSIX 标准？

【讨论】：

#!/bin/sh => #!/bin/bash
@vatosarmat，对，应该是 bash，谢谢。我想前者对我有用，因为 shell 使用 exectable printf 而不是内置的 shell。
对最后一条评论的更正：#!/bin/sh shebang 对我也不起作用，也许我真的没有尝试过。我的错。
\U 可能需要 6 个字符； \u 为 4 个字符。这对于编程语言来说是相当典型的（否则它会模棱两可），尽管有些事情有点松懈。至少在 Ubuntu 20.04 上有所不同，其中 printf \U1f643 打印 \u0001F643 （代理对？），但 \U01f643 返回?
嗯，'\U0030' 产生一个'0'，而'\U0030' 产生'0'。 '\U0030a' 产生 '\u030a' （前导零，用 4 位规范化为 \u）。但是，正如其他人所指出的，这是内置的 bash，而不是 POSIX printf。 /usr/bin/printf '\U0030' 给出 'missing hexadecimal number in escape'，而 /usr/bin/printf '\u0030' 给出'invalid Universal character name \u0030'，但这只是因为它应该被指定为' 0'。 gnu-coreutils.7620.n7.nabble.com/…

【解决方案7】：

我刚刚遇到了同样的问题，并制作了一个HOWTO，它更进一步，烘焙了所有受支持的 Unicode 代码点的正则表达式。

如果您只想要代码点数组，则可以在运行 ttx -t cmap myfont.ttf 之后在 Chrome 开发工具中查看 ttx xml 时使用它，并且可能将 myfont.ttx 重命名为 myfont.xml 以调用 Chrome 的 xml 模式：

function codepoint(node) { return Number(node.nodeValue); }
$x('//cmap/*[@platformID="0"]/*/@code').map(codepoint);

（也依赖于 gilamesh 的建议中的 fonttools；如果您使用的是 ubuntu 系统，请使用 sudo apt-get install fonttools。）

【讨论】：

【解决方案8】：

为了添加到@Oliver Lew 的答案，我添加了查询本地字体而不是系统字体的选项：

#!/bin/bash

# If the first argument is a font file, use fc-match instead of fc-query to
# display the font
[[ -f "$1" ]] && fc='fc-query' || fc='fc-match'

for range in $($fc --format='%{charset}\n' "$1"); do
    for n in $(seq "0x${range%-*}" "0x${range#*-}"); do
        printf "%04x\n" "$n"
    done
done | while read -r n_hex; do
    count=$((count + 1))
    printf "%-5s\U$n_hex\t" "$n_hex"
    [ $((count % 10)) = 0 ] && printf "\n"
done
printf "\n"

【讨论】：

【解决方案9】：

上述 Janus 的答案 (https://stackoverflow.com/a/19438403/431528) 有效。但是python太慢了，尤其是亚洲字体。在我的 E5 计算机上使用 40MB 文件大小的字体需要几分钟。

所以我写了一个小 C++ 程序来做这件事。它取决于 FreeType2(https://www.freetype.org/)。这是一个vs2015的项目，但是很容易移植到linux，因为它是一个控制台应用程序。

代码可以在这里找到，https://github.com/zhk/AllCodePoints 对于 40MB 文件大小的亚洲字体，在我的 E5 计算机上花费大约 30 毫秒。

【讨论】：

【解决方案10】：

您可以在 Linux 中使用 Font::TTF 模块在 Perl 中执行此操作。

【讨论】：

是的，应该可以。但它是一套复杂的模块，文档很糟糕。因此，如果没有如何完成的示例，这个答案似乎毫无用处。

【解决方案11】：

如果您只想“查看”字体，以下内容可能会有所帮助（如果您的终端支持相关字体）：

#!/usr/bin/env python
import sys
from fontTools.ttLib import TTFont

with TTFont(sys.argv[1], 0, ignoreDecompileErrors=True) as ttf:
    for x in ttf["cmap"].tables:
        for (_, code) in x.cmap.items():
            point = code.replace('uni', '\\u').lower()
            print("echo -e '" + point + "'")

一种不安全但简单的查看方式：

python font.py my-font.ttf | sh

感谢 Janus (https://stackoverflow.com/a/19438403/431528) 的上述回答。

【讨论】：

【解决方案12】：

如果您想获得字体支持的所有字符，您可以使用以下（基于 Janus 的回答）

from fontTools.ttLib import TTFont

def get_font_characters(font_path):
    with TTFont(font_path) as font:
        characters = {chr(y[0]) for x in font["cmap"].tables for y in x.cmap.items()}
    return characters

【讨论】：

您将如何修改此脚本以使用 otf 字体？

【解决方案13】：

FreeType 的项目提供了演示应用程序，其中一个演示程序称为“ftdump”。然后你可以这样做：“ftdump -V path-to-the-font-file”，你会得到你想要的。要查看源代码，您可以在此处关闭源代码：https://www.freetype.org/developer.html

在 Ubuntu 上可以使用“sudo apt install freetype2-demos”进行安装

注意：尝试使用“-c”而不是“-V”。我看到 args 在版本之间发生了变化。

【讨论】：