使用带有嵌入字体的 iText 提取文本答案

【问题标题】：Extract text with iText with embedded fonts使用带有嵌入字体的 iText 提取文本
【发布时间】：2019-07-12 16:04:32
【问题描述】：

我正在尝试使用 iTextSharp (v5.5.12.1) 从以下 PDF 中提取文本： https://structure.mil.ru/files/morf/military/files/ENGV_1929.pdf

不幸的是，他们似乎使用了许多嵌入的自定义字体，这让我很失望。

目前，我有一个使用 OCR 的有效解决方案，但 OCR 可能不精确，会错误地读取某些字符并且还会在字符之间添加额外的空格。如果我可以直接提取文本，那将是理想的。

public static string ExtractTextFromPdf(Stream pdfStream, bool addNewLineBetweenPages = false)
{
    using (PdfReader reader = new PdfReader(pdfStream))
    {
        string text = "";

        for (int i = 1; i <= reader.NumberOfPages; i++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, i);
            if (addNewLineBetweenPages && i != reader.NumberOfPages)
            {
                text += Environment.NewLine;
            }
        }

        return text;
    }
}

【问题讨论】：

@mkl 我已经阅读了你的回答 stackoverflow.com/questions/37748346/…>，我怀疑我的情况可能有点相似。任何帮助/指针将不胜感激。
简单地说，您文件中的字体不提供 PDF 规范中描述的文本提取所需的信息。它们既不包含 ToUnicode 映射，也不在 Encoding Differences 中使用标准命名编码或标准字形名称。此外，即使是嵌入式字体程序也不包含到 Unicode 的映射。因此，iText（或 Adobe Reader 复制和粘贴）中的标准文本提取对您没有帮助。但是，如果非标准字形名称恰好是一致的，则可能可以调整 iText 以提取文本。
@mkl 非常感谢您的回复。完全同意你所说的一切。我希望可能有一些我错过的东西。在我看来，如果 PDF 查看器可以显示 pdf，那么肯定可以以某种方式提取文本.... 我一直在尝试使用 itext 读取的字符并用我认为的替换它们应该。然而，到目前为止还没有运气。让我担心的一件事是前三个字符是 u0001、u0002、u0003。
“在我看来，如果 PDF 查看器可以显示 pdf，那么肯定可以以某种方式提取文本” - 不幸的是，这是两个完全不同的东西。为了显示查看器，只需要能够从pdf中的字符代码映射到嵌入字体程序中的一组绘图指令，并且该映射不需要与unicode代码相关（可以用于文本提取）一点也不。不过，下周我会尝试看看是否有办法，因为映射并非没有系统。
@mkl 非常感谢。手指交叉，你可以找到系统。我尝试了对我来说似乎很明显的方法，但没有奏效。

标签： c# pdf itext text-extraction

【解决方案1】：

这里的问题是嵌入字体程序中的字形具有非标准字形名称（G00、G01、...），并且仅由字形识别姓名。因此，必须建立从这些字形名称到 Unicode 字符的映射。可以这样做，例如通过检查 PDF 中的字体程序（例如使用 font forge）并通过名称直观地识别字形。例如。喜欢这里

（如您所见，相关字体的字形存在一些空白，手头的文档中未使用这些字形。您可以猜到一些缺失的字形，有些则不能。）

然后您必须将这些映射注入 iText。由于映射是隐藏的（GlyphList 类的private static 成员），您要么必须修补 iText 本身，要么使用反射：

void InitializeGlyphs()
{
    FieldInfo names2unicodeFiled = typeof(GlyphList).GetField("names2unicode", BindingFlags.Instance | BindingFlags.NonPublic | BindingFlags.Static);
    Dictionary<string, int[]> names2unicode = (Dictionary<string, int[]>) names2unicodeFiled.GetValue(null);
    names2unicode["G03"] = new int[] { ' ' };

    names2unicode["G0A"] = new int[] { '\'' };
    names2unicode["G0B"] = new int[] { '(' };
    names2unicode["G0C"] = new int[] { ')' };

    names2unicode["G0F"] = new int[] { ',' };
    names2unicode["G10"] = new int[] { '-' };
    names2unicode["G11"] = new int[] { '.' };
    names2unicode["G12"] = new int[] { '/' };
    names2unicode["G13"] = new int[] { '0' };
    names2unicode["G14"] = new int[] { '1' };
    names2unicode["G15"] = new int[] { '2' };
    names2unicode["G16"] = new int[] { '3' };
    names2unicode["G17"] = new int[] { '4' };
    names2unicode["G18"] = new int[] { '5' };
    names2unicode["G19"] = new int[] { '6' };
    names2unicode["G1A"] = new int[] { '7' };
    names2unicode["G1B"] = new int[] { '8' };
    names2unicode["G1C"] = new int[] { '9' };
    names2unicode["G1D"] = new int[] { ':' };

    names2unicode["G23"] = new int[] { '@' };
    names2unicode["G24"] = new int[] { 'A' };
    names2unicode["G25"] = new int[] { 'B' };
    names2unicode["G26"] = new int[] { 'C' };
    names2unicode["G27"] = new int[] { 'D' };
    names2unicode["G28"] = new int[] { 'E' };
    names2unicode["G29"] = new int[] { 'F' };
    names2unicode["G2A"] = new int[] { 'G' };
    names2unicode["G2B"] = new int[] { 'H' };
    names2unicode["G2C"] = new int[] { 'I' };
    names2unicode["G2D"] = new int[] { 'J' };
    names2unicode["G2E"] = new int[] { 'K' };
    names2unicode["G2F"] = new int[] { 'L' };
    names2unicode["G30"] = new int[] { 'M' };
    names2unicode["G31"] = new int[] { 'N' };
    names2unicode["G32"] = new int[] { 'O' };
    names2unicode["G33"] = new int[] { 'P' };
    names2unicode["G34"] = new int[] { 'Q' };
    names2unicode["G35"] = new int[] { 'R' };
    names2unicode["G36"] = new int[] { 'S' };
    names2unicode["G37"] = new int[] { 'T' };
    names2unicode["G38"] = new int[] { 'U' };
    names2unicode["G39"] = new int[] { 'V' };
    names2unicode["G3A"] = new int[] { 'W' };
    names2unicode["G3B"] = new int[] { 'X' };
    names2unicode["G3C"] = new int[] { 'Y' };
    names2unicode["G3D"] = new int[] { 'Z' };

    names2unicode["G42"] = new int[] { '_' };

    names2unicode["G44"] = new int[] { 'a' };
    names2unicode["G45"] = new int[] { 'b' };
    names2unicode["G46"] = new int[] { 'c' };
    names2unicode["G46._"] = new int[] { 'c' };
    names2unicode["G47"] = new int[] { 'd' };
    names2unicode["G48"] = new int[] { 'e' };
    names2unicode["G49"] = new int[] { 'f' };
    names2unicode["G4A"] = new int[] { 'g' };
    names2unicode["G4B"] = new int[] { 'h' };
    names2unicode["G4C"] = new int[] { 'i' };
    names2unicode["G4D"] = new int[] { 'j' };
    names2unicode["G4E"] = new int[] { 'k' };
    names2unicode["G4F"] = new int[] { 'l' };
    names2unicode["G50"] = new int[] { 'm' };
    names2unicode["G51"] = new int[] { 'n' };
    names2unicode["G52"] = new int[] { 'o' };
    names2unicode["G53"] = new int[] { 'p' };
    names2unicode["G54"] = new int[] { 'q' };
    names2unicode["G55"] = new int[] { 'r' };
    names2unicode["G56"] = new int[] { 's' };
    names2unicode["G57"] = new int[] { 't' };
    names2unicode["G58"] = new int[] { 'u' };
    names2unicode["G59"] = new int[] { 'v' };
    names2unicode["G5A"] = new int[] { 'w' };
    names2unicode["G5B"] = new int[] { 'x' };
    names2unicode["G5C"] = new int[] { 'y' };
    names2unicode["G5D"] = new int[] { 'z' };

    names2unicode["G62"] = new int[] { 'Ш' };
    names2unicode["G63"] = new int[] { 'Р' };
    names2unicode["G6A"] = new int[] { 'И' };
    names2unicode["G6B"] = new int[] { 'А' };
    names2unicode["G6C"] = new int[] { 'М' };
    names2unicode["G6D"] = new int[] { 'в' };
    names2unicode["G6E"] = new int[] { 'Ф' };
    names2unicode["G70"] = new int[] { 'Е' };
    names2unicode["G72"] = new int[] { 'Б' };
    names2unicode["G73"] = new int[] { 'Н' };
    names2unicode["G76"] = new int[] { 'С' };
    names2unicode["G7A"] = new int[] { 'К' };
    names2unicode["G7B"] = new int[] { 'В' };
    names2unicode["G7C"] = new int[] { 'О' };
    names2unicode["G7D"] = new int[] { 'к' };
    names2unicode["G7E"] = new int[] { 'З' };
    names2unicode["G80"] = new int[] { 'Г' };
    names2unicode["G81"] = new int[] { 'П' };
    names2unicode["G82"] = new int[] { 'у' };
    names2unicode["G85"] = new int[] { '»' };
    names2unicode["G88"] = new int[] { 'т' };
    names2unicode["G8D"] = new int[] { '’' };
    names2unicode["G90"] = new int[] { 'У' };
    names2unicode["G91"] = new int[] { 'Т' };
    names2unicode["GA1"] = new int[] { 'Ц' };
    names2unicode["GA2"] = new int[] { '№' };
    names2unicode["GAA"] = new int[] { 'э' };
    names2unicode["GAB"] = new int[] { 'я' };
    names2unicode["GAC"] = new int[] { 'і' };
    names2unicode["GAD"] = new int[] { 'б' };
    names2unicode["GAE"] = new int[] { 'й' };
    names2unicode["GAF"] = new int[] { 'р' };
    names2unicode["GB0"] = new int[] { 'с' };
    names2unicode["GB2"] = new int[] { 'х' };
    names2unicode["GB5"] = new int[] { '“' };
    names2unicode["GB9"] = new int[] { 'п' };
    names2unicode["GBA"] = new int[] { 'о' };
    names2unicode["GBD"] = new int[] { '«' };
    names2unicode["GC1"] = new int[] { 'ф' };
    names2unicode["GC8"] = new int[] { 'а' };
    names2unicode["GCB"] = new int[] { 'е' };
    names2unicode["GCE"] = new int[] { 'ж' };
    names2unicode["GCF"] = new int[] { 'з' };
    names2unicode["GD2"] = new int[] { 'и' };
    names2unicode["GD3"] = new int[] { 'н' };
    names2unicode["GDC"] = new int[] { '–' };
    names2unicode["GE3"] = new int[] { 'л' };
}

执行该方法后，您可以使用您的方法提取文本：

InitializeGlyphs();

using (FileStream pdfStream = new FileStream(@"ENGV_1929.pdf", FileMode.Open))
{
    string result = ExtractTextFromPdf(pdfStream, true);
    File.WriteAllText(@"ENGV_1929.txt", result);
    Console.WriteLine("\n\nENGV_1929.pdf\n");
    Console.WriteLine(result);
}

结果：

From Notices to Mariners
Edition No 29/2019
(English version)
Notiсes to Mariners from Seсtion II «Сharts Сorreсtion», based on the original sourсe information, and
NAVAREA XIII, XX and XXI navigational warnings are reprinted hereunder in English. Original Notiсes to
Mariners from Seсtion I «Misсellaneous Navigational Information» and from Seсtion III «Nautiсal
Publiсations Сorreсtion» may be only briefly annotated and/or a referenсe may be made to Notiсes from
other Seсtions. Information from Seсtion IV «Сatalogues of Сharts and Nautiсal Publiсations Сorreсtion»
сonсerning the issue of сharts and publiсations is presented with details.
Digital analogue of English version of the extracts from original Russian Notices to Mariners is available
by: http://structure.mil.ru/structure/forces/hydrographic/info/notices.htm
СНАRTS СОRRЕСTIОN
Вarents Sea
 3493 Сharts 18012, 17052, 15005, 15004
Amend 1. Light to light Fl G 4s 1M at
 front leading lightbeacon 69111’32.2“N 33129’48.0“E
 2. Light to light Fl G 4s 1M at
 rear leading lightbeacon 69111’34.85“N 33129’44.25“E
 Cancel coastal warning
 MURMANSK 71/19
...

请注意，您会经常看到使用相似的西里尔字符来代替拉丁字符。显然，该文档是由不认为排版正确性非常重要的人手动创建的..

因此，如果您想在文本中进行搜索，您应该首先对文本和搜索词进行规范化（例如，对拉丁文“c”和西里尔文“с”使用相同的字符）。

【讨论】：

我发现的一个问题是他们为不同的字符重复使用相同的代码。我正在看的主要是 G14，他们已将其用于字符 1 以及度数符号。这显示在您的结果中：69111’32.2“N。第一个 1 应该是°。有什么办法可以解决吗？
嗯，我确实没有发现。因此，映射可能因字体而异……好吧，您可以在每次使用不同字体时更改names2unicode 映射。为此，您可以注册 SetTextFont 内容运算符的扩展版本，这取决于所讨论的字体也会更改 names2unicode 映射。或者您可以考虑为字体构建 ToUnicode 映射并将它们添加到字体中，然后再使用标准类进行文本提取。无论哪种方式，解决方案的复杂性都会增加...... ;)
再次感谢。我认为必须有办法做到这一点，但正如你所说，复杂性增加了。目前，我使用 Regex 来识别坐标，然后将 1 替换为度数字符。目前它运行良好。