这里的问题是嵌入字体程序中的字形具有非标准字形名称(G00、G01、...),并且仅由字形识别姓名。因此,必须建立从这些字形名称到 Unicode 字符的映射。可以这样做,例如通过检查 PDF 中的字体程序(例如使用 font forge)并通过名称直观地识别字形。例如。喜欢这里
(如您所见,相关字体的字形存在一些空白,手头的文档中未使用这些字形。您可以猜到一些缺失的字形,有些则不能。)
然后您必须将这些映射注入 iText。由于映射是隐藏的(GlyphList 类的private static 成员),您要么必须修补 iText 本身,要么使用反射:
void InitializeGlyphs()
{
FieldInfo names2unicodeFiled = typeof(GlyphList).GetField("names2unicode", BindingFlags.Instance | BindingFlags.NonPublic | BindingFlags.Static);
Dictionary<string, int[]> names2unicode = (Dictionary<string, int[]>) names2unicodeFiled.GetValue(null);
names2unicode["G03"] = new int[] { ' ' };
names2unicode["G0A"] = new int[] { '\'' };
names2unicode["G0B"] = new int[] { '(' };
names2unicode["G0C"] = new int[] { ')' };
names2unicode["G0F"] = new int[] { ',' };
names2unicode["G10"] = new int[] { '-' };
names2unicode["G11"] = new int[] { '.' };
names2unicode["G12"] = new int[] { '/' };
names2unicode["G13"] = new int[] { '0' };
names2unicode["G14"] = new int[] { '1' };
names2unicode["G15"] = new int[] { '2' };
names2unicode["G16"] = new int[] { '3' };
names2unicode["G17"] = new int[] { '4' };
names2unicode["G18"] = new int[] { '5' };
names2unicode["G19"] = new int[] { '6' };
names2unicode["G1A"] = new int[] { '7' };
names2unicode["G1B"] = new int[] { '8' };
names2unicode["G1C"] = new int[] { '9' };
names2unicode["G1D"] = new int[] { ':' };
names2unicode["G23"] = new int[] { '@' };
names2unicode["G24"] = new int[] { 'A' };
names2unicode["G25"] = new int[] { 'B' };
names2unicode["G26"] = new int[] { 'C' };
names2unicode["G27"] = new int[] { 'D' };
names2unicode["G28"] = new int[] { 'E' };
names2unicode["G29"] = new int[] { 'F' };
names2unicode["G2A"] = new int[] { 'G' };
names2unicode["G2B"] = new int[] { 'H' };
names2unicode["G2C"] = new int[] { 'I' };
names2unicode["G2D"] = new int[] { 'J' };
names2unicode["G2E"] = new int[] { 'K' };
names2unicode["G2F"] = new int[] { 'L' };
names2unicode["G30"] = new int[] { 'M' };
names2unicode["G31"] = new int[] { 'N' };
names2unicode["G32"] = new int[] { 'O' };
names2unicode["G33"] = new int[] { 'P' };
names2unicode["G34"] = new int[] { 'Q' };
names2unicode["G35"] = new int[] { 'R' };
names2unicode["G36"] = new int[] { 'S' };
names2unicode["G37"] = new int[] { 'T' };
names2unicode["G38"] = new int[] { 'U' };
names2unicode["G39"] = new int[] { 'V' };
names2unicode["G3A"] = new int[] { 'W' };
names2unicode["G3B"] = new int[] { 'X' };
names2unicode["G3C"] = new int[] { 'Y' };
names2unicode["G3D"] = new int[] { 'Z' };
names2unicode["G42"] = new int[] { '_' };
names2unicode["G44"] = new int[] { 'a' };
names2unicode["G45"] = new int[] { 'b' };
names2unicode["G46"] = new int[] { 'c' };
names2unicode["G46._"] = new int[] { 'c' };
names2unicode["G47"] = new int[] { 'd' };
names2unicode["G48"] = new int[] { 'e' };
names2unicode["G49"] = new int[] { 'f' };
names2unicode["G4A"] = new int[] { 'g' };
names2unicode["G4B"] = new int[] { 'h' };
names2unicode["G4C"] = new int[] { 'i' };
names2unicode["G4D"] = new int[] { 'j' };
names2unicode["G4E"] = new int[] { 'k' };
names2unicode["G4F"] = new int[] { 'l' };
names2unicode["G50"] = new int[] { 'm' };
names2unicode["G51"] = new int[] { 'n' };
names2unicode["G52"] = new int[] { 'o' };
names2unicode["G53"] = new int[] { 'p' };
names2unicode["G54"] = new int[] { 'q' };
names2unicode["G55"] = new int[] { 'r' };
names2unicode["G56"] = new int[] { 's' };
names2unicode["G57"] = new int[] { 't' };
names2unicode["G58"] = new int[] { 'u' };
names2unicode["G59"] = new int[] { 'v' };
names2unicode["G5A"] = new int[] { 'w' };
names2unicode["G5B"] = new int[] { 'x' };
names2unicode["G5C"] = new int[] { 'y' };
names2unicode["G5D"] = new int[] { 'z' };
names2unicode["G62"] = new int[] { 'Ш' };
names2unicode["G63"] = new int[] { 'Р' };
names2unicode["G6A"] = new int[] { 'И' };
names2unicode["G6B"] = new int[] { 'А' };
names2unicode["G6C"] = new int[] { 'М' };
names2unicode["G6D"] = new int[] { 'в' };
names2unicode["G6E"] = new int[] { 'Ф' };
names2unicode["G70"] = new int[] { 'Е' };
names2unicode["G72"] = new int[] { 'Б' };
names2unicode["G73"] = new int[] { 'Н' };
names2unicode["G76"] = new int[] { 'С' };
names2unicode["G7A"] = new int[] { 'К' };
names2unicode["G7B"] = new int[] { 'В' };
names2unicode["G7C"] = new int[] { 'О' };
names2unicode["G7D"] = new int[] { 'к' };
names2unicode["G7E"] = new int[] { 'З' };
names2unicode["G80"] = new int[] { 'Г' };
names2unicode["G81"] = new int[] { 'П' };
names2unicode["G82"] = new int[] { 'у' };
names2unicode["G85"] = new int[] { '»' };
names2unicode["G88"] = new int[] { 'т' };
names2unicode["G8D"] = new int[] { '’' };
names2unicode["G90"] = new int[] { 'У' };
names2unicode["G91"] = new int[] { 'Т' };
names2unicode["GA1"] = new int[] { 'Ц' };
names2unicode["GA2"] = new int[] { '№' };
names2unicode["GAA"] = new int[] { 'э' };
names2unicode["GAB"] = new int[] { 'я' };
names2unicode["GAC"] = new int[] { 'і' };
names2unicode["GAD"] = new int[] { 'б' };
names2unicode["GAE"] = new int[] { 'й' };
names2unicode["GAF"] = new int[] { 'р' };
names2unicode["GB0"] = new int[] { 'с' };
names2unicode["GB2"] = new int[] { 'х' };
names2unicode["GB5"] = new int[] { '“' };
names2unicode["GB9"] = new int[] { 'п' };
names2unicode["GBA"] = new int[] { 'о' };
names2unicode["GBD"] = new int[] { '«' };
names2unicode["GC1"] = new int[] { 'ф' };
names2unicode["GC8"] = new int[] { 'а' };
names2unicode["GCB"] = new int[] { 'е' };
names2unicode["GCE"] = new int[] { 'ж' };
names2unicode["GCF"] = new int[] { 'з' };
names2unicode["GD2"] = new int[] { 'и' };
names2unicode["GD3"] = new int[] { 'н' };
names2unicode["GDC"] = new int[] { '–' };
names2unicode["GE3"] = new int[] { 'л' };
}
执行该方法后,您可以使用您的方法提取文本:
InitializeGlyphs();
using (FileStream pdfStream = new FileStream(@"ENGV_1929.pdf", FileMode.Open))
{
string result = ExtractTextFromPdf(pdfStream, true);
File.WriteAllText(@"ENGV_1929.txt", result);
Console.WriteLine("\n\nENGV_1929.pdf\n");
Console.WriteLine(result);
}
结果:
From Notices to Mariners
Edition No 29/2019
(English version)
Notiсes to Mariners from Seсtion II «Сharts Сorreсtion», based on the original sourсe information, and
NAVAREA XIII, XX and XXI navigational warnings are reprinted hereunder in English. Original Notiсes to
Mariners from Seсtion I «Misсellaneous Navigational Information» and from Seсtion III «Nautiсal
Publiсations Сorreсtion» may be only briefly annotated and/or a referenсe may be made to Notiсes from
other Seсtions. Information from Seсtion IV «Сatalogues of Сharts and Nautiсal Publiсations Сorreсtion»
сonсerning the issue of сharts and publiсations is presented with details.
Digital analogue of English version of the extracts from original Russian Notices to Mariners is available
by: http://structure.mil.ru/structure/forces/hydrographic/info/notices.htm
СНАRTS СОRRЕСTIОN
Вarents Sea
3493 Сharts 18012, 17052, 15005, 15004
Amend 1. Light to light Fl G 4s 1M at
front leading lightbeacon 69111’32.2“N 33129’48.0“E
2. Light to light Fl G 4s 1M at
rear leading lightbeacon 69111’34.85“N 33129’44.25“E
Cancel coastal warning
MURMANSK 71/19
...
请注意,您会经常看到使用相似的西里尔字符来代替拉丁字符。显然,该文档是由不认为排版正确性非常重要的人手动创建的..
因此,如果您想在文本中进行搜索,您应该首先对文本和搜索词进行规范化(例如,对拉丁文“c”和西里尔文“с”使用相同的字符)。