如何使用 iTextSharp 从 PDF 中提取突出显示的文本？答案

【问题标题】：How to extract highlighed text from PDF using iTextSharp?如何使用 iTextSharp 从 PDF 中提取突出显示的文本？
【发布时间】：2014-12-26 11:01:44
【问题描述】：

根据以下帖子： iTextSharp PDF Reading highlighed text (highlight annotations) using C#

这段代码：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

正在提取 PDF 注释。但是为什么相同的以下代码不适用于高亮显示（特别是 PdfName.HIGHLIGHT 不起作用）：

for (int i = pageFrom; i <= pageTo; i++) {
    PdfDictionary page = reader.GetPageN(i);
    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);
    if (annots!=null)
        foreach (PdfObject annot in annots.ArrayList) {
            PdfDictionary annotation = (PdfDictionary)PdfReader.GetPdfObject(annot);
            PdfString contents = annotation.GetAsString(PdfName.CONTENTS);
            // now use the String value of contents
        }
    }
}

【问题讨论】：

标签： .net pdf itextsharp

【解决方案1】：

这是使用 itextSharp 提取高亮文本的完整示例

    public void GetRectAnno()
    {

        string appRootDir = new DirectoryInfo(Environment.CurrentDirectory).Parent.Parent.FullName;

        string filePath = appRootDir + "/PDFs/" + "anot.pdf";

        int pageFrom = 0;
        int pageTo = 0;

        try
        {
            using (PdfReader reader = new PdfReader(filePath))
            {
                pageTo = reader.NumberOfPages;

                for (int i = 1; i <= reader.NumberOfPages; i++)
                {


                    PdfDictionary page = reader.GetPageN(i);
                    PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);
                    if (annots != null)
                        foreach (PdfObject annot in annots.ArrayList)
                        {

                            //Get Annotation from PDF File
                            PdfDictionary annotationDic = (PdfDictionary)PdfReader.GetPdfObject(annot);
                            PdfName subType = (PdfName)annotationDic.Get(PdfName.SUBTYPE);
                            //check only subtype is highlight
                            if (subType.Equals(PdfName.HIGHLIGHT))
                            {
                                 // Get Quadpoints and Rectangle of highlighted text
                                Console.Write("HighLight at Rectangle {0} with QuadPoints {1}\n", annotationDic.GetAsArray(PdfName.RECT), annotationDic.GetAsArray(PdfName.QUADPOINTS));

                                //Extract Text using rectangle strategy    
                                PdfArray coordinates = annotationDic.GetAsArray(PdfName.RECT);

                                Rectangle rect = new Rectangle(float.Parse(coordinates.ArrayList[0].ToString(), CultureInfo.InvariantCulture.NumberFormat), float.Parse(coordinates.ArrayList[1].ToString(), CultureInfo.InvariantCulture.NumberFormat),
                                float.Parse(coordinates.ArrayList[2].ToString(), CultureInfo.InvariantCulture.NumberFormat),float.Parse(coordinates.ArrayList[3].ToString(), CultureInfo.InvariantCulture.NumberFormat));



                                RenderFilter[] filter = { new RegionTextRenderFilter(rect) };
                                ITextExtractionStrategy strategy;
                                StringBuilder sb = new StringBuilder();


                                strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
                                sb.AppendLine(PdfTextExtractor.GetTextFromPage(reader, i, strategy));

                                //Show extract text on Console
                                Console.WriteLine(sb.ToString());
                                //Console.WriteLine("Page No" + i);

                            }



                        }



                }
            }
        }
        catch (Exception ex)
        {
        }
    }

【讨论】：

如果多行高亮在中线开始或结束，您将提取过多。考虑检查 QuadPoints 而不是 Rect。例如。 this question 讨论了这种情况，尽管是针对不同的库，this answer 讨论了细节..

【解决方案2】：

请查看 ISO-32000-1（又名 PDF 参考）中的表 30。它的标题是“页面对象中的条目”。在这些条目中，您可以找到一个名为Annots 的键。它的值为：

（可选）应包含的注释字典数组对与页面关联的所有注释的间接引用（参见 12.5，“注解”）。

您不会找到带有诸如Highlight 之类的键的条目，因此当您有此行时返回的数组为空是很正常的：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.HIGHLIGHT);

您需要按照您已经做的方式获取注释：

PdfArray annots = page.GetAsArray(iTextSharp.text.pdf.PdfName.ANNOTS);

现在您需要遍历这个数组并查找Subtype 等于Highlight 的注释。这种类型的注释列在 ISO-32000-1 的表 169 中，标题为“注释类型”。

换句话说，您假设页面字典包含键为Highlight 的条目是错误的，如果您阅读了整个规范，您还会发现您一直在做出的另一个错误假设。您错误地假设突出显示的文本存储在注释的 Contents 条目中。这表明对注释与页面内容的性质缺乏了解。

您要查找的文本存储在页面的内容流中。页面的内容流独立于页面的注释。因此，要获取突出显示的文本，您需要获取存储在 Highlight 注释中的坐标（存储在 QuadPoints 数组中），并且您需要使用这些坐标来解析页面内容中存在的文本坐标。

【讨论】：