【问题标题】:iTextSharp exception "Stack empty" when getting text from a PDF page从 PDF 页面获取文本时,iTextSharp 异常“堆栈为空”
【发布时间】:2017-06-18 04:03:12
【问题描述】:

我正在尝试遍历 PDF 上的每一页以查找特定关键字。除了 one

之外,代码在其他 PDF 上也能正常工作

我的代码

Using oReader As New pdf.PdfReader(pdfFilename)

    For pCurrent = oReader.NumberOfPages To 1 Step -1
        Dim strategy As pdf.parser.ITextExtractionStrategy = New pdf.parser.SimpleTextExtractionStrategy()
        Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

        '
        'search for keywords
        '
        'FindVOI

    Next 'proceed next page

End Using

这是导致此异常的代码的 sn-p,

Dim pageText As String = pdf.parser.PdfTextExtractor.GetTextFromPage(oReader, pCurrent, strategy)

在此 PDF 的第 98 页上显示异常 Stack empty,有什么问题吗?

完全例外:

Exception thrown: 'System.InvalidOperationException' in System.dll
System.Transactions Critical: 0 : <TraceRecord xmlns="http://schemas.microsoft.com/2004/10/E2ETraceEvent/TraceRecord" Severity="Critical"><TraceIdentifier>http://msdn.microsoft.com/TraceCodes/System/ActivityTracing/2004/07/Reliability/Exception/Unhandled</TraceIdentifier><Description>Unhandled exception</Description><AppDomain>VipMonitorService.vshost.exe</AppDomain><Exception><ExceptionType>System.InvalidOperationException, mscorlib, Version=4.0.0.0, Culture=neutral, PublicKeyToken=b77a5c561934e089</ExceptionType><Message>Stack empty.</Message><StackTrace>   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</StackTrace><ExceptionString>System.InvalidOperationException: Stack empty.
   at System.ThrowHelper.ThrowInvalidOperationException(ExceptionResource resource)
   at System.Collections.Generic.Stack`1.Pop()
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener, IDictionary`2 additionalContentOperators)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at WatcherApp.VipMonitorService.PDFHelper.FindVOI(List`1 voiList, String pdfFilename, Boolean searchFromLast, Int32 searchNumberOfPagesInPercent) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\PDFHelper.vb:line 59
   at WatcherApp.VipMonitorService.Controller.ProcessAnnualReport(Announcement a) in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 456
   at WatcherApp.VipMonitorService.Controller.ProcessARInQueueThread() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 362
   at WatcherApp.VipMonitorService.Controller._Lambda$__40-0() in \\Mac\Dropbox\git\Personal\WatcherApp\VipMonitorService\Controller.vb:line 339
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state, Boolean preserveSyncCtx)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ThreadHelper.ThreadStart()</ExceptionString></Exception></TraceRecord>

【问题讨论】:

  • 堆栈跟踪似乎表明您有一个 end-marked-content 指令而没有匹配的 begin-marked-content 指令。稍后我会查看 PDF。

标签: .net pdf itext pdftotext


【解决方案1】:

在此PDF 的第 98 页上显示异常堆栈为空,有什么想法吗?

堆栈跟踪显示 堆栈为空 发生在 iTextSharp.text.pdf.parser.PdfContentStreamProcessor.EndMarkedContentC.Invoke。因此,我们应该看看开始和结束标记的内容运算符:

标签 BMC 开始一个由平衡 EMC 运算符终止的标记内容序列。 tag 应该是一个名称对象,表明序列的作用或意义。

标签属性 BDC 以关联的属性列表开始标记内容序列,由平衡 EMC 运算符终止。 tag 应该是一个名称对象,指示序列的作用或意义。 properties 应该是包含属性列表的内联字典或在当前资源字典的 Properties 子字典中与之关联的名称对象(参见 14.6.2,“Property Lists ”)。

电磁兼容 结束由 BMCBDC 运算符开始的标记内容序列。

(表 320 – 标记内容运算符,ISO 32000-1)

如果您查看相关页面上标记内容的 BDC/BMCEMC 开始和结束,您会看到:

/Artifact <</O /Layout >>BDC
EMC 
/Artifact <</O /Layout >>BDC  
EMC  
/Artifact <</O /Layout >>BDC  
EMC 
/Artifact <</BBox [0 33.8887 407.4289 0 ]/O /Layout >>BDC  
EMC 
EMC
...

因此,有一个多余的 EMC 运算符,没有 BMCBDC 运算符来结束其标记的内容。 p>

因此,此文档不是有效的 PDF;特别是它的标记内容结构被破坏了。


话虽如此,如果 iTextSharp 会在 Pop 之前检查堆栈并可选择抛出更明显的异常或忽略 EMC 运算符,这将是合适的。

【讨论】:

  • 我想我可能会遇到类似的问题。您如何查看该 pdf 上的标记内容?
  • 我用了一个PDF内部结构的浏览器,比如iText RUPS或者PDFBox PDFDebugger。 Adobe Acrobat Pro Preflight 也包含这样的工具。它们中的任何一个都可用于检查页面内容流。
猜你喜欢
  • 2016-07-07
  • 2012-10-15
  • 2014-06-21
  • 2010-10-19
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多