【问题标题】:Reading pdf remotely using urllib2使用 urllib2 远程读取 pdf
【发布时间】:2017-06-11 10:23:02
【问题描述】:

我正在尝试从 pdf 远程提取文本。

网址是这个http://loc.gov/aba/publications/FreeLCC/A-text.pdf

我的代码如下

import urllib2
import PyPDF2
import io

URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
remote_file = urllib2.urlopen(URL).read()
memory_file = io.BytesIO(remote_file)

read_pdf = PyPDF2.PdfFileReader(memory_file)
number_of_pages = read_pdf.getNumPages()

for i in range(0, number_of_pages):
    pageObj = read_pdf.getPage(i)
    page = pageObj.extractText()
    print (page)

我收到403 HTTP 错误。我做错了什么?

【问题讨论】:

    标签: python python-2.7 urllib2 http-status-code-403 pypdf2


    【解决方案1】:

    Source

    import urllib2
    import PyPDF2
    import io
    
    URL = 'http://loc.gov/aba/publications/FreeLCC/A-outline.pdf'
    req = urllib2.Request(URL, headers={'User-Agent' : "Magic Browser"}) 
    remote_file = urllib2.urlopen(req).read()
    memory_file = io.BytesIO(remote_file)
    
    read_pdf = PyPDF2.PdfFileReader(memory_file)
    number_of_pages = read_pdf.getNumPages()
    
    for i in range(0, number_of_pages):
        pageObj = read_pdf.getPage(i)
        page = pageObj.extractText()
        print (page)
    

    【讨论】:

      猜你喜欢
      • 2012-07-19
      • 1970-01-01
      • 2010-11-04
      • 2013-08-20
      • 2017-10-28
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-03-09
      相关资源
      最近更新 更多