【问题标题】:PDF text mining using Python使用 Python 进行 PDF 文本挖掘
【发布时间】:2019-11-11 04:27:55
【问题描述】:

我有多个包含多个页面的 PDF。我只想从整个文本中提取所需的信息。我已设法通读文本并将其放入列表中,但无法找到提取所需字符串的方法。以下是我可以编写的代码:-

import PyPDF2
import io
import re
import pandas as pd

from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
mypdf = open('C:/XXXX/XXXXX/Desktop/7-29-19 Office Availabilities 1.pdf', mode='rb')
pdf_document = PyPDF2.PdfFileReader(mypdf)

entry=[]
for page in PDFPage.get_pages(mypdf, 
                              caching=True,
                              check_extractable=True):
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle)
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    page_interpreter.process_page(page)
    text = fake_file_handle.getvalue()

    entry.append(text)
     # close open handles
    converter.close()
    fake_file_handle.close()


Flyer= [x.split() for x in entry if x.startswith('FL')]
print(Flyer)

以下是到目前为止我可以得到的输出:-

["FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible1) 104-112 E 1st St - Sanford, FL 32771Rand Complex-40,000 SF Class C Office Building  Renovated in 1988 Built in 1910Hotard RealtyMarie Hotard (407) 467-5397Building Notes:-7,0001 yr2ndVacantOffice/N$7.80/mgN15 MthsMarie Hotard (407) 467-5397PHotard RealtyCall to negotiate renovation needs. Located in the heart of downtown Sanford's historic district, over 5,000 square feet ofoffice space on the second floor above bustling First Street. Historical building with great potential.2) 110 W 1st St - Sanford, FL 32771The Welaka Building-25,797 SF Class B Loft/Creative Space Building  Renovated in 1997 Built in 1887Brenner Real Estate E.Charles E. Brenner (407) 677-1700Building Notes:-366Negotiable2nd / Suite 214VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate198Negotiable2nd / Suite 234VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,276Negotiable2nd / Suite 240VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate743Negotiable2nd / Suite 242VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,357Negotiable2nd / Suite 246VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate720Negotiable2nd / Suite 250VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real EstateCopyrighted report licensed to CBRE - 759852.7/29/2019Page 1\x0c",
 'FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible3) 734 N 3rd St - Leesburg, FL 347483rd Street Office Park-21,472 SF Class B Office Building  Built in 1974Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:City of Leesburg services and utilities including high speed DSL, fiber optic internet and phone systems all pre-wired and marked from central control room.ADA bathrooms & private executive and managerial offices.Security and central fire alarm system installed.1,564Negotiable1st / Suite Space 1VacantOffice/D$14.00/mgN29 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC4) 900 N 14th St - Leesburg, FL 34748Trophy Leesburg Offices-40,302 SF Class B Office Building  Built in 1981Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:Building has 24 hour access.Join GSA and the Social Security Administration in this great Leesburg Office.  Current layout is perfect for high density office user.  Small modifications can bemade to accommodate anything from a small college to a large medical tenant.Great Central Florida location along busy US 27 and not far from The Villages.5,150Negotiable2ndVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC2nd Floor maybe divided into smaller units11,248Negotiable3rdVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC3rd Floor has 3 units that maybe divided or combined.Copyrighted report licensed to CBRE - 759852.7/29/2019Page 2\x0c',

想要的输出是:-

['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible',
'FL 32771', '104-112 E 1st St - Sanford',   '40,000 SF',    'C',    'P 2nd',    '7000', '$7.80/mg', '1 yr', 'Vacant',   'Office/N', 'Hotard Realty',    'Marie Hotard (407) 467-5397',  '15 Mths',  'N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd/Suite 214',  '366',  '$22.00/fs ',   'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 234 ',   '198',  '$22.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 240',    '1276', '$16.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 242',    '743',  '$18.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 246',    '1357', '$16.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N',
'FL 32771', '110 W 1st St - Sanford',   '25,797 SF',    'B',    'P 2nd / Suite 250',    '720',  '$18.00/fs',    'Negotiable',   ' Vacant ', 'Office/D ',    'Brenner Real Estate',  'Charles E. Brenner (407) 677', '4 Wks',    ' N']

请帮忙!!

【问题讨论】:

    标签: python-3.x ocr


    【解决方案1】:
    list = ["FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible1) 104-112 E 1st St - Sanford, FL 32771Rand Complex-40,000 SF Class C Office Building  Renovated in 1988 Built in 1910Hotard RealtyMarie Hotard (407) 467-5397Building Notes:-7,0001 yr2ndVacantOffice/N$7.80/mgN15 MthsMarie Hotard (407) 467-5397PHotard RealtyCall to negotiate renovation needs. Located in the heart of downtown Sanford's historic district, over 5,000 square feet ofoffice space on the second floor above bustling First Street. Historical building with great potential.2) 110 W 1st St - Sanford, FL 32771The Welaka Building-25,797 SF Class B Loft/Creative Space Building  Renovated in 1997 Built in 1887Brenner Real Estate E.Charles E. Brenner (407) 677-1700Building Notes:-366Negotiable2nd / Suite 214VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate198Negotiable2nd / Suite 234VacantOffice/D$22.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,276Negotiable2nd / Suite 240VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate743Negotiable2nd / Suite 242VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate1,357Negotiable2nd / Suite 246VacantOffice/D$16.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real Estate720Negotiable2nd / Suite 250VacantOffice/D$18.00/fsN4 WksCharles E. Brenner (407) 677-1700PBrenner Real EstateCopyrighted report licensed to CBRE - 759852.7/29/2019Page 1\x0c",
    'FloorSF AvailRent/SF/YrOccupancyTermBld OutLeasing CompanyUse/TypeContactListedDivisible3) 734 N 3rd St - Leesburg, FL 347483rd Street Office Park-21,472 SF Class B Office Building  Built in 1974Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:City of Leesburg services and utilities including high speed DSL, fiber optic internet and phone systems all pre-wired and marked from central control room.ADA bathrooms & private executive and managerial offices.Security and central fire alarm system installed.1,564Negotiable1st / Suite Space 1VacantOffice/D$14.00/mgN29 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC4) 900 N 14th St - Leesburg, FL 34748Trophy Leesburg Offices-40,302 SF Class B Office Building  Built in 1981Grizzard Commercial Real EstateGroup, LLCDan Tatro (352) 396-9136Building Notes:Building has 24 hour access.Join GSA and the Social Security Administration in this great Leesburg Office.  Current layout is perfect for high density office user.  Small modifications can bemade to accommodate anything from a small college to a large medical tenant.Great Central Florida location along busy US 27 and not far from The Villages.5,150Negotiable2ndVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC2nd Floor maybe divided into smaller units11,248Negotiable3rdVacantPartial Build-OutOffice/D$14.00/fsN10 MthsDan Tatro (352) 396-9136PGrizzard Commercial Real EstateGroup, LLC3rd Floor has 3 units that maybe divided or combined.Copyrighted report licensed to CBRE - 759852.7/29/2019Page 2\x0c']
    Requirements = ['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible']
    OutputList = []
    for item in list:
        FlyerNumber = "FL" + (item.split("FL",1)[1])[:6]
        Address = (item.split(") ",1)[1]).split(",",1)[0]
        Squarefeet = str((item.split("SF Class",1)[0]).rsplit("-",1)[1]) + "SF"
        Class = (item.split("Class ",1)[1])[0]
        CollectedData = [FlyerNumber,Address,Squarefeet,Class]
        OutputList.extend(CollectedData)
    print(OutputList)
    FinalList = Requirements + OutputList
    print(FinalList)
    

    打印OutputList会给你:

    ['FL 32771', '104-112 E 1st St - Sanford', '40,000 SF', 'C', 'FL 34748', '734 N 3rd St - Leesburg', '21,472 SF', 'B']
    

    打印FinalList会给你:

    ['Flyer Number', 'Address', 'Total SF', 'Class', 'Suite/Bldg', 'SF available', 'Rent/SF/Year', 'Term', 'Occupancy', 'User/Type', 'Leasing company', 'Contact', 'Listed', 'Divisible', 'FL 32771', '104-112 E 1st St - Sanford', '40,000 SF', 'C', 'FL 34748', '734 N 3rd St - Leesburg', '21,472 SF', 'B']
    

    我发现自己需要很长时间才能完成所需的每一项要求。所以这是其中的一半,从Flyer NumberClass。请确保您的信息不会发生变化,否则输出可能会发生变化。

    有问题的代码:

    import PyPDF2
    import io
    import re
    import pandas as pd
    
    from pdfminer.converter import TextConverter
    from pdfminer.pdfinterp import PDFPageInterpreter
    from pdfminer.pdfinterp import PDFResourceManager
    from pdfminer.pdfpage import PDFPage
    mypdf = open('C:/Users/renu.sharma/Desktop/7-29-19 Office Availabilities 1.pdf', mode='rb')
    pdf_document = PyPDF2.PdfFileReader(mypdf)
    
    entry=[]
    for page in PDFPage.get_pages(mypdf,  
                                  caching=True, 
                                  check_extractable=True):
        resource_manager = PDFResourceManager()
        fake_file_handle = io.StringIO() 
        converter = TextConverter(resource_manager, fake_file_handle)
        page_interpreter = PDFPageInterpreter(resource_manager, converter) 
        page_interpreter.process_page(page)
        text = fake_file_handle.getvalue() 
    
        entry.append(text) 
         # close open handles
        converter.close()
        fake_file_handle.close()
    
    Requirements = ['Flyer Number',    'Address',  'Total SF', 'Class',    'Suite/Bldg',   'SF available', 'Rent/SF/Year', 'Term', 'Occupancy',    'User/Type',    'Leasing company',  'Contact',  'Listed',   'Divisible']
    OutputList = []
    for item in entry:
        FlyerNumber = item.split("FL",1)[1])[:6]
        Address = (item.split(") ",1)[1]).split(",",1)[0]
        Squarefeet = str((item.split("SF Class",1)[0]).rsplit("-",1)[1]) + "SF"
        Class = (item.split("Class ",1)[1])[0]
        CollectedData = [FlyerNumber,Address,Squarefeet,Class]
        OutputList.extend(CollectedData)
    FinalList = Requirements + OutputList 
    
    print(OutputList)
    

    【讨论】:

    • 既然你现在知道了这个概念,我想你可以完成剩下的了。
    • :- 感谢您的回复,但它给出了以下错误 ---> 31 FlyerNumber = "FL" + (item.split("FL",1)[1])[: 5] 32 地址 = (item.split(") ",1)[1]).split(",",1)[0] 33 Squarefeet = str((item.split("SF Class",1)[ 0]).rsplit("-",1)[1]) + "SF" IndexError: list index out of range 请建议
    • 您使用的是哪个列表?通过使用你说你能够得到的输出。即使从您给出的错误行中我也没有遇到任何问题。我认为,它来自您的输入列表。我只是使用您从问题中提供的两个元素列表,它工作正常..
    • 我正在使用您建议的确切代码,但它给出了“FlyerNumber”和“Class”行的索引超出范围的错误
    • 嘿@renu,因为我的代码正在使用以下列表。我认为发生错误是因为您的列表具有不同的模式。与我分享您正在使用的完整列表或至少部分列表,以便我发现差异。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-08-04
    • 2013-11-17
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2011-02-16
    相关资源
    最近更新 更多