【问题标题】:Extract Paragraph with specific words between two similar titiles在两个相似的标题之间提取带有特定单词的段落
【发布时间】:2017-09-18 10:23:22
【问题描述】:

我的文本文件包含类似这样的段落。

summary

A result oriented and dedicated professional with three years’ experience in Software Development. A proactive individual with a logical approach to challenges, performs effectively even within a highly pressurised working environment.

summary

Oct 28th, 2010 – Till date  Cognizant Technology Solutions      


Project #1

Title           Wealth Passport – R7.3
Client                    Northern Trust
Operating System    Windows XP
Technologies        J2EE, JSP, Struts, Oracle, PL/SQL
Team Size       3
Role            Team Member
Period                    22nd Aug’ 2013 - Till Date    
Project Description
Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities.
Role and Responsibilities
Handled dockets in SGY and PMM applications.
Done root cause analysis to existing issues in a short span of time.
Designed and developed enhancements in PMM application.
Preparing Unit Test cases for the developed Java modules and executing them.


Project #2
Title           PFS Development – WP Filecabinet and R7.2
Client                    Northern Trust
Operating System    Windows XP
Technologies        J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO
Team Size       1
Role            Team Member – JavaEE Developer
Period                   18th June’ 2013 – 21st Aug’ 2013   
Project Description
PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0
Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files.  Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system.

现在,我想提取包含 "Project", "Teamsize " 这样的词的部分摘要 不提取其他摘要部分。 我在下面尝试过这段代码,它提取了两个摘要内容

import re
import os
with open ('9.txt', encoding='latin-1') as infile, open ('d.txt','w',encoding='latin-1') as outfile :
    copy = False 
    for line in infile:
        if line.strip() == 'summary':
            re.compile('\r\nproject*\r\n')
            copy = True
        elif line.strip() == "summary":
            copy =False 
        elif copy:
            outfile.write(line)
        #fh = open("d.txt",'r')
        contents = fh.read()
        len(contents)

我希望保存一个包含内容的文本文件作为 d.txt

 summary

    Oct 28th, 2010 – Till date  Cognizant Technology Solutions      


    Project #1

    Title           Wealth Passport – R7.3
    Client                    Northern Trust
    Operating System    Windows XP
    Technologies        J2EE, JSP, Struts, Oracle, PL/SQL
    Team Size       3
    Role            Team Member
    Period                    22nd Aug’ 2013 - Till Date    
    Project Description
    Wealth Passport R7.3 release aims at enhancements in four projects SGY, PMM, WPA and WPX. This primarily involves analysing existing issues in the four applications and enhancements to some of the functionalities.
    Role and Responsibilities
    Handled dockets in SGY and PMM applications.
    Done root cause analysis to existing issues in a short span of time.
    Designed and developed enhancements in PMM application.
    Preparing Unit Test cases for the developed Java modules and executing them.


    Project #2
    Title           PFS Development – WP Filecabinet and R7.2
    Client                    Northern Trust
    Operating System    Windows XP
    Technologies        J2EE, JSP, Struts, Weblogic Portal, Oracle, PL/SQL, UNIX, Hibernate, Spring, DOJO
    Team Size       1
    Role            Team Member – JavaEE Developer
    Period                   18th June’ 2013 – 21st Aug’ 2013   
    Project Description
    PFS Development project is to provide the development services for PFS capital projects: Wealth Passport, Private Passport 6.0 and Private Passport 7.0
    Wealth Passport Filecabinet provides functionality for users to store their files on our system. This enables users to create folders, upload files and view the uploaded files.  Batch upload/delete option is also available. Deleted files will be moved to Waste Bucket, from where users can restore should they wish. This project aims at improving the performance of Filecabinet which was mandated by increasing customer base and files handled by the system.

【问题讨论】:

  • 您可以控制文本文件的格式吗?如果是这样,将它们声明为 jsontxtcsv(仅举几例)文件格式会更容易解析。
  • d.txt 的预期输出是什么?
  • 包含项目词的摘要部分
  • 您能否编辑问题以显示您给出的示例的外观。
  • 所以要确认一下,您要从第二个summary 开始提取?即跳过第一个。如果有第三个summary 会怎样?

标签: python information-extraction


【解决方案1】:

要提取所有包含您感兴趣的单词的summary 部分:

split_on = 'summary\n\n'
must_contain = ['Project', 'Team Size']

with open('9.txt') as f_input, open('d.txt', 'w') as f_output:
    for part in f_input.read().split(split_on):
        if all(text in part for text in must_contain):
            f_output.write(split_on + part)

【讨论】:

  • 我有许多文件随机方式,其中我有特定的单词要检查和提取部分并非所有文件都与上述相似
  • 我想提取随机文本文件输入中的任何部分,其中包含特定的单词集,例如 {project, teamsize etc.,}
  • 我已更新脚本以过滤包含所需单词列表的所有部分。
【解决方案2】:

这里的第二个条件语句永远不会运行,因为它与第一个条件相同。这意味着在summary 的第一个实例之后,副本将始终为True

if line.strip() == 'summary':
    re.compile('\r\nproject*\r\n')
    copy = True
elif line.strip() == "summary":
    copy =False 

我建议使用一个语句来获取“摘要”标签(我假设这些是注释块的开始/结束) - 并切换 copy

要切换布尔值,您可以简单地将其设置为自身的倒数:

 a = True
 a = not a
 # a is now False

例如:

 if line.strip() == 'summary':
    copy = not copy
 elif copy:
    outfile.write(line)

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-05-16
    • 1970-01-01
    • 2022-10-15
    • 1970-01-01
    • 1970-01-01
    • 2021-05-10
    相关资源
    最近更新 更多