【发布时间】:2015-08-26 09:31:59
【问题描述】:
我有一个文本文件,其中一些 sn-ps 如下所示:
Page 1 of 515
Closing Report for Company Name LLC
222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
File number: Jackie Grant Status: Fell Thru Primary closing party: Seller
Acceptance: 01/01/2001 Closing date: 11/11/2011 Property type: Commercial Lease
MLS number: Sale price: $200,000 Commission: $1,500.00
Notes: 08/15/2000 02:30PM by Roger Lodge This property is a Commercial Lease handled by etc..
Seller: Company Name LLC
Company name: Company Name LLC
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Tomlinson, Ladainian
Address: 222 N 9th Street, #100 & 200, Las Vegas, NV, 89101
Home: Pager:
Business: 555-555-5555 Fax:
Mobile: Email:
Lessee Agent: Blank, Arthur
Company name: Sprockets Inc.
Address: 5001 Old Man Dr, North Las Vegas, NV, 89002
Home: (575) 222-3455 Pager:
Business: Fax: 999-9990
Mobile: (702) 600-3492 Email: sprockets@yoohoo.com
Leasing Agent: Van Uytnyck, Chameleon
Company name: Company Name LLC
Address:
Home: Pager:
Business: Fax: 909-222-2223
Mobile: 595-595-5959 Email:
(should be 2 spaces here.. this is not in normal text file)
Printed on Friday, June 12, 2015
Account owner: Roger Goodell
Page 2 of 515
Report for Adrian (Allday) Peterson
242 N 9th Street, #100 & 200
File number: Soap Status: Closed/Paid Primary closing party: Buyer
Acceptance: 01/10/2010 Closing date: 01/10/2010 Property type: RRR
MLS number: Sale price: $299,000 Commission: 33.00%
Seller: SOS, Bank
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: Fax:
Mobile: Email:
Buyer: Sabel, Aaron
Address:
Home: Pager:
Business: Fax:
Mobile: Email: sia@yoohoo.com
Escrow Co: Schneider, Patty
Company name: National Football League
Address: 242 N 9th Street, #100 & 200
Home: Pager:
Business: 800-2009 Fax: 800-1100
Mobile: Email:
Buyers Agent: Munchak, Mike
Company name: Commission Group
Address:
Home: Pager:
Business: Fax:
Mobile: 483374-3892 Email: donation@yoohoo.net
Listing Agent: Ricci, Christina
Company name: Other Guys
Address:
Home: Pager:
Business: Fax:
Mobile: 888-333-3333 Email: general.adama@cylon.net
这是我的代码:
import re
file = open('file-path.txt','r')
# if there are more than two consecutive blank lines, then we start a new Entry
entries = []
curr = []
prev_blank = False
for line in file:
line = line.rstrip('\n').strip()
if (line == ''):
if prev_blank == True:
# end of the entry, create append the entry
if(len(curr) > 0):
entries.append(curr)
print curr
curr = []
prev_blank = False
else:
prev_blank = True
else:
prev_blank = False
# we need to parse the line
line_list = line.split()
str = ''
start = False
for item in line_list:
if re.match('[a-zA-Z\s]+:.*',item):
if len(str) > 0:
curr.append(str)
str = item
start = True
elif start == True:
str = str + ' ' + item
这是输出:
['number: Jackie Grant', 'Status: Fell Thru Primary closing', 'Acceptance: 01/01/2001 Closing', 'date: 11/11/2011 Property', 'number: Sale', 'price: $200,000', 'Home:', 'Business:', 'Mobile:', 'Home:', 'Business: 555-555-5555', 'Mobile:', 'Home: (575) 222-3455', 'Business:', 'Mobile: (702) 600-3492', 'Home:', 'Business:', 'Mobile: 595-595-5959']
我的问题如下:
- 首先,应该有 2 条记录作为输出,而我只输出了一条。
- 在文本的顶部块中,我的脚本无法知道以前的值在哪里结束,而新的值从哪里开始:“状态:Fell Thru”应该是一个值,“主要结束方:”,“买方 接受:01/10/2010','截止日期:01/10/2010','物业类型:RRR','MLS编号:','售价:$299,000','佣金:33.00%'应该被抓住。
- 一旦正确解析,我需要再次解析以将键与值分开(即“截止日期”:01/10/2010),最好是在字典列表中。
除了使用正则表达式来挑选键,然后抓取随后的文本的 sn-ps 之外,我想不出更好的方法。
完成后,我想要一个带有键的标题行的 csv,我可以将其导入到带有 read_csv 的 pandas 中。我在这个上花了好几个小时..
【问题讨论】:
-
在冒号上拆分?似乎您的主要问题是键可以是一两个词。也许创建一个包含两个单词键的列表,否则键必须是一个单词。并且关键字之前的任何内容都必须是上一条记录的值。
-
不理解反对票 - 该问题显示了研究、努力、编码的路径,并且布局非常好.. 只需要帮助前进,仅此而已.. thx
-
您的循环只显示一个条目而不是两个条目的原因是您构建它的方式。它正在循环并找到与您提供的条件匹配的内容,然后停止。它正在做它应该做的事情。你最好遍历文件并分配每个条目(作为一个整体)然后构造一个列表。
-
为什么不直接使用正则表达式并完成它?
-
@sln 你让它听起来很简单.. 总共有大约 50 个字段名称,其中许多并没有出现在每条记录上.. 如果我能用正则表达式做到这一点,我会.. smtg 建设性的会很高兴帮助我度过难关.. thx
标签: python regex parsing csv pyparsing