【发布时间】:2014-04-08 05:39:00
【问题描述】:
所以我正在尝试读取文件列表,提取文件 ID 和摘要。摘要的每个句子都应写入一个 CSV 文件,其中包含文件 ID、句子编号和用“|”分隔的句子。
有人告诉我使用 NLTK 的分词器。我安装了 NLTK,但不知道如何让它与我的代码一起使用。我的 Python 是 3.2.2。以下是我的代码:
import re, os, sys
import csv
# Read into the list of files.
topdir = r'E:\Grad\LIS\LIS590 Text mining\Part1\Part1' # Topdir has to be an object rather than a string, which means that there is no paranthesis.
matches = []
for root, dirnames, filenames in os.walk(topdir):
for filename in filenames:
if filename.endswith(('.txt','.pdf')):
matches.append(os.path.join(root, filename))
# Create a list and fill in the list with the abstracts. Every abstract is a string in the list.
capturedabstracts = []
for filepath in matches[:10]: # Testing with the first 10 files.
with open (filepath,'rt') as mytext:
mytext=mytext.read()
# code to capture files
matchFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)[0]
capturedfiles.append(matchFile)
# code to capture abstracts
matchAbs=re.findall(r'Abstract\s+\:\s+(\w.+)'+'\n',mytext)[0]
capturedabstracts.append(matchAbs)
print (capturedabstracts)
with open('Abstract.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for data in capturedabstracts:
writer.writerow([data])
我是Python初学者,可能看不懂你们cmets,如果你们能提供给cmets修改代码就好了。
【问题讨论】: