【问题标题】:How to write files in a CSV by sentence number, sentence (split by '|')?如何按句号,句子(由'|'分割)在CSV中写入文件?
【发布时间】:2014-04-08 05:39:00
【问题描述】:

所以我正在尝试读取文件列表,提取文件 ID 和摘要。摘要的每个句子都应写入一个 CSV 文件,其中包含文件 ID、句子编号和用“|”分隔的句子。

有人告诉我使用 NLTK 的分词器。我安装了 NLTK,但不知道如何让它与我的代码一起使用。我的 Python 是 3.2.2。以下是我的代码:

import re, os, sys
import csv
# Read into the list of files.
topdir = r'E:\Grad\LIS\LIS590 Text mining\Part1\Part1' # Topdir has to be an object rather than a string, which means that there is no paranthesis.
matches = []
for root, dirnames, filenames in os.walk(topdir):
    for filename in filenames:
        if filename.endswith(('.txt','.pdf')):
            matches.append(os.path.join(root, filename))

# Create a list and fill in the list with the abstracts. Every abstract is a string in the list.
capturedabstracts = []
for filepath in matches[:10]:  # Testing with the first 10 files.
    with open (filepath,'rt') as mytext:
    mytext=mytext.read()

        # code to capture files
    matchFile=re.findall(r'File\s+\:\s+(\w\d{7})',mytext)[0]
    capturedfiles.append(matchFile)


    # code to capture abstracts
    matchAbs=re.findall(r'Abstract\s+\:\s+(\w.+)'+'\n',mytext)[0]
    capturedabstracts.append(matchAbs)
    print (capturedabstracts)

with open('Abstract.csv', 'w') as csvfile:
writer = csv.writer(csvfile)
for data in capturedabstracts:
    writer.writerow([data])

我是Python初学者,可能看不懂你们cmets,如果你们能提供给cmets修改代码就好了。

【问题讨论】:

    标签: python csv file-io


    【解决方案1】:

    作为第一次尝试,查看a sentence tokenizer 并将文本拆分为列表,然后使用 writerows 存储到 csv:

    with file(u'Abstract.csv','w') as outfile:
        sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
        list_of_sentences = sent_detector.tokenize(text.strip())
        writer = csv.DictWriter(outfile, headers = ['phrase'], delimiter = '|',  quotechar = None, quoting = csv.QUOTE_NONE, escapechar="\\")
        for phrase in list_of_sentences:
            phrasedict = {'phrase':phrase}
            writer.writerow(phrase)
        writer.close()
    

    【讨论】:

      【解决方案2】:

      尝试使用writerow

      试试这样的:

      with open('Abstract.csv', 'w') as csvfile:
          writer = csv.writer(csvfile)
          for data in capturedabstracts:
              writer.writerow([data])
      

      【讨论】:

      • "...应该写入一个CSV文件,包含文件ID、句号和句子用'|'分割。"?
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2022-01-23
      • 2017-10-23
      • 2018-04-26
      • 2021-07-12
      • 2018-05-03
      • 2015-12-21
      • 1970-01-01
      相关资源
      最近更新 更多