Python按段落将文本文件拆分为csv答案

【问题标题】：Python split text files by paragraph to csvPython按段落将文本文件拆分为csv
【发布时间】：2022-01-14 06:42:41
【问题描述】：

我有几个文本文件，我想按段落拆分并转置为 csv 文件。我的文本文件中的每个段落都由一个空行分隔，有些长段落有几行。这是一个文本文件的示例：

“世界你好！

布拉布拉

（空行）

这是第二段。这里有更多文字

这是一个很长的段落。”

我想获取以下 csv 文件：

filename	text
1.txt	Hello world! Blabla
1.txt	This is the 2nd paragrah. Here is more text and this is a very long paragraph.

这是我目前的代码，但它只提供了一行："1.text, [""Hello world!"", ""This is the 2nd paragraph. Here is more text. \nand这是一个很长的段落""]"：

import os, csv
os.chdir('path where I have text files')
from pathlib import Path
with open('output.csv', 'w', newline="", encoding="utf-16") as out_file:
    csv_out = csv.writer(out_file)
    csv_out.writerow(['filename', 'Content'])
    for fileName in Path('.').glob('*.txt'):
        csv_out.writerow([str(fileName),open(str(fileName.absolute())).read().strip().split("\n\n")])

【问题讨论】：

paragraph 是什么意思？新队？你从当前代码中得到了什么输出？
不要简单地报告“它不起作用”，请始终说明您获得的结果与您的预期。
请详细说明为什么您会在 pandas dataframe 中阅读它们。您可以添加预处理步骤以使用单个\n 删除多个连续的(\n){1,}。

标签： python csv text split paragraph

【解决方案1】：

假设您使用\n\n 定义新段落，例如

Hello world!\n\nThis is the 2nd paragraph. Here is more text.

所以，你需要将内容按.split('\n\n')分割，然后逐行写入。

使用下面的代码并将路径更新为您自己的：

import csv, os, sys
import glob

with open('output.csv', 'w', newline="", encoding="utf-16") as out_file:
    csv_out = csv.writer(out_file)
    csv_out.writerow(['filename', 'Content'])
    for text_file in glob.iglob('*.txt'):
        with open(text_file, 'r') as txt:
            for line in txt.read().split('\n\n'):
                csv_out.writerow([text_file, line])

这是您期望的输出：

filename,Content
1.txt,Hello world!
1.txt,This is the 2nd paragrah. Here is more text.
2.txt,Hello world! from 2.txt
2.txt,This is the 2nd paragrah. Here is more text. from 2.txt

【讨论】：

感谢您的评论。它适用于一个问题。当我有长段落（多行的文本）时，该行继续在另一行中。我得到：“1.txt，这是第二段。这里有更多文字”。在另一行中“文本很长”
你能给我看一下文本文件中内容的例子吗？我需要看看格式。
我编辑了我的帖子以改进我的文本文件示例。谢谢
@MGFern 如果文本有\n，那么它将显示在文件的下一行或屏幕的下一行。但这并不一定意味着它将它保留为表中的新行 - 它仍然可以是表中前一行的一部分。如果你不想在屏幕上显示下一行，那么.replace("\n", " ")
用(\n){1,}代替\n\n