【发布时间】:2021-12-02 10:30:07
【问题描述】:
我正在尝试遍历 python 中的一些非结构化文本数据。最终目标是将其构建在数据框中。现在我只是想在一个数组中获取相关数据并理解 python 中的行 readline() 功能。
这就是文本的样子:
Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python
同一文件中的许多文本文章都会重复使用相同的格式。到目前为止,我已经弄清楚如何提取包含某些文本的行。例如,我可以遍历它并将所有文章标题放在这样的列表中:
a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
titleList.append(line)
现在我想做以下事情:
a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:
for line in unstr:
if a in line:
list.append(line)
if b in line:
1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
2. Continue the for loop within which all of this sits
作为一名 Python 初学者,我正在谷歌上搜索这个主题。任何指针将不胜感激。
【问题讨论】:
标签: python pandas dataframe nlp readline