【问题标题】:Python Readline Loop and SubloopPython Readline 循环和子循环
【发布时间】:2021-12-02 10:30:07
【问题描述】:

我正在尝试遍历 python 中的一些非结构化文本数据。最终目标是将其构建在数据框中。现在我只是想在一个数组中获取相关数据并理解 python 中的行 readline() 功能。

这就是文本的样子:

Title: title of an article
Full text: unfortunately the full text of each article,
is on numerous lines. Each article has a differing number 
of lines. In this example, there are three..
Subject: Python
Title: title of another article
Full text: again unfortunately the full text of each article,
is on numerous lines.
Subject: Python

同一文件中的许多文本文章都会重复使用相同的格式。到目前为止,我已经弄清楚如何提取包含某些文本的行。例如,我可以遍历它并将所有文章标题放在这样的列表中:

a = "Title:"
titleList = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
      if a in line:
        titleList.append(line)

现在我想做以下事情:

a = "Title:"
b = "Full text:"
d = "Subject:"
list = []
sample = 'sample.txt'
with open(sample,encoding="utf8") as unstr:  
for line in unstr:
  if a in line:
    list.append(line)
  if b in line:
     1. Concatenate this line with each line after it, until i reach the line that includes "Subject:". Ignore the "Subject:" line, stop the "Full text:" subloop, add the concatenated full text to the list array.<br>
     2. Continue the for loop within which all of this sits

作为一名 Python 初学者,我正在谷歌上搜索这个主题。任何指针将不胜感激。

【问题讨论】:

    标签: python pandas dataframe nlp readline


    【解决方案1】:

    由于您的目标是构建 DataFrame,因此这里有一个 re+numpy+pandas 解决方案:

    import re
    import pandas as pd
    import numpy as np
    
    # read all file
    with open('sample.txt', encoding="utf8") as f:
        text = f.read()
    
    
    keys = ['Subject', 'Title', 'Full text']
    
    regex = '(?:^|\n)(%s): ' % '|'.join(keys)
    
    # split text on keys
    chunks = re.split(regex, text)[1:]
    # reshape flat list of records to group key/value and infos on the same article
    df = pd.DataFrame([dict(e) for e in np.array(chunks).reshape(-1, len(keys), 2)])
    

    输出:

                          Title                                                                                                                                               Full text Subject
    0       title of an article  unfortunately the full text of each article,\nis on numerous lines. Each article has a differing number \nof lines. In this example, there are three..  Python
    1  title of another article                                                                               again unfortunately the full text of each article,\nis on numerous lines.  Python
    

    【讨论】:

      【解决方案2】:

      如果你想坚持你的 for 循环,你可能需要这样的东西:

      titles = []
      texts = []
      subjects = []
      
      with open('sample.txt', encoding="utf8") as f:
          inside_fulltext = False
          for line in f:
              if line.startswith("Title:"):
                  inside_fulltext = False
                  titles.append(line)
              elif line.startswith("Full text:"):
                  inside_fulltext = True
                  full_text = line
              elif line.startswith("Subject:"):
                  inside_fulltext = False
                  texts.append(full_text)
                  subjects.append(line)
              elif inside_fulltext:
                  full_text += line
              else:
                  # Possibly throw a format error here?
                  pass
      

      (有几件事:Python 的名字很奇怪,当你写 list = [] 时,你实际上是在覆盖 list 类的标签,这可能会在以后给你带来问题。你真的应该对待 @987654324 @、set 等等类似关键字 - 即使认为 Python 在技术上没有 - 只是为了省去你的麻烦。此外,鉴于您对数据的描述,startswith 方法在这里更精确一些。)

      或者,您可以将文件对象包装在迭代器中(i = iter(f),然后是 next(i)),但这会导致捕获 StopIteration 异常时有些头疼 - 但它会让您使用更经典的 while -整个事情的循环。就我自己而言,我会坚持使用上面的状态机方法,并使其足够强大以处理所有合理预期的边缘情况。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2018-11-29
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-03-03
        • 1970-01-01
        • 1970-01-01
        • 2023-03-10
        相关资源
        最近更新 更多