【问题标题】:Python Text to Data Frame with Specific Pattern具有特定模式的 Python 文本到数据框
【发布时间】:2021-07-12 20:04:35
【问题描述】:

我正在尝试使用 Pandas 将一堆文本文件转换为数据框。

感谢 Stack Overflow 令人惊叹的社区,我几乎得到了想要的输出(OP:Python Text File to Data Frame with Specific Pattern)。

基本上我需要使用 Pandas 将具有特定模式(但有时缺少数据)的文本转换为数据框。

这是一个例子:

Number 01600 London                           Register  4314

Some random text...

************************************* B ***************************************
 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADDR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN:            ADDR: Streetname 4   2000
 5 SHARE: 23/1284
   James Klein
   BORN:            ADDR:      
   c 4222/1988 Supporting Text
   f 4222/2000 Extra Text
************************************* C ***************************************
More random text...

从上面的示例中,我需要将 ***B*** 和 ***C*** 之间的文本转换为具有以下输出的数据框:

Number Register City Id Share Name Born Address c f h i
01600 4314 London 1 73/1284 John Smith 1960-01-01 Streetname 3/2 1000 NaN 4222/2001 1334/2000 5774/2000
01600 4314 London 4 58/1284 Boris Morgan NaN Streetname 4 2000 NaN NaN NaN NaN
01600 4314 London 5 23/1284 James Klein NaN NaN 4222/1988 Supporting Text 4222/2000 Extra Text NaN NaN

一些模式:

  • 组的第一行包含单词 SHARE;这个词之前是Id,之后是Share

  • 第二行包含人名(应完全提取到Name 变量中)。

  • 第三行包含生日 (BORN) 和地址 (ADDR)。有时会丢失此信息 - 在这些情况下,变量 BornAddress 应该是 NaN。

  • 当它存在时,第四行及以后(持续到到达下一组)以小写字母开头。这些行中的每一行都应该被提取到一个以小写字母开头的变量中,直到段落的结尾。

以下代码适用于出生日期和地址可用,并且第四行及以后仅包含一块信息时(在前面的示例中,来自 John Smith 的 SHARE: 73/1284 有 f、h 和 i 行 -全部都只有一个信息块,并且来自 James Klein 的 SHARE: 23/1284 包含多个块)。

import pandas as pd

text = '''Number 01600 London                           Register  4314

Some random text...

************************************* B ***************************************
 1 SHARE: 73/1284
   John Smith
   BORN: 1960-01-01 ADDR: Streetname 3/2   1000
   f 4222/2001
   h 1334/2000
   i 5774/2000
 4 SHARE: 58/1284
   Boris Morgan
   BORN:            ADDR: Streetname 4   2000
 5 SHARE: 23/1284
   James Klein
   BORN:            ADDR:      
   c 4222/1988 Supporting Text
   f 4222/2000 Extra Text
************************************* C ***************************************
More random text...'''

text = [i.strip() for i in text.splitlines()] # create a list of lines

data = []

# extract metadata from first line
number = text[0].split()[1]
city = text[0].split()[2]
register = text[0].split()[4]

# create a list of the index numbers of the lines where new items start
indices = [text.index(i) for i in text if 'SHARE' in i]
# split the list by the retrieved indexes to get a list of lists of items
items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]

for i in items:
    d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1], }
    items = list(s.split() for s in i[3:])
    merged_items = []

    for i in items:
        if len(i[0]) == 1 and i[0].isalpha():
            merged_items.append(i)
        else:
            merged_items[-1][-1] = merged_items[-1][-1] + i[0]
    d.update({name: value for name,value in merged_items})
    data.append(d)

#load the list of dicts as a dataframe
df = pd.DataFrame(data)

有人知道如何解决这些问题吗?提前致谢。

【问题讨论】:

    标签: python pandas dataframe


    【解决方案1】:

    我不会对内部和外部循环使用相同的变量i。将您的 for 循环更改为以下内容应该更干净:

    for i in items:
        d = {'Number': number, 
             'Register': register, 
             'City': city, 
             'Id': int(i[0].split()[0]), 
             'Share': i[0].split(': ')[1], 
             'Name': i[1], 
             }
        
        if "ADDR" in i[2]:
            born, address = i[2].split("ADDR:")
            d['Born'] = born.replace("BORN:", "").strip()
            d['Address'] = address.strip()
        else:
            d['Born']: i[2].split()[1]
        
        if len(i)>3:
            for j in i[3:]:
                key, value = j.split(" ", 1)
                d[key] = value
        data.append(d)
    
    #load the list of dicts as a dataframe
    df = pd.DataFrame(data)
    

    【讨论】:

      【解决方案2】:

      您可以通过获取分隔线的索引号来分割列表以仅包含相关值:

      import pandas as pd
      
      text = '''Number 01600 London                           Register  4314
      
      Some random text...
      
      ************************************* B ***************************************
       1 SHARE: 73/1284
         John Smith
         BORN: 1960-01-01 ADDR: Streetname 3/2   1000
         f 4222/2001
         h 1334/2000
         i 5774/2000
       4 SHARE: 58/1284
         Boris Morgan
         BORN:            ADDR: Streetname 4   2000
       5 SHARE: 23/1284
         James Klein
         BORN:            ADDR:      
         c 4222/1988 Supporting Text
         f 4222/2000 Extra Text
      ************************************* C ***************************************
      More random text...'''
      
      text = [i.strip() for i in text.splitlines()] # create a list of lines
      
      # extract metadata from first line
      number = text[0].split()[1]
      city = text[0].split()[2]
      register = text[0].split()[4]
      
      # get index numbers of delimiter values and filter list
      start, end = [text.index(i) for i in text if '*****' in i]
      text = text[start+1:end]
      
      data = []
      
      # create a list of the index numbers of the lines where new items start
      indices = [text.index(i) for i in text if 'SHARE' in i]
      # split the list by the retrieved indexes to get a list of lists of items
      items = [text[i:j] for i, j in zip([0]+indices, indices+[None])][1:]
      
      for i in items:
          d = {'Number': number, 'Register': register, 'City': city, 'Id': int(i[0].split()[0]), 'Share': i[0].split(': ')[1], 'Name': i[1], 'Born': i[2].split()[1] if i[2].split()[1][:4].isnumeric() else None}
          items = list(s.split() for s in i[3:])
          merged_items = []
      
          for i in items:
              if len(i[0]) == 1 and i[0].isalpha():
                  merged_items.append([i[0], ' '.join(i[1:])])
              else:
                  merged_items[-1][-1] = merged_items[-1][-1] + i[0]
          d.update({name: value for name,value in merged_items})
          data.append(d)
      
      #load the list of dicts as a dataframe
      df = pd.DataFrame(data)
      
      Number Register City Id Share Name Born f h i c
      0 01600 4314 London 1 73/1284 John Smith 1960-01-01 4222/2001 1334/2000 5774/2000 nan
      1 01600 4314 London 4 58/1284 Boris Morgan nan nan nan nan
      2 01600 4314 London 5 23/1284 James Klein 4222/2000 Extra Text nan nan 4222/1988 Supporting Text

      【讨论】:

      • 再次感谢 RJ Adriaansen。它几乎按预期工作 - 对于小写字母变量(f、h、i、c),我需要获取小写字母之后的全部内容 - 所以对于 James Klein,变量 c 应该包含:“4222/1988支持文本”和f“4222/2000 额外文本”。由于这些行的字长可能会发生变化,因此应该一直持续到行尾。
      猜你喜欢
      • 1970-01-01
      • 2022-11-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多