【问题标题】:Create Pandas DataFrame from txt file with specific pattern从具有特定模式的 txt 文件创建 Pandas DataFrame
【发布时间】:2017-05-14 04:14:27
【问题描述】:

我需要基于以下结构的文本文件创建一个 Pandas DataFrame:

Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]

带有“[edit]”的行是州,而行 [number] 是地区。我需要拆分以下内容,然后为每个区域名称重复州名称。

Index          State          Region Name
0              Alabama        Aurburn...
1              Alabama        Florence...
2              Alabama        Jacksonville...
...
9              Alaska         Fairbanks...
10             Alaska         Arizona...
11             Alaska         Flagstaff...

熊猫数据框

我不确定如何将基于“[edit]”和“[number]”或“(characters)”的文本文件拆分为相应的列,并为每个区域名称重复州名称。请任何人都可以给我一个起点来完成以下工作。

【问题讨论】:

标签: python regex pandas text extract


【解决方案1】:

您似乎来自 Coursera 的数据科学入门课程。用这个解决方案通过了我的测试。我建议不要复制整个解决方案,而只是将其用于参考目的:)

lines = open('university_towns.txt').readlines()

l=[]
lofl=[]
flag=False
for line in lines:
    l = []
    if('[edit]' in line):
        index = line[:-7]
    elif('(' in line):
        pos = line.find('(')
        line = line[:pos-1]
        l.append(index)
        l.append(line)
        flag=True
    else:
        line = line[:-1]
        l.append(index)
        l.append(line)
        flag=True
    if(flag and np.array(l).size!=0):
        lofl.append(l)
df = pd.DataFrame(lofl,columns=["State","RegionName"])

【讨论】:

  • 更短、更易读的解决方案是循环使用regexp。它将消除那些 if/elseif 语句。
【解决方案2】:

您可以先read_csv 使用参数name 创建DataFrame 和列Region Name,分隔符是不在值中的值(如;):

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])

然后insert 新列Stateextract 行,其中文本[edit]replace( 到列Region Name 的所有值。

df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df['Region Name'] = df['Region Name'].str.replace(r' \(.+$', '')

最后删除文本 [edit] by boolean indexing, mask 由 str.contains 创建的行:

df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State   Region Name
0   Alabama        Auburn
1   Alabama      Florence
2   Alabama  Jacksonville
3   Alabama    Livingston
4   Alabama    Montevallo
5   Alabama          Troy
6   Alabama    Tuscaloosa
7   Alabama      Tuskegee
8    Alaska     Fairbanks
9   Arizona     Flagstaff
10  Arizona         Tempe
11  Arizona        Tucson

如果需要所有值的解决方案更容易:

df = pd.read_csv('filename.txt', sep=";", names=['Region Name'])
df.insert(0, 'State', df['Region Name'].str.extract('(.*)\[edit\]', expand=False).ffill())
df = df[~df['Region Name'].str.contains('\[edit\]')].reset_index(drop=True)
print (df)
      State                                        Region Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

【讨论】:

【解决方案3】:

TL;DR
s.groupby(s.str.extract('(?P<State>.*?)\[edit\]', expand=False).ffill()).apply(pd.Series.tail, n=-1).reset_index(name='Region_Name').iloc[:, [0, 2]]


regex = '(?P<State>.*?)\[edit\]'  # pattern to match
print(s.groupby(
    # will get nulls where we don't have "[edit]"
    # forward fill fills in the most recent line
    # where we did have an "[edit]"
    s.str.extract(regex, expand=False).ffill()  
).apply(
    # I still have all the original values
    # If I group by the forward filled rows
    # I'll want to drop the first one within each group
    pd.Series.tail, n=-1
).reset_index(
    # munge the dataframe to get columns sorted
    name='Region_Name'
)[['State', 'Region_Name']])

      State                                        Region_Name
0   Alabama                      Auburn (Auburn University)[1]
1   Alabama             Florence (University of North Alabama)
2   Alabama    Jacksonville (Jacksonville State University)[2]
3   Alabama         Livingston (University of West Alabama)[2]
4   Alabama           Montevallo (University of Montevallo)[2]
5   Alabama                          Troy (Troy University)[2]
6   Alabama  Tuscaloosa (University of Alabama, Stillman Co...
7   Alabama                  Tuskegee (Tuskegee University)[5]
8    Alaska      Fairbanks (University of Alaska Fairbanks)[2]
9   Arizona         Flagstaff (Northern Arizona University)[6]
10  Arizona                   Tempe (Arizona State University)
11  Arizona                     Tucson (University of Arizona)

设置

txt = """Alabama[edit]
Auburn (Auburn University)[1]
Florence (University of North Alabama)
Jacksonville (Jacksonville State University)[2]
Livingston (University of West Alabama)[2]
Montevallo (University of Montevallo)[2]
Troy (Troy University)[2]
Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
Tuskegee (Tuskegee University)[5]
Alaska[edit]
Fairbanks (University of Alaska Fairbanks)[2]
Arizona[edit]
Flagstaff (Northern Arizona University)[6]
Tempe (Arizona State University)
Tucson (University of Arizona)
Arkansas[edit]"""

s = pd.read_csv(StringIO(txt), sep='|', header=None, squeeze=True)

【讨论】:

    【解决方案4】:

    假设你有以下DF:

    In [73]: df
    Out[73]:
                                                     text
    0                                       Alabama[edit]
    1                       Auburn (Auburn University)[1]
    2              Florence (University of North Alabama)
    3     Jacksonville (Jacksonville State University)[2]
    4          Livingston (University of West Alabama)[2]
    5            Montevallo (University of Montevallo)[2]
    6                           Troy (Troy University)[2]
    7   Tuscaloosa (University of Alabama, Stillman Co...
    8                   Tuskegee (Tuskegee University)[5]
    9                                        Alaska[edit]
    10      Fairbanks (University of Alaska Fairbanks)[2]
    11                                      Arizona[edit]
    12         Flagstaff (Northern Arizona University)[6]
    13                   Tempe (Arizona State University)
    14                     Tucson (University of Arizona)
    15                                     Arkansas[edit]
    

    你可以使用Series.str.extract()方法:

    In [117]: df['State'] = df.loc[df.text.str.contains('[edit]', regex=False), 'text'].str.extract(r'(.*?)\[edit\]', expand=False)
    
    In [118]: df['Region Name'] = df.loc[df.State.isnull(), 'text'].str.extract(r'(.*?)\s*[\(\[]+.*[\n]*', expand=False)
    
    In [120]: df.State = df.State.ffill()
    
    In [121]: df
    Out[121]:
                                                     text     State   Region Name
    0                                       Alabama[edit]   Alabama           NaN
    1                       Auburn (Auburn University)[1]   Alabama        Auburn
    2              Florence (University of North Alabama)   Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]   Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]   Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]   Alabama    Montevallo
    6                           Troy (Troy University)[2]   Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...   Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]   Alabama      Tuskegee
    9                                        Alaska[edit]    Alaska           NaN
    10      Fairbanks (University of Alaska Fairbanks)[2]    Alaska     Fairbanks
    11                                      Arizona[edit]   Arizona           NaN
    12         Flagstaff (Northern Arizona University)[6]   Arizona     Flagstaff
    13                   Tempe (Arizona State University)   Arizona         Tempe
    14                     Tucson (University of Arizona)   Arizona        Tucson
    15                                     Arkansas[edit]  Arkansas           NaN
    
    In [122]: df = df.dropna()
    
    In [123]: df
    Out[123]:
                                                     text    State   Region Name
    1                       Auburn (Auburn University)[1]  Alabama        Auburn
    2              Florence (University of North Alabama)  Alabama      Florence
    3     Jacksonville (Jacksonville State University)[2]  Alabama  Jacksonville
    4          Livingston (University of West Alabama)[2]  Alabama    Livingston
    5            Montevallo (University of Montevallo)[2]  Alabama    Montevallo
    6                           Troy (Troy University)[2]  Alabama          Troy
    7   Tuscaloosa (University of Alabama, Stillman Co...  Alabama    Tuscaloosa
    8                   Tuskegee (Tuskegee University)[5]  Alabama      Tuskegee
    10      Fairbanks (University of Alaska Fairbanks)[2]   Alaska     Fairbanks
    12         Flagstaff (Northern Arizona University)[6]  Arizona     Flagstaff
    13                   Tempe (Arizona State University)  Arizona         Tempe
    14                     Tucson (University of Arizona)  Arizona        Tucson
    

    【讨论】:

      【解决方案5】:

      你可以先把文件解析成元组:

      import pandas as pd
      from collections import namedtuple
      
      Item = namedtuple('Item', 'state area')
      items = []
      
      with open('unis.txt') as f: 
          for line in f:
              l = line.rstrip('\n') 
              if l.endswith('[edit]'):
                  state = l.rstrip('[edit]')
              else:            
                  i = l.index(' (')
                  area = l[:i]
                  items.append(Item(state, area))
      
      df = pd.DataFrame.from_records(items, columns=['State', 'Area'])
      
      print df
      

      输出:

            State          Area
      0   Alabama        Auburn
      1   Alabama      Florence
      2   Alabama  Jacksonville
      3   Alabama    Livingston
      4   Alabama    Montevallo
      5   Alabama          Troy
      6   Alabama    Tuscaloosa
      7   Alabama      Tuskegee
      8    Alaska     Fairbanks
      9   Arizona     Flagstaff
      10  Arizona         Tempe
      11  Arizona        Tucson
      

      【讨论】:

        【解决方案6】:

        在将文件放入数据框之前,您可能需要对文件执行一些额外的操作。

        一个起点是将文件分成几行,在每一行中搜索字符串[edit],当它存在时将字符串名称作为字典的键......

        我认为 Pandas 没有任何内置方法可以处理这种格式的文件。

        【讨论】:

          猜你喜欢
          • 2019-09-27
          • 2021-07-12
          • 1970-01-01
          • 2015-11-11
          • 2016-01-01
          • 1970-01-01
          • 2020-04-09
          • 1970-01-01
          • 2018-02-12
          相关资源
          最近更新 更多