【问题标题】:Python for loop with if/else and append function带有 if/else 和 append 函数的 Python for 循环
【发布时间】:2025-11-21 11:50:02
【问题描述】:

根据下面的列表,我必须创建一个带有“state”和“region”列的DataFrame:

原始数据:

 Alabama[edit]
 Auburn (Auburn University)[1]
 Florence (University of North Alabama)
 Jacksonville (Jacksonville State University)[2]
 Livingston (University of West Alabama)[2]
 Montevallo (University of Montevallo)[2]
 Troy (Troy University)[2]
 Tuscaloosa (University of Alabama, Stillman College, Shelton State)[3][4]
 Tuskegee (Tuskegee University)[5]
 Alaska[edit]
 Fairbanks (University of Alaska Fairbanks)[2]
 Arizona[edit]
 Flagstaff (Northern Arizona University)[6]
 Tempe (Arizona State University)

(此处为数据link。)

期望的输出:

State   Region
Alabama Auburn
Alabama Florence
Alabama Jacksonville
Alabama Livingston
Alabama Montevallo
Alabama Troy
Alabama Tuscaloosa
Alabama Tuskegee
Alaska  Fairbanks
Arizona Flagstaff
Arizona Tempe

代码:

    df = pd.DataFrame(columns=['State', 'RegionName'])
    with open('university_towns.txt', 'r') as UniversityList:
            content = UniversityList.readlines()
            state_row = []
            region_row = []
            for row in content:
                if '[edit]' in row:
                    state_row.append(row)
                    region_row.append('region_to_be_repeated')
                else:
                    region_row.append(row)
                    state_row.append('state_to_be_repeated')

如果“如果”为真,我如何将'state_to_be_reapeted' 替换为附加的内容?

【问题讨论】:

  • 你能提供一个原始数据框的例子和你想要的结果吗?
  • 请将这些内容编辑到您的问题中,因为它作为评论有点难以理解。

标签: python python-3.x pandas


【解决方案1】:

您可以在教程Pythonic Data Cleaning With NumPy and Pandas 中找到清理此数据集的示例。

选项 1:在“纯 Python”中进行字符串处理

您可以在文件的行上使用贪婪的 for 循环并在 O(n) 时间内加载:

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        edit_pos = line.find('[edit]')
        if edit_pos != -1:
            # Remember this `state` until the next is found
            state = line[:edit_pos]
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            parens = line.find(' (')
            town = line[:parens] if parens != -1 else line
            university_towns.append((state, town))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])

选项 2:通过 Pandas API 进行字符串处理

或者,您可以使用 Pandas 的 .str 访问器进行字符串处理:

import re

import pandas as pd

university_towns = []

with open('input/university_towns.txt') as file:
    for line in file:
        if '[edit]' in line:
            # Remember this `state` until the next is found
            state = line
        else:
            # Otherwise, we have a city; keep `state` as last-seen
            university_towns.append((state, line))

towns_df = pd.DataFrame(university_towns,
                        columns=['State', 'RegionName'])

towns_df['State'] = towns_df.State.str.replace(r'\[edit\]\n', '')
towns_df['RegionName'] = towns_df.RegionName\
    .str.strip()\
    .str.replace(r' \(.*', '')\
    .str.replace(r'\[.*', '')

输出:

>>> towns_df.head()
     State    RegionName
0  Alabama        Auburn
1  Alabama      Florence
2  Alabama  Jacksonville
3  Alabama    Livingston
4  Alabama    Montevallo

【讨论】:

    【解决方案2】:

    我能想到的最短版本:

    import pandas as pd
    
    lst = list()
    
    with open('university_towns.txt', 'r', newline='\n') as infile:
        for line in infile.readlines():
            if '[edit]' in line:
                state = line.split('[')[0]
            else:
                lst.append([state, line.split(' ')[0]])
    
    df = pd.DataFrame(lst, columns=['State', 'RegionName'])
    print(df)
    

    在我的机器上生成(Python 3.6):

          State    RegionName
    0   Alabama        Auburn
    1   Alabama      Florence
    2   Alabama  Jacksonville
    3   Alabama    Livingston
    4   Alabama    Montevallo
    5   Alabama          Troy
    6   Alabama    Tuscaloosa
    7   Alabama      Tuskegee
    8    Alaska     Fairbanks
    9   Arizona     Flagstaff
    10  Arizona         Tempe
    

    【讨论】:

      【解决方案3】:

      如果我理解您的问题并且期望的输出正确,您可以这样做:

      univeristylist = []
      with open('university_towns.txt', 'r') as file:
          for line in file:
              if '[edit]' in line:
                  state = row
              else:
                  universitylist.append([state, row])
      
      df = pd.DataFrame(universitylist, columns=['State', 'RegionName'])
      

      如果您不想要 '[edit]''[1]' 部分等,那么您可以将代码更改为:

      univeristylist = []
      with open('university_towns.txt', 'r') as file:
          for line in file:
              if '[edit]' in line:
                  state = row.split(' [')[0]
              else:
                  universitylist.append([state, row.split(' [')[0]])
      
      df = pd.DataFrame(columns=['State', 'RegionName'])
      

      【讨论】:

      • 你纠正了布拉德。我没有仔细看给出的原始代码。这应该使它更有效率。
      • 正确,我想我该睡觉了。不再那么锋利了;-)。我应该添加一个 while 循环 while True: 但使用 for line in file: 可能会更好。
      最近更新 更多