【问题标题】:Python conditional filtering in csv filecsv文件中的Python条件过滤
【发布时间】:2014-12-01 23:55:31
【问题描述】:

请帮忙!我尝试了不同的东西/包来编写一个程序,该程序接受 4 个输入并根据来自 csv 文件的输入组合返回一个组的写作分数统计信息。这是我的第一个项目,因此我将不胜感激任何见解/提示/提示!

这是 csv 样本(总共有 200 行):

id  gender  ses schtyp  prog        write
70  male    low public  general     52
121 female  middle  public  vocation    68
86  male    high    public  general     33
141 male    high    public  vocation    63      
172 male    middle  public  academic    47
113 male    middle  public  academic    44
50  male    middle  public  general     59
11  male    middle  public  academic    34      
84  male    middle  public  general     57      
48  male    middle  public  academic    57      
75  male    middle  public  vocation    60      
60  male    middle  public  academic    57  

这是我目前所拥有的:

import csv
import numpy
csv_file_object=csv.reader(open('scores.csv', 'rU')) #reads file
header=csv_file_object.next() #skips header
data=[] #loads data into array for processing
for row in csv_file_object:
    data.append(row)
data=numpy.array(data)

#asks for inputs 
gender=raw_input('Enter gender [male/female]: ')
schtyp=raw_input('Enter school type [public/private]: ')
ses=raw_input('Enter socioeconomic status [low/middle/high]: ')
prog=raw_input('Enter program status [general/vocation/academic: ')

#makes them lower case and strings
prog=str(prog.lower())
gender=str(gender.lower())
schtyp=str(schtyp.lower())
ses=str(ses.lower())

我缺少的是如何过滤并仅获取特定组的统计信息。例如,假设我输入了男性、公共、中等和学术——我想获得该子集的平均写作分数。我尝试了 pandas 的 groupby 功能,但这只能为您提供广泛组的统计信息(例如公共与私人)。我还尝试了 pandas 的 DataFrame,但这只能让我过滤一个输入,并且不知道如何获得写作分数。任何提示将不胜感激!

【问题讨论】:

  • section 开始阅读,看看你的进展如何,基本上你所要求的都可以完成
  • 似乎是在数据框中的多个列上进行布尔索引的典型案例。您可以尝试按照here概述的方法吗

标签: python csv pandas


【解决方案1】:

同意Ramon,Pandas 绝对是您的必经之路,并且一旦您习惯了它,它就具有非凡的过滤/子设置能力。但是首先你可能很难理解(或者至少对我来说是这样!),所以我从我的一些旧代码中找出了一些你需要的子设置的例子。下面的变量itu 是一个 Pandas DataFrame,其中包含不同国家/地区随时间变化的数据。

# Subsetting by using True/False:
subset = itu['CntryName'] == 'Albania'  # returns True/False values
itu[subset]  # returns 1x144 DataFrame of only data for Albania
itu[itu['CntryName'] == 'Albania']  # one-line command, equivalent to the above two lines

# Pandas has many built-in functions like .isin() to provide params to filter on    
itu[itu.cntrycode.isin(['USA','FRA'])]  # returns where itu['cntrycode'] is 'USA' or 'FRA'
itu[itu.year.isin([2000,2001,2002])]  # Returns all of itu for only years 2000-2002
# Advanced subsetting can include logical operations:
itu[itu.cntrycode.isin(['USA','FRA']) & itu.year.isin([2000,2001,2002])]  # Both of above at same time

# Use .loc with two elements to simultaneously select by row/index & column:
itu.loc['USA','CntryName']
itu.iloc[204,0]
itu.loc[['USA','BHS'], ['CntryName', 'Year']]
itu.iloc[[204, 13], [0, 1]]

# Can do many operations at once, but this reduces "readability" of the code
itu[itu.cntrycode.isin(['USA','FRA']) & 
    itu.year.isin([2000,2001,2002])].loc[:, ['cntrycode','cntryname','year','mpen','fpen']]

# Finally, if you're comfortable with using map() and list comprehensions, 
you can do some advanced subsetting that includes evaluations & functions 
to determine what elements you want to select from the whole, such as all 
countries whose name begins with "United":
criterion = itu['CntryName'].map(lambda x: x.startswith('United'))
itu[criterion]['CntryName']  # gives us UAE, UK, & US

【讨论】:

  • 非常感谢 TC Allen!有效。感谢您在我刚开始学习这个程序时给了我一些重要的提示和提示:)
【解决方案2】:

看看pandas。我认为它会缩短您的 csv 解析工作并提供您要求的子集功能......

import pandas as pd
data = pd.read_csv('fileName.txt', delim_whitespace=True)

#get all of the male students
data[data['gender'] == 'male']

【讨论】:

    猜你喜欢
    • 2023-04-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-08-18
    • 2011-09-20
    • 1970-01-01
    • 2018-07-16
    • 1970-01-01
    相关资源
    最近更新 更多