【问题标题】:How to filter rows based on cell contents in a row-based expression如何根据基于行的表达式中的单元格内容过滤行
【发布时间】:2021-12-29 20:30:02
【问题描述】:

我从一个文件中读取了一些数据。由于第一个数据行中的 XXX,第一列被分配了“对象”类型:

tips = pd.read_csv("tips.csv")
print(tips.head())
print(tips.info())

total_bill   tip     sex smoker  day    time  size    
0        xxx  1.01  Female     No  Sun  Dinner     2    
1      10.34  1.66    Male     No  Sun  Dinner     3    
2      21.01  3.50    Male     No  Sun  Dinner     3    
3      23.68  3.31    Male     No  Sun  Dinner     2    
4      24.59  3.61  Female     No  Sun  Dinner     4    
<class 'pandas.core.frame.DataFrame'>    
RangeIndex: 244 entries, 0 to 243    
Data columns (total 7 columns):    
 #   Column      Non-Null Count  Dtype      
---  ------      --------------  -----      
 0   total_bill  244 non-null    object     
 1   tip         244 non-null    float64    
 2   sex         244 non-null    object     
 3   smoker      244 non-null    object     
 4   day         244 non-null    object     
 5   time        244 non-null    object     
 6   size        244 non-null    int64 

所以,这将失败,因为第一行数据中的 XXX 应该是一个数字:

tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])

如何在不实际更改 DataFrame 内容的情况下重写上述行以过滤掉坏行?

【问题讨论】:

    标签: python python-3.x pandas


    【解决方案1】:

    您可以使用errors='coerce' 将具有“xxx”的列包装在pd.to_numeric 中。这会将字符串类型值转换为NaN,以便您可以进行操作并且您的数据框将保持不变

    tips['tip_pct'] = tips['tip'] / (pd.to_numeric(tips['total_bill'],errors='coerce') - tips['tip'])
    
      total_bill   tip     sex   smoker  day     time        size     p_pct
    0        xxx  1.01  Female     No  Sun     Dinner           2       NaN
    1      10.34  1.66    Male     No  Sun     Dinner           3  0.191244
    2      21.01  3.50    Male     No  Sun     Dinner           3  0.199886
    3      23.68  3.31    Male     No  Sun     Dinner           2  0.162494
    4      24.59  3.61  Female     No  Sun     Dinner           4  0.172069
    

    【讨论】:

      【解决方案2】:

      另一种方式,掩码,强制total_bill 浮动和计算

      m=tips['total_bill']!='xxx'
      tips['tip_pct'] =tips.loc[m,'tip'] / (tips.loc[m,'total_bill'].astype(float) - tips.loc[m,'tip'])
      
      
      
      
         total_bill   tip     sex smoker  day    time  size   tip_pct
      0        xxx  1.01  Female     No  Sun  Dinner     2       NaN
      1      10.34  1.66    Male     No  Sun  Dinner     3  0.191244
      2      21.01  3.50    Male     No  Sun  Dinner     3  0.199886
      3      23.68  3.31    Male     No  Sun  Dinner     2  0.162494
      4      24.59  3.61  Female     No  Sun  Dinner     4  0.172069
      

      【讨论】:

        【解决方案3】:

        来自read_csv

        data = pd.read_csv('tips.csv',
           
            dtype={'total_bil': np.float64})
        
        tips['tip_pct'] = tips['tip'] / (tips['total_bill'] - tips['tip'])
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2019-11-02
          • 1970-01-01
          • 2013-03-18
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多