【问题标题】:Select rows randomly based on condition pandas python根据条件pandas python随机选择行
【发布时间】:2016-06-02 13:53:55
【问题描述】:

我有一个小的测试数据样本:

import pandas as pd

df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
  'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

看起来像:

df
Out[4]: 
      Clone   ID  Length
0       0   H900      48
1       1   H901      42
2       2   H902      48
3       2             48
4       2  M1435      48
5       2   M149      48
6       2   M157      48
7       2             48
8       3   M699      48
9       3   M920      48
10      3             48
11      4   M789      48
12      4   M617      48
13      4   M991      48
14      5   H903      48
15      5   M730      48
16      6   M191      48

我想要一个简单的脚本来随机选择例如 5 行,但只选择包含 ID 的行,它不应包含任何不包含 ID 的行。

我的脚本:

import pandas as pd
import numpy as np

df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
  'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
  'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}

df = pd.DataFrame(df)

rows = np.random.choice(df.index.values, 5)
sampled_df = df.ix[rows]

sampled_df.to_csv('sampled_df.txt', sep = '\t', index=False)

但此脚本有时会挑选出不包含 ID 的行

【问题讨论】:

    标签: python pandas random


    【解决方案1】:

    我认为你需要用boolean indexing过滤空ID

    import pandas as pd
    import numpy as np
    
    df = {'ID': ['H900','H901','H902','','M1435','M149','M157','','M699','M920','','M789','M617','M991','H903','M730','M191'],
      'Clone': [0,1,2,2,2,2,2,2,3,3,3,4,4,4,5,5,6],
      'Length': [48,42  ,48,48,48,48,48,48,48,48,48,48,48,48,48,48,48]}
    
    df = pd.DataFrame(df)
    print (df)
    df = df[df.ID != '']
    
    rows = np.random.choice(df.index.values, 5)
    sampled_df = df.loc[rows]
    print (sampled_df)
    

    【讨论】:

    • 最近偶然发现了这一点,并意识到在 pandas==1.2.2 上 ix 已被弃用,请改用 loc
    【解决方案2】:

    在这种情况下也可以使用查询然后采样。像这样:

    df = df.query('(ID != "")').sample(n=5)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2016-11-02
      • 1970-01-01
      • 2021-11-17
      • 2017-02-15
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多