【问题标题】:pandas read_csv with chunksize带有块大小的熊猫 read_csv
【发布时间】:2018-06-21 14:11:50
【问题描述】:
# calculate CTR
count_all = 0
count_4 = 0
for df in pd.read_csv( open("%s/tianchi_fresh_comp_train_user.csv" % 
root_path,'r'), chunksize=10000):
     try:
         count_user = df['behavior_type'].value_counts()
         count_all += count_user[1]+count_user[2]+count_user[3]+count_user[4]
         count_4 += count_user[4]
     except StopIteration:
         print("Iteration is stopped.")

# CTR
print(count_all)
print(count_4)

错误信息

但如果我将 chunksize 从 10000 修改为 100000。 chunksize = 100000, 没关系,没问题

为什么,我设置chunksize = 10000,有错误?

【问题讨论】:

  • 这里的事情是当你做1000块时,一些块文件不会包含behavior_type 4
  • 是的,你是对的。但是如何解决这个问题呢?我应该检查每个块是否有 1、2、3 或 4??

标签: pandas


【解决方案1】:
count_all = 0
count_4 = 0
for df in pd.read_csv( open("%s/tianchi_fresh_comp_train_user.csv" % root_path,'r'), 
chunksize=10000):
    try:
        count_user = df['behavior_type'].value_counts()
        for i in range(5):
            if i not in count_user.index: count_user[i] = 0
            else:
               count_all += count_user[i]
        count_4 += count_user[4]
    except StopIteration:
          print("Iteration is stopped.")

我修改了代码,现在可以了,当chunksize=10000时,没问题。

【讨论】:

    猜你喜欢
    • 2017-03-26
    • 2018-01-27
    • 1970-01-01
    • 1970-01-01
    • 2016-03-30
    • 2019-07-12
    • 2017-08-02
    • 1970-01-01
    • 2019-12-31
    相关资源
    最近更新 更多