【问题标题】:Scientific notation being read as string in pandas科学记数法在 pandas 中被读取为字符串
【发布时间】:2017-05-24 13:11:25
【问题描述】:

我正在尝试读取包含科学计数法数字列的 .csv。 无论我做什么,最终都会将它们读取为字符串:

def readData(path, cols):
    types  = [str, str, str, str, np.float32]
    t_dict = {key: value for (key, value) in zip(c, types)}

    df = pd.read_csv(path, header=0, sep=';', encoding='latin1', usecols=cols, dtype=t_dict, chunksize=5000)

    return df

c = [3, 6, 7, 9, 16]
df2017_chunks = readData('Data/2017.csv', c)

def preProcess(df, f):    
    df.columns = f
    df['id_client'] = df['id_client'].apply(lambda x: str(int(float(x))))

    return df

f = ['issue_date', 'channel', 'product', 'issue', 'id_client']

df = pd.DataFrame(columns=f)
for chunk in df2017_chunks:
    aux = preProcess(chunk, f)
    df = pd.concat([df, aux])

如何正确读取这些数据?

【问题讨论】:

标签: python csv pandas scientific-notation


【解决方案1】:

您的预处理函数在应用其他函数之后应用字符串转换。这是预期的行为吗?

你可以试试:

df = pd.read_csv(path, header=0, sep=';', encoding='latin1', usecols=cols, chunksize=5000)
df["id_client"] = pd.to_numeric(df["id_client"])

【讨论】:

    【解决方案2】:

    示例数据框:

    df = pd.DataFrame({'issue_date': [1920,1921,1922,1923,1924,1925,1926],
        'name': ['jon doe1','jon doe2','jon doe3','jon doe4','jon doe5','jon doe6','jon doe7'],
        'id_cleint': ['18.61', '17.60', '18.27', '16.18', '16.81', '16.37', '67.07']})
    

    您可以使用以下命令检查数据框的类型

    print df.dtypes 
    

    输出:

    id_client     object
    issue_date     int64
    name          object
    dtype: object
    

    使用以下命令将df['id_client'] dtype 从object 转换为float64

    df['id_client'] =  pd.to_numeric(df['id_client'], errors='coerce')
    

    当项目无法转换时,errors='coerce' 将导致 NaN。使用命令
    print df.dtypes 会产生以下输出:

    id_client     float64
    issue_date      int64
    name           object
    dtype: object
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-12-23
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-04-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多