pandas read_json 错误地将大整数读取为字符串答案

【问题标题】：pandas read_json reads large integers as strings incorrectlypandas read_json 错误地将大整数读取为字符串
【发布时间】：2018-09-12 14:28:56
【问题描述】：

我正在尝试读取存储为 json 文件的推文。我正在使用熊猫来加载数据。但是在read_json 函数中发现了一些奇怪的行为。我在下面提供mcve：

json_content="""
{ 
    "1": {
        "tid": "9999999999999998", 
    }, 
    "2": {
        "tid": "9999999999999999", 
    },
    "3": {
        "tid": "10000000000000001", 
    },
    "4": {
        "tid": "10000000000000002", 
    }
}
"""
df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.info())
print(df)

在我的电脑上输出以下内容：

<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2  10000000000000000
3  10000000000000000
4  10000000000000002

没有为tid 列存储正确的值，这是为什么发生了什么？

注意：不应该有an overflow case。 tid 列存储为 int64，它的限制比我最初测试的 tid 高约 10 倍（见下文）：

import sys
# original problem 
tid_0 = 956677215197970432 
print(sys.maxsize,tid_0,sys.maxsize/tid_0)    # < 1 if overflow possible
# minimal case
tid = 10000000000000001 
print(sys.maxsize,tid,sys.maxsize/tid)    # < 1 if overflow possible

#Output
9223372036854775807 956677215197970432 9
9223372036854775807 10000000000000001 922

更新：

明确指定参数时读取正确 dtype=int，但我不明白为什么。当我们指定时会发生什么变化数据类型？

【问题讨论】：

提出了同样的问题：github.com/pandas-dev/pandas/issues/20608

标签： python json python-3.x pandas

【解决方案1】：

您可以明确指定数据类型：

In [32]: df=pd.read_json(json_content,
    ...:                 orient='index', # read as transposed
    ...:                 convert_axes=False, # don't convert keys to dates
    ...:                 dtype='int64'   # <------- NOTE
    ...:         )
    ...: print(df.info())
    ...: print(df)
    ...:
<class 'pandas.core.frame.DataFrame'>
Index: 4 entries, 1 to 4
Data columns (total 1 columns):
tid    4 non-null int64
dtypes: int64(1)
memory usage: 64.0+ bytes
None
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

如果我们在 JSON 中指定整数而不是字符串值，它也可以按预期工作：

In [61]: %paste
json_content="""
{
    "1": {
        "tid": 9999999999999998,
    },
    "2": {
        "tid": 9999999999999999,
    },
    "3": {
        "tid": 10000000000000001,
    },
    "4": {
        "tid": 10000000000000002,
    }
}
"""

df=pd.read_json(json_content,
                orient='index', # read as transposed
                convert_axes=False, # don't convert keys to dates
        )
print(df.dtypes)
print(df)

## -- End pasted text --
tid    int64
dtype: object
                 tid
1   9999999999999998
2   9999999999999999
3  10000000000000001
4  10000000000000002

所以看起来它与类型推断有关，因为默认情况下dtype=True，这意味着：If True, infer dtypes

【讨论】：

感谢您的信息。我也在寻找对这种行为的解释。我碰巧发现在您发布之前明确提及 dtype 有效。这是一种解决方法，但它没有回答我的问题。（你刚刚得到的反对票不是来自我）
@UdayrajDeshmukh，它与类型推断有关 - 如果您传递整数而不是字符串（例如："tid": 10000000000000002 而不是 "tid": "10000000000000002"） - 它会正常工作。 PS 默认dtype-True - "If True, infer dtypes"
实际上我是从大约 100 个 json 文件（Twitter 数据库示例）中读取的，这些文件已经将 tid 列作为字符串
@UdayrajDeshmukh，是的，我正在寻找原因——这似乎是一种“类型推断”