【问题标题】:Loading a file with more than one line of JSON into Pandas将包含多行 JSON 的文件加载到 Pandas 中
【发布时间】:2015-07-17 06:07:18
【问题描述】:

我正在尝试将 JSON 文件读入 Python pandas (0.14.0) 数据框。这是 JSON 文件的第一行:

{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "P_Mk0ygOilLJo4_WEvabAA", "review_id": "OeT5kgUOe3vcN7H6ImVmZQ", "stars": 3, "date": "2005-08-26", "text": "This is a pretty typical cafe.  The sandwiches and wraps are good but a little overpriced and the food items are the same.  The chicken caesar salad wrap is my favorite here but everything else is pretty much par for the course.", "type": "review", "business_id": "Jp9svt7sRT4zwdbzQ8KQmw"}

我正在尝试执行以下操作:df = pd.read_json(path)

我收到以下错误(带有完整回溯):

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/d/anaconda/lib/python2.7/site-packages/pandas/io/json.py", line 198, in read_json
    date_unit).parse()
  File "/Users/d/anaconda/lib/python2.7/site-packages/pandas/io/json.py", line 266, in parse
    self._parse_no_numpy()
  File "/Users/d/anaconda/lib/python2.7/site-packages/pandas/io/json.py", line 483, in _parse_no_numpy
    loads(json, precise_float=self.precise_float), dtype=None)
ValueError: Trailing data

Trailing data 错误是什么?如何将其读入数据框?

根据一些建议,以下是 .json 文件的几行:

{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "P_Mk0ygOilLJo4_WEvabAA", "review_id": "OeT5kgUOe3vcN7H6ImVmZQ", "stars": 3, "date": "2005-08-26", "text": "This is a pretty typical cafe.  The sandwiches and wraps are good but a little overpriced and the food items are the same.  The chicken caesar salad wrap is my favorite here but everything else is pretty much par for the course.", "type": "review", "business_id": "Jp9svt7sRT4zwdbzQ8KQmw"}
{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "TNJRTBrl0yjtpAACr1Bthg", "review_id": "qq3zF2dDUh3EjMDuKBqhEA", "stars": 3, "date": "2005-11-23", "text": "I agree with other reviewers - this is a pretty typical financial district cafe.  However, they have fantastic pies.  I ordered three pies for an office event (apple, pumpkin cheesecake, and pecan) - all were delicious, particularly the cheesecake.  The sucker weighed in about 4 pounds - no joke.\n\nNo surprises on the cafe side - great pies and cakes from the catering business.", "type": "review", "business_id": "Jp9svt7sRT4zwdbzQ8KQmw"}
{"votes": {"funny": 0, "useful": 0, "cool": 0}, "user_id": "H_mngeK3DmjlOu595zZMsA", "review_id": "i3eQTINJXe3WUmyIpvhE9w", "stars": 3, "date": "2005-11-23", "text": "Decent enough food, but very overpriced. Just a large soup is almost $5. Their specials are $6.50, and with an overpriced soda or juice, it's approaching $10. A bit much for a cafe lunch!", "type": "review", "business_id": "Jp9svt7sRT4zwdbzQ8KQmw"}

我正在使用的这个 .json 文件根据规范在每一行中包含一个 JSON 对象。

我按照建议尝试了jsonlint.com 网站,但出现以下错误:

Parse error on line 14:
...t7sRT4zwdbzQ8KQmw"}{    "votes": {
----------------------^
Expecting 'EOF', '}', ',', ']'

【问题讨论】:

  • 文件中有不属于 JSON 对象的其他数据。
  • json文件的最后几行是什么样的?
  • 这个例子在 pandas 0.16.0 中对我来说很好读。你用的是什么版本的熊猫?
  • @user62198 更新到 0.16.0,对 read_json 进行了一些修复。
  • @Cornel Ghiban,我可以加载整个文件或在单独的行中读取。转换成你提到的格式似乎有点困难,因为有超过 500 万条这样的记录。

标签: python json python-2.7 pandas


【解决方案1】:

我也遇到了同样的问题。当您的数据被写入以 '\n' 之类的结尾分隔的行时,就会发生这种情况;您需要先逐行读取它们,然后将每一行转换为 python 内置类型。 我是这样解决的:

with open("/path/to/file") as f:
    content = f.readlines()

data = [eval(c) for c in content]
data = pd.DataFrame(data)

祝你好运!

【讨论】:

    【解决方案2】:

    以下代码帮助我将JSON 内容加载到dataframe

    import json
    import pandas as pd
    
    with open('Appointment.json', encoding="utf8") as f:
        data = f.readlines()
        data = [json.loads(line) for line in data] #convert string to dict format
    df = pd.read_json(data) # Load into dataframe
    

    【讨论】:

      【解决方案3】:

      我也遇到过类似的问题。

      原来pd.read_json(myfile.json) 会自动在父文件夹中搜索,但如果您与文件不在同一个文件夹中,则会返回此“尾随数据”错误。

      我想通了,因为当我尝试使用 open('myfile.json', 'r') 执行此操作时,我收到了 FileNotFound 错误,所以我检查了路径。

      我未能将 myfile.json 移动到与我的笔记本相同的文件夹中。

      将其更改为 pd.read_json('../myfile.json') 即可。

      【讨论】:

      • 当它应该给出FileNotFound时却给出了ValueError: Trailing data,这很愚蠢。这也发生在我身上。
      【解决方案4】:

      从 Pandas 0.19.0 版本开始,您可以使用 lines 参数,如下所示:

      import pandas as pd
      
      data = pd.read_json('/path/to/file.json', lines=True)
      

      【讨论】:

      【解决方案5】:

      你必须逐行阅读。例如,您可以在reddit上使用ryptophan提供的以下代码:

      import pandas as pd
      
      # read the entire file into a python array
      with open('your.json', 'rb') as f:
          data = f.readlines()
      
      # remove the trailing "\n" from each line
      data = map(lambda x: x.rstrip(), data)
      
      # each element of 'data' is an individual JSON object.
      # i want to convert it into an *array* of JSON objects
      # which, in and of itself, is one large JSON object
      # basically... add square brackets to the beginning
      # and end, and have all the individual business JSON objects
      # separated by a comma
      data_json_str = "[" + ','.join(data) + "]"
      
      # now, load it into pandas
      data_df = pd.read_json(data_json_str)
      

      【讨论】:

      • 嗨,我正在尝试读取 un json 文件并存储到数据框中。但是,当我使用您的代码时,出现错误:“TypeError: sequence item 0: expected str instance, bytes found”。你知道它有什么问题吗?
      • 将第 4 行中的 'rb' 更改为 'r',您应该不会收到字节错误。
      猜你喜欢
      • 2018-11-22
      • 2019-06-24
      • 2021-08-06
      • 2021-12-20
      • 2022-01-04
      • 1970-01-01
      • 2019-12-29
      • 1970-01-01
      • 2017-06-16
      相关资源
      最近更新 更多