【问题标题】:Convert JSON to CSV with pandas使用 pandas 将 JSON 转换为 CSV
【发布时间】:2018-11-06 13:23:21
【问题描述】:

我有一个 JSON 文件,其中包含 46k+ 条英语和其他语言的推文,我想将其保存为 csv 文件。下面是json文件的一部分。

    [{"user_id": 938118866135343104, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…", "tweet_id": 976166125502427136}
{"user_id": 959235642, "date_time": "03/20/2018 18:38:35", "tweet_content": "At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. \n\n#PZvQG \n#ABC", "tweet_id": 976166125535973378}
{"user_id": 395163528, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali\nWatch ball by ball highlights at (link removed)\n\n#PZvQG #HBLPSL #PSL2018 @_crici…", "tweet_id": 976166126202839040}
{"user_id": 3117825702, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.", "tweet_id": 976166126483902466}
{"user_id": 3310967346, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…", "tweet_id": 976166126559354880}
{"user_id": 701494826194354179, "date_time": "03/20/2018 18:38:35", "tweet_content": "I wanted a super over????\n#PZvQG", "tweet_id": 976166126836178944}
{"user_id": 347132028, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG", "tweet_id": 976166126685171713}
{"user_id": 3461853618, "date_time": "03/20/2018 18:38:35", "tweet_content": "RT @walterMiitty: It's harder than I thought to tell the truth\nIt's gonna leave you in pieces\nAll alone with your demons\nAnd I know that we…", "tweet_id": 976166126924201986}]

我按照solution 将其转换为 CSV,但在 urdu 推文上出现无效语法错误。 我也试过这个:

    import json
with open("PeshVsQuetta.json") as f:
all_tweets = []
for line in f:
    text_dict = json.loads(line)
    all_tweets.append(text_dict)

print(all_tweets[0]['tweet_content'])

这给了我以下错误。

    UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position 148: character maps to <undefined>

我什至将 json 文件保存为 txt 文件并尝试了这个:

    import pandas as pd
    from ast import literal_eval
    columns = ['Tweet ID','Author ID','Tweet','Time']
    df1 = pd.DataFrame(columns = columns)
    f = open('PeshvsQuetta.txt',encoding = 'utf-8')
    counter = 1
    for line in f:
         if(counter != 1):
             s1 = literal_eval(line)
             ser = pd.Series([s1['tweet_id'],s1['user_id'],s1['tweet_content'],s1["date_time"]],index=['Tweet ID','Author ID','Tweet','Time'])
             df1 = df1.append(ser,ignore_index=True)
    counter = counter + 1
    df1.to_csv('PeshVsQuetta1.csv', encoding='utf-8',index=False,columns = columns)

但是生成的 csv 文件将每个系列保存在一个单元格中,并且它有很多空行,并且一些推文保存在多行中。下面是图片。

任何帮助将不胜感激。

【问题讨论】:

  • 你试过pd.read_csv()吗?
  • 我已经关闭,但它不是 100% 重复的。
  • 嗯,我要重新打开这个,副本不适合我。
  • @coldspeed 对我来说是完全重复的,错误是一样的,解决方案是一样的。
  • @eyllanesc 我同意,但是 OP 的代码还有一些其他问题需要修复。好的,我已经用尽了我对这个问题的投票,但我会尝试看看是否可以关闭它。

标签: python json python-3.x pandas csv


【解决方案1】:

您应该可以按如下方式使用 Pandas:

import pandas as pd

with open('PeshVsQuetta.json', encoding='utf-8-sig') as f_input:
    df = pd.read_json(f_input)

df.to_csv('PeshVsQuetta.csv', encoding='utf-8', index=False)

这假设您的 JSON 文件在开始时包含一个 BOM。对于您上面给出的数据,这会生成以下 CSV 文件:

date_time,tweet_content,tweet_id,user_id
2018-03-20 18:38:35,RT @PTISPOfficial: پاکستان تحریک انصاف کے وائس چیئرمین شاہ محمود قریشی  بغیر کسی پروٹوکول کے پاکستان سپر لیگ کا میچ دیکھنے کے لئے اسٹیڈیم م…,976166125502427136,938118866135343104
2018-03-20 18:38:35,"At last, Pakistan Have Witnessed The Most Thrilling Match Of Cricket In Pakistan, The Home. 

#PZvQG 
#ABC",976166125535973378,959235642
2018-03-20 18:38:35,"RT @thePSLt20: SIX! 19.4 Liam Dawson to Anwar Ali
Watch ball by ball highlights at (link removed)

#PZvQG #HBLPSL #PSL2018 @_crici…",976166126202839040,395163528
2018-03-20 18:38:35,RT @JeremyMcLellan: Rumor has it Amir Liaquat isn’t allowed to play in #PSL2018 because he keeps switching teams every week.,976166126483902466,3117825702
2018-03-20 18:38:35,RT @daniel86cricket: Peshawar beat Quetta by 1 run in one of the best T20 thrillers. PSL played in front of full house in Lahore Pakistan i…,976166126559354880,3310967346
2018-03-20 18:38:35,"I wanted a super over?
#PZvQG",976166126836178944,701494826194354179
2018-03-20 18:38:35,RT @hinaparvezbutt: Congratulations Peshawar Zalmi over great win but Quetta Gladiators won our hearts ♥️  #PZvQG,976166126685171713,347132028
2018-03-20 18:38:35,"RT @walterMiitty: It's harder than I thought to tell the truth
It's gonna leave you in pieces
All alone with your demons
And I know that we…",976166126924201986,3461853618

注意:您的某些字段包含换行符,因此输出可能看起来有点奇怪。读取此内容的应用程序将正确处理它(只要您在导入时告诉它编码是 UTF-8)

【讨论】:

  • 谢谢。当我在数据框中读取 csv 文件时,它得到了正确处理,但这些数据是为一个研究项目收集的,所以我希望它采用正确的格式。
  • 我显示的输出是正确的。为什么您认为它的格式不正确?
猜你喜欢
  • 2021-12-28
  • 2022-01-26
  • 1970-01-01
  • 2017-09-12
  • 2017-12-17
  • 2019-03-15
  • 2020-07-01
  • 2020-02-13
  • 2021-04-26
相关资源
最近更新 更多