【发布时间】:2021-06-24 01:10:25
【问题描述】:
我有一个JSON文件,结构如下(不是完整的json文件,但结构是一样的):
{"data":[{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxx","id":"xxxxxxxxxxx"},{"referenced_tweets":[{"type":"retweeted","id":"xxxxxxxxxxxx"}],"text":"abcdefghijkl","created_at":"2020-03-09T00:11:41.000Z","author_id":"xxxxxxxx","id":"xxxxxxxxxxx"}]}
.....
//The rest of json continues with the same structure, but referenced_tweets is not always present
我的问题:如何将这些数据加载到具有以下列的数据框中:type、id(referenced_tweet id)、text、created_at、author_id 和 id (tweet id) ?
到目前为止我能做什么:我可以获得以下列:
| referenced_tweets | text | cerated_at | author_id | id (tweet id) |
|---|---|---|---|---|
| [{'type': 'xx', 'id': 'xxx'}] | xxx | xxxx | xxxxx | xxxxxxxxxxxx |
获取上表的代码如下:
with open('Test_SampleRetweets.json') as json_file:
data_list = json.load(json_file)
df1 = json_normalize(data_list, 'data')
df1.head()
但是,我想在单独的列中获取 type 和 id(在 referenced_tweets 中),到目前为止我可以获得以下信息:
| type | id (referenced_tweet id) |
|---|---|
| xxxx | xxxxxxxxxxxxxxxxxxxxxxx |
这是获取上表的代码:
df2 = json_normalize(data_list, record_path=['data','referenced_tweets'], errors='ignore')
df2.head()
问题是什么?我想将所有内容都放在一个表中,即类似于此处的第一个表,但 type 和 id 在单独的列中(例如第二张表)。所以,最后的列应该是:type、id (referenced_tweet id)、text、created_at、author_id 和 id (tweet id)
感谢任何帮助
谢谢
【问题讨论】:
标签: python json pandas nested tweets