【发布时间】:2021-01-28 06:24:50
【问题描述】:
我想知道如何清理我的数据以更好地理解它,以便我知道如何更轻松地筛选数据。到目前为止,我已经能够下载一个公开的 google 电子表格文档,然后将其转换为 csv 文件。但是当我打印数据时,它非常混乱且难以理解。数据来自一个网站,所以当我进入谷歌开发者模式时,我可以看到它是如何整齐地组织起来的。
像这样: Website data on inspect page mode
但实际上当我在 Jupyter 笔记本中打印时看到它看起来像这样混乱:
b'/O_o/\ngoogle.visualization.Query.setResponse({"version":"0.6","reqId":"0output=csv","status":"ok" ,"sig":"1241529276","table":{"cols":[{"id":"A","label":"Entity","type":"string"},{"id": "B","label":"Week","type":"number","pattern":"General"},{"id":"C","label":"Day","type": "date","pattern":"yyyy-mm-dd"},{"id":"D","label":"航班 2019 (参考)","type":"number","pattern":"General"},{"id":"E","label":"Flights","type":"number","pattern" :"常规"},{"id":"F","label":"% 与 2019 年相比 (每日)","type":"number","pattern":"General"},{"id":"G","label":"Flights (7天搬家 平均)","type":"number","pattern":"General"},{"id":"H","label":"% vs 2019(7天搬家) 平均)","type":"number","pattern":"General"},{"id":"I","label":"Day 2019","type":"date","pattern":"yyyy-mm-dd"},{"id":"J","label":"Day 以前的 年份","type":"date","pattern":"yyyy-mm-dd"},{"id":"K","label":"航班 以前的 年份","type":"number","pattern":"General"}],"rows":[{"c":[{"v":"Albania"},{"v":36.0," f":"36"},{"v":"日期(2020,8,1)","f":"2020-09-01"},{"v":129.0,"f":"129 "},{"v":64.0,"f":"64"},{"v":-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f": "71,57142857"},{"v":-0.291371994342291,"f":"-0,2913719943"},{"v":"日期(2019,8,3)","f":"2019- 09-03"},{"v":"日期(2019,8,3)","f":"2019-09-03"},{"v":129.0,"f":"129"} ]},{"c":[{"v":"阿尔巴尼亚"},{"v":36.0,"f":"36"},{"v":"日期(2020,8,2)" ,"f":"2020-09-02"},{"v":92.0,"f":"92"},{"v":59.0,"f":"59"},{"v" :-0.358695652173913,"f":"-0,3586956522"},{"v":70.0,"f":"70"},{"v":-0.300998573466476,"f":"-0,3009985735" },{"v":"日期(2019,8,4)","f":"2019-09-04"},{"v":"日期(2019,8,4)","f" :"2019-09-04"},{"v":92.0,"f":"92"}]},{"c":[{"v":"阿尔巴尼亚"},{"v":36.0 ,"f":"36"},{"v":"日期(2020,8,3)","f":"2020-09-03"},{"v":96.0,"f": "96"},{"v":67.0,"f":"67"},{"v":-0.302083333333333,"f":"-0,3020833333"},
有 Panda 方法来保持这些数据吗?
基本上我要做的是从数据中提取三个变量:国家、日期和数字。
这里可以看到代码是如何以标题“行”开头的:
Code in Jupyter showing how the code starts out
基本上它给出了一个国家、日期,然后是一堆相关的数字。
我想得到的是国家名称、具体日期和具体号码。
例如,这是一个示例部分,此序列在整个数据中重复:
{"c":[{"v":"阿尔巴尼亚"},{"v":36.0,"f":"36"},{"v":"日期(2020,8,1)" ,"f":"2020-09-01"},{"v":129.0,"f":"129"},{"v":64.0,"f":"64"},{"v" :-0.503875968992248,"f":"-0,503875969"},{"v":71.5714285714286,"f":"71,57142857"},{"v":-0.291371994342291,"f":"-0, 2913719943"},{"v":"日期(2019,8,3)","f":"2019-09-03"},{"v":"日期(2019,8,3)"," f":"2019-09-03"},{"v":129.0,"f":"129"}]},
这部分数据我只想取出单词国名:阿尔巴尼亚,日期“2020-09-01”,数字-0.5038
这是我用来抓取 google 电子表格数据并将其保存为 csv 的代码:
import requests
import pandas as pd
r = requests.get('https://docs.google.com/spreadsheets/d/1GJ6CvZ_mgtjdrUyo3h2dU3YvWOahbYvPHpGLgovyhtI/gviz/tq?usp=sharing&tqx=reqId%3A0output=csv')
data = r.content
print(data)
请任何和所有建议都会很棒。
谢谢
【问题讨论】: