从 Json 数据制作直方图答案

【问题标题】：Making a histogram from Json data从 Json 数据制作直方图
【发布时间】：2021-11-28 20:16:12
【问题描述】：

我有类似这样的 JSON 格式的数据

{
   "ts": 1393631983,
   "visitor_uuid": "ade7e1f63bc83c66",
   "visitor_source": "external",
   "visitor_device": "browser",
   "visitor_useragent": "Opera/9.80 (Windows NT 6.1) Presto/2.12.388 Version/12.16",
   "visitor_ip": "b5af0ba608ab307c",
   "visitor_country": "BR",
   "visitor_referrer": "53c643c16e8253e7",
   "env_type": "reader",
   "env_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "event_type": "pagereadtime",
   "event_readtime": 1010,
   "subject_type": "doc",
   "subject_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "subject_page": 3
} {
    "ts": 1393631983,
    "visitor_uuid": "232eeca785873d35",
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
    "visitor_ip": "fcf9c67037f993f0",
    "visitor_country": "MX",
    "visitor_referrer": "63765fcd2ff864fd",
    "env_type": "stream",
    "env_ranking": 10,
    "env_build": "1.7.118-b946",
    "env_name": "explore",
    "env_component": "editors_picks",
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
    "subject_page": 1
}

我的任务要求我找到与用户输入匹配的 subject_doc_id，然后显示一个直方图，显示查看该文档的国家/地区。

我已经能够通过我的代码阅读数据，并且我也熟悉如何绘制直方图，但我需要有关如何计算国家/地区并将其显示在直方图中的帮助。

例如，上面的数据中存在“visitor_country”：“MX”和“visitor_country”：“BR”，所以我想要每个国家的计数。

关于如何实现这一目标的任何想法？

【问题讨论】：

标签： python json histogram

【解决方案1】：

您的 json 文件不是正确的 json 文件。您需要在文件开头添加“[”，在文件末尾添加“]”，并用逗号分隔每个“{}”部分。这是一个例子：

数据.json

[
    {
   "ts": 1393631983,
   "visitor_uuid": "ade7e1f63bc83c66",
   "visitor_source": "external",
   "visitor_device": "browser",
   "visitor_useragent": "Opera/9.80 (Windows NT 6.1) Presto/2.12.388 Version/12.16",
   "visitor_ip": "b5af0ba608ab307c",
   "visitor_country": "BR",
   "visitor_referrer": "53c643c16e8253e7",
   "env_type": "reader",
   "env_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "event_type": "pagereadtime",
   "event_readtime": 1010,
   "subject_type": "doc",
   "subject_doc_id": "140222143932-91796b01f94327ee809bd759fd0f6c76",
   "subject_page": 3
}, {
    "ts": 1393631983,
    "visitor_uuid": "232eeca785873d35",
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
    "visitor_ip": "fcf9c67037f993f0",
    "visitor_country": "MX",
    "visitor_referrer": "63765fcd2ff864fd",
    "env_type": "stream",
    "env_ranking": 10,
    "env_build": "1.7.118-b946",
    "env_name": "explore",
    "env_component": "editors_picks",
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
    "subject_page": 1
}, {
    "ts": 1393631983,
    "visitor_uuid": "232eeca785873d35",
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
    "visitor_ip": "fcf9c67037f993f0",
    "visitor_country": "PL",
    "visitor_referrer": "63765fcd2ff864fd",
    "env_type": "stream",
    "env_ranking": 10,
    "env_build": "1.7.118-b946",
    "env_name": "explore",
    "env_component": "editors_picks",
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
    "subject_page": 1
}
, {
    "ts": 1393631983,
    "visitor_uuid": "232eeca785873d35",
    "visitor_source": "internal",
    "visitor_device": "browser",
    "visitor_useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.117 Safari/537.36",
    "visitor_ip": "fcf9c67037f993f0",
    "visitor_country": "PL",
    "visitor_referrer": "63765fcd2ff864fd",
    "env_type": "stream",
    "env_ranking": 10,
    "env_build": "1.7.118-b946",
    "env_name": "explore",
    "env_component": "editors_picks",
    "event_type": "impression",
    "subject_type": "doc",
    "subject_doc_id": "100713205147-2ee05a98f1794324952eea5ca678c026",
    "subject_page": 1
}
]

之后对于 data.json 文件中的每个元素，我正在检查它是否与我们的输入 subject_doc_id 匹配。如果我们得到匹配，我会将其附加到匹配列表中，这样我们就可以收集直方图的数据。之后，我想根据唯一国家/地区的数量获得一些垃圾箱，因此我正在创建一个唯一的国家列表，然后我正在检查它的长度。

import matplotlib.pyplot as plt
import json

with open("data.json") as json_file:
    data = json.load(json_file)

#Here is the subject id i'm using for the data presentation
#100713205147-2ee05a98f1794324952eea5ca678c026
subject_id = input("subject_doc_id: ")
visitors = []
for i in range(len(data)):
    if subject_id == data[i]["subject_doc_id"]:
        print("got a match from {}".format(data[i]["visitor_country"]))
        visitors.append(data[i]["visitor_country"])
countries = []
for i in visitors:
    if i not in countries:
        countries.append(i)
try:
    plt.hist(visitors, bins = len(countries))
    plt.show()
except ValueError:
    print("No matches for given subject_doc_id")

如果要按大洲排序，首先需要知道哪个国家属于哪个大洲。我的例子：

continents = {
    "europe": ["PL, GER"],
    "south_america": ["BR"],
    "north_america": ["MX"]
}

我是 python 新手，所以除了循环之外，我不知道任何花哨的技术来对以前的列表进行排序。

continent_data = []
for continent in continents:
    for visitor_country in visitors:
        for country in continents[continent]:
            if visitor_country in country:
                continent_data.append(continent)
print(continent_data)

之后，您可以使用前面的代码将其排序为 bin 的唯一值，并根据上面的示例创建直方图

【讨论】：

感谢指正。非常感谢。如果我想按大陆对国家进行分组，然后将其显示在另一个直方图中，你知道我能做什么吗？再次感谢。
@user17534067 这听起来像是一个新问题。请随意问另一个问题，因为 cmets 是为了澄清。

【解决方案2】：

我不得不稍微修改您的文件内容以使其成为有效的 JSON，然后在我的工作目录中将其保存为“jsonExample.json”。

修改后的json数据是这样的：

{
"visitor1": {[your data]}
"visotor2": {[your data]}
}

然后使用 json 库 (https://docs.python.org/3/library/json.html)，您只需列出每个访问者所在的国家/地区，并计算每个访问者出现的次数：

import json

with open("jsonExample.json", 'r') as file:
    contents = file.read()
visitors = json.loads(contents)

countryList = []
for v in visitors.keys():
    if visitors[v]['subject_doc_id'] == "desired_subject_doc_id":
        countryList.append(visitors[v]['visitor_country'])

for country in set(countryList):
    print(f"Country {country} appears {countryList.count(country)} times")

if visitors[v]['subject_doc_id'] 语句检查 subject_doc_id 是否匹配指定值，只需将 RHS 替换为所需的 id。

【讨论】：

修改了什么？
如果我只想计算国家/地区的特定文档 ID (subject_doc_id)，这是用户输入的。
已修改以显示如何完成。如果你只想要一个 id 这很好，但如果你想让它易于修改，它可以存储在一个变量中。