【发布时间】:2016-03-08 18:37:49
【问题描述】:
我正在使用yelp dataset,我想将评论 json 文件解析为字典。我尝试将其加载到 pandas DataFrame 上,然后创建字典,但由于文件太大,因此非常耗时。我只想保留 user_id 和 stars 值。 json文件的一行是这样的:
{
"votes": {
"funny": 0, "useful": 2, "cool": 1},
"user_id": "Xqd0DzHaiyRqVH3WRG7hzg",
"review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17",
"text": (
"dr. goldberg offers everything i look for in a general practitioner. "
"he's nice and easy to talk to without being patronizing; he's always on "
"time in seeing his patients; he's affiliated with a top-notch hospital (nyu) "
"which my parents have explained to me is very important in case something "
"happens and you need surgery; and you can get referrals to see specialists "
"without having to see him first. really, what more do you need? i'm "
"sitting here trying to think of any complaints i have about him, but i'm "
"really drawing a blank."
),
"type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA"
}
我如何遍历每个“字段”(因为缺少更好的词)?到目前为止,我只能遍历每一行。
编辑
根据要求的熊猫代码:
读取 json
with open('yelp_academic_dataset_review.json') as f:
df = pd.DataFrame(json.loads(line) for line in f)
创建字典
dict = {}
for i, row in df.iterrows():
business_id = row['business_id']
user_id = row['user_id']
rating = row['stars']
key = (business_id, user_id)
dict[key] = rating
【问题讨论】:
-
有没有其他方法可以只使用熊猫?
-
显示您的 pandas 代码以读取 json 并将其转换为字典。
-
将此作为一般性评论添加,因为它不是特定于我的答案,您可能会考虑是否该建立数据库。对于大数据集,将内容存储在内存、平面文件或 json 文件中不再实用,是时候使用数据库了。不知道你是否在那个时候,但要记住这一点。 Python 有sqlite3——你也可以和sqlalchemy一起使用——满足“简单”的数据库需求。
-
我想坚持使用 pandas,我认为字典是 pandas 中用于我想做的操作的最快的数据结构。之后我想找到一些特定的用户,例如对独特的地方进行了 50 多次评论的用户。
标签: python dictionary pandas