从推文 json 格式文件中解析的有效方法答案

【问题标题】：Efficient way to parsing from tweets json formated files从推文 json 格式文件中解析的有效方法
【发布时间】：2017-09-12 01:30:43
【问题描述】：

我正在解析推文数据，它是 json 格式并使用 gzip 压缩。

这是我的代码：

###Preprocessing
##Importing:
import os
import gzip
import json
import pandas as pd
from pandas.io.json import json_normalize

##Variables:
#tweets: DataFrame for merging. empty
tweets = pd.DataFrame()
idx = 0

#Parser provides parsing the input data and return as pd.DataFrame format

###Directory reading:
##Reading whole directory from
for root, dirs, files in os.walk('D:/twitter/salathe-us-twitter/11April1'):
    for file in files:
        #file tracking, #Memory Checker:
        print(file, tweets.memory_usage())
        # ext represent the extension.
        ext = os.path.splitext(file)[-1]
        if ext == '.gz':
            with gzip.open(os.path.join(root, file), "rt") as tweet_file:
                # print(tweet_file)
                for line in tweet_file:
                    try:
                        temp = line.partition('|')
                        date = temp[0]
                        tweet = json.loads(temp[2])
                        if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
                            # Mapping for memory.
                            # The index must be sequence like series.
                            # temporary solve by listlizing int values: id, retweet-count.
                            #print(tweet)
                            temp_dict = {"id": tweet["user"]["id"],
                                         "text": tweet["text"],
                                         "hashtags": tweet["entities"]["hashtags"][0]["text"],
                                         "date":[int(date[:8])]}
                            #idx for DataFrame ix
                            temp_DF = pd.DataFrame(temp_dict, index=[idx])
                            tweets = pd.concat([tweets, temp_DF])
                            idx += 1
                    except:
                        continue
        else:
            with open(os.path.join(root, file), "r") as tweet_file:
                # print(tweets_file)
                for line in tweet_file:
                    try:
                        temp = line.partition('|')
                        #date
                        date = temp[0]
                        tweet = json.loads(temp[2])
                        if tweet['user']['lang'] == 'en' and tweet['place']['country_code'] == 'US':
                            # Mapping for memory.
                            # The index must be sequence like series.
                            # temporary solve by listlizing int values: id, retweet-count.
                            #print(tweet)
                            temp_dict = {"id": [tweet["user"]["id"]],
                                         "text": tweet["text"],
                                         "hashtags": tweet["entities"]["hashtags"][0]["text"],
                                         "date":[int(date[:8])]}
                            temp_DF = pd.DataFrame(temp_dict, index=[idx])
                            tweets = pd.concat([tweets, temp_DF])
                            idx += 1
                    except:
                        continue

##STORING PROCESS.
store = pd.HDFStore('D:/Twitter_project/mydata.h5')
store['11April1'] = tweets
store.close()

我的代码可以分为 3 个部分：读取、处理以选择列和存储。我感兴趣的是我想更快地解析它们。所以这是我的问题：它太慢了。怎么会快很多？由熊猫 json 阅读器阅读？好吧，我想它比普通的 json.loads 快得多...... 但！因为我的原始推文数据有multi-index values。所以熊猫 read_json 没有用。总的来说，我不确定我是否很好地实现了我的代码。有什么问题或更好的方法吗？我对编程有点陌生。所以请教我做得更好。

p.s 代码运行时计算机刚刚关闭。为什么会这样？内存问题？

感谢阅读。

pps

20110331010003954|{"text":"#Honestly my toe still aint healed im suppose to be in that boot still!!!","truncated":false,"in_reply_to_user_id":null,"in_reply_to_status_id":null,"favorited":false,"source":"web","in_reply_to_screen_name":null,"in_reply_to_status_id_str":null,"id_str":"53320627431550976","entities":{"hashtags":[{"text":"Honestly","indices":[0,9]}],"user_mentions":[],"urls":[]},"contributors":null,"retweeted":false,"in_reply_to_user_id_str":null,"place":{"country_code":"US","country":"United States","bounding_box":{"type":"Polygon","coordinates":[[[-84.161625,35.849573],[-83.688543,35.849573],[-83.688543,36.067417],[-84.161625,36.067417]]]},"attributes":{},"full_name":"Knoxville, TN","name":"Knoxville","id":"6565298bcadb82a1","place_type":"city","url":"http:\/\/api.twitter.com\/1\/geo\/id\/6565298bcadb82a1.json"},"retweet_count":0,"created_at":"Thu Mar 31 05:00:02 +0000 2011","user":{"notifications":null,"profile_use_background_image":true,"default_profile":true,"profile_background_color":"C0DEED","followers_count":161,"profile_image_url":"http:\/\/a0.twimg.com\/profile_images\/1220577968\/RoadRunner_normal.jpg","is_translator":false,"profile_background_image_url":"http:\/\/a3.twimg.com\/a\/1301071706\/images\/themes\/theme1\/bg.png","default_profile_image":false,"description":"Cool & Calm Basically Females are the way of life and key to my heart...","screen_name":"FranklinOwens","verified":false,"time_zone":"Central Time (US & Canada)","friends_count":183,"profile_text_color":"333333","profile_sidebar_fill_color":"DDEEF6","location":"","id_str":"63499713","show_all_inline_media":true,"follow_request_sent":null,"geo_enabled":true,"profile_background_tile":false,"contributors_enabled":false,"lang":"en","protected":false,"favourites_count":8,"created_at":"Thu Aug 06 18:24:50 +0000 2009","profile_link_color":"0084B4","name":"Franklin","statuses_count":5297,"profile_sidebar_border_color":"C0DEED","url":null,"id":63499713,"listed_count":0,"following":null,"utc_offset":-21600},"id":53320627431550976,"coordinates":null,"geo":null}

只有一行。我有超过 200GB 的空间是用 gzip 文件压缩的。我猜这个数字首先是指它的日期。我不确定你是否清楚。

【问题讨论】：

你能举个输入文件的例子吗？
嗯.. 我在字母“多索引值”上连接链接。它几乎相同，只是日期的最前面有日期。而且我不确定我可以发布这个。因为是真实数据，可能会引起一些法律问题。
对不起。这不是我的语言很难交谈。如果您几乎没有阅读，请告诉我，以便我修复并使其更清晰。
您不必发布真实数据，以真实数据的格式制作一些虚拟数据（示例）。用你最喜欢的诗句替换所有重要的东西:)
@Taras 好的。我发了。

标签： python json pandas

【解决方案1】：

首先，我的祝贺。当您面临像这样的现实世界挑战时，您作为一名软件工程师会变得更好。

现在，谈谈您的解决方案。每个软件都分 3 个阶段运行。

输入数据。
处理数据。
输出数据。（回应）

输入数据

1.1。无聊的员工

信息最好采用一种格式。为了实现这一点，我们编写了解析器、API、包装器、适配器。所有这些背后的想法是将数据转换为相同的格式。这有助于避免在使用不同数据源时出现问题，如果其中一个数据源出现故障 - 您只修复了一个适配器，仅此而已，所有其他数据源和您的解析器仍然可以工作。

1.2。你的情况

您的数据来自相同的scheme，但文件格式不同。您可以将其转换为读取为 json、txt 的一种格式，也可以提取一种将数据转换为单独的函数或模块的方法，然后重复使用/调用它 2 次。示例：

with gzip.open(os.path.join(root, file), "rt") as tweet_file:
    process_data(tweet_file)
with open(os.path.join(root, file), "r") as tweet_file:
    process_data(tweet_file)

process_data(tweet_file):
   for line in tweet_file:
       # do your stuff

2。处理数据

2.1 无聊的员工

这很可能是一个瓶颈部分。在这里，您的目标是将数据从给定格式转换为所需格式，并在需要时执行一些操作。在这里，您可以获得所有异常、所有性能问题、所有业务逻辑。这就是 SE 技术派上用场的地方，您创建一个架构并决定在其中放入多少错误。

2.2 你的情况

处理问题的最简单方法是知道如何找到它。如果这是性能 - 放置时间戳来跟踪它。有了经验，就会更容易发现问题。在这种情况下，dt.concat 很可能会导致性能下降。每次调用时，它都会将所有数据复制到一个新实例中，因此当您只需要 1 个时，您有 2 个内存对象。尽量避免它concat，将所有数据收集到一个列表中，然后将其放入 DataFrame。

例如，我不会一开始就将所有数据放入 DataFrame 中，您可以将其收集并放入 csv 文件中，然后从中构建 DataFrame，pandas 处理 csv 文件非常好。这是一个例子：

import json
import pandas as pd
from pandas.io.json import json_normalize
import csv

source_file = '11April1.txt'
result_file = 'output.csv'


with open(source_file) as source:
    with open(result_file, 'wb') as result:
        writer = csv.DictWriter(result, fieldnames=['id','text','hashtags','date','idx'])
        writer.writeheader();

         # get index together with a line
        for index, line in enumerate(source):
            # a handy way to get data in 1 func call.
            date, data = line.split('|')
            tweet = json.loads(data)
            if tweet['user']['lang'] != 'en' or tweet['place']['country_code'] != 'US':
                continue

            item =  {"id": tweet["user"]["id"],
                     "text": tweet["text"],
                     "hashtags": tweet["entities"]["hashtags"][0]["text"],
                     "date":[int(date[:8])],
                     "idx": index}

            # either write it to the csv or save into the array
            # tweets.append(item)
            writer.writerow(item)

print "done"

3。输出数据。

3.1。无聊的员工

在处理完您的数据并采用正确的格式后，您需要查看结果，对吗？这是 HTTP 响应和页面加载发生的地方，pandas 构建图表等的地方。你决定你需要什么样的输出，这就是你创建软件的原因，从你不想自己经历的格式中得到你想要的。

3.2 你的情况

您必须找到一种有效的方法来从已处理的文件中获取所需的输出。也许您需要将数据转换成 HDF5 格式并在 Hadoop 上进行处理，在这种情况下，您的软件输出变成了某人的软件输入，对吧？ :D 笑话不谈，从 csv 或数组中收集所有处理过的数据并将其分块放入 HDF5，这很重要，因为您无法将所有内容加载到 RAM 中，RAM 被称为临时内存是有原因的，它速度快且非常有限，请明智地使用它.在我看来，这就是你的电脑关闭的原因。或者由于某些 C 库的性质可能会导致内存损坏，这有时是可以的。

总体而言，尝试尝试并返回 StackOverflow（如果有的话）。

【讨论】：

非常感谢。我真的很感谢你的帮助。我有一个问题。在第 1.2 节中，您说“调用函数或模块”。我想它比我的方式慢得多，但事实并非如此？
其次，您的意思是收集所有推文并将它们转换为 csv 格式，然后立即将它们放入数据帧，对吗？甚至流程也分为两部分：将它们保存到 csv 文件；调用 csv 文件来保存数据框格式对吗？而且速度更快？
哦等等。我需要文本挖掘的数据；带有 nltk python 包。那我觉得我不需要dataFrame包或者HDF5格式吧？
1) 如果您在函数中提取功能 - 它不会使其更快或更慢，它只是允许您重用相同的功能并避免重复产生的错误。
2) 对，把它分成两部分，我猜应该更快地将所有内容收集到一个列表或 csv 文件中，然后一次处理所有内容。从头到尾处理每一行并不是一个好主意。一旦你摆脱combine，它会更快。