【发布时间】:2017-03-30 02:05:56
【问题描述】:
我正在尝试了解多处理和池来处理我在 MySQL 数据库中获得的一些推文。这是代码和错误消息。
import multiprocessing
import sqlalchemy
import pandas as pd
import config
from nltk import tokenize as token
q = multiprocessing.Queue()
engine = sqlalchemy.create_engine(config.sqlConnectionString)
def getRow(pandasSeries):
df = pd.DataFrame()
tweetTokenizer = token.TweetTokenizer()
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
for tokens in tweetTokenizer.tokenize(pandasSeries.loc['BODY']):
df = df.append(pd.Series(data=[pandasSeries.loc['ID'], tokens, pandasSeries.loc['AUTHOR'],
pandasSeries.loc['RETWEET_COUNT'], pandasSeries.loc['FAVORITE_COUNT'],
pandasSeries.loc['FOLLOWERS_COUNT'], pandasSeries.loc['FRIENDS_COUNT'],
pandasSeries.loc['PUBLISHED_AT']],
index=['id', 'tweet', 'author', 'retweet', 'fav', 'followers', 'friends',
'published_at']), ignore_index=True)
df.to_sql(name="tweet_tokens", con=engine, if_exists='append')
if __name__ == '__main__':
##LOADING SQL INTO DATAFRAME##
databaseData = pd.read_sql_table(config.tweetTableName, engine)
pool = multiprocessing.Pool(6)
for row in databaseData.iterrows():
print(row)
pool.map(getRow, row)
pool.close()
q.close()
q.join_thread()
"""
OUPUT
C:\Users\Def\Anaconda3\python.exe C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py
(0, ID 3247
AUTHOR b'Elon Musk News'
RETWEET_COUNT 0
FAVORITE_COUNT 0
FOLLOWERS_COUNT 20467
FRIENDS_COUNT 14313
BODY Elon Musk Takes an Adorable 5th Grader's Idea ...
PUBLISHED_AT 2017-03-03 00:00:01
Name: 0, dtype: object)
Elon Musk Takes an Adorable 5th Grader's
<class 'str'>
multiprocessing.pool.RemoteTraceback:
Traceback (most recent call last):
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 44, in mapstar
return list(map(*args))
File "C:\Users\Def\Dropbox\Dissertation\testThreadCopy.py", line 16, in getRow
print(pandasSeries.loc['BODY'], "\n", type(pandasSeries.loc['BODY']))
AttributeError: 'numpy.int64' object has no attribute 'loc'
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "C:/Users/Def/Dropbox/Dissertation/testThreadCopy.py", line 34, in <module>
pool.map(getRow, row)
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "C:\Users\Def\Anaconda3\lib\multiprocessing\pool.py", line 608, in get
raise self._value
AttributeError: 'numpy.int64' object has no attribute 'loc'
Process finished with exit code 1
"""
我不明白为什么它会打印出第一个系列然后崩溃?为什么它说 pandasSeries.loc['BODY'] 是 numpy.int64 类型,而打印出来的却是字符串类型?我敢肯定我在其他一些地方出错了,如果你能看到,请你指出来。 谢谢。
【问题讨论】:
-
我已经打印了类型,结果是 Str 而不是 numpy.int64。打印出来的类型在“Elon Musk Takes an Adorable 5th Grader's
”下面的输出中
标签: python python-3.x pandas numpy python-multiprocessing