【发布时间】:2020-09-09 09:54:47
【问题描述】:
我正在尝试遵循有关如何将 Whatsapp 聊天文本导出导入 Pandas 数据框的教程/示例,发现 here。
当我尝试运行它时,出现了编码问题 (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1123: character maps to <undefined>) 和类型错误 (TypeError: data argument can't be an iterator,我使用 this SO post 解决了这些问题。
但是,由于某种原因,当我使用 encoding='utf8' 传入从 Whatsapp 导出的文件时(我尝试了其他选项,但文件是 UTF-8),它只会产生一个空数据帧。
当它不起作用时,我找到了作者创建的 Stack Overflow 帖子以获取他们的代码,即this one。但它似乎可以无缝运行并且没有任何错误。
这是代码:
import pandas as pd
import re
def parse_file(text_file):
'''Convert WhatsApp chat log text file to a Pandas dataframe.'''
# some regex to account for messages taking up multiple lines
pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
with open(text_file) as f:
data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]
sender = []; message = []; datetime = []
for row in data:
# timestamp is before the first dash
datetime.append(row.split(' - ')[0])
# sender is between am/pm, dash and colon
try:
s = re.search('m - (.*?):', row).group(1)
sender.append(s)
except:
sender.append('')
# message content is after the first colon
try:
message.append(row.split(': ', 1)[1])
except:
message.append('')
df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')
# remove events not associated with a sender
df = df[df.sender != ''].reset_index(drop=True)
return df
df = parse_file('chat_data_anon.txt')
我的预期结果与作者在他们的 SO 帖子中描述的相同:
我有这个:
06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde
fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde
fghe
ijkl
07/01/2016, 7:58 pm - abcde
并且想要:
['06/01/2016, 10:40 pm - abcde\n',
'07/01/2016, 12:04 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 12:05 pm - abcde\n',
'07/01/2016, 6:14 pm - abcde\n\nfghe\n',
'07/01/2016, 6:20 pm - abcde\n',
'07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
'07/01/2016, 7:58 pm - abcde\n']
... 除了我只得到一个空的数据框。当我把它拆成碎片时,data 似乎是空的。我传递的文件正是 Whatsapp 导出它的方式(一个简单的 .txt 文件),没有任何更改。
谁能告诉我我错过了什么?
【问题讨论】:
-
将
return df替换为return df.tolist()对你有用吗?