Whatsapp 聊天导出的编码是否发生了变化？答案

【问题标题】：Did something change in the encoding of Whatsapp chat exports?Whatsapp 聊天导出的编码是否发生了变化？
【发布时间】：2020-09-09 09:54:47
【问题描述】：

我正在尝试遵循有关如何将 Whatsapp 聊天文本导出导入 Pandas 数据框的教程/示例，发现 here。

当我尝试运行它时，出现了编码问题 (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1123: character maps to <undefined>) 和类型错误 (TypeError: data argument can't be an iterator，我使用 this SO post 解决了这些问题。

但是，由于某种原因，当我使用 encoding='utf8' 传入从 Whatsapp 导出的文件时（我尝试了其他选项，但文件是 UTF-8），它只会产生一个空数据帧。

当它不起作用时，我找到了作者创建的 Stack Overflow 帖子以获取他们的代码，即this one。但它似乎可以无缝运行并且没有任何错误。

这是代码：

import pandas as pd
import re

def parse_file(text_file):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''

    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
    with open(text_file) as f:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]

    sender = []; message = []; datetime = []
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('m - (.*?):', row).group(1)
            sender.append(s)
        except:
            sender.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.sender != ''].reset_index(drop=True)

    return df

df = parse_file('chat_data_anon.txt')

我的预期结果与作者在他们的 SO 帖子中描述的相同：

我有这个：

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

并且想要：

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

... 除了我只得到一个空的数据框。当我把它拆成碎片时，data 似乎是空的。我传递的文件正是 Whatsapp 导出它的方式（一个简单的 .txt 文件），没有任何更改。

谁能告诉我我错过了什么？

【问题讨论】：

将return df 替换为return df.tolist() 对你有用吗？

标签： python pandas

【解决方案1】：

我的朋友，我所做和为我工作的首先是阅读我的 .txt 文件...示例：

opened_file = open("file.txt", encoding="utf8").read()

因此您可以使用 opens_file 。

【讨论】：

【解决方案2】：

我做了 3 处小改动，现在代码对我来说运行良好：

1- 日期的格式并不总是有两位数的日期和月份，但它总是有两位数的年份。我调整了正则表达式以反映它：

r'^(\d+/\d+/\d\d.*?)(?=^^\d+/\d+/\d\d,*?)'

2- 数据时间字段的末尾有大写字母 AM 或 PM：

s = re.search('M - (.*?):', row).group(1)

3 - 日期时间格式实际上是月/日/年：

df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')

import pandas as pd
import re

def parse_file(FULL_PATH):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''

    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d+\/\d+\/\d\d.*?)(?=^^\d+\/\d+\/\d\d\,\*?)', re.S | re.M)
    with open(FULL_PATH, encoding = 'utf8') as raw:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(raw.read())]
    
    sender = []; message = []; datetime = []
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('M - (.*?):', row).group(1)
            sender.append(s)
        except:
            sender.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.sender != ''].reset_index(drop=True)

    return df

df = parse_file(FULL_PATH)

【讨论】：

【解决方案3】：

刚刚遇到同样的问题。看起来whatsapp提取格式不同 - 至少对我来说现在是这样的：

[dd/mm/yy, hh:mm:ss:] 发件人：消息

【讨论】：