【问题标题】:Did something change in the encoding of Whatsapp chat exports?Whatsapp 聊天导出的编码是否发生了变化?
【发布时间】:2020-09-09 09:54:47
【问题描述】:

我正在尝试遵循有关如何将 Whatsapp 聊天文本导出导入 Pandas 数据框的教程/示例,发现 here

当我尝试运行它时,出现了编码问题 (UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1123: character maps to <undefined>) 和类型错误 (TypeError: data argument can't be an iterator,我使用 this SO post 解决了这些问题。

但是,由于某种原因,当我使用 encoding='utf8' 传入从 Whatsapp 导出的文件时(我尝试了其他选项,但文件是 UTF-8),它只会产生一个空数据帧。

当它不起作用时,我找到了作者创建的 Stack Overflow 帖子以获取他们的代码,即this one。但它似乎可以无缝运行并且没有任何错误。

这是代码:

import pandas as pd
import re

def parse_file(text_file):
    '''Convert WhatsApp chat log text file to a Pandas dataframe.'''

    # some regex to account for messages taking up multiple lines
    pat = re.compile(r'^(\d\d\/\d\d\/\d\d\d\d.*?)(?=^^\d\d\/\d\d\/\d\d\d\d|\Z)', re.S | re.M)
    with open(text_file) as f:
        data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(f.read())]

    sender = []; message = []; datetime = []
    for row in data:

        # timestamp is before the first dash
        datetime.append(row.split(' - ')[0])

        # sender is between am/pm, dash and colon
        try:
            s = re.search('m - (.*?):', row).group(1)
            sender.append(s)
        except:
            sender.append('')

        # message content is after the first colon
        try:
            message.append(row.split(': ', 1)[1])
        except:
            message.append('')

    df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
    df['timestamp'] = pd.to_datetime(df.timestamp, format='%d/%m/%Y, %I:%M %p')

    # remove events not associated with a sender
    df = df[df.sender != ''].reset_index(drop=True)

    return df

df = parse_file('chat_data_anon.txt')

我的预期结果与作者在他们的 SO 帖子中描述的相同:

我有这个:

06/01/2016, 10:40 pm - abcde
07/01/2016, 12:04 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 12:05 pm - abcde
07/01/2016, 6:14 pm - abcde

fghe
07/01/2016, 6:20 pm - abcde
07/01/2016, 7:58 pm - abcde

fghe

ijkl
07/01/2016, 7:58 pm - abcde

并且想要:

['06/01/2016, 10:40 pm - abcde\n',
 '07/01/2016, 12:04 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 12:05 pm - abcde\n',
 '07/01/2016, 6:14 pm - abcde\n\nfghe\n',
 '07/01/2016, 6:20 pm - abcde\n',
 '07/01/2016, 7:58 pm - abcde\n\nfghe\n\nijkl\n',
 '07/01/2016, 7:58 pm - abcde\n']

... 除了我只得到一个空的数据框。当我把它拆成碎片时,data 似乎是空的。我传递的文件正是 Whatsapp 导出它的方式(一个简单的 .txt 文件),没有任何更改。

谁能告诉我我错过了什么?

【问题讨论】:

  • return df 替换为return df.tolist() 对你有用吗?

标签: python pandas


【解决方案1】:

我的朋友,我所做和为我工作的首先是阅读我的 .txt 文件...示例:

opened_file = open("file.txt", encoding="utf8").read()

因此您可以使用 opens_file 。

【讨论】:

    【解决方案2】:

    我做了 3 处小改动,现在代码对我来说运行良好:

    1- 日期的格式并不总是有两位数的日期和月份,但它总是有两位数的年份。我调整了正则表达式以反映它:

    r'^(\d+/\d+/\d\d.*?)(?=^^\d+/\d+/\d\d,*?)'

    2- 数据时间字段的末尾有大写字母 AM 或 PM:

    s = re.search('M - (.*?):', row).group(1)

    3 - 日期时间格式实际上是月/日/年:

    df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')

    import pandas as pd
    import re
    
    def parse_file(FULL_PATH):
        '''Convert WhatsApp chat log text file to a Pandas dataframe.'''
    
        # some regex to account for messages taking up multiple lines
        pat = re.compile(r'^(\d+\/\d+\/\d\d.*?)(?=^^\d+\/\d+\/\d\d\,\*?)', re.S | re.M)
        with open(FULL_PATH, encoding = 'utf8') as raw:
            data = [m.group(1).strip().replace('\n', ' ') for m in pat.finditer(raw.read())]
        
        sender = []; message = []; datetime = []
        for row in data:
    
            # timestamp is before the first dash
            datetime.append(row.split(' - ')[0])
    
            # sender is between am/pm, dash and colon
            try:
                s = re.search('M - (.*?):', row).group(1)
                sender.append(s)
            except:
                sender.append('')
    
            # message content is after the first colon
            try:
                message.append(row.split(': ', 1)[1])
            except:
                message.append('')
    
        df = pd.DataFrame(zip(datetime, sender, message), columns=['timestamp', 'sender', 'message'])
        df['timestamp'] = pd.to_datetime(df.timestamp, format='%m/%d/%y, %I:%M %p')
    
        # remove events not associated with a sender
        df = df[df.sender != ''].reset_index(drop=True)
    
        return df
    
    df = parse_file(FULL_PATH)
    
    

    【讨论】:

      【解决方案3】:

      刚刚遇到同样的问题。看起来whatsapp提取格式不同 - 至少对我来说现在是这样的:

      [dd/mm/yy, hh:mm:ss:] 发件人:消息

      【讨论】:

        猜你喜欢
        • 2021-03-03
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2020-12-03
        • 1970-01-01
        • 2019-09-14
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多