【问题标题】:Cleanup phone numbers in dataframe column using a regex to fit a standard format使用正则表达式清理数据框列中的电话号码以适应标准格式
【发布时间】:2026-01-16 13:20:02
【问题描述】:

我需要使用 RegEx 将填充有不同格式的手机号码的 DataFrame 列中的值转换为遵循一种格式。

表格中有 5 种格式,我希望它们都遵循第一种格式:

  1. +63xxxxxxxxxx #正确格式
  2. 63xxxxxxxxxx #add '+'
  3. 09xxxxxxxxx #删除“0”并添加“+63”
  4. 9xxxxxxxxx #add '+63'
  5. 09xx xxxx xxx #删除空格

我该怎么做?我尝试使用 ifs 并遍历整个值列,但我不断收到 KeyError。我确信有更好的方法可以做到这一点,所以请帮助我。

filename = "./section2/raw-website.csv"
website_df = pd.read_csv(filename)

clean_mobile_list = []

for i in website_df['mobile']: 
    if i[0:2] == "+63":
        clean_mobile_list.append(website_df['mobile'][i])
    if i[0] == "9":
        clean_mobile = re.sub("", "+63", website_df['mobile'][i], 1)
        clean_mobile_list.append(clean_mobile)
    if i[0:1] == "09":
        clean_mobile = re.sub("0", "+63", website_df['mobile'][i], 1)
        clean_mobile_list.append(clean_mobile)
    if i[0] == "6":
        clean_mobile = re.sub("", "+", website_df['mobile'][i], 1)
        clean_mobile_list.append(clean_mobile)
    if i[4] == " ":
        clean_mobile = re.sub(" ", "", website_df['mobile'][i])
        clean_mobile_list.append(clean_mobile)

clean_mobile_list
>>>
KeyError                                  Traceback (most recent call last)
<ipython-input-42-c3202695c4eb> in <module>
      8         clean_mobile_list.append(website_df['mobile'][i])
      9     if i[0] == "9":
---> 10         clean_mobile = re.sub("", "+63", website_df['mobile'][i], 1)
     11         clean_mobile_list.append(clean_mobile)
     12     if i[0:1] == "09":

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in __getitem__(self, key)
    851 
    852         elif key_is_scalar:
--> 853             return self._get_value(key)
    854 
    855         if is_hashable(key):

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/series.py in _get_value(self, label, takeable)
    959 
    960         # Similar to Index.get_value, but we do not fall back to positional
--> 961         loc = self.index.get_loc(label)
    962         return self.index._get_values_for_loc(self, loc, label)
    963 

~/opt/anaconda3/lib/python3.8/site-packages/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
    352                 except ValueError as err:
    353                     raise KeyError(key) from err
--> 354             raise KeyError(key)
    355         return super().get_loc(key, method=method, tolerance=tolerance)
    356 

KeyError: '9087091471'

来自文件名的样本数据:

    email            fname     lname         mobile
0   3f@hotmail.com   DNLG      JSBEXJFJCEH   +639273710560
1   ec3d@yahoo.com   VJEZSAT   TQGTVEYAL     +639287703748
2   d7a8@protonmai...QCLCMOTQ  EJRNWDKVUQVX  09176971246
3   adb74@yahoo.com  TIPOSNZB  KXTL          9161832409

【问题讨论】:

  • 能否提供filename.csv的示例数据?另外我认为您提供的代码和错误消息代码不匹配!
  • @Xitiz 对此感到抱歉!修复了错误并从文件名中添加了示例数据

标签: python regex pandas


【解决方案1】:

这是一个完成这项工作的简单管道:

df['fixed_mobile'] = (df['mobile']
                        .str.replace('\s+', '', regex=True) # remove unwanted characters
                        .str.extract('^(?P<prefix>\+63)?0?(?P<number>\d+)') # extract prefix/number
                        .fillna({'prefix': '+63'}) # replace prefix
                        .apply(''.join, axis=1)    # join to form number
                     )

输出:

             email     fname         lname         mobile   fixed_mobile
0   3f@hotmail.com      DNLG   JSBEXJFJCEH  +639273710560  +639273710560
1   ec3d@yahoo.com   VJEZSAT     TQGTVEYAL  +639287703748  +639287703748
2   d7a8@protonmai  QCLCMOTQ  EJRNWDKVUQVX    09176971246  +639176971246
3  adb74@yahoo.com  TIPOSNZB          KXTL     9161832409  +639161832409
4  adb74@yahoo.com  TIPOSNZB          KXTL   9161 832 409  +639161832409

【讨论】:

    【解决方案2】:

    你可以试试:

    import numpy as np
    df = pd.DataFrame({'mobile': ['+6364353435345', '63654645645','0945327943', '9463945759', '0932 5432 432']})
    df.mobile=np.where(df.mobile.str.startswith('63'),'+'+ df.mobile,df.mobile)
    df.mobile=np.where(df.mobile.str.startswith('09'),'+63'+ 
    df.mobile.str[1:],df.mobile)
    df.mobile=np.where(df.mobile.str.startswith('9'),'+63'+ df.mobile,df.mobile)
    df.mobile = df.mobile.str.replace(' ', '')
    print (df)
    

    输入:

               mobile
    0  +6364353435345
    1     63654645645
    2      0945327943
    3      9463945759
    4   0932 5432 432
    

    输出:

               mobile
    0  +6364353435345
    1    +63654645645
    2    +63945327943
    3   +639463945759
    4   +639325432432
    

    或者如果某些数字不以9开头:

    df.mobile=np.where(df.mobile.str.startswith('63'),'+'+ 
    df.mobile,df.mobile)
    df.mobile=np.where(df.mobile.str.startswith('0'),'+63'+ df.mobile.str[1:],df.mobile)
    df.mobile=np.where(~df.mobile.str.startswith('+63'),'+63'+ df.mobile,df.mobile)
    df.mobile = df.mobile.str.replace(' ', '')
    

    【讨论】:

    • prefix/0后面的电话号码不是9开头怎么办? 注意。我想应该没问题,因为 OP 也使用了匹配 9
    • 已更新.. 抱歉,我的回答不如你的优雅 ;) 但为此 +1
    • 嗯,你遵循了OP的初始逻辑
    • 非常感谢!!
    【解决方案3】:

    使用下面的这个脚本来清理你的手机号码和 python 包。

    提供国家 alpha2 代码,例如“GB”、“IN”、“US”..etc

    import re
    import numpy as np
    import phonenumbers
    
    NULL_VALUES = {
        np.nan,
        float("NaN"),
        "#N/A",
        "#N/A N/A",
        "#NA",
        "-1.#IND",
        "-1.#QNAN",
        "-NaN",
        "-nan",
        "1.#IND",
        "1.#QNAN",
        "<NA>",
        "N/A",
        "NA",
        "NULL",
        "NaN",
        "n/a",
        "nan",
        "null",
        "",
        None,
    }
    
    PH_REGEX = re.compile(
        # dot-atom
        r"^(\+\d{1,2}\s)?\(?\d{3}\)?[\s.-]\d{3}[\s.-]\d{4}$",
        re.IGNORECASE
    )
    
    
    def clean_phone_number(phoneNumber, country):
        """
        Clean and phone number.
    
        """
    
        if phoneNumber in NULL_VALUES:
            return None
    
        # remove speacial cahracters from the number
        phoneNumber = re.sub('[^A-Za-z0-9]+', '', str(phoneNumber))
    
        try:
            phoneNumber = phonenumbers.format_number(phonenumbers.parse(phoneNumber, country),
                                                     phonenumbers.PhoneNumberFormat.E164)
        except:
            return None
    
        # phoneNumber = phoneNumber.split(' ')[1:]
        # phoneNumber = ''.join(phoneNumber)
    
        return phoneNumber
    

    如果你喜欢这个脚本,请点赞

    【讨论】: