【问题标题】:Extract all numbers from data frame string Python从数据框字符串Python中提取所有数字
【发布时间】:2018-02-21 02:59:35
【问题描述】:

我在DataFrame中有一个超长字符串,需要提取所有数字,只提取所有数字,最后不包括AW7S23211和7P0145

样本数据:

id  rate
1   {"mileage": "42331", "pricing": [{"fees_tax_cents": 700, "start_fee_cents": 203159, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 75500}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 17776, "dealer_reserve_cents": 0, "monthly_payment_cents": 29033, "non_taxable_fees_cents": 78400, "expected_annual_mileage": 10000, "monthly_tax_payment_cents": 2540, "total_drive_off_tax_cents": 21017, "total_drive_off_cost_cents": 318592, "micro_ownership_premium_cents": 203159, "cost_per_additional_mile_cents": 13, "start_fee_without_cpo_premium_cents": 203159}, {"fees_tax_cents": 700, "start_fee_cents": 203159, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 75500}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 17776, "dealer_reserve_cents": 0, "monthly_payment_cents": 34450, "non_taxable_fees_cents": 78400, "expected_annual_mileage": 15000, "monthly_tax_payment_cents": 3014, "total_drive_off_tax_cents": 21491, "total_drive_off_cost_cents": 324009, "micro_ownership_premium_cents": 203159, "cost_per_additional_mile_cents": 13, "start_fee_without_cpo_premium_cents": 203159}], "stock_number": "AW7S23211"}
2   {"mileage": "3343", "pricing": [{"fees_tax_cents": 700, "start_fee_cents": 766343, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 0}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 67055, "dealer_reserve_cents": 0, "monthly_payment_cents": 101106, "non_taxable_fees_cents": 2900, "expected_annual_mileage": 12500, "monthly_tax_payment_cents": 8847, "total_drive_off_tax_cents": 76602, "total_drive_off_cost_cents": 878349, "micro_ownership_premium_cents": 766343, "cost_per_additional_mile_cents": 46, "start_fee_without_cpo_premium_cents": 766343}, {"fees_tax_cents": 700, "start_fee_cents": 766343, "non_taxable_fees": [{"name": "Electronic Vehicle Registration or Transfer Charge", "value_cents": 2900}, {"name": "Registration Fees (Transfer and Smog)", "value_cents": 0}], "cpo_premium_cents": 0, "taxable_fees_cents": 8000, "start_fee_tax_cents": 67055, "dealer_reserve_cents": 0, "monthly_payment_cents": 89436, "non_taxable_fees_cents": 2900, "expected_annual_mileage": 7500, "monthly_tax_payment_cents": 7826, "total_drive_off_tax_cents": 75581, "total_drive_off_cost_cents": 866679, "micro_ownership_premium_cents": 766343, "cost_per_additional_mile_cents": 46, "start_fee_without_cpo_premium_cents": 766343}], "stock_number": "7P0145"}

预期输出

id   rate   
1    42331 700 203159 2900 75500 ......
2    3343  700 766343 2900 0 ......

下面的代码只适用于简单的字符串,不适用于这个超长的,请指教

import pandas as pd
df= pd.read_csv('C:/Users/Desktop/items.csv')
df=pd.DataFrame(df)
from ast import literal_eval
df['rate'] = df['rate'].apply(literal_eval)
s=df.rate.apply(pd.Series).set_index('id').stack().apply(pd.Series)

如果将其视为 JSON,则会出现“错误:后视需要固定宽度模式 “为什么?

import re
import pandas as pd
df= pd.read_csv('C:/Users/Desktop/items.csv')
p = re.compile(r'(?<=\s+|")\d+(?!\w+)')
df.rate.apply(lambda x: re.findall(p, x))

【问题讨论】:

    标签: python json pandas dataframe


    【解决方案1】:

    使用递归生成器遍历嵌套字典对象。

    import json
    from itertools import chain
    
    def gnum(d):
        if str(d).isdigit():
            yield int(d)
        elif isinstance(d, dict):
            for i in chain(*map(gnum, d.values())):
                yield i
        elif isinstance(d, list):
            for i in chain(*map(gnum, d)):
                yield i
    
    df.assign(rate=df.rate.apply(lambda x: list(gnum(json.loads(x)))))
    
       id                                               rate
    0   1  [42331, 700, 203159, 2900, 75500, 0, 8000, 177...
    1   2  [3343, 700, 766343, 2900, 0, 0, 8000, 67055, 0...
    

    【讨论】:

    • 使用json.loads 可能比literal_eval 更好——这看起来像json 编码数据。
    【解决方案2】:

    将 json 视为字符串并使用正则表达式 '(?&lt;=\s|")\d+(?!\w+)' 提取所有数字。

    import re
    p = re.compile(r'(?<=\s+|")\d+(?!\w+)')
    df.rate.apply(lambda x: re.findall(p, x))
    

    这将找到除AW7S232111237P1234ABD342123.23 形式的数字之外的所有数字。结果将是 df.rate 系列每一行的数字列表

    【讨论】:

    • 嗨 Haleemur,请查看我更新后的问题与您的正则表达式代码,为什么它不起作用
    • 您收到错误消息,因为您在r'(?... 之间添加了一个空格。不应该有空格。此外,您是否尝试在将数据加载为 json 后应用正则表达式。这种方法注定会失败。
    • 嗨 Haleemur,我修改了它,有一个新的“错误:后视需要固定宽度的图案”,请告知
    猜你喜欢
    • 1970-01-01
    • 2017-04-19
    • 2016-06-24
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-05-31
    • 1970-01-01
    相关资源
    最近更新 更多