【问题标题】:validation of csv using json schema in python在 python 中使用 json 模式验证 csv
【发布时间】:2020-03-21 15:53:40
【问题描述】:

我想对数据进行验证。我已经使用 pandas 架构而不是 pandas 架构编写了代码,如何传递一个包含所有验证规则的 json 文件,然后将其应用于 csv 文件。

这意味着对哪个列应用哪个规则必须从 json 文件而不是 pandas 模式中获取并生成错误文件。

def check_decimal(dec):
    try:
        Decimal(dec)
    except InvalidOperation:
        return False
    return True


def check_int(num):
    try:
        int(num)
    except ValueError:
        return False
    return True


def do_validation():
    # read the data
    data = pd.read_csv('data.csv')

    # define validation elements
    decimal_validation = [CustomElementValidation(lambda d: check_decimal(d), 'is not decimal')]
    int_validation = [CustomElementValidation(lambda i: check_int(i), 'is not integer')]
    null_validation = [CustomElementValidation(lambda d: d is None, 'this field cannot be null')]

    # define validation schema

    schema = pandas_schema.Schema([
            Column('dec1', decimal_validation + null_validation),
            Column('dec2', decimal_validation),
            Column('dec3', decimal_validation),
            Column('dec4', decimal_validation),
            Column('dec5', decimal_validation),
            Column('dec6', decimal_validation),
            Column('dec7', decimal_validation),
            Column('company_id', int_validation + null_validation),
            Column('currency_id', int_validation + null_validation),
            Column('country_id', int_validation + null_validation)])


    # apply validation
    errors = schema.validate(data)
    errors_index_rows = [e.row for e in errors]
    data_clean = data.drop(index=errors_index_rows)

    # save data
    pd.DataFrame({'col':errors}).to_csv('errors55.csv')

【问题讨论】:

    标签: json python-3.x validation schema jsonschema


    【解决方案1】:

    所以,我对pandas_schema 真的一无所知,但如果您在这样的 json 中有列及其验证器:

    {
        "dec1": ['decimal', 'null'],
        "dec2": ['decimal'],
        "dec3": ['decimal'],
        "dec4": ['decimal'],
        "dec5": ['decimal'],
        "dec6": ['decimal'],
        "dec7": ['decimal'],
        "company_id": ['int', 'null'],
        "currency_id": ['int', 'null'],
        "country_id": ['int', 'null']
    }
    

    然后您可以使用验证器的字典和列表推导来为Schema 生成您的Column 对象:

    def check_decimal(dec):
        try:
            Decimal(dec)
        except InvalidOperation:
            return False
        return True
    
    
    def check_int(num):
        try:
            int(num)
        except ValueError:
            return False
        return True
    
    
    VALIDATORS = {
        'decimal': CustomElementValidation(lambda d: check_decimal(d), 'is not decimal'),
        'int': CustomElementValidation(lambda i: check_int(i), 'is not integer'),
        'null': CustomElementValidation(lambda d: d is None, 'this field cannot be null'),
    }
    
    def do_validation():
        # read the data
        data = pd.read_csv('data.csv')
        with open('my_json_schema.json', 'r') as my_json:
            json_schema = json.load(my_json)
    
        column_list = [Column(k, [VALIDATORS[v] for v in vals]) for k, vals in json_schema.items()]
        schema = pandas_schema.Schema(column_list)
    
        # apply validation
        errors = schema.validate(data)
        errors_index_rows = [e.row for e in errors]
        data_clean = data.drop(index=errors_index_rows)
    
        # save data
        pd.DataFrame({'col':errors}).to_csv('errors55.csv')
    

    编辑:

    为了使用带有在 JSON 中定义的参数的验证器,您需要稍微更改 JSON 格式和代码。以下应该可以工作,但我自己无法测试。

    {
        "dec1": [['decimal'], ['null']],
        "dec2": [['decimal'], ['range', 0, 10]],
        "dec3": [['decimal']],
        "dec4": [['decimal']],
        "dec5": [['decimal']],
        "dec6": [['decimal']],
        "dec7": [['decimal']],
        "company_id": [['int'], ['null']],
        "currency_id": [['int'], ['null']],
        "country_id": [['int'], ['null']]
    }
    
    
    def get_validator(opts)
        VALIDATORS = {
            'decimal': (CustomElementValidation, [lambda d: check_decimal(d), 'is not decimal']),
            'int': (CustomElementValidation, [lambda i: check_int(i), 'is not integer']),
            'null': (CustomElementValidation, [lambda d: d is None, 'this field cannot be null']),
            'range': (InRangeValidation, []),
        }
        func, args = VALIDATORS[opts[0]]
        args.extend(opts[1:])
        return func(*args)
    
    
    def do_validation():
        # read the data
        data = pd.read_csv('data.csv')
        with open('my_json_schema.json', 'r') as my_json:
            json_schema = json.load(my_json)
    
        column_list = [Column(k, [get_validator(v) for v in vals]) for k, vals in json_schema.items()]
        schema = pandas_schema.Schema(column_list)
    
        # apply validation
        errors = schema.validate(data)
        errors_index_rows = [e.row for e in errors]
        data_clean = data.drop(index=errors_index_rows)
    
        # save data
        pd.DataFrame({'col':errors}).to_csv('errors55.csv')
    

    【讨论】:

    • 如何打印保存的 csv @PyPingu 中发生错误的索引
    • 现在它只是在发生错误时打印行@PyPingu
    • 我不确定我是否理解这个问题。它以前是否打印过索引和行?打印的位置/内容,因为上面的代码中没有打印语句。我猜这是pandas_schema 的事情?
    • 如何为 Json 中的任何列提供范围并将范围应用于 csv@PyPingu
    • 假设您想使用 pandas-schema InRangeValidation 并且您的意思是在 json 中指定范围,那么您需要完全从此答案更改处理验证器的方式,因为现在他们没有简单的方法来传递参数。
    猜你喜欢
    • 2018-02-05
    • 1970-01-01
    • 2012-11-28
    • 1970-01-01
    • 2019-12-14
    • 2022-01-02
    • 2021-05-18
    • 1970-01-01
    • 2015-07-18
    相关资源
    最近更新 更多