【问题标题】：splitting non-obvious messy strings in python在python中拆分不明显的混乱字符串
【发布时间】：2019-02-28 10:25:10
【问题描述】：

我有这个字符串：

Model:                ARIMA                                       BIC:                 417.2273
Dependent Variable:   D.Sales of shampoo over a three year period Log-Likelihood:      -196.17
Date:                 2018-09-24 13:20                            Scale:               1.0000
No. Observations:     35                                          Method:              css-mle
Df Model:             6                                           Sample:              02-01-1901
Df Residuals:         29                                                               12-01-1903
Converged:            1.0000                                      S.D. of innovations: 64.241
No. Iterations:       19.0000                                     HQIC:                410.098
AIC:                  406.3399

我想把它编成字典。我已经使用了： split("\n") 我得到了

Model: ARIMA BIC: 417.2273
Dependent Variable: D.Sales of shampoo over a three year period Log-Likelihood: -196.17
Date: 2018-09-24 13:20 Scale: 1.0000
No. Observations: 35 Method: css-mle
Df Model: 6 Sample: 02-01-1901
Df Residuals: 29 12-01-1903
Converged: 1.0000 S.D. of innovations: 64.241
No. Iterations: 19.0000 HQIC: 410.098
AIC: 406.3399

但我没有看到将其放入字典的好方法。也许我遗漏了一些明显的东西？

另外，请注意“示例：”旁边的日期格式

我想要类似：{"Model": "ARIMA", "BIC": 417.2273, ...}

【问题讨论】：

你还没有显示结果字典应该是什么样子
您有关于如何导入字符串的选项吗？您可以查看解析固定宽度文件的示例here，这似乎是。
不，我不这么认为。归根结底，这是一个字符串解析问题。
第一行是否总是包含 Model 和 BIC 密钥？
您有所有可能的键的列表吗？（模型，Bic..）？ ':' 字符可以出现在 values 中吗？

标签： python string python-3.x split

【解决方案1】：

主要问题是有几列并排。由于键和值都包含空格，因此您不能对其进行拆分。相反，您必须先分离列，然后解析数据。

如果列的长度未知

使用第一行来标识列的长度。将列分开后，您可以轻松地在冒号处分隔键和值。

如果键的位置是稳定的，你可以利用第一行只有键没有空格。

lines = input_string.splitlines()
key_values = lines[0].split()  # split first line into keys and values
column_keys = key_values[::2]  # extract the keys by taking every second element
column_starts = [lines[0].find(key) for key in column_keys]  # index of each key

一旦你在这一点上，就好像列的长度是已知的一样继续。

如果列的长度已知

在它们的起始索引上分隔列。

column_ends = column_starts[1:] + [None]
# separate all key: value lines
key_values = [
    line[start:end]
    # ordering is important - need to parse column-first for the next step
    for start, end in zip(column_starts, column_ends)
    for line in lines
]

由于Sample 使用多行值，我们不能从冒号上的值中巧妙地拆分键。相反，我们必须跟踪先前看到的键以将其插入到无键值中。

data = {}
for line in key_values:
    if not line:
        continue
    # check if there is a key at the start of the line
    if line[0] != ' ':
        # insert key/value pairs
        key, value = line.split(':', 1)
        data[key.strip()] = value.strip()
    else:
        # append dangling values
        value = line
        data[key.strip()] += '\n' + value.strip()

这给了你一个键：字符串的值字典：

{'Model': 'ARIMA',
 'Dependent Variable': 'D.Sales of shampoo over a three year period',
 'Date': '2018-09-24 13:20',
 'No. Observations': '35',
 'Df Model': '6',
 'Df Residuals': '29',
 'Converged': '1.0000',
 'No. Iterations': '19.0000',
 'AIC': '406.3399',
 'BIC': '417.2273',
 'Log-Likelihood': '-196.17',
 'Scale': '1.0000',
 'Method': 'css-mle',
 'Sample': '02-01-1901\n12-01-1903',
 'S.D. of innovations': '64.241',
 'HQIC': '410.098'}

如果您需要将值转换为非字符串，我建议显式转换每个字段。您可以为每个键使用调度表来定义转换。

import time

converters = {
 'Model': str, 'Dependent Variable': str,
 'Date': lambda field: time.strptime(field, '%Y-%m-%d %H:%M'),
 'No. Observations': int, 'Df Model': int, 'Df Residuals': int,
 'Converged': float, 'No. Iterations': float, 'AIC': float,
 'BIC': float, 'Log-Likelihood': float, 'Scale': float,
 'Method': str, 'Sample': str, 'S.D. of innovations': float,
 'HQIC': float
}
converted_data = {key: converters[key](data[key]) for key in data}

【讨论】：