有没有更好的方法将字符串转换为 python 中的数据集？答案

【问题标题】：Is there a nicer way to convert a string into a data set in python?有没有更好的方法将字符串转换为 python 中的数据集？
【发布时间】：2011-06-28 04:00:20
【问题描述】：

我刚刚用 Python 完成了我的一个课程的作业，它运行良好，我对它很满意，但它看起来很丑！我已经提交了这段代码，因为我们没有标记它的外观，但它运行正常。不过，我不介意一些关于如何将字符串转换为数据集以供未来项目使用的提示和指示。

输入是由节点和边组成的网格，例如：

"4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"

“:”前面的第一个数字是网格的大小 (4x4)，(1,2;4) 表示从节点 1 到节点 2 的边，成本为 4。以下代码将其转换为数组，其中数组[0] 是网格大小，array[1] 是格式为 (node1,node2)=cost 的字典。

def partitionData(line):
finalDic = dict()
#partition the data around the formating
line = line.split(":")
line[1] = line[1].split("),(")
#clean up data some more
line[1][0] = line[1][0][1:]
end = len(line[1])-1
line[1][end] = line[1][end][:len(line[1][end])-2]
#simplify data and organize into a list
for i in range(len(line[1])):
    line[1][i] = line[1][i].split(",")
    line[1][i][1] = line[1][i][1].split(";")
    #clean up list
    for j in range(len(line[1][i])):
        line[1][i].append(line[1][i][1][j])
    del line[1][i][1]
#convert everything to integer to simplify algorithm
for i in range(len(line[1])):
    for j in range(len(line[1][i])):
        line[1][i][j] = int(line[1][i][j])
line[0] = int(line[0])
newData = dict()
for i in range(len(line[1])):
    newData[(line[1][i][0],line[1][i][1])] = line[1][i][2]
line[1] = newData
for i in line[1]:
    if not ((min(i),max(i)) in finalDic):
        finalDic[(min(i),max(i))] = line[1][i]
    else:
        print "There is a edge referenced twice!"
        exit()  
line[1] = finalDic
return line

一开始我有更干净的东西，但是没有考虑到数字可能大于9。我觉得这很丑，必须有更漂亮的方法来做到这一点。

【问题讨论】：

您的解析需要有多强大？输入中可能有空格吗？
解析将使用给定的确切格式解析输入，因此不是很健壮。

标签： python string dictionary coding-style

【解决方案1】：

import re

# regular expression for matching a (node1,node2;cost)
EDGE = re.compile(r'\((\d+),(\d+);(\d+)\)')

def parse(s):
    # Separate size from the list of edges
    size, edges = s.split(':')

    # Build a dictionary
    edges = dict(
        # ...where key is (node1,node2) and value is (cost)
        # (all converted to integers)
        ((int(node1),int(node2)),int(cost))

        # ...by iterating the edges using the regular expression
        for node1,node2,cost in EDGE.findall(edges))

    return int(size),edges

例子：

>>> test = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"
>>> parse(test)
(4, {(1, 2): 4, (5, 9): 9, (2, 6): 3, (6, 7): 1, (4, 8): 1, (5, 6): 1, (6, 10): 2, (9, 10): 5, (13, 14): 4, (11, 15): 3, (10, 14): 6, (9, 13): 7, (12, 16): 3, (7, 11): 1, (3, 7): 15, (8, 12): 23, (15, 16): 7})

【讨论】：

dict(...) 的 ( ) 之间是生成器。但是，通过使用创建对象的 EDGE.findall(edges)，您会失去对生成器的兴趣。最好写edges = dict( ((int(node1),int(node2)),int(cost)) for node1,node2,cost in (match.groups() for match in EDGE.finditer(edges)) )
啊正则表达式！老实说，我想到了这一点，但没想到会有一个库，并且不确定我将如何自己实现它。谢谢！很有帮助！
@WhiteDawn 尽管您不知道正则表达式，但您似乎没有兴趣了解为什么您的代码如此简陋，而只能通过解决方案来了解。但是，您必须提高编码技能，因为您将永远无法以您在代码中遵循的如此复杂的方式有效地使用 Python。

【解决方案2】：

import re
data = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"
temp = data.split(":")    # split into grid size and rest
array = [int(temp[0]),{}] # first item: grid size
# split the rest of the string (from the second to the second-to-last characters)
# along the delimiters ");("
for item in temp[1][1:-1].split("),("):
    numbers = re.split("[,;]", item)          # split item along delimiters , or ;
    k1, k2, v = (int(num) for num in numbers) # and convert to int
    array[1][(k1,k2)] = v                     # populate the array
print array

结果

[4, {(1, 2): 4, (5, 9): 9, (2, 6): 3, (6, 7): 1, (4, 8): 1, (5, 6): 1, (6, 10):2, (9, 10): 5, (13, 14): 4, (11, 15): 3, (10, 14): 6, (9, 13): 7, (12, 16): 3, (7, 11): 1, (3, 7): 15, (8, 12): 23, (15, 16): 7}]

【讨论】：

使用正则表达式的另一种方式，非常感谢这个例子！

【解决方案3】：

您需要的是一个简单的解析器。您的输入可以显示为以下扩展 BNF 表示法：

input := NUM ':' edge_defn*
edge_defn := '(' NUM ',' NUM ';' NUM )
NUM := [0-9]+

然后您可以编写自己的自顶向下解析器或使用解析器生成器（例如 ANTLR 或 yacc/bison）。

让我们开始编写自己的解析器。您首先需要识别输入中的标记。到目前为止，只有标记是：):,;和数字。我们可以简单地使用 Python 的 split() 方法，就像 Peter Norvig 在 Python 中的 Lisp 中一样：

 input = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"
 tokens = input.replace(':', ' : ').replace(')',' ) ').replace('(',' ( ').replace(',',' , ').replace(';', ' ; ').split()

我知道，这看起来也很丑，但这是我们使用这种 hack 的唯一地方。我们所做的只是在符号周围放置空格并使用 split 方法获取所有标记的列表。

接下来我们需要一个 get_token 函数，并且由于 edge_defn 的原因，我们需要为最后一种情况再向前看一个令牌。这就是全局前瞻变量在哪里的原因。

look_ahead = None

def next_token(t):
    global look_ahead
    if look_ahead:
        temp = look_ahead
        try:
            look_ahead = t.next()
        except StopIteration:
            look_ahead = None
        return temp

然后根据 BNF 表示法，我们将为定义的左侧编写函数。

def match(t, tok):
    if next_token(t) != tok:
        print "Syntax error! Expecting: ", tok
        exit()

def read_num(t):
    return int(next_token(t))

def edge_defn(t):
    match(t, '(')
    a = read_num(t)
    match(t, ',')
    b = read_num(t)
    match(t, ';')
    c = read_num(t)
    print "%d,%d = %d" % (a,b,c)    # ..do whatever here..
    match(t, ')')

def input(t):
    global grid_size
    grid_size = read_num(t)
    match(t, ':')
    while True:
        edge_defn(t)
        if look_ahead:
            match(t, ',')
        else:
            return


t = tokenizer()
look_ahead = t.next()
input(t)

在调用第一个规则（输入）后，输入被解析并且您可以执行操作。虽然这本身就是一个很好的练习，但最好使用解析器生成器，但我不确定它是否会被接受。（取决于分配的目的。）

【讨论】：

比其他回复更深入，并很好地引导我完成整个过程。谢谢！

【解决方案4】：

这是一种不同的方法，它利用了边缘列表看起来很像一堆元组这一事实。在实践中，我可能会做 shang 所做的事情，但那已经完成了：

import ast

def build_graph(line):
    size, content = line.split(':')
    size = int(size)
    content = content.replace(';',',')
    edges = ast.literal_eval(content)
    d = {}
    for v0, v1, cost in edges:
        pair = tuple(sorted([v0, v1]))
        if pair not in d:
            d[pair] = cost
       else:
            print "There is an edge referenced twice!"
            return
    return [size, d]


>>> line = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"
>>> build_graph(line)
[4, {(1, 2): 4, (5, 9): 9, (2, 6): 3, (6, 7): 1, (4, 8): 1, (5, 6): 1, (6, 10): 2, (9, 10): 5, (13, 14): 4, (11, 15): 3, (10, 14): 6, (9, 13): 7, (12, 16): 3, (7, 11): 1, (3, 7): 15, (8, 12): 23, (15, 16): 7}]

像往常一样，当您关心错误处理和拒绝无效输入时，真正令人头疼的事情就出现了，所以我将完全忽略这个问题。 :^) 但是literal_eval 是一个需要记住的有用的小函数，并且没有直接“eval”的危险。

【讨论】：

感谢您花时间展示如何在没有 re 库的情况下进行操作！我认为有一种简单的方法可以将节点转换为元组，这就是我决定首先使用元组的原因，比我所做的要干净得多。

【解决方案5】：

已经提出使用的解决方案

解析器：太长了
一个正则表达式：我喜欢它，但需要知道正则表达式
ast 模块：有趣但也需要了解它

.

我以初学者可以理解的最简单的方式处理问题。此外，我的解决方案表明 Python 的内置功能足以完成这项工作。

.

首先，我将您的代码修正后，WhiteDawn，以便您能够看到您必须理解的非常基本的点，可以使用 Python 的特性对其进行简化。

比如seq是一个序列，seq[len(seq)-1]是它的最后一个元素，但是seq[-1] 也是最后一个元素。顺便说一句，您的代码中有一个错误：我认为是

line[1][end] = line[1][end][:len(line[1][end])-1]
# not:
line[1][end] = line[1][end][:len(line[1][end])-2]

否则执行时出错

还要注意伟大的函数enumerate()

而且你必须研究列表的切片：if li = [45, 12, 78, 96] then li[2:3] = [2, 5, 8] 将 li 转换为 li = [45, 12, 2, 5, 8, 96]

y = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"


def partitionData(line):
    finalDic = dict()

    #partition the data around the formating
    print 'line==',line
    line = line.split(":")
    print '\ninstruction :  line = line.split(":")'
    print 'line==',line
    print 'len of line==',len(line),'  (2 strings)'

    print '---------------------'
    line[1] = line[1].split("),(")
    print '\ninstruction :  line[1] = line[1].split("),(")'
    print 'line[1]==',line[1]

    #clean up data some more
    line[1][0] = line[1][0][1:]
    print 'instruction :  line[1][0] = line[1][0][1:]'
    line[1][-1] = line[1][-1][0:-1]
    print 'instruction :  line[1][-1] = line[1][-1][0:-1]'
    print 'line[1]==',line[1]

    print '---------------------'
    #simplify data and organize into a list
    for i,x in enumerate(line[1]):
        line[1][i] = x.split(",")
        line[1][i][1:] = line[1][i][1].split(";")
    print 'loop to clean the data in line[1]'
    print 'line[1]==',line[1]
    print '---------------------'
    #convert everything to integer to simplify algorithm
    print 'convert everything to integer to simplify algorithm'
    for i,x in enumerate(line[1]):
        line[1][i] = map(int,x)

    line[0] = int(line[0])
    print 'line==',line
    print '---------------------'
    newData = dict()
    for a,b,c in line[1]:
        newData[(a,b)] = c
    line[1] = newData
    print 'line==',line



    print '---------------------'
    for i in line[1]:
        print 'i==',i,'  (min(i),max(i))==',(min(i),max(i))
        if not ((min(i),max(i)) in finalDic):
            finalDic[(min(i),max(i))] = line[1][i]
        else:
            print "There is a edge referenced twice!"
            exit()
    line[1] = finalDic
    print '\nline==',line
    return line


print partitionData(y)

.

其次，我的解决方案：

y = "4:(1,2;4),(2,6;3),(3,7;15),(4,8;1),(5,6;1),(6,7;1),(5,9;9),(6,10;2),(7,11;1),(8,12;23),(9,10;5),(9,13;7),(10,14;6),(11,15;3),(12,16;3),(13,14;4),(15,16;7)"


# line[1]== {(1, 2): 4, (5, 9): 9, (2, 6): 3, (6, 7): 1, (4, 8): 1, (5, 6): 1, (6, 10): 2, (9, 10): 5, (13, 14): 4, (11, 15): 3, (10, 14): 6, (9, 13): 7, (12, 16): 3, (7, 11): 1, (3, 7): 15, (8, 12): 23, (15, 16): 7}

def partitionData(line):
    finalDic = dict()
    #partition the data around the formating
    print '\nline==',line

    line = line.split(":")
    print '\ninstruction:\n   line = line.split(":")'
    print 'result:\n   line==',line
    print '\n----------------------------------------------------'

    print '\nline[1]==',line[1]

    line[1] = line[1][1:-1].replace(";",",")
    print '\ninstruction:\n   line[1] = line[1][1:-1].replace(";",",")'
    print 'result:\n   line[1]==',line[1]

    line[1] = [ x.split(",") for x in line[1].split("),(") ]
    print '\ninstruction:\n   line[1] = [ x.split(",") for x in line[1].split("),(") ]'
    print 'result:\n   line[1]==',line[1]

    line = [int(line[0]),dict( ((int(a),int(b)),int(c)) for (a,b,c) in line[1] ) ]
    print '\ninstruction:\n   line = [int(line[0],dict( ((int(a),int(b)),int(c)) for (a,b,c) in line[1] ) ]'
    print 'result:\n   line[1]==',line[1]         


    for i in line[1]:
        if not ((min(i),max(i)) in finalDic):
            finalDic[(min(i),max(i))] = line[1][i]
        else:
            print "There is a edge referenced twice!"
            exit()
    line[1] = finalDic
    print '\nline[1]==',line[1]


    return line


print partitionData(y)

我没有触及 FinalDict 的结尾，因为我不明白这个 sn-p 是做什么的。如果 i 是一对整数，则 (min(i),max(i)) 就是这对夫妇本身

【讨论】：