将列表中的一组 URL 表示为树结构答案

【问题标题】：Representing a set of URLs in a list as a tree structure将列表中的一组 URL 表示为树结构
【发布时间】：2011-10-17 13:30:07
【问题描述】：

我有一个存储 URL 的字典列表。它只有两个字段，title 和 url。示例：

[
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

但是，我想从这个字典列表中得到一个树结构。我正在寻找这样的东西：

{ 'www.example.com': 
  [ 
    { 'something': 
      [ 
        { 'thisthing':
          [
            { 'title': 'Detail Page', 'url': 'detail.htm'}
          ]
        },
        [
          { 'title': 'Index Page', 'url': 'index.htm'},
          { 'title': 'Other Page', 'url': 'other.htm'}
        ]
      ]
    },
    { 'thatthing': 
      [ 
        { 'title': 'About Page', 'url': 'about.htm'}
      ]
    }
  ]
}

我的第一次尝试是在一堆 for 循环中使用 urlparse 汤，我相信有更好更快的方法来做到这一点。

我看到人们在 SO 上使用列表推导、lambda 函数等来发挥神奇的作用。我仍在弄清楚它的过程中。

（对于 Django 开发人员：我将使用这个我的 Django 应用程序。我将 URL 存储在一个名为 Page 的模型中，该模型有两个字段 name 和 title）

【问题讨论】：

标签： python

【解决方案1】：

第三次是魅力...那是您那里的一些不错的结构:)。在您的评论中，您提到您“无法想出更好的树格式来表示这样的数据”...这让我再次冒昧地（只是稍微）改变输出格式。为了动态添加子元素，必须创建一个字典来容纳它们。但是对于“叶节点”，这个字典永远不会被填充。如果需要，这些当然可以被另一个循环删除，但它不会在迭代过程中发生，因为空的dict 应该存在于可能的新节点中。有些适用于没有文件的节点：这些将包含一个空的list。

ll = [
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'}, 
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
]

# First build a list of all url segments: final item is the title/url dict
paths = []
for item in ll:
    split = item['url'].split('/')
    paths.append(split[2:-1])
    paths[-1].append({'title': item['title'], 'url': split[-1]})

# Loop over these paths, building the format as we go along
root = {}
for path in paths:
    branch = root.setdefault(path[0], [{}, []])
    for step in path[1:-1]:
        branch = branch[0].setdefault(step, [{}, []])
    branch[1].append(path[-1])

# As for the cleanup: because of the alternating lists and
# dicts it is a bit more complex, but the following works:
def walker(coll):
    if isinstance(coll, list):
        for item in coll:
            yield item
    if isinstance(coll, dict):
        for item in coll.itervalues():
            yield item

def deleter(coll):
    for data in walker(coll):
        if data == [] or data == {}:
            coll.remove(data)
        deleter(data)

deleter(root)

import pprint
pprint.pprint(root)

输出：

{'www.example.com':
    [
        {'something':
            [
                {'thisthing':
                    [
                        [
                            {'title': 'Detail Page', 'url': 'detail.htm'}
                        ]
                    ]
                },
                [
                    {'title': 'Index Page', 'url': 'index.htm'},
                    {'title': 'Other Page', 'url': 'other.htm'}
                ]
            ],
         'thatthing':
            [
                [
                    {'title': 'About Page', 'url': 'about.htm'}
                ]
            ]
        },
    ]
}

【讨论】：

这似乎只适用于一级深度的路径。我应该更明确一点。它不适用于像 http://www.example.com/thisthing/thisthing/about.htm 这样的 URL。
嗨，小罗。我不能随意更改模型，所以它已经过时了。这样做的原因是通过 JSON 返回所有这些记录。你是对的，检查一个节点是否是一个列表来查看它是否是一组页面是丑陋的，但我还没有想到更好的树格式来表示这样的数据。我回到了尝试将 URL 列表转换为示例数据格式的原始问题。我真的很感谢你的帮助，但如果你能以某种方式告诉我如何转换它，那将是一种解脱。我一直在打我的头，但没有运气。谢谢小罗。
啊哈。谢谢你。谢谢你，小罗。我已经接受了你的回答，但有一件小事：我怎样才能删除所有空的字典和列表？我需要递归遍历整棵树吗？
我在答案中添加了一个清理建议。
谢谢小罗。我也设法拼凑了一些东西。看看我下面的答案。它需要清理，但这是一种非常不同的方法。我将在 5000 多个项目的列表中使用这个函数，所以我想知道什么会执行得更快。

【解决方案2】：

这是我的解决方案。它似乎工作。与 Jro 截然不同的方法：

import itertools
import pprint

pages = [
  {'title': 'Index Page', 'url': 'http://www.example.com/something/index.htm'},
  {'title': 'Other Page', 'url': 'http://www.example.com/something/other.htm'},
  {'title': 'About Page', 'url': 'http://www.example.com/thatthing/about.htm'},
  {'title': 'dtgtet Page', 'url': 'http://www.example.com/thatthing/'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/detail.htm'},
  {'title': 'Detail Page', 'url': 'http://www.example.com/something/thisthing/thisthing/detail.htm'},
]



def group_urls(url_set, depth=0):
    """
    Fetches the actions for a particular domain
    """
    url_set = sorted(url_set, key=lambda x: x['url'][depth])

    tree = []

    leaves = filter(lambda x: len(x['url']) - 1 == depth, url_set)
    for cluster, group in itertools.groupby(leaves, lambda x: x['url'][depth]):
        branch = list(group)
        tree.append({cluster: branch})

    twigs = filter(lambda x: len(x['url']) - 1 > depth, url_set)
    for cluster, group in itertools.groupby(twigs, lambda x: x['url'][depth]):
        branch = group_urls(list(group), depth+1)
        tree.append({cluster: branch})

    return tree

if __name__ == '__main__':
    for page in pages:
        page['url'] = page['url'].strip('http://').split('/')

    pprint.pprint(group_urls(pages))

我似乎无法弄清楚为什么我需要在每次递归开始时进行排序。我敢打赌，如果我能解决这个问题，我可以再刮几秒钟。

【讨论】：