使用 Graph-tool 读取 Pajek .net 文件答案

【问题标题】：Read Pajek .net files using Graph-tool使用 Graph-tool 读取 Pajek .net 文件
【发布时间】：2020-10-28 16:44:51
【问题描述】：

我有一个Pajek network file（带加权边的无向网络），此处提供了一个示例：

*Vertices 5
1  apple
2  cat
3  tree
4  nature
5  fire
*Edges
1  3  14
2  4  1

节点标签无需引用即可提供。边指定为node1，node2，边权重。

我需要将graph-tool 中的这个文件作为一个带有节点标签和边的“权重”属性的无向图来读取。该函数还应保留隔离节点。

在 Python 中有没有一种有效的方法来做到这一点？到目前为止，我一直在使用 Networkx 读取 .net 文件，然后使用转换函数like this。我正在寻找一种方法来加快这个过程。

【问题讨论】：

标签： python-3.x graph-tool

【解决方案1】：

Pajek 文件的每个部分（顶点/边）似乎都可以解释为以空格分隔的 CSV 文件，这意味着您可以使用 pandas.read_csv() 对其进行解析。该函数比您在纯 python 答案中建议的逐行解析要快。

此外，一次初始化边缘列表和属性列表（作为 numpy 数组）比在 python 循环中单独设置每个元素更快。

我认为下面的实现应该有点接近最优，但我没有对其进行基准测试。

import re
from io import StringIO

import numpy as np
import pandas as pd

import graph_tool as gt

def pajek_to_gt(path, directed=False, remove_loops=False):
    """
    Load a Pajek .NET file[1] as a graph_tool.Graph.
    Supports files which specify their edges via node pairs.
    Does not support files which specify their edges via the
    'edgeslist' scheme (i.e. the neighbors-list scheme).

    Note:
        Vertices are renumbered to start with 0, per graph-tool
        conventions (not Pajek conventions, which start with 1).

    Author: Stuart Berg (github.com/stuarteberg)
    License: MIT

    [1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
    """
    # Load into RAM
    with open(path, 'r') as f:
        full_text = f.read()

    if '*edgeslist' in full_text:
        raise RuntimeError("Neighbor list format not supported.")

    # Erase comment lines
    full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)

    # Erase blank lines (including those created by erasing comments)
    full_text = re.sub(r'\n+', '\n', full_text)

    # Ensure delimiter is a single space
    full_text = re.sub(r'[ \t]+', ' ', full_text)

    num_vertices = int(StringIO(full_text).readline().split()[-1])

    # Split into vertex section and edges section
    # (Vertex section might be empty)
    vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]

    # Parse vertices (if present)
    v_df = None
    if vertex_text:
        v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
        assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
            "File does not list all vertices, or lists them out of order."

    # Parse edges
    e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
    if len(e_df.columns) == 2:
        e_df.columns = ['v1', 'v2']
    elif len(e_df.columns) == 3:
        e_df.columns = ['v1', 'v2', 'weight']
    else:
        raise RuntimeError("Can't understand edge list")

    e_df[['v1', 'v2']] -= 1

    # Free up some RAM
    del full_text, vertex_text, edges_text

    # Create graph
    g = gt.Graph(directed=directed)
    g.add_vertex(num_vertices)
    g.add_edge_list(e_df[['v1', 'v2']].values)

    # Add properties
    if 'weight' in e_df.columns:
        g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
    if v_df is not None:
        g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)

    if remove_loops:
      gt.stats.remove_self_loops(g)

    return g

这是它为您的示例文件返回的内容：

In [1]: from pajek_to_gt import pajek_to_gt

In [2]: g = pajek_to_gt('pajek-example.NET')

In [3]: g.get_vertices()
Out[3]: array([0, 1, 2, 3, 4])

In [4]: g.vertex_properties['label'].get_2d_array([0])
Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')

In [5]: g.get_edges()
Out[5]:
array([[0, 2],
       [1, 3]])

In [6]: g.edge_properties['weight'].get_array()
Out[6]: PropertyArray([14.,  1.])

注意：此函数会进行一些预处理以将双空格转换为单空格，因为您上面的示例在条目之间使用了双空格。那是故意的吗？您链接到的 Pajek 文件规范使用单空格。

编辑：

重新阅读您链接到的 Pajek 文件规范后，我注意到边缘部分有两种可能的格式。第二种格式在可变长度列表中列出每个节点的邻居：

*edgeslist
4941 386 395 451
1 3553 3586 3587 3637
2 3583
3 4930
4 88
5 13 120

显然，我上面的实现与该格式不兼容。如果文件中使用了该格式，我已经编辑了函数以引发异常。

【讨论】：

非常感谢斯图尔特！我用大约 43000 个节点的网络对其进行了测试，您的解决方案似乎比我的解决方案快 6 倍 :) 我害怕将所有文件加载到内存中，以防网络文件非常大。
这行gt.stats.remove_self_loops(g)有错误，所以我改成gts.remove_self_loops(g)，用之前的导入命令import graph_tool.stats as gts
至于分隔符，我看到文件有时使用单个空格，有时使用多个空格（2 个或更多），有时使用制表符。但在我看来，您的功能在所有这些情况下都能正常工作？
是的，预处理正则表达式应该处理多个空格或制表符。
我认为是我的错误——我忘了在代码中包含导入语句！对于那个很抱歉。（现已修复。）如您所见，我使用了import graph_tool as gt，而不是graph_tool.all。也许这就是区别。

【解决方案2】：

这是我今天开发的解决方案：

import graph_tool.all as gt
import graph_tool.stats as gts

def pajTOgt(filepath, directed = False, removeloops = True):
  if directed:
    g = gt.Graph(directed=True)
  else:
    g = gt.Graph(directed=False)

  #define edge and vertex properties
  g.edge_properties["weight"] = g.new_edge_property("double")
  g.vertex_properties["id"] = g.new_vertex_property("string")

  with open(filepath, encoding = "utf-8") as input_data:
    #create vertices
    for line in input_data:
        g.add_vertex(int(line.replace("*Vertices ", "").strip())) #add vertices
        break

    #label vertices
    for line in input_data: #keeps going for node labels
      if not line.strip() == '*Edges' or line.strip() == '*Arcs':  
        v_id = int(line.split()[0]) - 1
        g.vertex_properties["id"][g.vertex(v_id)] = "".join(line.split()[1:])
      else:
        break

    #create weighted edges
    for line in input_data: #keeps going for edges
      linesplit = line.split()
      linesplit = [int(x) for x in linesplit[:2]] + [float(linesplit[2])]
      if linesplit[2] > 0:
        n1 = g.vertex(linesplit[0]-1)
        n2 = g.vertex(linesplit[1]-1)
        e = g.add_edge(n1, n2)
        g.edge_properties["weight"][e] = linesplit[2]

    if removeloops:
      gts.remove_self_loops(g)

    return g

如果你找到更有效的方法，我很想知道。

【讨论】：