Pajek 文件的每个部分(顶点/边)似乎都可以解释为以空格分隔的 CSV 文件,这意味着您可以使用 pandas.read_csv() 对其进行解析。该函数比您在纯 python 答案中建议的逐行解析要快。
此外,一次初始化边缘列表和属性列表(作为 numpy 数组)比在 python 循环中单独设置每个元素更快。
我认为下面的实现应该有点接近最优,但我没有对其进行基准测试。
import re
from io import StringIO
import numpy as np
import pandas as pd
import graph_tool as gt
def pajek_to_gt(path, directed=False, remove_loops=False):
"""
Load a Pajek .NET file[1] as a graph_tool.Graph.
Supports files which specify their edges via node pairs.
Does not support files which specify their edges via the
'edgeslist' scheme (i.e. the neighbors-list scheme).
Note:
Vertices are renumbered to start with 0, per graph-tool
conventions (not Pajek conventions, which start with 1).
Author: Stuart Berg (github.com/stuarteberg)
License: MIT
[1]: https://gephi.org/users/supported-graph-formats/pajek-net-format/
"""
# Load into RAM
with open(path, 'r') as f:
full_text = f.read()
if '*edgeslist' in full_text:
raise RuntimeError("Neighbor list format not supported.")
# Erase comment lines
full_text = re.sub(r'^\s*%.*$', '', full_text, flags=re.MULTILINE)
# Erase blank lines (including those created by erasing comments)
full_text = re.sub(r'\n+', '\n', full_text)
# Ensure delimiter is a single space
full_text = re.sub(r'[ \t]+', ' ', full_text)
num_vertices = int(StringIO(full_text).readline().split()[-1])
# Split into vertex section and edges section
# (Vertex section might be empty)
vertex_text, edges_text = re.split(r'\*[^\n]+\n', full_text)[1:]
# Parse vertices (if present)
v_df = None
if vertex_text:
v_df = pd.read_csv(StringIO(vertex_text), delimiter=' ', engine='c', names=['id', 'label'], header=None)
assert (v_df['id'] == np.arange(1, 1+num_vertices)).all(), \
"File does not list all vertices, or lists them out of order."
# Parse edges
e_df = pd.read_csv(StringIO(edges_text), delimiter=' ', engine='c', header=None)
if len(e_df.columns) == 2:
e_df.columns = ['v1', 'v2']
elif len(e_df.columns) == 3:
e_df.columns = ['v1', 'v2', 'weight']
else:
raise RuntimeError("Can't understand edge list")
e_df[['v1', 'v2']] -= 1
# Free up some RAM
del full_text, vertex_text, edges_text
# Create graph
g = gt.Graph(directed=directed)
g.add_vertex(num_vertices)
g.add_edge_list(e_df[['v1', 'v2']].values)
# Add properties
if 'weight' in e_df.columns:
g.edge_properties["weight"] = g.new_edge_property("double", e_df['weight'].values)
if v_df is not None:
g.vertex_properties["label"] = g.new_vertex_property("string", v_df['label'].values)
if remove_loops:
gt.stats.remove_self_loops(g)
return g
这是它为您的示例文件返回的内容:
In [1]: from pajek_to_gt import pajek_to_gt
In [2]: g = pajek_to_gt('pajek-example.NET')
In [3]: g.get_vertices()
Out[3]: array([0, 1, 2, 3, 4])
In [4]: g.vertex_properties['label'].get_2d_array([0])
Out[4]: array([['apple', 'cat', 'tree', 'nature', 'fire']], dtype='<U6')
In [5]: g.get_edges()
Out[5]:
array([[0, 2],
[1, 3]])
In [6]: g.edge_properties['weight'].get_array()
Out[6]: PropertyArray([14., 1.])
注意:此函数会进行一些预处理以将双空格转换为单空格,因为您上面的示例在条目之间使用了双空格。那是故意的吗?您链接到的 Pajek 文件规范使用单空格。
编辑:
重新阅读您链接到的 Pajek 文件规范后,我注意到边缘部分有两种可能的格式。第二种格式在可变长度列表中列出每个节点的邻居:
*edgeslist
4941 386 395 451
1 3553 3586 3587 3637
2 3583
3 4930
4 88
5 13 120
显然,我上面的实现与该格式不兼容。如果文件中使用了该格式,我已经编辑了函数以引发异常。