【发布时间】:2021-05-28 15:06:22
【问题描述】:
需要澄清一下我面临的错误。
语料库是一个 Python 字典,将页面名称映射到由该页面链接的所有页面的集合。
page是代表页面的字符串
当我尝试这个时linkouts = corpus[page]
TypeError: unhashable type: 'list'
当我打印 corpus[page] 这是输出(语料库是集合的字典)
{'3.html', '1.html'}
当 i ```print(type(corpus[page])) 设置输出时。
我可以遍历 corpus[page] ,但如果我尝试 len(corpus[page]) 会发生同样的错误。 corpus[page] 不是一个集合吗?我应该如何解决这个错误? Makinf a corpus[page].copy() 也面临同样的问题。非常感谢任何建议和帮助,谢谢大家!
pagelink.py 的代码
import os
import random
import re
import sys
DAMPING = 0.85
SAMPLES = 10000
def main():
if len(sys.argv) != 2:
sys.exit("Usage: python pagerank.py corpus")
corpus = crawl(sys.argv[1])
ranks = sample_pagerank(corpus, DAMPING, SAMPLES)
print(f"PageRank Results from Sampling (n = {SAMPLES})")
for page in sorted(ranks):
print(f" {page}: {ranks[page]:.4f}")
#ranks = iterate_pagerank(corpus, DAMPING)
#print(f"PageRank Results from Iteration")
for page in sorted(ranks):
print(f" {page}: {ranks[page]:.4f}")
def crawl(directory):
"""
Parse a directory of HTML pages and check for links to other pages.
Return a dictionary where each key is a page, and values are
a list of all other pages in the corpus that are linked to by the page.
"""
pages = dict()
# Extract all links from HTML files
for filename in os.listdir(directory):
if not filename.endswith(".html"):
continue
with open(os.path.join(directory, filename)) as f:
contents = f.read()
links = re.findall(r"<a\s+(?:[^>]*?)href=\"([^\"]*)\"", contents)
pages[filename] = set(links) - {filename}
# Only include links to other pages in the corpus
for filename in pages:
pages[filename] = set(
link for link in pages[filename]
if link in pages
)
return pages
def transition_model(corpus, page, damping_factor):
"""
Return a probability distribution over which page to visit next,
given a current page.
With probability `damping_factor`, choose a link at random
linked to by `page`. With probability `1 - damping_factor`, choose
a link at random chosen from all pages in the corpus.
"""
linkouts = set(corpus[page])
output = {}
for key in corpus:
output[key] = 0.00
dampvalue = damping_factor / len(linkouts)
for link in linkouts:
output[link] += dampvalue
if linkouts:
dampvalue = 1 - damping_factor
dampvalue = dampvalue / len(corpus)
for key in corpus:
output[key] += dampvalue
else:
dampvalue = 1 / len(corpus)
for key in corpus:
output[key] += dampvalue
return output
def sample_pagerank(corpus, damping_factor, n):
"""
Return PageRank values for each page by sampling `n` pages
according to transition model, starting with a page at random.
Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""
samples = []
first = random.choice(list(corpus))
samples.append(first)
for i in range(n-1):
output = transition_model(corpus, first, damping_factor)
second = random.choices(list(output), weights=(output.values()))
samples.append(second)
first = second
output = {}
for link in corpus:
num = 0
for sample in samples:
if sample == link:
num += 1
output[link] = num / n
return output
def iterate_pagerank(corpus, damping_factor):
"""
Return PageRank values for each page by iteratively updating
PageRank values until convergence.
Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""
raise NotImplementedError
if __name__ == "__main__":
main()
1.html 和 2.html 的代码在与 pagerank.py 相同的文件夹中的文件夹(corpus0)中
1.html
<html lang="en">
<head>
<title>1</title>
</head>
<body>
<h1>1</h1>
<div>Links:</div>
<ul>
<li><a href="2.html">2</a></li>
</ul>
</body>
</html>
2.html
<!DOCTYPE html>
<html lang="en">
<head>
<title>2</title>
</head>
<body>
<h1>2</h1>
<div>Links:</div>
<ul>
<li><a href="1.html">1</a></li>
<li><a href="3.html">3</a></li>
</ul>
</body>
</html>
程序使用 python pagerank.py corpus0 运行
编辑
linkouts = []
for i in corpus[page]:
linkouts.append(i)
给出相同类型的错误,但如果我将 linkouts.append(i) 替换为 print(i) 则没有错误,i 也是类型 str
【问题讨论】:
-
这取决于
page的类型。我认为您可能正在使用不同的页面对象来索引并获得不同的结果。考虑发布一个可重现的最小示例 (stackoverflow.com/help/minimal-reproducible-example) 以获得更好的反馈。 -
@nneonneo 是的,添加了。但
page是一个字符串 -
问题似乎出在
sample_pagerank而不是transition_model