Python3 网络爬虫

1. 直接使用python3

一个简单的伪代码

以下这个简单的伪代码用到了set和queue这两种经典的数据结构, 集与队列. 集的作用是记录那些已经访问过的页面, 队列的作用是进行广度优先搜索.

 
1
2
3
4
5
6
7
8
9
10
11
queue Q
set S
Q.push(StartPoint)  # 经典的BFS开头
S.insert(StartPoint)  # 访问一个页面之前先标记他为已访问
while (Q.empty() == false)  # BFS循环体
  T = Q.top()  # 并且pop
  for point in PageUrl(T)  # PageUrl(T)是指页面T中所有url的集合, point是这个集合中的一个元素.
    if (point not in S)
      Q.push(point)
      S.insert(point)

这里用到的Set其内部原理是采用了Hash表, 传统的Hash对爬虫来说占用空间太大, 因此有一种叫做Bloom Filter的数据结构更适合用在这里替代Hash版本的set. 

 

简单的webSpider实现

 1 from html.parser import HTMLParser
 2 from urllib.request import urlopen
 3 from urllib import parse
 4 
 5 class LinkParser(HTMLParser):
 6     def handle_starttag(self, tag, attrs):
 7         if tag == 'a':
 8             for (key, value) in attrs:
 9                 if key == 'href':
10                     newUrl = parse.urljoin(self.baseUrl, value)
11                     self.links = self.links + [newUrl]
12     
13     def getLinks(self, url):
14         self.links = []
15         self.baseUrl =  url
16         response = urlopen(url)
17         if response.getheader('Content-Type')=='text/html':
18             htmlBytes = response.read()
19             htmlString = htmlBytes.decode("utf-8")
20             self.feed(htmlString)
21             return htmlString, self.links
22         else:
23             return "", []
24 
25 def spider(url, word, maxPages):
26     pagesToVisit = [url]
27     numberVisited = 0
28     foundWord = False
29     while numberVisited < maxPages and pagesToVisit != [] and not foundWord:
30         numberVisited = numberVisited + 1
31         url = pagesToVisit[0]
32         pagesToVisit = pagesToVisit[1:]
33         try:
34             print(numberVisited, "Visiting:", url)
35             parser = LinkParser()
36             data, links = parser.getLinks(url)
37             if data.find(word) > -1:
38                 foundWord = True
39             pagesToVisit = pagesToVisit + links
40             print("**Success!**")
41         except:
42             print("**Failed!**")
43         
44         if foundWord:
45             print("The word", word, "was found at", url)
46             return
47         else:
48             print("Word never found")
View Code

相关文章:

  • 2022-01-27
  • 2022-12-23
  • 2022-02-10
  • 2021-05-19
  • 2022-12-23
  • 2022-12-23
  • 2022-12-23
  • 2021-09-13
猜你喜欢
  • 2022-12-23
  • 2021-07-10
  • 2021-05-26
  • 2021-05-03
  • 2022-12-23
  • 2022-01-24
  • 2022-02-18
相关资源
相似解决方案