Web页面解析 / Web page parsing
1 HTMLParser解析
下面介绍一种基本的Web页面HTML解析的方式,主要是利用Python自带的html.parser模块进行解析。其主要步骤为:
- 创建一个新的Parser类,继承HTMLParser类;
- 重载handler_starttag等方法,实现指定功能;
- 实例化新的Parser并将HTML文本feed给类实例。
完整代码
1 from html.parser import HTMLParser 2 3 # An HTMLParser instance is fed HTML data and calls handler methods when start tags, end tags, text, comments, and other markup elements are encountered 4 # Subclass HTMLParser and override its methods to implement the desired behavior 5 6 class MyHTMLParser(HTMLParser): 7 # attrs is the attributes set in HTML start tag 8 def handle_starttag(self, tag, attrs): 9 print('Encountered a start tag:', tag) 10 for attr in attrs: 11 print(' attr:', attr) 12 13 def handle_endtag(self, tag): 14 print('Encountered an end tag :', tag) 15 16 def handle_data(self, data): 17 print('Encountered some data :', data) 18 19 parser = MyHTMLParser() 20 parser.feed('<html><head><title>Test</title></head>' 21 '<body><h1>Parse me!</h1></body></html>' 22 '<img src="python-logo.png" alt="The Python logo">')
代码中首先对模块进行导入,派生一个新的 Parser 类,随后重载方法,当遇到起始tag时,输出并判断是否有定义属性,有则输出,遇到终止tag与数据时同样输出。
Note: handle_starttag()函数的attrs为由该起始tag属性组成的元组元素列表,即列表中包含元组,元组中第一个参数为属性名,第二个参数为属性值。
输出结果
Encountered a start tag: html Encountered a start tag: head Encountered a start tag: title Encountered some data : Test Encountered an end tag : title Encountered an end tag : head Encountered a start tag: body Encountered a start tag: h1 Encountered some data : Parse me! Encountered an end tag : h1 Encountered an end tag : body Encountered an end tag : html Encountered a start tag: img attr: ('src', 'python-logo.png') attr: ('alt', 'The Python logo')
从输出中可以看到,解析器将HTML文本进行了解析,并且输出了tag中包含的属性。
2 BeautifulSoup解析
接下来介绍一种第三方的HTML页面解析包BeautifulSoup,同时与HTMLParser进行对比。
首先需要进行BeautifulSoup的安装,安装方式如下,
pip install beautifulsoup4
完整代码
1 from html.parser import HTMLParser 2 from io import StringIO 3 from urllib import request 4 5 from bs4 import BeautifulSoup, SoupStrainer 6 from html5lib import parse, treebuilders 7 8 9 URLs = ('http://python.org', 10 'http://www.baidu.com') 11 12 def output(x): 13 print('\n'.join(sorted(set(x)))) 14 15 def simple_beau_soup(url, f): 16 'simple_beau_soup() - use BeautifulSoup to parse all tags to get anchors' 17 # BeautifulSoup returns a BeautifulSoup instance 18 # find_all function returns a bs4.element.ResultSet instance, 19 # which contains bs4.element.Tag instances, 20 # use tag['attr'] to get attribute of tag 21 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib').find_all('a')) 22 23 def faster_beau_soup(url, f): 24 'faster_beau_soup() - use BeautifulSoup to parse only anchor tags' 25 # Add find_all('a') function 26 output(request.urljoin(url, x['href']) for x in BeautifulSoup(markup=f, features='html5lib', parse_only=SoupStrainer('a')).find_all('a')) 27 28 def htmlparser(url, f): 29 'htmlparser() - use HTMLParser to parse anchor tags' 30 class AnchorParser(HTMLParser): 31 def handle_starttag(self, tag, attrs): 32 if tag != 'a': 33 return 34 if not hasattr(self, 'data'): 35 self.data = [] 36 for attr in attrs: 37 if attr[0] == 'href': 38 self.data.append(attr[1]) 39 parser = AnchorParser() 40 parser.feed(f.read()) 41 output(request.urljoin(url, x) for x in parser.data) 42 print('DONE') 43 44 def html5libparse(url, f): 45 'html5libparse() - use html5lib to parser anchor tags' 46 #output(request.urljoin(url, x.attributes['href']) for x in parse(f) if isinstance(x, treebuilders.etree.Element) and x.name == 'a') 47 48 def process(url, data): 49 print('\n*** simple BeauSoupParser') 50 simple_beau_soup(url, data) 51 data.seek(0) 52 print('\n*** faster BeauSoupParser') 53 faster_beau_soup(url, data) 54 data.seek(0) 55 print('\n*** HTMLParser') 56 htmlparser(url, data) 57 data.seek(0) 58 print('\n*** HTML5lib') 59 html5libparse(url, data) 60 data.seek(0) 61 62 if __name__=='__main__': 63 for url in URLs: 64 f = request.urlopen(url) 65 data = StringIO(f.read().decode()) 66 f.close() 67 process(url, data)