用 python 解析非有序 HTML 页面的最佳方法是什么？答案

【问题标题】：What is the best approach to parse a non ordered HTML page with python?用 python 解析非有序 HTML 页面的最佳方法是什么？
【发布时间】：2012-07-02 15:37:03
【问题描述】：

我正在尝试使用 BeautifulSoup 解析以下 HTML 页面（我将解析大量页面）。

我需要保存每个页面中的所有字段，但它们可以动态更改（在不同页面上）。

这是一个页面示例 - Page 1 以及具有不同字段顺序的页面 - Page 2

我编写了以下代码来解析页面。

import requests
from bs4 import BeautifulSoup

PTiD = 7680560

url = "http://patft.uspto.gov/netacgi/nph-Parser?Sect1=PTO1&Sect2=HITOFF&d=PALL&p=1&u=/netahtml/PTO/srchnum.htm&r=1&f=G&l=50&s1=" + str(PTiD) + ".PN.&OS=PN/" + str(PTiD) + "&RS=PN/" + str(PTiD)

res = requests.get(url, prefetch = True)

raw_html = res.content

print "Parser Started.. "

bs_html = BeautifulSoup(raw_html, "lxml")

#Initialize all the Search Lists
fonts = bs_html.find_all('font')
para = bs_html.find_all('p')
bs_text = bs_html.find_all(text=True)
onlytext = [x for x in bs_text if x != '\n' and x != ' ']

#Initialize the Indexes
AppNumIndex = onlytext.index('Appl. No.:\n')
FiledIndex = onlytext.index('Filed:\n  ')
InventorsIndex = onlytext.index('Inventors: ')
AssigneeIndex = onlytext.index('Assignee:')
ClaimsIndex = onlytext.index('Claims')
DescriptionIndex = onlytext.index(' Description')
CurrentUSClassIndex = onlytext.index('Current U.S. Class:')
CurrentIntClassIndex = onlytext.index('Current International Class: ')
PrimaryExaminerIndex = onlytext.index('Primary Examiner:')
AttorneyOrAgentIndex = onlytext.index('Attorney, Agent or Firm:')
RefByIndex = onlytext.index('[Referenced By]')

#~~Title~~
for a in fonts:
        if a.has_key('size') and a['size'] == '+1':
                d_title = a.string
print "title: " + d_title

#~~Abstract~~~
d_abstract = para[0].string
print "abstract: " + d_abstract

#~~Assignee Name~~
d_assigneeName = onlytext[AssigneeIndex +1]
print "as name: " + d_assigneeName

#~~Application number~~
d_appNum = onlytext[AppNumIndex + 1]
print "ap num: " + d_appNum

#~~Application date~~
d_appDate = onlytext[FiledIndex + 1]
print "ap date: " + d_appDate

#~~ Patent Number~~
d_PatNum = onlytext[0].split(':')[1].strip()
print "patnum: " + d_PatNum

#~~Issue Date~~
d_IssueDate = onlytext[10].strip('\n')
print "issue date: " + d_IssueDate

#~~Inventors Name~~
d_InventorsName = ''
for x in range(InventorsIndex+1, AssigneeIndex, 2):
    d_InventorsName += onlytext[x]
print "inv name: " + d_InventorsName

#~~Inventors City~~
d_InventorsCity = ''

for x in range(InventorsIndex+2, AssigneeIndex, 2):
    d_InventorsCity += onlytext[x].split(',')[0].strip().strip('(')

d_InventorsCity = d_InventorsCity.strip(',').strip().strip(')')
print "inv city: " + d_InventorsCity

#~~Inventors State~~
d_InventorsState = ''
for x in range(InventorsIndex+2, AssigneeIndex, 2):
    d_InventorsState += onlytext[x].split(',')[1].strip(')').strip() + ','

d_InventorsState = d_InventorsState.strip(',').strip()
print "inv state: " + d_InventorsState

#~~ Asignee City ~~
d_AssigneeCity = onlytext[AssigneeIndex + 2].split(',')[1].strip().strip('\n').strip(')')
print "asign city: " + d_AssigneeCity

#~~ Assignee State~~
d_AssigneeState = onlytext[AssigneeIndex + 2].split(',')[0].strip('\n').strip().strip('(')
print "asign state: " + d_AssigneeState

#~~Current US Class~~
d_CurrentUSClass = ''

for x in range (CuurentUSClassIndex + 1, CurrentIntClassIndex):
    d_CurrentUSClass += onlytext[x]
print "cur us class: " + d_CurrentUSClass

#~~ Current Int Class~~
d_CurrentIntlClass = onlytext[CurrentIntClassIndex +1]
print "cur intl class: " + d_CurrentIntlClass

#~~~Primary Examiner~~~
d_PrimaryExaminer = onlytext[PrimaryExaminerIndex +1]
print "prim ex: " + d_PrimaryExaminer

#~~d_AttorneyOrAgent~~
d_AttorneyOrAgent = onlytext[AttorneyOrAgentIndex +1]
print "agent: " + d_AttorneyOrAgent

#~~ Referenced by ~~
for x in range(RefByIndex + 2, RefByIndex + 400):
    if (('Foreign' in onlytext[x]) or ('Primary' in onlytext[x])):
        break
    else:
        d_ReferencedBy += onlytext[x]
print "ref by: " + d_ReferencedBy

#~~Claims~~
d_Claims = ''

for x in range(ClaimsIndex , DescriptionIndex):
    d_Claims += onlytext[x]
print "claims: " + d_Claims

我将页面中的所有文本插入到一个列表中（使用 BeautifulSoup 的 find_all(text=True)）。然后我尝试查找字段名称的索引，并从该位置遍历列表并将成员保存到字符串，直到到达下一个字段索引。

当我在几个不同的页面上尝试代码时，我注意到成员的结构发生了变化，我在列表中找不到他们的索引。例如，我搜索“123”的索引，在某些页面上，它在列表中显示为“12”、“3”。

你能想出任何其他方法来解析通用的页面吗？

谢谢。

【问题讨论】：

对于模式，我已经更新了我的帖子

标签： python parsing html-parsing web-crawler beautifulsoup

【解决方案1】：

我认为最简单的解决方案是使用 pyquery 库 http://packages.python.org/pyquery/api.html

您可以使用 jquery 选择器选择页面的元素。

【讨论】：

PyQuery ftw。无痛快速网页抓取：D

【解决方案2】：

如果你使用 beautifulsoup，并且拥有 dom 123 和 find_all(text=True)，你将拥有 ['123']

但是，如果你有 dom 123，它的语义和之前一样，但是 beautifulsoup 会给你['12','3']

也许您可以准确地找出哪个标签让您无法完成 ['123'] ，然后首先忽略/消除该标签。

一些关于如何消除标签的假代码

import re
html='<p>12<b>3</b></p>'
reExp='<[\/\!]?b[^<>]*?>'
print re.sub(reExp,'',html)

对于模式，你可以使用这个：

import re
patterns = '<TD align=center>(?P<VALUES_TO_FIND>.*?)<\/TD>'
print re.findall(patterns, your_html)

【讨论】：

那么模式呢？如果我想通过前后搜索来查找内容。例如，如果我有 html 代码： Reissue of: VALUES_TO_FIND
，我确信 VALUES_TO_FIND 之前和之后的代码将始终是相同的。如何使用 RE 找到它？谢谢。