用 Beautiful Soup 刮掉杂乱的源页面答案

【问题标题】：Scraping messy source page with Beautiful Soup用 Beautiful Soup 刮掉杂乱的源页面
【发布时间】：2014-02-08 06:04:53
【问题描述】：

我尝试使用 Python 和 Beautiful Soup 进行一些网页抓取，但网页的源页面并不是最漂亮的。下面的代码是源页面的一小部分：

...717301758],"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0,...

我想在字符串'birthdayFriends'之后获取参数'2'，但我不知道如何获取它。到目前为止，我已经编写了下面的代码，但它只打印一个空列表。

import urllib2
from bs4 import BeautifulSoup

# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib2.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='myWebpage',
                          user='myUsername',
                          passwd='myPassword')
opener = urllib2.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib2.install_opener(opener)
page = urllib2.urlopen('myWebpage')

soup = BeautifulSoup(page.read())

bf = soup.findAll('birthdayFriends')

print bf

>> []

【问题讨论】：

BeautifulSoup 是一个 html 解析器，您显示的片段根本不像 html。它在“脚本”标签内吗？
是的，它在脚本标签内。那有什么可做的吗？可能是 Beautiful Soup 以外的另一个图书馆？
嗯，从脚本标签获取数据的一种方法是使用正则表达式：例如用BS定位脚本元素，然后用正则表达式解析脚本标签的内容。

标签： python-2.7 web-scraping beautifulsoup

【解决方案1】：

假设在 html 的某处有一个类似如下的脚本标签：

<script>
var x = {"birthdayFriends":2,"lastActiveTimes":{"719317510":0,"719435783":0}}
</script>

那么您的代码可能类似于：

script = soup.findAll('script')[0] # or the number it appears in the file
# take the json part
j = bf.text.split('=')[1]

import json
# load json string to a dictionary
d = json.loads(j, strict=False)
print d["birthdayFriends"]

如果脚本标签的内容比较复杂，可以考虑循环遍历脚本行或查看How can I parse Javascript variables using python?

另外，在 python 中解析 JavaScript 也可以查看pynoceros

【讨论】：