python无法读取网站html代码答案

【问题标题】：python can't read website html codepython无法读取网站html代码
【发布时间】：2015-11-09 12:59:32
【问题描述】：

我无法使用 urllib 读取此 website 的 html 代码

def tests(url):
	response = urllib.urlopen(url)
	soup = BeautifulSoup(response.read())
	universities=soup.findAll('a',{'class':'pin-link'})
	print universities

if __name__ == '__main__':
	tests("https://pinshape.com/shop?page=3&is-free=true&type=-streamable")

是否可以读取页面源？

【问题讨论】：

它不仅仅是纯 HTML。有javascript激活登录框，更难解析，

标签： python urllib

【解决方案1】：

您可以尝试使用 urllib.request。取我正在使用的部分代码的 sn-p，它的工作原理如下

import urllib.request
with urllib.request.urlopen('https://pinshape.com/shop?page=2') as f:
   data = str(f.read()).replace('\n', '')

myfile = open("TestFile.txt", "r+")
myfile.write(data)

【讨论】：

urllib.request 适用于 python 3 及以上版本，是否有适用于 python 2.7 的？

【解决方案2】：

您尝试访问的 URL 是 HTTPS，请注意“S”，因此您需要建立安全连接。 HTTP 和 HTTPS 请求的处理方式非常不同。

【讨论】：

【解决方案3】：

尽管urllib，你可以试试requests库，它更适合初学者使用。

例如，通过使用requests，您可以获得这样的网页

>>> import requests
>>> r = requests.get("https://pinshape.com/shop?page=2")
>>> r.text
>>> u'<!DOCTYPE html>\n<html class=\'no-js\' lang=\'en\'>\n<head>\n<meta charset=\'utf-8\'> ...

提醒一下，BeautifulSoup不够快，你可以看看

根据上面的帖子和我自己的经验，lxml 肯定比BeautifulSoup 快。您可以查看以下链接以获取 xpath 教程

W3School: XPath Tutorial

希望对你有帮助

【讨论】：