使用 urllib2 缺少源页面信息答案

【问题标题】：Missing source page information using urllib2使用 urllib2 缺少源页面信息
【发布时间】：2014-04-03 06:15:27
【问题描述】：

我正在尝试从数字游戏分发网站 Steam (store.steampowered.com) 上列出的游戏中抓取“游戏标签”数据（与 HTML 标签不同）。据我所知，此信息无法通过 Steam API 获得。

一旦我获得了页面的原始源数据，我想将其传递到 beautifulsoup 中进行进一步解析，但我有一个问题 - urllib2 似乎没有读取我想要的信息（请求 em> 也不起作用），即使在浏览器中查看时它显然在源页面中。例如，我可能会下载游戏“7 Days to Die”的页面 (http://store.steampowered.com/app/251570/)。在Chrome中查看浏览器源码页面时，可以看到以下关于游戏“标签”的相关信息接近尾声，从第 1615 行开始：

<script type="text/javascript">
      $J( function() {
          InitAppTagModal( 251570,    
          {"tagid":1662,"name":"Survival","count":283,"browseable":true},
          {"tagid":1659,"name":"Zombies","count":274,"browseable":true},
          {"tagid":1702,"name":"Crafting","count":248,"browseable":true},...

在 initAppTagModal 中，有“生存”、“僵尸”、“制作”等标签，其中包含我想要收集的信息。

但是当我使用urllib2获取页面源时：

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page  
page = urllib2.urlopen(url).read()

我感兴趣的源页面部分没有保存在我的“页面”变量中，而是在第 1555 行以下的所有内容都只是空白，直到结束正文和 html 标记。导致这个（包括回车）：

</div><!-- End Footer -->





</body>  
</html>

空白处是我需要的源代码（以及其他代码）应该在的位置。
我已经在几台安装了不同 python 2.7（Windows 机器和 Mac）的不同计算机上进行了尝试，并且在所有计算机上都得到了相同的结果。

如何获取我正在寻找的数据？

感谢您的考虑。

【问题讨论】：

他们可能会根据用户代理返回不同的页面。尝试将其欺骗到浏览器。
您通过浏览器查看源代码时是否已登录？我在浏览器中访问了该页面，但没有看到游戏标签。

标签： python web-scraping beautifulsoup urllib2 steam

【解决方案1】：

好吧，我不知道我是否遗漏了什么，但使用请求对我有用：

import requests

# Getting html code
url = "http://store.steampowered.com/app/251570/"
html = requests.get(url).text

而且，请求的数据是json格式的，这样提取起来很方便：

# Extracting javscript object (a json like object)
start_tag = 'InitAppTagModal( 251570,'
end_tag = '],'
startIndex = html.find(start_tag) + len(start_tag)
endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
raw_data = html[startIndex:endIndex]

# Load raw data as python json object
data = json.loads(raw_data)

你会看到这样一个漂亮的 json 对象（这是你需要的信息，对吧？）：

[
  {
    "count": 283,
    "browseable": true,
    "tagid": 1662,
    "name": "Survival"
 },
 {
    "count": 274,
    "browseable": true,
    "tagid": 1659,
    "name": "Zombies"
 },
 {
   "count": 248,
   "browseable": true,
   "tagid": 1702,
   "name": "Crafting"
 }......

希望对你有帮助....

更新：

好的，我现在看到您的问题，似乎问题出在页面 224600。在这种情况下，网页要求您在显示游戏信息之前确认您的年龄。无论如何，只需发布确认年龄的表格即可轻松解决。这是更新的代码（我创建了一个函数）：

def extract_info_games(page_id):
    # Create session
    session = requests.session()

    # Get initial html
    html = session.get("http://store.steampowered.com/app/%s/" % page_id).text

    # Checking if I'm in the check age page (just checking if the check age form is in the html code)
    if ('<form action="http://store.steampowered.com/agecheck/app/%s/"' % page_id) in html:
            # I'm being redirected to check age page
            # let's confirm my age with a POST:
            post_data = {
                     'snr':'1_agecheck_agecheck__age-gate',
                     'ageDay':1,
                     'ageMonth':'January',
                     'ageYear':'1960'
            }
            html = session.post('http://store.steampowered.com/agecheck/app/%s/' % page_id, post_data).text


    # Extracting javscript object (a json like object)
    start_tag = 'InitAppTagModal( %s,' % page_id
    end_tag = '],'
    startIndex = html.find(start_tag) + len(start_tag)
    endIndex = html.find(end_tag, startIndex) + len(end_tag) - 1
    raw_data = html[startIndex:endIndex]

    # Load raw data as python json object
    data = json.loads(raw_data)
    return data

并使用它：

extract_info_games(224600)
extract_info_games(251570)

享受吧！

【讨论】：

【解决方案2】：

当使用urllib2 和read() 时，您必须以块的形式重复阅读，直到到达 EOF，才能阅读整个 HTML 源代码。

import urllib2  
url = "http://store.steampowered.com/app/224600/" #7 Days to Die page
url_handle = urllib2.urlopen(url)
data = ""
while True:
    chunk = url_handle.read()
    if not chunk:
        break
    data += chunk

另一种方法是将requests module 用作：

import requests
r = requests.get('http://store.steampowered.com/app/251570/')
soup = BeautifulSoup(r.text)

【讨论】：

这不是真的。 read() 尝试将整个页面读入内存。阅读文档。