【问题标题】:Python Scraping Web with Session CookiePython 使用 Session Cookie 抓取网页
【发布时间】:2013-09-24 06:51:44
【问题描述】:

您好,我正在尝试从该 URL 中删除一些数据:

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

您可能已经注意到,如果尚未设置 cookie 和会话数据,您将被重定向到其基本 url (http://www.21cineplex.com/)

我试着这样做:

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

但是,我最终未能从该特定 URL 中删除。

快速检查显示该网站正在设置会话 ID (PHPSESSID) 并复制到客户端的 cookie 中。

问题是我如何减轻这样的例子?

ps:我尝试安装 request(通过 pip)它给了我什么(404):

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

【问题讨论】:

    标签: python python-2.7 web-scraping session-cookies


    【解决方案1】:

    感谢@Chainik,我现在可以使用它了。我最终像这样修改了我的代码:

    cj = CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
    baseurl = "http://www.21cineplex.com/"
    regex = '<ul class="w462">(.*?)</ul>'
    
    opener.open(baseurl)
    urllib2.install_opener(opener)
    
    request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
    request.add_header('Referer', baseurl)
    
    requestData = urllib2.urlopen(request)
    htmlText = requestData.read()
    

    一次,html 文本被检索。就是解析它的内容。

    干杯

    【讨论】:

      【解决方案2】:

      尝试设置引用 URL,见下文。

      未设置引用 URL(302 重定向):

      $ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
      HTTP/1.1 302 Moved Temporarily                       
      Server: nginx
      Date: Thu, 19 Sep 2013 09:19:19 GMT
      Content-Type: text/html
      Connection: keep-alive
      X-Powered-By: PHP/5.4.17
      Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
      Expires: Thu, 19 Nov 1981 08:52:00 GMT
      Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
      Pragma: no-cache
      Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
      Location: http://www.21cineplex.com/
      

      设置引用 URL (HTTP/200):

      $ curl -I -e "http://www.21cineplex.com/"
      "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
      HTTP/1.1 200 OK
      Server: nginx
      Date: Thu, 19 Sep 2013 09:19:24 GMT
      Content-Type: text/html
      Connection: keep-alive
      Vary: Accept-Encoding
      X-Powered-By: PHP/5.4.17
      Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
      Expires: Thu, 19 Nov 1981 08:52:00 GMT
      Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
      Pragma: no-cache
      Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/
      

      要使用 urllib 设置引用 URL,请参见 post

      --ab1

      【讨论】:

      • 嗨,谢谢,理论上这应该可行。但我不确定到期的事情。不管怎样,我会告诉你的。
      猜你喜欢
      • 1970-01-01
      • 2013-04-29
      • 1970-01-01
      • 2021-10-25
      • 2014-08-16
      • 1970-01-01
      • 1970-01-01
      • 2022-11-09
      相关资源
      最近更新 更多