Python 使用 Session Cookie 抓取网页答案

【问题标题】：Python Scraping Web with Session CookiePython 使用 Session Cookie 抓取网页
【发布时间】：2013-09-24 06:51:44
【问题描述】：

您好，我正在尝试从该 URL 中删除一些数据：

http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1

您可能已经注意到，如果尚未设置 cookie 和会话数据，您将被重定向到其基本 url (http://www.21cineplex.com/)

我试着这样做：

def main():
    try:
        cj = CookieJar()
        baseurl = "http://www.21cineplex.com"
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        opener.open(baseurl)

        urllib2.install_opener(opener)
        movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()

        splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)

        print splitSource

    except Exception, e:
        str(e)
        print "Error occured in main Block"

但是，我最终未能从该特定 URL 中删除。

快速检查显示该网站正在设置会话 ID (PHPSESSID) 并复制到客户端的 cookie 中。

问题是我如何减轻这样的例子？

ps：我尝试安装 request（通过 pip）它给了我什么（404）：

  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Getting page https://pypi.python.org/simple/
  URLs to search for versions for request:
  * https://pypi.python.org/simple/request/
  Getting page https://pypi.python.org/simple/request/
  Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
  Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
  Could not find any downloads that satisfy the requirement request

Cleaning up...

【问题讨论】：

标签： python python-2.7 web-scraping session-cookies

【解决方案1】：

感谢@Chainik，我现在可以使用它了。我最终像这样修改了我的代码：

cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'

opener.open(baseurl)
urllib2.install_opener(opener)

request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)

requestData = urllib2.urlopen(request)
htmlText = requestData.read()

一次，html 文本被检索。就是解析它的内容。

干杯

【讨论】：

【解决方案2】：

尝试设置引用 URL，见下文。

未设置引用 URL（302 重定向）：

$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily                       
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/

设置引用 URL (HTTP/200)：

$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/

要使用 urllib 设置引用 URL，请参见 post

--ab1

【讨论】：

嗨，谢谢，理论上这应该可行。但我不确定到期的事情。不管怎样，我会告诉你的。