【问题标题】:Why I can't send cookies to the website with Scrapy or Selenium?为什么我不能使用 Scrapy 或 Selenium 向网站发送 cookie?
【发布时间】:2018-10-15 23:05:04
【问题描述】:

首先,如果我的问题有一个非常明显的解决方案,请原谅我。我是网络抓取和 Scrapy 的新手。这将是我被废弃的第三个网站(如果我能找到以下问题的解决方案)。

我想要达到的目标:

是从以下网站抓取产品数据:https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler

然而 商品根据您登录后选择的城市动态加载。

所以我想,也许我可以用我自己的帐户登录,从请求标头中获取 cookie,然后用 scrapy Request 发送它们。问题,我猜,网站不接受我发送的cookies。

我也用 selenium 尝试了同样的过程。

  1. 打开页面

  2. 已登录

  3. 选择城市

  4. 得到了 cookie(也可以用 pickle 保存它们,以便以后在 scrapy 上使用,但没有用)

  5. 从网站上删除所有 cookie

  6. 刷新页面后发送第4步中的cookies

再次网站不接受 cookie。

注意:由于我需要每天抓取网站中的所有类别,因此我需要像 scrapy 这样的快速抓取解决方案。所以用 Selenium 刮擦对我来说不是一个选择。

这里有一些日志和屏幕截图来支持我的问题。

Request url and method

Request headers and cookie info

data preview after I logged in and choose a city(note the 'sid:1885' this is the store id that I want to scrape)

this is the output of view(response) line from scrapy

scrapy shell https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler
from scrapy import Request
mycookie = {'JSESSIONID ': 'yndMqXswzQYeUw1CsLtp9A0GBI7ZZE0yI1W0zPk4u4JJxpZES8RF!-1577658491 ', 'NSC_wjq_dt_iuuq_lbohvsvn_lxfc    ': '756ca3c16479c6cdde0681fa2edb1040d4786c1c0a6b2f3116d5fc7f605b4631d4d0f199 ','_dc_gtm_UA-1547459-1  ':'1','_ga':'GA1.3.219867582.1525198968','_gat_UA-1547459-1 ':'1','_gid':'GA1.3.1499846526.1525198968','current-currency    ':'TRY','customer':'ggB2MTVRWi76tWJwj2ZvbDa896G27N3YaH','district':'ac00a4001701ce63cc30626def','first-permission-impression    ':'1','ins-gaSSId   ':'cbf3cd92-3c71-e321-30ac-b2d89dbf3826_1525528747  ','insIsUserLoggedIn    ':'1','insTotalCartAmount187    ':'194.96   ','insUserDetails   ':'%22muharrem.akkaya96%40gmail.com%22  ','insdrSV':'285','scs':'%7B%22t%22%3A1%7D  ','spUID':'15251989688268402d4dc11.7edd9701 ','total-cart-amount    ':'120.78   '}
req = Request('https://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011',cookies = mycookie)
fetch(req)
view(response)

记录第一行

2018-05-05 19:11:02 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: seleniumcrawler)
2018-05-05 19:11:03 [scrapy.utils.log] INFO: Versions: lxml 4.2.1.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.9.0, Python 2.7.14 (v2.7.14:84471935ed, Sep 16 2017, 20:19:30) [MSC v.1500 32 bit (Intel)], pyOpenSSL 17.5.0 (OpenSSL 1.1.0h  27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.16299
2018-05-05 19:11:03 [scrapy.crawler] INFO: Overridden settings: {'COOKIES_DEBUG': True, 'NEWSPIDER_MODULE': 'seleniumcrawler.spiders', 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter', 'SPIDER_MODULES': ['seleniumcrawler.spiders'], 'BOT_NAME': 'seleniumcrawler', 'LOGSTATS_INTERVAL': 0, 'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36', 'FEED_EXPORT_ENCODING': 'utf-8'}
2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'seleniumcrawler.middlewares.seleniumcrawlerDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-05-05 19:11:03 [scrapy.middleware] INFO: Enabled item pipelines:
['seleniumcrawler.pipelines.JsonPipeline',
 'seleniumcrawler.pipelines.CsvPipeline']
2018-05-05 19:11:03 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-05 19:11:03 [scrapy.core.engine] INFO: Spider opened
2018-05-05 19:11:03 [migros] INFO: Spider opened: migros
2018-05-05 19:11:04 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.sanalmarket.com.tr/kweb/sclist/30011-tum-meyveler>
Set-Cookie: JSESSIONID=cMTfOnFTK1dPSPF2Qdi0d1EqqCXP3HW0S00BwxOwljYjaOMcAOqE!1083904106; path=/; HttpOnly
Set-Cookie: NSC_wjq_dt_iuuq_lbohvsvn_lxfc=0933a3df2cf252c6b4bd9a5784157b04f2a0c6e4b29bff73d54a79d474fdc48e85bdc9ec;path=/;secure;httponly

记录剩余的行

2018-05-05 19:19:32 [scrapy.downloadermiddlewares.cookies] DEBUG: Sending cookies to: <GET https://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011>
Cookie: customer=ggB2MTVRWi76tWJwj2ZvbDa896G27N3YaH; insIsUserLoggedIn=1; insUserDetails=%22muharrem.akkaya96%40gmail.com%22; district=ac00a4001701ce63cc30626def; spUID=15251989688268402d4dc11.7edd9701; ins-gaSSId=cbf3cd92-3c71-e321-30ac-b2d89dbf3826_1525528747; insTotalCartAmount187=194.96; _ga=GA1.3.219867582.1525198968; JSESSIONID=yndMqXswzQYeUw1CsLtp9A0GBI7ZZE0yI1W0zPk4u4JJxpZES8RF!-1577658491; current-currency=TRY; first-permission-impression=1; insdrSV=285; _gid=GA1.3.1499846526.1525198968; _gat_UA-1547459-1=1; total-cart-amount=120.78; _dc_gtm_UA-1547459-1=1; scs=%7B%22t%22%3A1%7D; NSC_wjq_dt_iuuq_lbohvsvn_lxfc=756ca3c16479c6cdde0681fa2edb1040d4786c1c0a6b2f3116d5fc7f605b4631d4d0f199; NSC_wjq_dt_iuuq_lbohvsvn_lxfc=0933a3df2cf252c6b4bd9a5784157b04f2a0c6e4b29bff73d54a79d474fdc48e85bdc9ec; JSESSIONID=cMTfOnFTK1dPSPF2Qdi0d1EqqCXP3HW0S00BwxOwljYjaOMcAOqE!1083904106
2018-05-05 19:19:32 [scrapy.downloadermiddlewares.cookies] DEBUG: Received cookies from: <200 https://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011>
Set-Cookie: JSESSIONID=ZvvfjByrDdrOTwmJX7QaaU0jWBv5nxKgfXvPVfvwSsCn63bkGH3m!-1577658491; path=/; HttpOnly
2018-05-05 19:19:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.sanalmarket.com.tr/kweb/getProductList.do?shopCategoryId=30011> (referer: None)

那么我该如何克服这种 cookie 情况呢?

【问题讨论】:

    标签: python cookies web-scraping scrapy scrapy-spider


    【解决方案1】:

    Cookie 似乎通过您的 Scrapy 代码正确发送,据我所知,问题在于您的密钥 JSESSIONID 的 cookie 值。

    当我创建自己的会话时,将我的城市设置为“AFYON-Akmescit”并获取该会话 ID,我按预期获得了 AFYON-Akmescit 的 sid 1885,但是当我尝试您的或任何其他损坏的会话 id(通过随机更改一个字符而损坏),我收到 sid 193。所以不知何故,城市 ID 193 是默认值,它不接受您的 JSESSIONID 值,而不是 cookie 信息本身。

    无论如何,作为回答您问题的另一个方面,您当然不应该在抓取时使用会话 ID 作为可靠的标识来源,您可能还希望自动化身份验证过程。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2020-02-06
      • 2022-12-01
      • 2021-02-11
      • 2013-03-30
      • 2018-06-18
      • 1970-01-01
      • 2017-03-22
      相关资源
      最近更新 更多