【问题标题】:Scrapy authenticationScrapy 身份验证
【发布时间】:2013-11-11 06:49:44
【问题描述】:

我正在尝试在一个项目中使用 scrapy。我无法绕过 https://text.westlaw.com/signon/default.wl?RS=ACCS10.10&VR=2.0&newdoor=true&sotype=mup 的身份验证系统。 为了理解这个问题,我做了一个简单的请求处理程序。

import cookielib, urllib2
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/30.0.1599.101 Safari/537.36'),]
url='https://text.westlaw.com/signon/default.wl?RS=ACCS10.10&VR=2.0&newdoor=true&sotype=mup'
r = opener.open(url)
f = open('code.html', 'wb')
f.write(r.read())
f.close()

返回的 html 代码不包含表单元素。可能有人知道如何说服服务器,我不是假浏览器,所以我可以继续进行身份验证?

【问题讨论】:

标签: python authentication web-scraping scrapy


【解决方案1】:

你可以使用InitSpider,它允许你做一些后期处理,比如使用自定义处理程序登录:

class CrawlpySpider(InitSpider):

    #...

    # Make sure to add the logout page to the denied list
    rules = (
        Rule(
            LinkExtractor(
                allow_domains=(self.allowed_domains),
                unique=True,
                deny=('logout.php'),
            ),
            callback='parse',
            follow=True
        ),
    )


    def init_request(self):
        """This function is called before crawling starts."""

        # Do a login
        return Request(url="http://domain.tld/login.php", callback=self.login)


    def login(self, response):
        """Generate a login request."""

        return FormRequest.from_response(
            response,
            formdata={
                "username": "admin",
                "password": "very-secure",
                "reguired-field": "my-value"
            },
            method="post",
            callback=self.check_login_response
        )

    def check_login_response(self, response):
        """Check the response returned by a login request to see if we are
        successfully logged in.
        """
        if "incorrect password" not in response.body:
            # Now the crawling can begin..
            logging.info('Login successful')
            return self.initialized()
        else:
            # Something went wrong, we couldn't log in, so nothing happens.
            logging.error('Unable to login')


    def parse(self, response):
        """Your stuff here"""

我还刚刚实现了一个工作示例,它完全符合您的目标。看看吧:https://github.com/cytopia/crawlpy

【讨论】:

    猜你喜欢
    • 2021-06-07
    • 2020-09-19
    • 1970-01-01
    • 2016-10-16
    • 2016-09-19
    • 1970-01-01
    • 1970-01-01
    • 2015-02-02
    相关资源
    最近更新 更多