【问题标题】:web scraping, python, requests , downloading pdf files, authentication网页抓取,python,请求,下载 pdf 文件,身份验证
【发布时间】:2019-02-22 07:19:09
【问题描述】:

我对此很陌生,我正在尝试抓取一个网站。一些 html 文本可以公开访问。但我需要在网站上下载一些 pdf 文件。我也有登录详细信息。

所以我尝试了这些方法。

#Attempt 1:

import requests, lxml.html
s = requests.session()
import BeautifulSoup

login = s.get('https://www.cottongrower.com.au/Member-Login.php')
login_html = lxml.html.fromstring(login.text)
hidden_inputs = login_html.xpath(r'//form//input[@type="hidden"]')
form = {x.attrib["name"]: x.attrib["value"] for x in hidden_inputs}
form['email'] = 'xxxxxxxx'
form['password'] = 'xxxxx'
form['contact'] = 'Log In'

s.post('https://www.cottongrower.com.au/Member-Login.php',data = form)
r = s.get('https://www.cottongrower.com.au/Content.php')

# check the pdf link is changed from 'signupdirect' to pdf url
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')


for tag in tags:
     print(tag.get('href'))

尝试 2:

from requests.auth import AuthBase
from requests.auth import HTTPBasicAuth
auth = HTTPBasicAuth('xxxxxxx', 'xxxxxx')
s = requests.session()
login = s.post('https://www.cottongrower.com.au/Member-Login.php',auth=auth )
r = s.get('https://www.cottongrower.com.au/Content.php')

# check the pdf link is changed from 'signupdirect' to pdf url
data = r.text
soup = BeautifulSoup(data, 'lxml')
tags = soup.find_all('a')
for tag in tags:
     print(tag.get('href'))

在登录之前检查我需要抓取的链接:

<td align="left" valign="top"><a target="_blank" href="signupredirect.php" class="issue_link">Increasing gossypol containing glands in cotton can boost plants natural defences</a><span class="smalltext"> &nbsp; (141kb)</span> </td>

登录后应该是这样的

<a target="_blank" href="images/articles/38ef71991e839fad5437d77bd5297e99.pdf" class="issue_link">Increasing gossypol containing glands in cotton can boost plants natural defences</a>

对于这两次尝试,我最终都打印了 signupdirect。

任何帮助将不胜感激。

【问题讨论】:

    标签: python-3.x authentication python-requests


    【解决方案1】:

    你把它弄得太复杂了,试试这个代码。 (因为没有账号,所以没测试过)

    from requests import Session
    
    
    username = "username"
    
    password = "password"
    
    
    s = Session()
    
    s.get("https://www.cottongrower.com.au/")
    
    data = {"email":username,
    "password":password,
    "button":">",
    "redirecttocontent":"1",
    "website":"1"}
    
    s.post("https://www.cottongrower.com.au/ValidateLogin.php", data=data)
    
    r = s.get('https://www.cottongrower.com.au/Content.php')
    

    【讨论】:

    • 谢谢利亚姆,它就像一个魅力。我做错的是我将登录详细信息发布到 MembershipLogin 页面,但没有直接发布到 ValidateLogin 页面。如果没有你指出,我永远不会弄明白。谢谢
    猜你喜欢
    • 2015-12-04
    • 1970-01-01
    • 2021-07-11
    • 1970-01-01
    • 2019-01-12
    • 1970-01-01
    • 2015-12-31
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多