Course Hero 前端向 https://www.coursehero.com/api/v2/search 发送 POST 请求并从 JavaScript 呈现搜索结果。
只需通过 HTTP 请求获取 JSON。 Full example。我没有付费账户,所以代码的最后一部分被注释了,因为它是一个伪代码。
import requests
headers = {
'User-Agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.3987.78 Safari/537.36'
}
data = {
"client": "web",
"query": "scrape",
"view": "list_w",
"filters": {
"type": ["document"],
"doc_type": [],
},
"sort": "relevancy",
"limit": 20,
"offset": 0,
"callout_types": ["textbook"]
}
response = requests.post(
'https://www.coursehero.com/api/v2/search/', headers=headers, json=data)
data = response.json()
for result in data['results']:
url = f"https://www.coursehero.com/file/{result['document']['db_filename']}"
print(f"'{result['core']['title']}' URL: {url}")
# Login and extract download URL from HTML
#
# response = requests.get(url, headers=headers)
# soup = BeautifulSoup(response.content, 'lxml')
# download_url = soup.select('...')
#
# OR
#
# Download file via direct HTTP request if URL is returned via XHR request
#
# download_url = 'https://www.coursehero.com/...'
# requests.get(download_url, headers=headers)
输出
'Week 6 - Web Scraping.pptx' URL: https://www.coursehero.com/file/38748386
'Python web_scraping train.docx' URL: https://www.coursehero.com/file/70193727
'ScrAPES Book' URL: https://www.coursehero.com/file/6219095
'scrape.py' URL: https://www.coursehero.com/file/43396377
'scrAPES - Rain didn't Boost Lakes' URL: https://www.coursehero.com/file/10042922
'orders cannot scrape.docx' URL: https://www.coursehero.com/file/75016027
...