如何从python中的多次重定向url中获取目标url？答案

【问题标题】：How to get destination url from multiple time redirecting url in python?如何从python中的多次重定向url中获取目标url？
【发布时间】：2019-12-15 18:05:55
【问题描述】：

我正在尝试制作网络爬虫。我想从查询 URL 中获取目标 URL。但它会重定向很多次。

这是我的网址：

https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO

目标网址应该是：

https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO

但我将 https://www.jw.org/ins/library/videos/?item=pub-jwb_201812_16_VIDEO&appLanguage=INS 作为重定向的 URL。

我试过这段代码：

import requests

url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'

s = requests.get(url)
print(s.url)

【问题讨论】：

标签： python-3.x redirect web-scraping python-requests

【解决方案1】：

重定向是使用 JavaScript 进行的

它不是服务器重定向，所以请求没有跟随它。

您可以使用 Selenium 获取 URL

from selenium import webdriver
import time


browser = webdriver.Chrome()
url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'
browser.get(url)
time.sleep(5)
print (browser.current_url)
browser.quit()

输出

https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO

如果您正在构建刮板，我建议您查看 scrapy-splash https://github.com/scrapy-plugins/scrapy-splash 或 requests-html https://github.com/psf/requests-html

【讨论】：

感谢您的解决方案。但是 Selenium 的问题在于它需要 webdriver，例如 Chromedriver、Firefoxdriver ......如果我将此脚本作为可执行文件并将其安装在另一台计算机上，该计算机也需要 webdriver。除了 Selenium 还有其他解决方案吗？
你可以看看 PyQt 特别是 QWebEngineView 有很多关于如何使用 JavaScript 进行网页基本渲染的例子。
非常感谢，我会检查的。

【解决方案2】：

您可以使用请求非常轻松地做到这一点：

import requests
destination = requests.get("http://doi.org/10.1080/07435800.2020.1713802") 
#this link redirects the user to another link with a research paper of a given DOI code
print(destination.url)
#this returns "https://www.tandfonline.com/doi/full/10.1080/07435800.2020.1713802", the redirect of the initial doi.org link

【讨论】：

当有两个重定向但它不起作用时，我已经尝试过这个；它为您提供第一个重定向的 url，而不是最后一个。 TLDR 仅适用于 1 个重定向。