通过beautifulsoup从href获取url链接，没有重定向链接答案

【问题标题】：get url link from href by beautifulsoup without redirect link通过beautifulsoup从href获取url链接，没有重定向链接
【发布时间】：2022-01-22 19:01:31
【问题描述】：

我想只获取 URL 而不重定向链接。我的代码是：

html = '<a class="css-10y60kr" href="/biz_redir?url=https%3A%2F%2Faceplumbingandrooter.com&amp;cachebuster=1642876680&amp;website_link_type=website&amp;src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f" rel="noopener nofollow" role="link" target="_blank">https://aceplumbingandrooter.c…</a>'

soup=BeautifulSoup(html,'lxml')

在标签['href'] 内容中：

href="/biz_redir?url=https%3A%2F%2Faceplumbingandrooter.com&amp;cachebuster=1642876680&amp;website_link_type=website&amp;src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f"

我只想要链接网址：aceplumbingandrooter.com

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

您可以使用urllib.parse 包。你要找的网址确实是/biz_redir的参数之一，所以我们需要先把'url'这个参数取出来。

from urllib.parse import urlparse, parse_qs

url = '/biz_redir?url=https%3A%2F%2Faceplumbingandrooter.com&amp;' \
      'cachebuster=1642876680&amp;website_link_type=website&amp;' \
      'src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00' \
      'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'

parsed_url = urlparse(url)
print(parse_qs(parsed_url.query)['url'][0])

这将为您提供完整的 URL https://aceplumbingandrooter.com。然后可以进一步解析得到netloc，完整代码如下：

from urllib.parse import urlparse, parse_qs

url = '/biz_redir?url=https%3A%2F%2Faceplumbingandrooter.com&amp;' \
      'cachebuster=1642876680&amp;website_link_type=website&amp;' \
      'src_bizid=hqjCHBGnEj4nECnLJBvjQw&amp;s=2caa69aa7350cca9ad00' \
      'f1fd1d5a6346f341dd43e1ede874aa2eaa94d6a3458f'

parsed_url = urlparse(url)
new = parse_qs(parsed_url.query)['url'][0]
new = urlparse(new)
print(new.netloc)

输出：

aceplumbingandrooter.com

【讨论】：