使用python拉href标签答案

【问题标题】：using python to pull href tags使用python拉href标签
【发布时间】：2018-12-25 15:09:27
【问题描述】：

想拉出此网页上产品的 href 链接。该代码提取除页面上列出的产品之外的所有 href。

from bs4 import BeautifulSoup
import requests

url = "https://www.neb.com/search#t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending"

response = requests.get(url)

data = response.text

soup = BeautifulSoup(data, 'lxml')

tags = soup.find_all('a')

for tag in tags:
    print(tag.get('href'))

【问题讨论】：

快速浏览一下网络检查器告诉我，该记录集中有 13323 条记录。此外，结果是在客户端从 json 有效负载格式化的，所以我想你会对结构化格式有更好的运气。无论哪种方式，您都必须处理分页。

标签： python beautifulsoup href

【解决方案1】：

我想你想要这样的东西：

from bs4 import BeautifulSoup
import urllib.request

for numb in ('1', '100'):
    resp = urllib.request.urlopen("https://www.neb.com/search#first=" + numb + "&t=_483FEC15-900D-4CF1-B514-1B921DD055BA&sort=%40ftitle51880%20ascending")
    soup = BeautifulSoup(resp, from_encoding=resp.info().get_param('charset'))

    for link in soup.find_all('a', href=True):
        print(link['href'])

【讨论】：

【解决方案2】：

尝试验证产品 href 是否在收到的响应中。我告诉你这样做是因为如果产品的一部分是由 ajax 动态生成的，例如，在主页上简单的 get 不会带来它们。

打印响应并验证是否在 html 中收到产品

【讨论】：

【解决方案3】：

产品是通过rest API动态加载的，URL是这样的： https://international.neb.com/coveo/rest/v2/?sitecoreItemUri=sitecore%3A%2F%2Fweb%2F%7BA1D9D237-B272-4C5E-A23F-EC954EB71A26%7D%3Flang%3Den%26ver%3D1&siteName=nebinternational

加载此响应将为您提供 URL。

下次，如果网页的任何部分没有动态加载（或使用 selenium），请检查您的网络检查器。

【讨论】：