无法一次打印所有结果答案

【问题标题】：Can't print all the results all at once无法一次打印所有结果
【发布时间】：2021-01-08 21:14:04
【问题描述】：

我正在尝试创建一个脚本，该脚本将从此 webpage 中获取产品的标题和描述。在它的登陆页面中有一个产品。但是，当您查看左侧区域时，您会注意到一个标题为 17 products 的选项卡。我也试图抓住他们的title 和description。实际上，名为 17 products 的选项卡没有任何作用，因为 17 种产品已经在页面源中。

我可以通过以下方式获取所有 18 种产品。我不得不使用print 两次来打印所有 18 种产品。如果我将结果附加并一起打印，脚本会看起来更混乱。

import requests
from bs4 import BeautifulSoup

link = 'https://www.3m.com/3M/en_US/company-us/all-3m-products/~/3M-Cubitron-II-Cut-Off-Wheel/?N=5002385+3290927385&preselect=8710644+3294059243&rt=rud'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")

    product_title = soup.select_one("h1[itemprop='name']").text
    specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
    print(product_title,specification)

    for additional_link in list(set([item.get("href") for item in soup.select(".js-row-results .allModelItemDetails a.SNAPS--actLink")])):
        res = s.get(additional_link)
        sauce = BeautifulSoup(res.text,"lxml")
        product_title = sauce.select_one("h1[itemprop='name']").text
        specification = sauce.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
        print(product_title,specification)

如何将产品的所有标题和描述全部打印在一起？

【问题讨论】：

标签： python python-3.x web-scraping python-requests

【解决方案1】：

不确定我是否理解您的问题。您想将所有标题和描述一起打印，但又不想将它们附加到列表中，因为脚本会很乱？

一种选择是使用字典而不是列表。导入后在代码顶部定义一个字典：products = {}，并用 products[product_title] = specification 替换您的打印语句

之后，您可以使用我相信是 python 附带的 pprint 包来整齐地打印字典对象，如下所示：

import pprint
some_random_dict = {'a': 123, 'b': 456}
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(stuff)

将some_random_dict 替换为products

如果您关心整洁，我也会将此位重构为一个单独的函数：

    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    product_title = soup.select_one("h1[itemprop='name']").text
    specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity

可能是这样的：

def get_product(sess, link):
    info = {}
    r = s.get(link)
    soup = BeautifulSoup(r.text,"lxml")
    product_title = soup.select_one("h1[itemprop='name']").text
    specification = soup.select_one(".MMM--tabHeader:contains('Product Details') + .tabContentContainer").get_text(strip=True)[:30] #truncated for brevity
    info[product_title] = specification
    return soup, info

您的代码将如下所示：

products = {}
with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36'
   soup, product_info = get_link(s, link)
   products.update(product_info)
   
   for additional_link in list(set([item.get("href") for item in soup.select(".js-row-results .allModelItemDetails a.SNAPS--actLink")])):
        sauce, product_info = get_link(s, additional_link)
        products.update(product_info)

应始终避免将同一段代码粘贴到多个位置。从长远来看，将该位重构为单独的函数将有助于提高可读性和可维护性。

【讨论】：