【问题标题】:Python Web Scraping List from Webpage to Text FilePython Web Scraping List 从网页到文本文件
【发布时间】:2017-09-22 22:58:10
【问题描述】:

我在大学三年级时参加了 Python 课程,但忘记了很多。对于工作,我被要求尝试找到一种方法来从网站上抓取一些日期。我有一个 python 文件,它对我使用的不同站点执行类似的操作。这是代码:

from bs4 import BeautifulSoup
import io
import requests

soup = 
BeautifulSoup(requests.get("https://servicenet.dewalt.com/Parts/Search?searchedNumber=N365763").content)

rows = soup.select("#customerList tbody tr")
with io.open("data.txt", "w", encoding="utf-8") as f:
   f.write(u", ".join([row.select_one("td a").text for row in rows]))

这将获取该站点的电动工具零件的型号列表。现在我基本上想做同样的事情,但我不知道从哪里开始。该网站是https://www.powertoolreplacementparts.com/briggs-stratton-part-finder/#/s/BRG//498260/1/y

您单击“使用位置”按钮,然后出现型号列表“093412-0011-01”、“093412-0011-02”等。我希望将这些数字发送到文本文件就像在我的第一个代码中一样,用逗号分隔 ("093412-0011-01, 093412-0011-02,...") 非常感谢任何帮助。谢谢!

【问题讨论】:

    标签: python web-scraping beautifulsoup python-requests


    【解决方案1】:

    我使用 selenium 来导航页面。

    代码:

    import io
    import time
    from selenium import webdriver
    from bs4 import BeautifulSoup
    from selenium.webdriver.common.by import By
    from selenium.webdriver.support.ui import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    
    # Selenium Intializations
    driver = webdriver.Chrome()
    driver.get('https://www.powertoolreplacementparts.com/briggs-stratton-part-finder/#/s/BRG//498260/1/y')
    wait = WebDriverWait(driver, 30)
    driver.maximize_window()
    
    # Locating the "Where Used" Button
    driver.find_element_by_xpath("//input[@id='aripartsSearch_whereUsedBtn_0'][@class='ariPartListWhereUsed ariImageOverride'][@title='Where Used']").click()
    wait.until(EC.visibility_of_element_located((By.XPATH, '//*[@id="ari_searchResults_Grid"]/ul')))
    
    
    # Intializing BS4 and looking for the "Show More" Button
    soup = BeautifulSoup(driver.page_source, "html.parser")
    show = soup.find('li', {'class': 'ari-search-showMore'})
    
    # Keep clicking the "Show More" Button until it is not visible anymore
    while not show is None:
        time.sleep(2)
        hidden_element = driver.find_element_by_css_selector('#ari-showMore-unhide')
        if hidden_element.is_displayed():
            print("Element found")
            driver.find_element_by_css_selector('#ari-showMore-unhide').click()
            show = soup.find('li', {'class': 'ari-search-showMore'})
        else:
            print("Element not found")
            break
    
    # Write the data parsed to the text file "data.txt"
    with io.open("data.txt", "w", encoding="utf-8") as f:
        rows = soup.findAll('li', {'class': 'ari-ModelByPrompt'})
        for row in rows:
            part = str(row.text).replace(" ", "").replace("\n", "")
            print(part)
            f.write(part + ",")
    

    输出:

    Element found
    Element found
    Element found
    Element not found
    093412-0011-01
    093412-0011-02
    093412-0015-01
    093412-0039-01
    093412-0060-01
    093412-0136-01
    093412-0136-02
    093412-0139-01
    093412-0150-01
    093412-0153-01
    093412-0154-01
    093412-0169-01
    093412-0169-02
    093412-0172-01
    093412-0174-01
    093412-0315-A1
    093412-0339-A1
    093412-0360-A1
    093412-0636-A1
    093412-0669-A1
    093412-1015-E1
    093412-1039-E1
    093412-1060-E1
    093412-1236-E1
    093412-1236-E2
    093412-1253-E1
    093412-1254-E1
    093412-1269-E1
    093412-1274-E1
    093412-1278-E1
    093432-0035-01
    093432-0035-02
    093432-0035-03
    093432-0036-01
    093432-0036-03
    093432-0036-04
    093432-0037-01
    093432-0038-01
    093432-0038-03
    093432-0041-01
    093432-0140-01
    093432-0145-01
    093432-0149-01
    093432-0152-01
    093432-0157-01
    093432-0158-01
    093432-0160-01
    093432-0192-B1
    093432-0335-A1
    093432-0336-A1
    093432-0337-A1
    093432-0338-A1
    093432-1035-B1
    093432-1035-E1
    093432-1035-E2
    093432-1035-E4
    093432-1036-B1
    093432-1036-E1
    093432-1037-E1
    093432-1038-B1
    093432-1038-E1
    093432-1240-B1
    093432-1240-E1
    093432-1257-E1
    093432-1258-E1
    093432-1280-B1
    093432-1280-E1
    093432-1281-B1
    093432-1281-E1
    093432-1282-B1
    093432-1282-E1
    093432-1286-B1
    093452-0049-01
    093452-0141-01
    093452-0168-01
    093452-0349-A1
    093452-1049-B1
    093452-1049-E1
    093452-1049-E5
    093452-1241-E1
    093452-1242-E1
    093452-1277-E1
    093452-1283-B1
    093452-1283-E1
    09A412-0267-E1
    09A413-0201-E1
    09A413-0202-E1
    09A413-0202-E2
    09A413-0202-E3
    09A413-0203-E1
    09A413-0522-E1
    09K432-0022-01
    09K432-0023-01
    09K432-0024-01
    09K432-0115-01
    09K432-0116-01
    09K432-0116-02
    09K432-0117-01
    09K432-0118-01
    120502-0015-E1
    

    文件内容:

    093412-0011-01,093412-0011-02,093412-0015-01,093412-0039-01,093412-0060-01,093412-0136-01,093412-0136-02,093412-0139-01,093412-0150-01,093412-0153-01,093412-0154-01,093412-0169-01,093412-0169-02,093412-0172-01,093412-0174-01,093412-0315-A1,093412-0339-A1,093412-0360-A1,093412-0636-A1,093412-0669-A1,093412-1015-E1,093412-1039-E1,093412-1060-E1,093412-1236-E1,093412-1236-E2,093412-1253-E1,093412-1254-E1,093412-1269-E1,093412-1274-E1,093412-1278-E1,093432-0035-01,093432-0035-02,093432-0035-03,093432-0036-01,093432-0036-03,093432-0036-04,093432-0037-01,093432-0038-01,093432-0038-03,093432-0041-01,093432-0140-01,093432-0145-01,093432-0149-01,093432-0152-01,093432-0157-01,093432-0158-01,093432-0160-01,093432-0192-B1,093432-0335-A1,093432-0336-A1,093432-0337-A1,093432-0338-A1,093432-1035-B1,093432-1035-E1,093432-1035-E2,093432-1035-E4,093432-1036-B1,093432-1036-E1,093432-1037-E1,093432-1038-B1,093432-1038-E1,093432-1240-B1,093432-1240-E1,093432-1257-E1,093432-1258-E1,093432-1280-B1,093432-1280-E1,093432-1281-B1,093432-1281-E1,093432-1282-B1,093432-1282-E1,093432-1286-B1,093452-0049-01,093452-0141-01,093452-0168-01,093452-0349-A1,093452-1049-B1,093452-1049-E1,093452-1049-E5,093452-1241-E1,093452-1242-E1,093452-1277-E1,093452-1283-B1,093452-1283-E1,09A412-0267-E1,09A413-0201-E1,09A413-0202-E1,09A413-0202-E2,09A413-0202-E3,09A413-0203-E1,09A413-0522-E1,09K432-0022-01,09K432-0023-01,09K432-0024-01,09K432-0115-01,09K432-0116-01,09K432-0116-02,09K432-0117-01,09K432-0118-01,120502-0015-E1,
    

    【讨论】:

    • 这正是我想要的,但是当我运行您的代码时,我收到一条错误消息:“FileNotFoundError: [WinError 2] The system cannot find the file specified”和另一个说“selenium .common.exceptions.WebDriverException:消息:“chromedriver”可执行文件需要在 PATH 中。”
    • 我想我需要安装“webdriver”和“time”包,但是当我尝试这样做时,我收到一条错误消息“找不到满足时间要求的版本(来自版本: ) 没有找到匹配的发行版”所有其他软件包都已更新。
    • 嗨,Thomas,我相信 python 3 内置了“时间”库。您唯一需要安装的是 selenium。您将需要使用 PIP 来安装 selenium “PIP install selenium”。您还需要从link 下载 chromedriver 并将可执行文件添加到您的路径(我通常将其添加到项目目录中)。请参阅此link 了解整体分步说明。
    • 阿里,我将文件添加到 C:\Users\Thomas\PycharmProjects\PowerToolSuperstore 但我仍然收到错误“FileNotFoundError: [WinError 2] The system cannot find the file specified. 难道我放错地方了?
    • 您确定您下载了 Windows 版本的 chromedriver 可执行文件并从项目位置的 chromedriver_win32.zip 中提取了 chromedriver.exe 吗?也请参考这个answer
    【解决方案2】:

    1) 打开chrome到https://www.powertoolreplacementparts.com/briggs-stratton-part-finder/#/s/BRG//498260/1/y

    2) 打开网络标签

    3) 点击“Where used”

    4) 查看对端点“GetModelSearchModelsForPrompt”的 API 调用

    5) 复制网址https://partstream.arinet.com/Search/GetModelSearchModelsForPrompt?cb=jsonp1506134982932&arib=BRG&arisku=498260&modelName=&responsive=true&arik=AjydG6MJi4Y9noWP0hFB&aril=en-US&ariv=https%253A%252F%252Fwww.powertoolreplacementparts.com%252Fbriggs-stratton-part-finder%252F

    6) 使用请求打开它,您需要一些聪明的思维来解析它,因为它们以“JSON”返回 HTML。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2014-11-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-06-08
      • 1970-01-01
      • 2022-07-22
      • 1970-01-01
      相关资源
      最近更新 更多