使用 Selenium 和 Python 下载 JavaScript 加载的音频答案

【问题标题】：Downloading JavaScript-loaded audio using Selenium and Python使用 Selenium 和 Python 下载 JavaScript 加载的音频
【发布时间】：2019-07-13 05:57:26
【问题描述】：

我正在尝试使用 Python 和 Selenium 编写一个脚本来自动从网站下载文本和音频文件。

网站：https://learn.dict.naver.com/conversation#/korean-en/20190713 (yyyymmdd)

import requests
from time import sleep
from selenium import webdriver
from selenium.webdriver.firefox.options import Options

url = 'https://learn.dict.naver.com/conversation#/korean-en/20190713'

options = Options()
options.headless = True

driver = webdriver.Firefox(options=options, executable_path = 'geckodriver')
driver.get(url)
sleep(3)
driver.find_element_by_class_name('btn_listen').click() #for the first one

音频在点击时播放/加载，但我不知道如何在加载和下载文件时“捕获”文件。

例如，第一个播放按钮会加载此 URL： https://dict-dn.pstatic.net/v?_lsu_sa_=3348a15dcd343766a69b01513e9444f36d1462055f0edfbd60a21c73bbe96741685d375f6b45b579a9df6f95d82950485fa22dddfc987cc04ba7a344d3daaff10b8f5ed218b169623e2b926412981ebffcd2ee2a025bbfea806ec1ee58c519fab30368be2e72c258347eb029646cd69ca0c931d102f1fcdef76df1a85dc49c52df2a6431603057d8f62c0c613ec86b1c

将其复制到浏览器中，会加载一个可以手动下载的音频文件。我想自动下载它（能够动态重命名它的奖励积分）。

我已经尝试了一些 options.set_preference()，但它们似乎主要与要下载的文件有关（即“单击此处下载”按钮），而不仅仅是播放。

谢谢！

【问题讨论】：

标签： javascript python selenium selenium-webdriver web-scraping

【解决方案1】：

您可以使用requests下载mp3文件并在页面上以文本格式获取有关句子的其他有用信息。
下面的代码是https://learn.dict.naver.com/conversation#/korean-en/20190713 的示例。在data 变量中使用 json 来查看您可以使用的信息。

import requests
import json

callback = 'angular.callbacks._0'

headers = {
    'Referer': 'https://learn.dict.naver.com/conversation',
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/75.0.3770.100 Safari/537.36',
    'DNT': '1',
}
params = (
    ('callback', callback),
)

with requests.Session() as session:
    response = session.get('https://gateway.dict.naver.com/krdict/kr/koen/today/20190713/conversation.dict',
                           headers=headers, params=params)

    data = json.loads(response.text.lstrip(f"{callback}(").rstrip(")"))["data"]
    sentences = data["sentences"]

    for sentence in sentences:
        audio_id = sentence["id"]
        sentence_pron_file = sentence["sentence_pron_file"]

        response = requests.post(f'https://learn.dict.naver.com/dictPronunciation.dict?filePaths={sentence_pron_file}')
        audio_url = response.json()["url"][0]
        audio_file = session.get(audio_url)

        with open(f'./{audio_id}.mp3', 'wb') as f:
            f.write(audio_file.content)

【讨论】：

那是……完美。非常感谢。现在我必须尝试理解这一切哈哈！你是怎么知道这个对话的？这让一切变得如此简单！