转换后从 Flash 播放器中抓取 mp3 文件答案

【问题标题】：Scraping the mp3 file from flash player after the conversion转换后从 Flash 播放器中抓取 mp3 文件
【发布时间】：2015-04-27 21:29:44
【问题描述】：

页面上有一个textarea 和一个按钮Synthesize。它看起来如下：

        <textarea id="ttstext" name="text" style="font-size: 130%; width: 100%;
        height: 120px; padding: 5px;"></textarea>
        ...
        <div id="audioplayer">
            <script>
                create_playback();
            </script><audio autoplay="" autobuffer="" controls=""></audio>
        </div>
        <input id="commitbtn" value="Synthesize" type="submit">

当我点击按钮synthesize时，页面的HTML代码会改变如下（它会创建音频播放器）。

<div id="audioplayer" style="display: block;"><embed width="370" height="20" flashvars="height=20&amp;width=370&amp;type=mp3&amp;file=http://services.abc.xyz.mp3&amp;showstop=true&amp;usefullscreen=false&amp;autostart=true" allowfullscreen="true" allowscriptaccess="always" quality="high" name="mpl" id="mpl" style="undefined" src="/demo/mediaplayer.swf" type="application/x-shockwave-flash"></div>

我想从 Python 代码生成 mp3 文件。

到目前为止我已经尝试过什么。

#!/usr/bin/env python
# encoding: utf-8
from __future__ import unicode_literals
from contextlib import closing
from selenium.webdriver import Firefox
from selenium.webdriver.support.ui import WebDriverWait
import BeautifulSoup
import time

url = "http://www..."

def textToSpeech():
  with closing(Firefox()) as browser:
    try:
      browser.get(url)
    except selenium.common.exceptions.TimeoutException:
      print "timeout"
    browser.find_element_by_id("ttstext").send_keys("Hello.")
    button = browser.find_element_by_id("commitbtn")
    button.click()
    time.sleep(10)
    WebDriverWait(browser, timeout=100).until(
      lambda x: x.find_element_by_id('audioplayer'))
    src = browser.page_source
    return src

def getAudio(source):
  soup = BeautifulSoup.BeautifulSoup(source)
  audio = soup.find("div", {"id": "audioplayer"})
  return audio.string


if __name__ == "__main__":
  print getAudio(textToSpeech())

成功的关键是获取生成的 mp3 文件的 URL。我不知道如何等待脚本更改 HTML（<div id="audioplayer"> 的内部文本）。我的代码返回None，因为它会更快地得到结果。

【问题讨论】：

自从div 出现后，URL 是否出现在div 中？
@Lawrence No. 点击按钮synthesize后会从textarea生成带有mp3文件的URL。
而mp3创建后，可能需要很长时间吧？
@Lawrence 通常是几秒钟。

标签： python selenium web-scraping beautifulsoup

【解决方案1】：

在发生变化的情况下，等待元素是不够的：

WebDriverWait(browser, timeout=100).until(
      lambda x: x.find_element_by_id('audioplayer'))

但是你需要等待它改变某个条件，使用ExpectedCondition。这是为了让您入门（未经测试）：

from selenium.webdriver.support import expected_conditions as EC
wait_text = 'file=http://'
element = WebDriverWait(driver, 10).until(
        EC.text_to_be_present_in_element((By.ID, "myDynamicElement"), wait_text)
    )

您还可以在此处查看所有预期条件： http://selenium-python.readthedocs.org/en/latest/api.html?highlight=text_to_be_present_in_element#module-selenium.webdriver.support.expected_conditions

【讨论】：

现在，它正在等待，但它会以selenium.common.exceptions.TimeoutException 结束。你确定文本存在于元素中吗？它实际上是子元素参数的一部分。
我不确定，正如我所说，我没有测试过。你需要弄清楚你在等待什么文本，我这里没有运行示例。
我不知道text_to_be_present_in_element 在哪里搜索wait_text。这没有用。但它帮助我理解了逻辑。我用find_element_by_id('mpl') 尝试了第一次等待，它成功了。