用漂亮的汤刮 json 数据答案

【问题标题】：Scrape json data with beautiful soup用漂亮的汤刮 json 数据
【发布时间】：2016-02-17 20:16:34
【问题描述】：

我试图在一个 div 类中抓取一个子数据，我试图在其中获取“url”的数据，我使用了video_link = self.soup.find('div' ,{'class':'video-embed-big'})，但我无法使用引用的 url 获取该 div 中的数据。

<div class="video-embed-big video-embed-area bf_dom" id="video_buzz_element_4154403_7994283" rel:thumb="https://img.youtube.com/vi/_Ym0LW_uPPk/2.jpg" rel:bf_bucket_data="{"video": {"size": "big", "width":"625", "height":"376", "url":"https://youtube.com/watch?v=_Ym0LW_uPPk", "id":"4154403_7994283"}}">
  <div style="position:relative;" id="video_wrapper_4154403_7994283">     
     <iframe id="yt_4154403_7994283" class="ytvideo" type="text/html" allowscriptaccess="always" allowfullscreen="true" width="625" height="376" src="https://www.youtube.com/embed/_Ym0LW_uPPk?version=3&amp;hl=en&amp;fs=1&amp;enablejsapi=1&amp;origin=http://www.buzzfeed.com&amp;autoplay=0&amp;showinfo=0&amp;wmode=opaque" frameborder="0">
          </iframe>
     </div>
</div>

【问题讨论】：

标签： python web-scraping beautifulsoup

【解决方案1】：

怎么样

video_div = self.soup.find('div', id=lambda d: d and d.startswith('video_wrapper_'))
video_link = video_div.find('iframe')['src']

将返回

In [5]: video_link
Out[5]: 'https://www.youtube.com/embed/_Ym0LW_uPPk?version=3&hl=en&fs=1&enablejsapi=1&origin=http://www.buzzfeed.com&autoplay=0&showinfo=0&wmode=opaque'

如果您想使用 urlparse 并获取实际的 youtube 页面，您可以再深入一点。

import urlparse

video_div = self.soup.find('div', id=lambda d: d and d.startswith('video_wrapper_'))
video_link = video_div.find('iframe')['src']
url = urlparse.urlparse(video_link)
youtube_url = urlparse.urlunparse((url[0], url[1], "watch?v=" + url[2].split('/')[2],'','',''))

这是youtube_url的输出

In [15]: urlunparse((url[0], url[1], "watch?v=" + url[2].split('/')[2],'','',''))
Out[15]: 'https://www.youtube.com/watch?v=_Ym0LW_uPPk'

【讨论】：

【解决方案2】：

video_link = self.soup.find('div',{'class':'video-embed-big'}).div.iframe['src']

您需要使用“。”运算符进入 div 的子元素，然后使用 src 属性获取 url。

【讨论】：