爬取网站如何获取特定链接答案

【问题标题】：Scrape a web site how to get a specific link爬取网站如何获取特定链接
【发布时间】：2016-12-05 02:01:19
【问题描述】：

嗨，伙计们，我正在抓取一个网站，每部电影都有 3 个电影链接，它有 3 个链接，我有获取 3 个链接的代码，但我想选择 1 并只打印那个 1，在这种情况下是 openload 一个，它也像整个 iframe 一样打印它，我喜欢像这样打印清晰的链接 = 'https://openload.co/embed/cosxf9mWZlg/' 我也要把印刷品放在这里，所以你们知道我现在是如何正确的

import urllib2
import urllib
import re
import requests
from bs4 import BeautifulSoup
from lxml import html
url= ('http://goldfilmesonline.com/goldstone-legendado-online/','http://goldfilmesonline.com/sob-a-sombra-legendado-online/','http://goldfilmesonline.com/fora-do-rumo-dublado-online/')
b=0

while b < len(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
    a = r = requests.get(url[b], headers=headers)
    soup = BeautifulSoup(a.text,'html.parser')
    x = soup.findAll({'iframe' : 'src'})
    print x
    b+=1

这是印刷品

[<iframe allowfullscreen="" frameborder="0" src="https://www.youtube.com/embed/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="https://openload.co/embed/noK42_ITHiU/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="http://thevid.net/e/zqlcx3byxh/"></iframe>]
[<iframe allowfullscreen="" frameborder="0" src="https://www.youtube.com/embed/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="https://openload.co/embed/oMzqATsLLsw/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="http://thevid.net/e/rgt2kyrmzdqdbeocwjmspd6/"></iframe>]
[<iframe allowfullscreen="" frameborder="0" src="https://www.youtube.com/embed/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="https://openload.co/embed/cosxf9mWZlg/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="https://openload.co/embed/b85sRhsjJ3Q/"></iframe>, <iframe allowfullscreen="" frameborder="0" src="http://thevid.net/e/4mvpjkef43pqyhnmg/"></iframe>]

【问题讨论】：

标签： python python-2.7 beautifulsoup httprequest

【解决方案1】：

如果我理解您的要求，您只想打印出 src 中包含 openload 的 iframe。如果是这种情况，那么您需要做的就是遍历x 并检查openload 是否在该帧的src 值中。如果这是真的，那么您将打印出该框架。

while b < len(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
    a = r = requests.get(url[b], headers=headers)
    soup = BeautifulSoup(a.text,'html.parser')
    x = soup.findAll({'iframe' : 'src'})
    #print x
    for eachFrame in x:
        currentSRC = eachFrame['src']
        if "openload" in currentSRC.lower(): #lowercased here just in case.
            #print currentSRC #uncomment this if you want just the src link to print.
            #print eachFrame #uncomment this if you want the whole iFrame to print
    b+=1

【讨论】：

谢谢，我找到了一些代码来做，但你的更好
再次感谢伙计。但是你的代码中有一些错误 eachFrame 应该是“ i ”，我猜我不知道为什么它会打印 thevid 和 openload 链接不知道为什么
只是为了确保它在我的项目之外运行良好，但当我将其设为 def 时，它无法工作并打印 thevid 和 openload
我已经修复了代码，抱歉。我最初使用i 作为变量名而不是eachFrame。我忘记将i['src'] 更新为eachFrame['src']。这应该是你说的问题。

【解决方案2】：

好吧，伙计们，我自己有一个答案，但它看起来不正确，但工作......如果中小企业知道使用相同模块的简单或更好的方法，请帮助谢谢

import urllib2
import urllib
import re
import requests
from bs4 import BeautifulSoup
from lxml import html
url= ('http://goldfilmesonline.com/goldstone-legendado-online/','http://goldfilmesonline.com/sob-a-sombra-legendado-online/','http://goldfilmesonline.com/fora-do-rumo-dublado-online/')
b=0

while b < len(url):
    headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36'}
    a = r = requests.get(url[b], headers=headers)
    soup = BeautifulSoup(a.text,'html.parser')
    x = soup.findAll({'iframe' : 'src'})
    c = x[1]
    a = re.compile('src="(.+?)"').findall(str(c))
    print a
    b+=1

【讨论】：

你不应该在回答中提出新问题
在下面查看我的答案，如果这不能回答您的问题，请评论并告诉我。
谢谢迈克尔，它确实帮助了我很多我用不同的代码得到了同样的事情，但我决定非常感谢你