【问题标题】:How to get the direct download link inside a page?如何获取页面内的直接下载链接?
【发布时间】:2013-09-02 00:43:40
【问题描述】:

我有这个代码:

import urllib
from bs4 import BeautifulSoup

f = open('log1.txt', 'w')

url ='http://www.brothersoft.com/tamil-font-513607.html'
pageUrl = urllib.urlopen(url)
soup = BeautifulSoup(pageUrl)

for a in soup.select("div.class1.coLeft a[href]"):
    try:
        suburl = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
        f.write ('http://www.brothersoft.com'+a['href']+'\n')
    except:
        print 'cannot read'
        f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

        pass

    content = urllib.urlopen(suburl)
    soup = BeautifulSoup(content)
    for a in soup.select("div.Sever1.coLeft a[href]"):
        try:
            suburl2 = ('http://www.brothersoft.com'+a['href']).encode('utf-8','replace')
            f.write ('http://www.brothersoft.com'+a['href']+'\n')
        except:
            print 'cannot read'
            f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

            pass

        content = urllib.urlopen(suburl2)
        soup = BeautifulSoup(content)
        for a in soup.select("span.p a[href]"):
            try:
                print (a['href']).encode('utf-8','replace')
                f.write ('http://www.brothersoft.com'+a['href']+'\n')
            except:
                print 'cannot read'
                f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')

                pass




f.close()

当我运行它时,我得到了这个结果:

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brotherso
ft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font
http://ask.brothersoft.com/ask-a-question/?topic=1
http://ask.brothersoft.com/
http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Fusfiles.brother
soft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font
http://ask.brothersoft.com/ask-a-question/?topic=1
http://ask.brothersoft.com/

但我需要的只是这样的直接下载链接:

http://www.brothersoft.com/d.php?soft_id=513607&url=http%3A%2F%2Ffiles.brothersoft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font

【问题讨论】:

    标签: python html python-2.7 html-parsing beautifulsoup


    【解决方案1】:

    而不是最后一个块:

       for a in soup.select("span.p a[href]"):
            try:
                print (a['href']).encode('utf-8','replace')
                f.write ('http://www.brothersoft.com'+a['href']+'\n')
            except:
                print 'cannot read'
                f.write('cannot read:'+'http://www.brothersoft.com'+a['href']+'\n')
    
                pass
    

    bodyonload属性中读取url:

    print soup.find('body')['onload'][10:-2]
    

    【讨论】:

    • 为什么我有两个下载链接? brothersoft.com/… ft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font brothersoft.com/…soft.com%2Fphotograph_graphics%2Ffont_tools%2Fkeyman.exe&name=Tamil%20Font
    • @wanmohdpayed 因为第二步有两个下载镜像。您可以使用soup.find("div.Sever1.coLeft a[href]") 而不是循环。如果您有问题,请告诉我。谢谢。
    • 我收到此错误:回溯(最近一次调用最后一次):文件“C:\Users\ext-chermo\Desktop\soup5.py”,第 32 行,在 content = urllib. urlopen(suburl2) 文件“C:\Python27\lib\urllib.py”,第 86 行,在 urlopen 返回 opener.open(url) 文件“C:\Python27\lib\urllib.py”,第 179 行,打开 fullurl = unwrap(toBytes(fullurl)) File "C:\Python27\lib\urllib.py", line 1056, in unwrap url = url.strip() AttributeError: 'NoneType' object has no attribute 'strip'
    猜你喜欢
    • 1970-01-01
    • 2020-09-29
    • 1970-01-01
    • 2017-08-26
    • 2019-10-15
    • 2017-06-16
    • 2018-06-26
    • 1970-01-01
    • 2018-10-27
    相关资源
    最近更新 更多