用python中的字符串替换Beautiful Soup的节点答案

【问题标题】：Replace the node of Beautiful Soup with string in python用python中的字符串替换Beautiful Soup的节点
【发布时间】：2016-06-10 15:31:47
【问题描述】：

我必须下载并保存具有给定 URL 的网页。我已经下载了页面以及所需的 js 和 css 文件。但问题是在 html 源文件中更改这些标签的 src 和 href 值以使其工作。

我的 html 源代码是：

<link REL="shortcut icon" href="/commd/favicon.ico">
<script src="/commd/jquery.min.js"></script>
<script src="/commd/jquery-ui.min.js"></script>
<script src="/commd/slimScroll.min.js"></script>
<script src="/commd/ajaxstuff.js"></script>
<script src="/commd/jquery.nivo.slider.pack.js"></script>FCT0505
<script src="/commd/jquery.nivo.slider.pack.js"></script>
<link rel="stylesheet" type="text/css" href="/fonts/stylesheet.cssFCT0505"/>
<link rel="stylesheet" type="text/css" href="/commd/stylesheet.css"/>
<!--[if gte IE 6]>
<link rel="stylesheet" type="text/css" href="/commd/stylesheetIE.css" />
<![endif]-->
<link rel="stylesheet" type="text/css" href="/commd/accordion.css"/>
<link rel="stylesheet" href="/commd/nivo.css" type="text/css" media="screen" />
<link rel="stylesheet" href="/commd/nivo-slider.css" type="text/css" media="screen" />

我已经找到了所有 css 和 js 文件的链接，并使用以下方式下载了它们：

scriptsurl = soup3.find_all("script")
        os.chdir(foldername)
        for l in scriptsurl:
            if l.get("src") is not None:
                print(l.get("src"))
                script="http://www.iitkgp.ac.in"+l.get("src")
                print(script)
                file=l.get("src").split("/")[-1]
                l.get("src").replaceWith('./foldername/'+file)
                print(file)
                urllib.request.urlretrieve(script,file)
        linksurl=soup3.find_all("link")
        for l in linksurl:
            if l.get("href") is not None:
                print(l.get("href"))
                css="http://www.iitkgp.ac.in"+l.get("href")
                file=l.get("href").split("/")[-1]
                print(css)
                print(file)
                if(os.path.exists(file)):
                    urllib.request.urlretrieve(css,file.split(".")[0]+"(1)."+file.split(".")[-1])
                else:
                    urllib.request.urlretrieve(css,file)
os.chdir("..")

谁能建议我在这些循环执行期间更改（本地机器路径）src/href 文本的方法，这将很有帮助。这是我的第一个爬虫任务。

【问题讨论】：

标签： python beautifulsoup web-crawler

【解决方案1】：

阅读自documentation：

您可以添加、删除和修改标签的属性。同样，这是通过将标签视为字典来完成的：

所以写这样的东西：

l["src"] = os.path.join(os.getcwd(),foldername, file)

而不是

l.get("src").replaceWith('./foldername/'+file)

我相信会成功的

【讨论】：

给出错误：AttributeError: 'str' object has no attribute 'replaceWith'
对，我没有意识到foldername 是一个变量。你让我忘乎所以；）你为什么首先写replaceWith？你试图写伪代码？检查编辑的答案