使用 python beautifulsoup 和 selenium 下载文件答案

【问题标题】：download file using python beautifulsoup and selenium使用 python beautifulsoup 和 selenium 下载文件
【发布时间】：2014-01-07 10:10:10
【问题描述】：

我想下载以从搜索结果中下载第一个 pdb 文件（下载链接在名称下方给出）。我正在使用 python、selenium 和 beautifulsoup。到目前为止，我已经开发了代码。

import urllib2
from BeautifulSoup import BeautifulSoup
from selenium import webdriver


uni_id = "P22216"

# set parameters
download_dir = "/home/home/Desktop/"
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id

print "url - ", url


# opening the url
text = urllib2.urlopen(url).read();

#print "text : ", text
soup = BeautifulSoup(text);
#print soup
print


table = soup.find( "table", {"class":"queryBlue"} )
#print "table : ", table

status = 0
rows = table.findAll('tr')
for tr in rows:
    try:
        cols = tr.findAll('td')
        if cols:
            link = cols[1].find('a').get('href')
        print "link : ", link
            if link:
                if status==1:
                    main_url = "http://www.rcsb.org" + link
                print "main_url-----", main_url
                status = False
                browser.click(main_url)
        status+=1

    except:
    pass

我的表格没有。
如何下载搜索列表中的第一个文件？（即本例中的 2YGV）

Download link is : /pdb/protein/P32447

【问题讨论】：

为我工作。获取/pdb/explore/explore.do?structureId=2YGV。什么问题？不能下载吗？
我也知道了，但是如何下载该文件。是我的问题

标签： python selenium beautifulsoup

【解决方案1】：

我不确定您到底要下载什么，但这里是如何下载 2YGV 文件的示例：

import urllib
import urllib2
from bs4 import BeautifulSoup    

uni_id = "P22216"    
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id    
text = urllib2.urlopen(url).read()    
soup = BeautifulSoup(text)    
link = soup.find( "span", {"class":"iconSet-main icon-download"}).parent.get("href")    
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

此脚本将从页面上的链接下载该文件。该脚本不需要selenium，但我使用urllib 来检索文件。您可以阅读this post 了解如何使用 urllib 下载文件。

编辑：

或使用此代码找到下载链接（这完全取决于您要从哪个URL下载哪些文件）：

import urllib
import urllib2
from bs4 import BeautifulSoup


uni_id = "P22216"
url = "http://www.rcsb.org/pdb/search/smart.do?smartComparator=and&smartSearchSubtype_0=UpAccessionIdQuery&target=Current&accessionIdList_0=%s" % uni_id
text = urllib2.urlopen(url).read()
soup = BeautifulSoup(text)
table = soup.find( "table", {"class":"queryBlue"} )
link = table.find("a", {"class":"tooltip"}).get("href")
urllib.urlretrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")

这里是你如何做你在评论中问的例子：

import mechanize
from bs4 import BeautifulSoup


SEARCH_URL = "http://www.rcsb.org/pdb/home/home.do"

l = ["YGL130W", "YDL159W", "YOR181W"]
browser = mechanize.Browser()

for item in l:
    browser.open(SEARCH_URL)
    browser.select_form(nr=0)
    browser["q"] = item
    html = browser.submit()

    soup = BeautifulSoup(html)
    table = soup.find("table", {"class":"queryBlue"})
    if table:
        link = table.find("a", {"class":"tooltip"}).get("href")
        browser.retrieve("http://www.rcsb.org/" + str(link), str(link.split("=")[-1]) + ".pdb")[0]
        print "Downloaded " + item + " as " + str(link.split("=")[-1]) + ".pdb"
    else:
        print item + " was not found"

输出：

Downloaded YGL130W as 3KYH.pdb
Downloaded YDL159W as 3FWB.pdb
YOR181W was not found

【讨论】：

我阅读并理解了您的代码。谢谢。我有列表 l = [YGL130W，YDL159W，YOR181W]。有了这个，我必须去rcsb.org/pdb/home/home.do，然后我必须获取每个 ID 并在该站点中搜索。结果页面有一个链接搜索 pdb。我必须点击它，然后我会得到下载 pdb 页面，否则我会得到多个 pdb。如果有多个 pdb，那么我必须下载搜索结果的第一个 pdb。
已编辑答案。希望这会有所帮助