【发布时间】:2017-03-18 18:23:55
【问题描述】:
我有一个文本文件中的 url 列表。我希望将图像下载到特定文件夹,我该怎么做。chrome 或任何其他程序中是否有可用的插件从 url 下载图像
【问题讨论】:
标签: image google-chrome
我有一个文本文件中的 url 列表。我希望将图像下载到特定文件夹,我该怎么做。chrome 或任何其他程序中是否有可用的插件从 url 下载图像
【问题讨论】:
标签: image google-chrome
在您的机器中创建一个文件夹。
将图片 URL 的文本文件放在文件夹中。
cd 到那个文件夹。使用wget -i images.txt
您会在文件夹中找到所有下载的文件。
【讨论】:
brew install wget,但在那之后,这很容易!非常感谢!
这需要做成一个有错误处理的函数,但它会重复下载图像用于图像分类项目
import requests
urls = pd.read_csv('cat_urls.csv') #save the url list as a dataframe
rows = []
for index, i in urls.iterrows():
rows.append(i[-1])
counter = 0
for i in rows:
file_name = 'cat' + str(counter) + '.jpg'
print(file_name)
response = requests.get(i)
file = open(file_name, "wb")
file.write(response.content)
file.close()
counter += 1
【讨论】:
import os
import time
import sys
import urllib
from progressbar import ProgressBar
def get_raw_html(url):
version = (3,0)
curr_version = sys.version_info
if curr_version >= version: #If the Current Version of Python is 3.0 or above
import urllib.request #urllib library for Extracting web pages
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
request = urllib.request.Request(url, headers = headers)
resp = urllib.request.urlopen(request)
respData = str(resp.read())
return respData
except Exception as e:
print(str(e))
else: #If the Current Version of Python is 2.x
import urllib2
try:
headers = {}
headers['User-Agent'] = "Mozilla/5.0 (X11; Linux i686) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.27 Safari/537.17"
request = urllib2.Request(url, headers = headers)
try:
response = urllib2.urlopen(request)
except URLError: # Handling SSL certificate failed
context = ssl._create_unverified_context()
response = urlopen(req,context=context)
#response = urllib2.urlopen(req)
raw_html = response.read()
return raw_html
except:
return"Page Not found"
def next_link(s):
start_line = s.find('rg_di')
if start_line == -1: #If no links are found then give an error!
end_quote = 0
link = "no_links"
return link, end_quote
else:
start_line = s.find('"class="rg_meta"')
start_content = s.find('"ou"',start_line+1)
end_content = s.find(',"ow"',start_content+1)
content_raw = str(s[start_content+6:end_content-1])
return content_raw, end_content
def all_links(page):
links = []
while True:
link, end_content = next_link(page)
if link == "no_links":
break
else:
links.append(link) #Append all the links in the list named 'Links'
#time.sleep(0.1) #Timer could be used to slow down the request for image downloads
page = page[end_content:]
return links
def download_images(links, search_keyword):
choice = input("Do you want to save the links? [y]/[n]: ")
if choice=='y' or choice=='Y':
#write all the links into a test file.
f = open('links.txt', 'a') #Open the text file called links.txt
for link in links:
f.write(str(link))
f.write("\n")
f.close() #Close the file
num = input("Enter number of images to download (max 100): ")
counter = 1
errors=0
search_keyword = search_keyword.replace("%20","_")
directory = search_keyword+'/'
if not os.path.isdir(directory):
os.makedirs(directory)
pbar = ProgressBar()
for link in pbar(links):
if counter<=int(num):
file_extension = link.split(".")[-1]
filename = directory + str(counter) + "."+ file_extension
#print ("Downloading image: " + str(counter)+'/'+str(num))
try:
urllib.request.urlretrieve(link, filename)
except IOError:
errors+=1
#print ("\nIOError on Image" + str(counter))
except urllib.error.HTTPError as e:
errors+=1
#print ("\nHTTPError on Image"+ str(counter))
except urllib.error.URLError as e:
errors+=1
#print ("\nURLError on Image" + str(counter))
counter+=1
return errors
def search():
version = (3,0)
curr_version = sys.version_info
if curr_version >= version: #If the Current Version of Python is 3.0 or above
import urllib.request #urllib library for Extracting web pages
else:
import urllib2 #If current version of python is 2.x
search_keyword = input("Enter the search query: ")
#Download Image Links
links = []
search_keyword = search_keyword.replace(" ","%20")
url = 'https://www.google.com/search?q=' + search_keyword+ '&espv=2&biw=1366&bih=667&site=webhp&source=lnms&tbm=isch&sa=X&ei=XosDVaCXD8TasATItgE&ved=0CAcQ_AUoAg'
raw_html = (get_raw_html(url))
links = links + (all_links(raw_html))
print ("Total Image Links = "+str(len(links)))
print ("\n")
errors = download_images(links, search_keyword)
print ("Download Complete.\n"+ str(errors) +" errors while downloading.")
search()
【讨论】:
在这个python project 中,我在 unsplash.com 中进行搜索,它为我提供了一个 URL 列表,然后我将其中的一些(由用户预定义)保存到预定义的文件夹中。看看吧。
【讨论】:
在 Windows 10/11 上,这使用起来相当简单
for /F "eol=;" %f in (filelist.txt) do curl -O %f
注意 eol=; 的包含允许我们通过在 filelist.txt 中我们这次不想要的那些行的开头添加 ; 来屏蔽单个排除项。如果在批处理文件 GetFileList.cmd 中使用上述内容,则将那些 %% 的值加倍
Windows 7 有一个 FTP 命令,但这通常会引发需要用户授权响应的防火墙对话框。
当前运行 Windows 7 并且想要下载 URL 列表而不下载任何 wget.exe 或其他依赖项,如 curl.exe(这将是最简单的第一个命令),最短的兼容方式是 power-shell 命令(不是我最喜欢速度,但如果需要的话。)
带有 URL 的文件是 filelist.txt 和 IWR 是 PS 几乎等同于 wget。
Security Protocol first 命令确保我们使用现代 TLS1.2 协议
-OutF ... split-path ... 表示文件名将与远程文件名相同,但在 CWD(当前工作目录)中,如果需要,您可以cd /d folder 进行脚本编写。
PS> [Net.ServicePointManager]::SecurityProtocol = "Tls12" ; GC filelist.txt | % {IWR $_ -OutF $(Split-Path $_ -Leaf)}
要作为 CMD 运行,请在 'Tls12' 周围使用一组稍微不同的引号
PowerShell -C "& {[Net.ServicePointManager]::SecurityProtocol = 'Tls12' ; GC filelist.txt | % {IWR $_ -OutF $(Split-Path $_ -Leaf)}}"
【讨论】:
在 Windows 上,install wget - https://sourceforge.net/projects/gnuwin32/files/wget/1.11.4-1/
并将C:\Program Files (x86)\GnuWin32\bin 添加到您的环境路径中。
创建一个文件夹,其中包含您要下载的所有图像的 txt 文件。
在文件资源管理器顶部的位置栏中输入cmd
当命令提示符打开时输入以下内容。
wget -i images.txt --no-check-certificate
【讨论】: