【发布时间】:2015-02-02 06:47:31
【问题描述】:
我想从设置文件中动态加载列表/元组。
我需要编写一个爬取网站的爬虫,但我想知道找到的文件,而不是页面。
我允许用户在settings.py 文件中指定此类文件类型,如下所示:
# Document Types during crawling
textFiles = ['.doc', '.docx', '.log', '.msg', '.pages', '.rtf', '.txt', '.wpd', '.wps']
dataFiles = ['.csv', '.dat', '.efx', '.gbr', '.key', '.pps', '.ppt', '.pptx', '.sdf', '.tax2010', '.vcf', '.xml']
audioFiles = ['.3g2','.3gp','.asf','.asx','.avi','.flv','.mov','.mp4','.mpg','.rm','.swf','.vob','.wmv']
#What lists would you like to use ?
fileLists = ['textFiles', 'dataFiles', 'audioFiles']
我将我的设置文件导入crawler.py
我使用beautifulsoup模块从HTML内容中查找链接,处理如下:
for item in soup.find_all("a"):
# we dont want some of them because it is just a link to the current page or the startpage
if item['href'] in dontWantList:
continue
#check if link is a file based on the fileLists from the settings
urlpath = urlparse.urlparse(item['href']).path
ext = os.path.splitext(urlpath)[1]
file = False
for list in settings.fileLists:
if ext in settings.list:
file = True
#found file link
if self.verbose:
messenger("Found a file of type: %s" % ext, Colors.PURPLE)
if ext not in fileLinks:
fileLinks.append(item['href'])
#Only add the link if it is not a file
if file is not True:
links.append(item['href'])
else:
#Do not add the file to the other lists
continue
以下代码段抛出错误:
for list in settings.fileLists:
if ext in settings.list:
显然是因为 python 认为 settings.list 是一个列表。
有没有办法告诉 python 从设置文件中动态查找列表?
【问题讨论】:
-
不要命名你自己的变量
list,你会隐藏内置的。此外,使用set可以提高成员资格测试的效率。 -
settings.list来自哪里? -
谢谢。我也修改了我的命名。我的 IDE 对此也不是很高兴 :)
标签: python list dynamic nested-lists