【发布时间】:2017-03-03 23:28:45
【问题描述】:
我有一个包含很多或 URL 的 CSV 文件,它们都具有不同的域扩展名(.com、.eu、.org 等等)。但我只想在 python 2.7 中使用if '.nl' in row: 来抓取具有.nl 扩展名的域:
from selenium import webdriver
import csv
fieldnames = ['Website', '@media', 'googleadservices.com/pagead/conversion']
def csv_writerheader(path):
with open(path, 'w') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
def csv_writer(dictdata, path):
with open(path, 'a') as csvfile:
writer = csv.DictWriter(csvfile, fieldnames=fieldnames, lineterminator='\n')
writer.writerow(dictdata)
csv_output_file = 'output!.csv'
driver = webdriver.Chrome(executable_path=r'C:\Users\Jacob\PycharmProjects\Testing\chromedriver_win32\chromedriver.exe')
keywords = ['@media', 'googleadservices.com/pagead/conversion']
csv_writerheader(csv_output_file)
with open('top1m-edited.csv') as example_file:
example_reader = csv.reader(example_file)
for row in example_reader:
# INITIALIZE DICT
data = {'Website': row}
if '.nl' in row: # MAKING THE DOMAIN DISTINCTION HERE
try:
driver.get(row[0])
html = driver.page_source
for searchstring in keywords:
if searchstring.lower() in html.lower():
print (row, searchstring, 'FOUND!')
data[searchstring] = 'FOUND!'
else:
print (row, searchstring, 'not found')
data[searchstring] = 'not found'
csv_writer(data, csv_output_file)
except:
pass
打印结果:
C:\Python27\python.exe "C:/Users/Jacob/PycharmProjects/Testing/fooling around 2.py"
Process finished with exit code 0
所以我的脚本在这种状态下基本上不做任何事情,除了导出一个几乎没有结果的 CSV 文件。
但是,当我简单地省略 if '.nl' in row: 时,脚本可以完美运行。
我应该进行哪些调整以仅使用脚本导入/抓取 .nl 域 URL?
【问题讨论】:
标签: python csv selenium-webdriver web-crawler