一、作业①
- 要求:指定一个网站,爬取这个网站中的所有的所有图片,例如中国气象网。分别使用
单线程和多线程的方式爬取。(限定爬取图片数量为学号后3位) - 输出信息:将下载的Url信息在控制台输出,并将下载的图片存储在images子文件中,并给出截图。
(一)单线程爬取
Gitee链接:作业3_1_1
1.解析网页
1.1 页面跳转
-
选中首页中的某些标题,找到页面跳转的链接,如下图
-
构造正则表达式获取链接信息
link = re.findall(\'a href="(http://.*?)"\', resp.text)
1.2 图片链接
- 构造正则表达式获取链接信息
imgurl = re.findall(\'src="(.*?)"\',data)
2.获取网页源码getHTMLText(url)
def getHTMLText(url):
headers = {\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36\'}
try:
resp = requests.get(url,headers=headers,timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
return resp.text
except Exception as err:
return err
3.获取图片链接并下载至本地
def craw(html):
reg = r\'src="(.*?)"\'
img_list = re.findall(reg,html)
global count # 图片数量
for imgurl in img_list:
print(count,imgurl)
#利用requests库下载图片
try:
if count> 140:
return 0
response = requests.get(imgurl)
file_path = \'D:/PyCharm/InternetWorm/weather/weather/img/\' + \'第\'+ str(count) + \'张图片\' + \'.jpg\'
with open(file_path, \'wb\') as f: # 图片信息是二进制形式,所以要用wb写入
f.write(response.content)
print(\'success\')
except Exception as err:
print(err)
count += 1
4.运行结果
- 控制台输出
- 本地文件夹
(二)多线程爬取
Gitee链接:作业3_1_2
1.解析网页
- 网页解析与`单线程爬取相同
2.主函数
# main
threads = []
imageSpider(page,link)
for t in threads:
t.join()
3.获取图片信息imageSpider(page,link)
def imageSpider(page,link_list):
global threads
global count
for i in range(page):
try:
start = time.perf_counter()
urls = []
url = link_list[i]
# print(url)
resp = requests.get(url, headers=headers, timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
print(resp.text)
reg = r\'src="(.*?)"\'
img_list = re.findall(reg, resp.text)
# print(img_list)
for imgurl in img_list:
try:
if count >= 140:
end = time.perf_counter()
print(\'final is in \',end-start)
return 0
elif imgurl not in urls:
print(imgurl)
count += 1
# 启动线程
T = threading.Thread(target=download, args=(imgurl,count))
T.setDaemon(False)
T.start()
threads.append(T)
except Exception as err:
print(err)
except Exception as err:
print(err)
4.下载图片至本地download(url, count)
def download(url, count):
try:
response = requests.get(url)
file_path = \'D:/PyCharm/InternetWorm/weather/weather/img_thread/\' + \'第\' + str(count) + \'张图片\' + \'.jpg\'
with open(file_path, \'wb\') as f: # 图片信息是二进制形式,所以要用wb写入
f.write(response.content)
print(\'success\')
print("downloaded " + str(count) + \'.jpg\')
except Exception as err:
print(err)
5.运行结果
- 控制台输出
- 本地文件夹
(三)心得体会
- 此次作业与以往不同的是
页面跳转的方式,并非通过某些特定的值来实现跳转,而是通过爬取链接。 - 熟练掌握正则表达式的使用。
二、作业②
- 要求:使用scrapy框架复现作业①
- 输出信息:同作业①
Gitee链接:作业3_2
1.创建一个scrapy项目
scrapy startproject weather
2.编写setting.py
BOT_NAME = \'weather\'
SPIDER_MODULES = [\'weather.spiders\']
NEWSPIDER_MODULE = \'weather.spiders\'
ITEM_PIPELINES = {\'weather.pipelines.WeatherPipeline\': 300,}
USER_AGENT = \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36\'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = \'weather (+http://www.yourdomain.com)\'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
3.编写items.py中的数据项目类
class WeatherItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
id = scrapy.Field()
imgurl = scrapy.Field()
4.编写pipelines.py中的数据处理类
class WeatherPipeline:
def open_spider(self, spider):
print("opend")
self.con = sqlite3.connect("img.db")
self.cursor = self.con.cursor()
try:
try:
self.cursor.execute("create table img (wId varchar(4),"
"wimgUrl varchar(128),"
"constraint pk_movies primary key (wId,"
"wimgUrl));")
except:
self.cursor.execute("delete from img")
self.opened = True
self.count = 1
except Exception as err:
print(err)
self.opened = False
def close_spider(self, spider):
try:
if self.opened:
self.con.commit()
self.con.close()
self.opened = False
except Exception as err:
print(err)
print("closed")
print("总共爬取", self.count-1, "项信息")
def process_item(self, item, spider):
try:
print(item["imgurl"])
if self.opened:
self.cursor.execute("insert into img (wId,wimgUrl) "
"values(?,?)",
(self.count,item[\'imgurl\']))
self.count += 1
except Exception as err:
print(err)
return item
5.编写Scrapy爬虫程序MySpider.py
class MySpider(scrapy.Spider):
# 继承Scrapy.Spider类
name = "weather"
source_url = "http://www.weather.com.cn/"
page = 0
count = 1
def start_requests(self):
url = MySpider.source_url
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
try:
try:
dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
data = dammit.unicode_markup
# print(data)
except Exception as err:
print(err)
imgurl = re.findall(\'src="(.*?)"\',data)
MySpider.page += 1
print("MySpider.page:", MySpider.page)
for url in imgurl:
if url.endswith(\'.jpg\') or url.endswith(\'.JPG\') or \
url.endswith(\'.png\') or url.endswith(\'.PNG\')or \
url.endswith(\'.gif\') or url.endswith(\'.GIF\'):
item = WeatherItem()
item[\'imgurl\'] = url
else:
continue
yield item
try:
if MySpider.count > 140:
return 0
imagename = \'D:/PyCharm/InternetWorm/weather/weather/images/\'+ \'第\' + str(MySpider.count) + \'张图片\' + \'.jpg\'
urllib.request.urlretrieve(str(url), filename=imagename)
print(\'success\')
MySpider.count += 1
except Exception as err:
print(err)
link = re.findall(\'a href="(http://.*?)"\', data)
for i in range(5):
link_ = link[i]
url = response.urljoin(link_)
yield scrapy.Request(url=url, callback=self.parse)
except Exception as err:
print(err)
6.运行结果
- 控制台输出
- 数据库截图
- 本地文件夹
7.心得体会
- 对于某些后缀为
.js的imgurl进行了过滤; - 逐渐熟悉scrapy爬虫框架及数据库操作。
三、作业③
- 要求:爬取豆瓣电影数据使用
scrapy和xpath,并将内容存储到数据库,同时将图片存储在imgs路径下。 - 候选网站: https://movie.douban.com/top250
- 输出信息:
| 序号 | 电影名称 | 导演 | 演员 | 简介 | 电影评分 | 电影封面 |
|---|---|---|---|---|---|---|
| 1 | 肖申克的救赎 | 弗兰克·德拉邦特 | 蒂姆·罗宾斯 | 希望让人自由 | 9.7 | ./imgs/xsk.jpg |
| 2 | ... | ... | ... | ... | ... | ... |
Gitee链接:作业3_3
1.解析网页
1.1 页面跳转
Page1:https://movie.douban.com/top250?start=0
Page2:https://movie.douban.com/top250?start=25
1.2 网页结构
2 编写item.py
class MovieItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
id = scrapy.Field()
name = scrapy.Field()
director = scrapy.Field()
actor = scrapy.Field()
profile = scrapy.Field()
score = scrapy.Field()
imgurl = scrapy.Field()
3.编写pipelines.py
class MoviePipeline:
def open_spider(self, spider):
print("opend")
self.con = sqlite3.connect("movies.db")
self.cursor = self.con.cursor()
try:
try:
self.cursor.execute("create table movies (mId varchar(4),"
"mName varchar(256),mDirector varchar(64),"
"mActor varchar(64),mProfile varchar(256),"
"mScore varchar(8),mimgUrl varchar(128),"
"constraint pk_movies primary key (mId,"
"mName));")
except:
self.cursor.execute("delete from movies")
self.opened = True
self.count = 1
except Exception as err:
print(err)
self.opened = False
def close_spider(self, spider):
try:
if self.opened:
self.con.commit()
self.con.close()
self.opened = False
except Exception as err:
print(err)
print("closed")
print("总共爬取", self.count, "项信息")
def process_item(self, item, spider):
try:
print(item["id"])
print(item["name"])
print(item["director"])
print(item["actor"])
print(item["profile"])
print(item["score"])
print(item["imgurl"])
print()
if self.opened:
self.cursor.execute("insert into movies (mId,mName,mDirector,"
"mActor,mProfile,mScore,mimgUrl) "
"values(?,?,?,?,?,?,?)",
(item[\'id\'], item[\'name\'],
item[\'director\'],item[\'actor\'],
item[\'profile\'],item[\'score\'],
item[\'imgurl\'],))
self.count += 1
except Exception as err:
print(err)
try:
url = item["imgurl"]
imagename = \'D:/PyCharm/InternetWorm/movie/movie/imgs/\' + \'第\' + str(self.count) + \'张图片\' + \'.jpg\'
urllib.request.urlretrieve(str(url), filename=imagename)
print(\'success\')
except Exception as err:
print(err)
return item
4.编写MySpider.py
4.1 xpath获取信息
selector = scrapy.Selector(text=data)
movies = selector.xpath(\'//div[@class="info"]\')
name = movies.xpath(\'div[@class="hd"]/a/span[position()=1]/text()\').extract()
bd = movies.xpath(\'div[@class="bd"]/p/text()\').extract()
director = re.findall(\'导演: (.*?) \',str(bd))
actor = re.findall(\'主演: (.*?) \',str(bd))
profile = movies.xpath(\'div[@class="bd"]/p[@class="quote"]/span/text()\').extract()
score = movies.xpath(\'div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()\').extract()
img = selector.xpath(\'//div[@class="item"]\')
id = img.xpath(\'div/em/text()\').extract()
imgurl = img.xpath(\'div/a/img/@src\').extract()
4.2 start_requests(self)
def start_requests(self):
while MySpider.page < 5:
MySpider.page += 1
print("MySpider.page:", MySpider.page)
url = MySpider.source_url + \'?start=\' + str((MySpider.page-1) * 25)
yield scrapy.Request(url=url, callback=self.parse)
5.运行结果
- 控制台输出
- 数据库截图
- 本地文件夹
6.心得体会
- 设置
page = 5,本应获取125条信息,但控制台显示获取了122条,目前尚未解决该问题。 - 熟悉了xpath信息提取方法。