「数据采集」实验三

一、作业①

要求：指定一个网站，爬取这个网站中的所有的所有图片，例如中国气象网。分别使用单线程和多线程的方式爬取。(限定爬取图片数量为学号后3位)
输出信息:将下载的Url信息在控制台输出，并将下载的图片存储在images子文件中，并给出截图。

（一）单线程爬取

Gitee链接：作业3_1_1

1.解析网页

1.1 页面跳转

选中首页中的某些标题，找到页面跳转的链接，如下图
构造正则表达式获取链接信息
link = re.findall(\'a href="(http://.*?)"\', resp.text)

1.2 图片链接

构造正则表达式获取链接信息
imgurl = re.findall(\'src="(.*?)"\',data)

2.获取网页源码getHTMLText(url)

def getHTMLText(url):
    headers = {\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.85 Safari/537.36\'}
    try:
        resp = requests.get(url,headers=headers,timeout=30)
        resp.raise_for_status()
        resp.encoding = resp.apparent_encoding
        return resp.text
    except Exception as err:
        return err

3.获取图片链接并下载至本地

def craw(html):
    reg = r\'src="(.*?)"\'
    img_list = re.findall(reg,html)
    global count # 图片数量
    for imgurl in img_list:
        print(count,imgurl)
        #利用requests库下载图片
        try:
            if count> 140:
                return 0
            response = requests.get(imgurl)
            file_path = \'D:/PyCharm/InternetWorm/weather/weather/img/\' + \'第\'+ str(count) + \'张图片\' + \'.jpg\'
            with open(file_path, \'wb\') as f:  # 图片信息是二进制形式，所以要用wb写入
                f.write(response.content)
                print(\'success\')
        except Exception as err:
            print(err)
        count += 1

4.运行结果

控制台输出
本地文件夹

（二）多线程爬取

Gitee链接：作业3_1_2

1.解析网页

网页解析与`单线程爬取相同

2.主函数

# main
threads = []
imageSpider(page,link)
for t in threads:
    t.join()

3.获取图片信息`imageSpider(page,link)`

def imageSpider(page,link_list):
    global threads
    global count
    for i in range(page):
        try:
            start = time.perf_counter()
            urls = []
            url = link_list[i]
            # print(url)
            resp = requests.get(url, headers=headers, timeout=30)
            resp.raise_for_status()
            resp.encoding = resp.apparent_encoding
            print(resp.text)
            reg = r\'src="(.*?)"\'
            img_list = re.findall(reg, resp.text)
            # print(img_list)
            for imgurl in img_list:
                try:
                    if count >= 140:
                        end = time.perf_counter()
                        print(\'final is in \',end-start)
                        return 0
                    elif imgurl not in urls:
                        print(imgurl)
                        count += 1
                        # 启动线程
                        T = threading.Thread(target=download, args=(imgurl,count))
                        T.setDaemon(False)
                        T.start()
                        threads.append(T)
                except Exception as err:
                    print(err)
        except Exception as err:
            print(err)

4.下载图片至本地`download(url, count)`

def download(url, count):
    try:
        response = requests.get(url)
        file_path = \'D:/PyCharm/InternetWorm/weather/weather/img_thread/\' + \'第\' + str(count) + \'张图片\' + \'.jpg\'
        with open(file_path, \'wb\') as f:  # 图片信息是二进制形式，所以要用wb写入
            f.write(response.content)
            print(\'success\')
        print("downloaded " + str(count) + \'.jpg\')
    except Exception as err:
        print(err)

5.运行结果

控制台输出
本地文件夹

（三）心得体会

此次作业与以往不同的是页面跳转的方式，并非通过某些特定的值来实现跳转，而是通过爬取链接。
熟练掌握正则表达式的使用。

二、作业②

要求：使用scrapy框架复现作业①
输出信息：同作业①

Gitee链接：作业3_2

1.创建一个scrapy项目

scrapy startproject weather

2.编写`setting.py`

BOT_NAME = \'weather\'
SPIDER_MODULES = [\'weather.spiders\']
NEWSPIDER_MODULE = \'weather.spiders\'
ITEM_PIPELINES = {\'weather.pipelines.WeatherPipeline\': 300,}
USER_AGENT = \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36\'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = \'weather (+http://www.yourdomain.com)\'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False

3.编写`items.py`中的数据项目类

class WeatherItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = scrapy.Field()
    imgurl = scrapy.Field()

4.编写`pipelines.py`中的数据处理类

class WeatherPipeline:
    def open_spider(self, spider):
        print("opend")
        self.con = sqlite3.connect("img.db")
        self.cursor = self.con.cursor()
        try:
            try:
                self.cursor.execute("create table img (wId varchar(4),"
                                    "wimgUrl varchar(128),"
                                    "constraint pk_movies primary key (wId,"
                                    "wimgUrl));")
            except:
                self.cursor.execute("delete from img")
            self.opened = True
            self.count = 1
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        try:
            if self.opened:
                self.con.commit()
                self.con.close()
                self.opened = False
        except Exception as err:
            print(err)
        print("closed")
        print("总共爬取", self.count-1, "项信息")

    def process_item(self, item, spider):
        try:
            print(item["imgurl"])
            if self.opened:
                self.cursor.execute("insert into img (wId,wimgUrl) "
                                    "values(?,?)",
                                    (self.count,item[\'imgurl\']))
                self.count += 1
        except Exception as err:
            print(err)
        return item

5.编写`Scrapy`爬虫程序`MySpider.py`

class MySpider(scrapy.Spider):
    # 继承Scrapy.Spider类
    name = "weather"
    source_url = "http://www.weather.com.cn/"
    page = 0
    count = 1

    def start_requests(self):
        url = MySpider.source_url
        yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        try:
            try:
                dammit = UnicodeDammit(response.body, ["utf-8", "gbk"])
                data = dammit.unicode_markup
                # print(data)
            except Exception as err:
                print(err)
            imgurl = re.findall(\'src="(.*?)"\',data)
            MySpider.page += 1
            print("MySpider.page:", MySpider.page)
            for url in imgurl:
                if url.endswith(\'.jpg\') or url.endswith(\'.JPG\') or \
                        url.endswith(\'.png\') or url.endswith(\'.PNG\')or \
                        url.endswith(\'.gif\') or url.endswith(\'.GIF\'):
                    item = WeatherItem()
                    item[\'imgurl\'] = url
                else:
                    continue
                yield item
                try:
                    if MySpider.count > 140:
                        return 0
                    imagename = \'D:/PyCharm/InternetWorm/weather/weather/images/\'+ \'第\' + str(MySpider.count) + \'张图片\' + \'.jpg\'
                    urllib.request.urlretrieve(str(url), filename=imagename)
                    print(\'success\')
                    MySpider.count += 1
                except Exception as err:
                    print(err)

            link = re.findall(\'a href="(http://.*?)"\', data)
            for i in range(5):
                link_ = link[i]
                url = response.urljoin(link_)
                yield scrapy.Request(url=url, callback=self.parse)

        except Exception as err:
            print(err)

6.运行结果

控制台输出
数据库截图
本地文件夹

7.心得体会

对于某些后缀为.js的imgurl进行了过滤；
逐渐熟悉scrapy爬虫框架及数据库操作。

三、作业③

要求：爬取豆瓣电影数据使用scrapy和xpath，并将内容存储到数据库，同时将图片存储在imgs路径下。
候选网站： https://movie.douban.com/top250
输出信息：

序号	电影名称	导演	演员	简介	电影评分	电影封面
1	肖申克的救赎	弗兰克·德拉邦特	蒂姆·罗宾斯	希望让人自由	9.7	./imgs/xsk.jpg
2	...	...	...	...	...	...

Gitee链接：作业3_3

1.解析网页

1.1 页面跳转

Page1：https://movie.douban.com/top250?start=0
Page2：https://movie.douban.com/top250?start=25

1.2 网页结构

2 编写`item.py`

class MovieItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    id = scrapy.Field()
    name = scrapy.Field()
    director = scrapy.Field()
    actor = scrapy.Field()
    profile = scrapy.Field()
    score = scrapy.Field()
    imgurl = scrapy.Field()

3.编写`pipelines.py`

class MoviePipeline:
    def open_spider(self, spider):
        print("opend")
        self.con = sqlite3.connect("movies.db")
        self.cursor = self.con.cursor()
        try:
            try:
                self.cursor.execute("create table movies (mId varchar(4),"
                                    "mName varchar(256),mDirector varchar(64),"
                                    "mActor varchar(64),mProfile varchar(256),"
                                    "mScore varchar(8),mimgUrl varchar(128),"
                                    "constraint pk_movies primary key (mId,"
                                    "mName));")
            except:
                self.cursor.execute("delete from movies")
            self.opened = True
            self.count = 1
        except Exception as err:
            print(err)
            self.opened = False

    def close_spider(self, spider):
        try:
            if self.opened:
                self.con.commit()
                self.con.close()
                self.opened = False
        except Exception as err:
            print(err)
        print("closed")
        print("总共爬取", self.count, "项信息")

    def process_item(self, item, spider):
        try:
            print(item["id"])
            print(item["name"])
            print(item["director"])
            print(item["actor"])
            print(item["profile"])
            print(item["score"])
            print(item["imgurl"])
            print()
            if self.opened:
                self.cursor.execute("insert into movies (mId,mName,mDirector,"
                                    "mActor,mProfile,mScore,mimgUrl) "
                                    "values(?,?,?,?,?,?,?)",
                                    (item[\'id\'], item[\'name\'],
                                     item[\'director\'],item[\'actor\'],
                                     item[\'profile\'],item[\'score\'],
                                     item[\'imgurl\'],))
                self.count += 1
        except Exception as err:
            print(err)
        try:
            url = item["imgurl"]
            imagename = \'D:/PyCharm/InternetWorm/movie/movie/imgs/\' + \'第\' + str(self.count) + \'张图片\' + \'.jpg\'
            urllib.request.urlretrieve(str(url), filename=imagename)
            print(\'success\')
        except Exception as err:
            print(err)
        return item

4.编写`MySpider.py`

4.1 xpath获取信息

selector = scrapy.Selector(text=data)
movies = selector.xpath(\'//div[@class="info"]\')
name = movies.xpath(\'div[@class="hd"]/a/span[position()=1]/text()\').extract()
bd = movies.xpath(\'div[@class="bd"]/p/text()\').extract()
director = re.findall(\'导演: (.*?) \',str(bd))
actor = re.findall(\'主演: (.*?) \',str(bd))
profile = movies.xpath(\'div[@class="bd"]/p[@class="quote"]/span/text()\').extract()
score = movies.xpath(\'div[@class="bd"]/div[@class="star"]/span[@class="rating_num"]/text()\').extract()
img = selector.xpath(\'//div[@class="item"]\')
id = img.xpath(\'div/em/text()\').extract()
imgurl = img.xpath(\'div/a/img/@src\').extract()

4.2 `start_requests(self)`

    def start_requests(self):
        while MySpider.page < 5:
            MySpider.page += 1
            print("MySpider.page:", MySpider.page)
            url = MySpider.source_url + \'?start=\' + str((MySpider.page-1) * 25)
            yield scrapy.Request(url=url, callback=self.parse)

5.运行结果

控制台输出
数据库截图
本地文件夹

6.心得体会

设置page = 5,本应获取125条信息，但控制台显示获取了122条，目前尚未解决该问题。
熟悉了xpath信息提取方法。

一、作业①

（一）单线程爬取

1.解析网页

1.1 页面跳转

1.2 图片链接

2.获取网页源码getHTMLText(url)

3.获取图片链接并下载至本地

4.运行结果

（二）多线程爬取

1.解析网页

2.主函数

3.获取图片信息imageSpider(page,link)

4.下载图片至本地download(url, count)

5.运行结果

（三）心得体会

二、作业②

1.创建一个scrapy项目

2.编写setting.py

3.编写items.py中的数据项目类

4.编写pipelines.py中的数据处理类

5.编写Scrapy爬虫程序MySpider.py

6.运行结果

7.心得体会

三、作业③

1.解析网页

1.1 页面跳转

1.2 网页结构

2 编写item.py

3.编写pipelines.py

4.编写MySpider.py

4.1 xpath获取信息

4.2 start_requests(self)

5.运行结果

6.心得体会

3.获取图片信息`imageSpider(page,link)`

4.下载图片至本地`download(url, count)`

2.编写`setting.py`

3.编写`items.py`中的数据项目类

4.编写`pipelines.py`中的数据处理类

5.编写`Scrapy`爬虫程序`MySpider.py`

2 编写`item.py`

3.编写`pipelines.py`

4.编写`MySpider.py`

4.2 `start_requests(self)`