一、作业①
| 2020排名 |
全部层次 |
学校类型 |
总分 |
| 1 |
前2% |
中国人民大学 |
1069.0 |
| 2 |
.... |
........... |
...... |
1.获取网页源码:getHTMLTextUrllib(url)
def getHTMLTextUrllib(url):
try:
headers = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; "
"en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre"}
req=urllib.request.Request(url,headers=headers)
resp=urllib.request.urlopen(req)
data =resp.read()
unicodeData =data.decode()
#dammit = UnicodeDammit(data,["utf-8","gbk"])
#unicodeData= dammit.unicode_markup
return unicodeData
except Exception as err:
print(err)
2.构造正则表达式匹配所需内容
- 2020排名:
rank = re.findall(r\'<td data-v-68e330ae><div class="ranking" data-v-68e330ae>(\n\s*?\d*\s*?)<\/div><\/td>\',html)
- 全部层次:
level = re.findall(r\'<td data-v-68e330ae>(\'r\'\n\s*?.*\s*?)<!----><\/td>\',html)
- 学校类型:
name = re.findall(r\'class="name-cn" data-v-b80b4d60>(.*?)<\/a>\',html)
- 总分:
score = re.findall(r\'<td data-v-68e330ae>(\n\s*?.*\s*?)<\/td>\',html)
3.将获取信息存储于ulist中:fillUnivList(ulist, html)
def fillUnivList(ulist, html):
try:
rank = re.findall(r\'<td data-v-68e330ae><div class="ranking" data-v-68e330ae>(\n\s*?\d*\s*?)<\/div><\/td>\',html)
level = re.findall(r\'<td data-v-68e330ae>(\n\s*?.*\s*?)<!----><\/td>\',html)
name = re.findall(r\'class="name-cn" data-v-b80b4d60>(.*?)<\/a>\',html)
score = re.findall(r\'<td data-v-68e330ae>(\n\s*?.*\s*?)<\/td>\',html)
for i in range(len(rank)):
Rank = rank[i].strip()
Level = level[i].strip()
Name = name[i].strip()
Score = score[i].strip()
ulist.append([Rank,Level,Name,Score])
except Exception as err:
print(err)
4.打印列表信息printUnivList(ulist, 20)
def printUnivList(ulist, num):
#中西文混排时,要使用中文字符空格填充chr(12288)
tplt = "{:6}\t{:4}\t{:10}\t{:4}"
#对中文输出的列,用第6个参数即中文空格chr(12288)填充
print(tplt.format("2020排名", "全部层次", "学校类型", "总分",chr(12288)))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0], u[1], u[2], u[3],chr(12288)))
5.输出结果

6.心得体会
- 正则表达式使用还不够熟练,匹配
<div class="ranking" data-v-68e330ae=""> 1 </div>内容时,由于忽略了换行与空格,浪费了好多时间...
- 要多多练习编写正则表达式
二、作业②
- 要求:用
requests和Beautiful Soup库方法设计爬取数据服务AQI实时报。
- 输出形式
| 序号 |
城市 |
AQI |
PM2.5 |
SO2 |
NO2 |
CO |
首要污染物 |
| 1 |
北京 |
55 |
6 |
5 |
1.0 |
225 |
- |
| 2 |
.... |
.... |
.... |
.... |
.... |
.... |
.... |
1.获取网页源码getHTMLText(url)
def getHTMLText(url):
try:
resp = requests.get(url,timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
return resp.text
except:
return \'产生异常\'
2.调用Beautiful Soup获取信息
#html页面数据存入ulist
def fillAQIList(ulist, html):
soup = BeautifulSoup(html, "html.parser")
i = 1 #表示序号
for tr in soup.find(\'tbody\').children:
if isinstance(tr, bs4.element.Tag):
#检查tr标签,排除tr为普通字符串,需要引入bs4库
tds = tr(\'td\')
# 列表中存入AQI信息
ulist.append([str(i),tds[0].text.strip(), tds[1].text.strip(),
tds[2].text.strip(), tds[4].text.strip(),
tds[5].text.strip(),tds[6].text.strip(),
tds[8].text.strip()])
i += 1
3.打印列表信息printAQIList(ulist, num)
def printAQIList(ulist, num):
#中西文混排时,要使用中文字符空格填充chr(12288)
tplt = "{:6}\t{:4}\t{:10}\t{:4}\t{:6}\t{:4}\t{:10}\t{:4}"
#对中文输出的列,进行用第6个参数即中文空格chr(12288)填充
print(tplt.format("序号", "城市", "AQI", "PM2.5", "SO2","NO2","CO","首要污染物",
chr(12288)))
for i in range(num):
u = ulist[i]
print(tplt.format(u[0], u[1], u[2], u[3], u[4],u[5],u[6],u[7],chr(
12288)))
4.输出结果

5.心得体会
- 由于已做过类似爬虫,此项作业快速完成,没有遇到困难
三、作业③
- 要求:使用
urllib和requests及re爬取一个给定网页福州大学新闻网下的所有图片
- 输出形式:将自选网页内的所有jpg文件保存在一个文件夹中
1.获取网页源码getHTMLText(url)
def getHTMLText(url):
headers = {
\'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36\',
}
try:
resp = requests.get(url,headers=headers,timeout=30)
resp.raise_for_status()
resp.encoding = resp.apparent_encoding
return resp.text
except Exception as err:
return err
try:
req = urllib.request.Request(url, headers=headers)
resp = urllib.request.urlopen(req)
data = resp.read().decode()
return data
except Exception as err:
return err
2.构造正则表达式匹配jpg图片
reg = r\'<img src="/(.*?).jpg"\'
3.保存图片至文件夹中SavePics(html)
def SavePics(html):
reg = r\'<img src="/(.*?).jpg"\'
img_list = re.findall(reg,html)
i = 0 #记录图片数量
for imgurl in img_list:
i += 1
imgurl = \'http://news.fzu.edu.cn/\' + imgurl + \'.jpg\'
print(i,imgurl)
#利用requests库下载图片
try:
response = requests.get(imgurl)
file_path = \'D:/PyCharm/InternetWorm/News/\' + \'第\' + str(i) + \'张图片\' + \'.jpg\' #图片保存路径
with open(file_path, \'wb\') as f: # 图片信息是二进制形式,所以要用wb写入
f.write(response.content)
print(\'success\')
except Exception as err:
print(err)
4.输出结果


5.心得体会
1 http://attach/2021/09/26/433747.jpg
HTTPConnectionPool(host=\'attach\', port=80): Max retries exceeded with url: /2021/09/26/433747.jpg (Caused by NewConnectionError(\'<urllib3.connection.HTTPConnection object at 0x0000022D1D22DB50>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed\'))
- 寻找多种解决方案无果后,尝试复制图片url在浏览器中打开,提示无法打开,于是将原url变为
http://news.fzu.edu.cn/attach/2021/09/26/433747.jpg,成功解决...
- 目前对于正则表达式的运用还不够熟练,课下应多做练习,增强理解。