最近在复习以前学过的pyquery,正好在看《剑来》(当然不会说是后者居多了.....),毕竟没有需求就没有提高嘛!b话不多说。
目标:
- 获取小说每个章节的url
- 因为每页只有40章,所以需要遍历出所有章节
- 根据每个章节的url获取相对应的文章内容
- 保存
获取章节url
图片有点问题,大家将就着看就行,也就是瞅瞅看章节的url在哪个节点下而已
接下来分析下一章怎么遍历,一共有两种方法:
第一种就是获取a标签的href属性的值并与原url做拼接;第二种就是看a标签的href值是否有规律,然后用一个for循环遍历出来。本文章选的是第二种方法,因为第一种还得获取a标签的href属性,忒麻烦了点。
该部分代码如下:
# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
"""
get the directory url
:return: list of url
"""
lis = []
for i in range(15):
url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
html = requests.get(url, headers=headers)
doc = pq(html.content.decode('utf-8'))
items = doc('.book_last dl dd a')
for item in items:
lis.append(item.get('href'))
return lis[:]
获取文章内容
同样的,先分析text在哪个位置:
可以看到,在id为nr1的div标签下,所以我们就可以获取该div的文本内容了,当然,为了一些阅读器能够智能分章,我们还得把章节名搞下来:
章节名在id为_bqgmb_h1的h1中,直接提取就行。
该部分代码如下:
# get content
def get_content(url):
"""
get the name and the content of article
:param url: 'https://m.80txt.com/6194' + str(url)
:return: name, article
"""
url = 'https://m.80txt.com/6194' + str(url)
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.content.decode('utf-8')
doc = pq(html)
name = pq(doc('#_bqgmb_h1')).text()
article = pq(doc('#nr #nr1')).text()
return name, article
保存
代码:
# save
def write_to_file(content):
"""
save
:param content: get_content()
:return: None
"""
# if two strings are returned, they from a tuple
for i in content:
with open('剑来_1.txt', 'a', encoding='utf-8') as f:
f.write('\n'+i)
Notice:
因为上边获取文章内容和章节名时return的是name、article两个值,python会自动将两个值组合成一个元组,而write()函数的参数不能是元组,所以必须将元组遍历出来,即:
for i in content:
主函数
最后再写个主函数:
# main
def main():
for url in get_directory():
content = get_content(url)
write_to_file(content)
time.sleep(random.randint(1, 5))
这里我time.sleep(random.randint(1, 5))了一下,就是为了防止请求过于频繁使自己的ip被该网站禁封了。
完事!
看小说是不会购买的,爬就完了!毕竟技术无罪嘛!
源码:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time : 2019/3/23 10:23
# @Site : E:\code
# @File : jianlai.py
# @Software: PyCharm
import requests
from pyquery import PyQuery as pq
import time
import random
# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
"""
get the directory url
:return: list of url
"""
lis = []
for i in range(15):
url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
html = requests.get(url, headers=headers)
doc = pq(html.content.decode('utf-8'))
items = doc('.book_last dl dd a')
for item in items:
lis.append(item.get('href'))
return lis[:]
# get content
def get_content(url):
"""
get the name and the content of article
:param url: 'https://m.80txt.com/6194' + str(url)
:return: name, article
"""
url = 'https://m.80txt.com/6194' + str(url)
response = requests.get(url, headers=headers)
if response.status_code == 200:
html = response.content.decode('utf-8')
doc = pq(html)
name = pq(doc('#_bqgmb_h1')).text()
article = pq(doc('#nr #nr1')).text()
return name, article
# save
def write_to_file(content):
"""
save
:param content: get_content()
:return: None
"""
# if two strings are returned, they from a tuple
for i in content:
with open('剑来_1.txt', 'a', encoding='utf-8') as f:
f.write('\n'+i)
# main
def main():
for url in get_directory():
content = get_content(url)
write_to_file(content)
time.sleep(random.randint(1, 5))
# run
if __name__ == '__main__':
main()
纯属娱乐,若有错误,请各位大佬斧正!筱君谢谢各位啦!