最近在复习以前学过的pyquery,正好在看《剑来》(当然不会说是后者居多了.....),毕竟没有需求就没有提高嘛!b话不多说。


目标:

  1. 获取小说每个章节的url
  2. 因为每页只有40章,所以需要遍历出所有章节
  3. 根据每个章节的url获取相对应的文章内容
  4. 保存

获取章节url

利用pyquery爬取并保存小说

图片有点问题,大家将就着看就行,也就是瞅瞅看章节的url在哪个节点下而已

利用pyquery爬取并保存小说

接下来分析下一章怎么遍历,一共有两种方法:

第一种就是获取a标签的href属性的值并与原url做拼接;第二种就是看a标签的href值是否有规律,然后用一个for循环遍历出来。本文章选的是第二种方法,因为第一种还得获取a标签的href属性,忒麻烦了点。

该部分代码如下:

# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
                        'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
    """
    get the directory url
    :return: list of url
    """
    lis = []
    for i in range(15):
        url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
        html = requests.get(url, headers=headers)
        doc = pq(html.content.decode('utf-8'))
        items = doc('.book_last dl dd a')
        for item in items:
            lis.append(item.get('href'))
    return lis[:]

获取文章内容

同样的,先分析text在哪个位置:

利用pyquery爬取并保存小说

可以看到,在id为nr1的div标签下,所以我们就可以获取该div的文本内容了,当然,为了一些阅读器能够智能分章,我们还得把章节名搞下来:

利用pyquery爬取并保存小说

章节名在id为_bqgmb_h1的h1中,直接提取就行。

该部分代码如下:

# get content
def get_content(url):
    """
    get the name and the content of article
    :param url: 'https://m.80txt.com/6194' + str(url)
    :return: name, article
    """
    url = 'https://m.80txt.com/6194' + str(url)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html = response.content.decode('utf-8')
        doc = pq(html)
        name = pq(doc('#_bqgmb_h1')).text()
        article = pq(doc('#nr #nr1')).text()
    return name, article

保存

代码:

# save
def write_to_file(content):
    """
    save
    :param content: get_content()
    :return: None
    """
    # if two strings are returned, they from a tuple
    for i in content:
        with open('剑来_1.txt', 'a', encoding='utf-8') as f:
            f.write('\n'+i)

Notice:

因为上边获取文章内容和章节名时return的是name、article两个值,python会自动将两个值组合成一个元组,而write()函数的参数不能是元组,所以必须将元组遍历出来,即:

for i in content:

主函数

最后再写个主函数:

# main
def main():
    for url in get_directory():
        content = get_content(url)
        write_to_file(content)
        time.sleep(random.randint(1, 5))

这里我time.sleep(random.randint(1, 5))了一下,就是为了防止请求过于频繁使自己的ip被该网站禁封了。

完事!

看小说是不会购买的,爬就完了!毕竟技术无罪嘛!

源码:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2019/3/23 10:23
# @Site    : E:\code
# @File    : jianlai.py
# @Software: PyCharm

import requests
from pyquery import PyQuery as pq
import time
import random
# disguised header
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) '
                        'AppleWeKit/537.36(KHTML,like Gecko) Chrome/58.0.3029.110 Safari/537.36'
}
# Access to the directory
def get_directory():
    """
    get the directory url
    :return: list of url
    """
    lis = []
    for i in range(15):
        url = 'https://m.80txt.com/6194/page-{}.html'.format(i+2)
        html = requests.get(url, headers=headers)
        doc = pq(html.content.decode('utf-8'))
        items = doc('.book_last dl dd a')
        for item in items:
            lis.append(item.get('href'))
    return lis[:]
# get content
def get_content(url):
    """
    get the name and the content of article
    :param url: 'https://m.80txt.com/6194' + str(url)
    :return: name, article
    """
    url = 'https://m.80txt.com/6194' + str(url)
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        html = response.content.decode('utf-8')
        doc = pq(html)
        name = pq(doc('#_bqgmb_h1')).text()
        article = pq(doc('#nr #nr1')).text()
    return name, article
# save
def write_to_file(content):
    """
    save
    :param content: get_content()
    :return: None
    """
    # if two strings are returned, they from a tuple
    for i in content:
        with open('剑来_1.txt', 'a', encoding='utf-8') as f:
            f.write('\n'+i)

# main
def main():
    for url in get_directory():
        content = get_content(url)
        write_to_file(content)
        time.sleep(random.randint(1, 5))
# run
if __name__ == '__main__':
    main()

纯属娱乐,若有错误,请各位大佬斧正!筱君谢谢各位啦!

相关文章:

  • 2021-11-19
  • 2022-12-23
  • 2021-12-20
  • 2021-12-10
  • 2021-07-14
  • 2021-12-24
  • 2021-10-27
猜你喜欢
  • 2022-12-23
  • 2022-12-23
  • 2021-11-23
  • 2021-08-05
  • 2021-06-10
  • 2021-11-04
  • 2021-05-07
相关资源
相似解决方案