1、首先创建一个Scrapy项目:
在命令行输入:
scrapy startproject project_name
project_name为项目名称,比如我的项目名称为py_scrapyjobbole,生成的目录为:

2、创建新的Spider
在命令行输入:
scrapy genspider jobbole(spider名称) blog.jobbole.com(爬取的起始url)
# -*- coding: utf-8 -*-
import scrapy
class JobboleSpider(scrapy.Spider):
name = 'jobbole'
allowed_domains = ['blog.jobbole.com']
start_urls = ['http://blog.jobbole.com/111322/']
def parse(self, response):
re_select = response.xpath('//*[@id="post-111322"]/div[1]/h1')
pass
3、配置setting.py文件(这步很重要)
BOT_NAME = 'py_scrapyjobbole'
SPIDER_MODULES = ['py_scrapyjobbole.spiders']
NEWSPIDER_MODULE = 'py_scrapyjobbole.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
# USER_AGENT = 'py_scrapyjobbole (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
ROBOTSTXT_OBEY
= False 一定要设置成 False,断点调试才能正常进行。
4、在工程目录下建立main.py文件,稍后将会在这里面进行调试!
from scrapy.cmdline import execute
import sys
import os
# 打断点调试py文件
# sys.path.append('D:PyCharmpy_scrapyjobbole')
sys.path.append(os.path.dirname(os.path.abspath(__file__)))
print(os.path.dirname(os.path.abspath(__file__)))
execute(['scrapy', 'crawl', 'jobbole'])
5、进行断点调试