接上日scrapy爬虫


Section1 用xpath 抽取数据

import scrapy
from mySpiderOne.mySpiderOne.items import MyspideroneItem

class TiebaspiderSpider(scrapy.Spider):
    name = 'tiebaSpider'
    allowed_domains = ['tieba.baidu.com']
    start_urls = ['https://tieba.baidu.com/f?kw=%E5%9C%A8%E5%AE%B6%E8%B5%9A%E9%92%B1']

    def parse(self, response):
        # filename = "tieba.html"
        # open(filename, "wb+").write(response.body)
        items = []
        for each in response.xpath("//li[@class=' j_thread_list clearfix']//div[@class='threadlist_lz clearfix']"):
            print (each.extract())

            # 将我们得到的数据封装到一个 `Item` 对象
          
        return items

scrapy crawl tiebaSpider


C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne>scrapy c
rawl tiebaSpider
Traceback (most recent call last):
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\run
py.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\run
py.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\Administrator\AppData\Local\Programs\Python\Python36-32\Scripts
\scrapy.exe\__main__.py", line 9, in <module>
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\cmdline.py", line 148, in execute
    cmd.crawler_process = CrawlerProcess(settings)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 243, in __init__
    super(CrawlerProcess, self).__init__(settings)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 134, in __init__
    self.spider_loader = _get_spider_loader(settings)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\crawler.py", line 330, in _get_spider_loader
    return loader_cls.from_settings(settings.frozencopy())
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 61, in from_settings
    return cls(settings)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 25, in __init__
    self._load_all_spiders()
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\spiderloader.py", line 47, in _load_all_spiders
    for module in walk_modules(name):
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\sit
e-packages\scrapy\utils\misc.py", line 71, in walk_modules
    submod = import_module(fullpath)
  File "c:\users\administrator\appdata\local\programs\python\python36-32\lib\imp
ortlib\__init__.py", line 126, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 978, in _gcd_import
  File "<frozen importlib._bootstrap>", line 961, in _find_and_load
  File "<frozen importlib._bootstrap>", line 950, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 655, in _load_unlocked
  File "<frozen importlib._bootstrap_external>", line 678, in exec_module
  File "<frozen importlib._bootstrap>", line 205, in _call_with_frames_removed
  File "C:\Users\Administrator\PycharmProjects\mySpider\mySpiderOne\mySpiderOne\
spiders\tiebaSpider.py", line 4, in <module>
    from mySpiderOne.mySpiderOne.items import MyspideroneItem
ModuleNotFoundError: No module named 'mySpiderOne.mySpiderOne'


Section2  修改导包方式

提示没有找到对应的模块,我们修改下导包的方式

from ..items import MyspideroneItem
其中,一个点代表当前目录,每多一个点则代表向上一层目录

再次运行 scrapy crawl tiebaSpider
这次终于有数据了。
2017-08-23 22:54:20 [scrapy.core.engine] INFO: Closing spider (finished)
2017-08-23 22:54:20 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 532,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 52996,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 23, 14, 54, 20, 98239),
 'log_count/DEBUG': 3,
 'log_count/INFO': 7,
 'response_received_count': 2,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2017, 8, 23, 14, 54, 18, 103125)}
2017-08-23 22:54:20 [scrapy.core.engine] INFO: Spider closed (finished)

Section3  让PyCharm直接运行spider 
每次都在命令行启动spider是不是有点麻烦,我没配置一下pyCharm,让以后启动可以直接run




新建start.py
内容为


from scrapy import cmdline

cmdline.execute("scrapy crawl tiebaSpider".split())


python引包以及pyCharm运行scrapy方法



现在,直接点击三角形的Run 按钮就可以直接运行爬虫了。



分类:

技术点:

相关文章: