【问题标题】:Scrapy - How to crawl website & store data in Microsoft SQL Server database?Scrapy - 如何抓取网站并将数据存储在 Microsoft SQL Server 数据库中?
【发布时间】:2017-04-06 22:02:03
【问题描述】:

我正在尝试从我们公司创建的网站中提取内容。我在 MSSQL Server 中为 Scrapy 数据创建了一个表。我还设置了 Scrapy 并配置了 Python 来抓取和提取网页数据。我的问题是,如何将 Scrapy 爬取的数据导出到我本地的 MSSQL Server 数据库中?

这是 Scrapy 提取数据的代码:

import scrapy

class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = [
        'http://quotes.toscrape.com/page/1/',
        'http://quotes.toscrape.com/page/2/',
    ]

    def parse(self, response):
        for quote in response.css('div.quote'):
            yield {
                'text': quote.css('span.text::text').extract_first(),
                'author': quote.css('small.author::text').extract_first(),
                'tags': quote.css('div.tags a.tag::text').extract(),
            }

【问题讨论】:

    标签: python sql-server scrapy web-crawler


    【解决方案1】:

    您可以使用pymssql 模块将数据发送到 SQL Server,如下所示:

    import pymssql
    
    class DataPipeline(object):
        def __init__(self):
            self.conn = pymssql.connect(host='host', user='user', password='passwd', database='db')
            self.cursor = self.conn.cursor()
    
        def process_item(self, item, spider):
            try:
                self.cursor.execute("INSERT INTO MYTABLE(text, author, tags) VALUES (%s, %s, %s)", (item['text'], item['author'], item['tags']))
                self.conn.commit()
            except pymssql.Error, e:
                print ("error")
    
            return item
    

    另外,您需要在设置中将'spider_name.pipelines.DataPipeline' : 300 添加到ITEM_PIPELINES dict。

    【讨论】:

      【解决方案2】:

      我认为最好的做法是将数据保存到 CSV,然后将 CSV 加载到您的 SQL Server 表中。

      import csv
      import requests
      import bs4
      
      res = requests.get('http://www.ebay.com/sch/i.html?LH_Complete=1&LH_Sold=1&_from=R40&_sacat=0&_nkw=gerald%20ford%20autograph&rt=nc&LH_Auction=1&_trksid=p2045573.m1684')
      res.raise_for_status()
      soup = bs4.BeautifulSoup(res.text)
      
      # grab all the links and store its href destinations in a list
      links = [e['href'] for e in soup.find_all(class_="vip")]
      
      # grab all the bid spans and split its contents in order to get the number only
      bids = [e.span.contents[0].split(' ')[0] for e in soup.find_all("li", "lvformat")]
      
      # grab all the prices and store those in a list
      prices = [e.contents[0] for e in soup.find_all("span", "bold bidsold")]
      
      # zip each entry out of the lists we generated before in order to combine the entries
      # belonging to each other and write the zipped elements to a list
      l = [e for e in zip(links, prices, bids)]
      
      # write each entry of the rowlist `l` to the csv output file
      with open('ebay.csv', 'w') as csvfile:
          w = csv.writer(csvfile)
          for e in l:
              w.writerow(e)
      

      import requests, bs4
      import numpy as np
      import requests
      import pandas as pd
      
      res = requests.get('http://www.ebay.com/sch/i.html? LH_Complete=1&LH_Sold=1&_from=R40&_sacat=0&_nkw=gerald%20ford%20autograph&r        t=nc&LH_Auction=1&_trksid=p2045573.m1684')
      res.raise_for_status()
      soup=bs4.BeautifulSoup(res.text, "lxml")
      
      # grabs the link, selling price, and # of bids from historical auctions
      df = pd.DataFrame()
      
      
      l = []
      p = []
      b = []
      
      
      for links in soup.find_all(class_="vip"):
          l.append(links)
      
      for bids in soup.find_all("li", "lvformat"):
          b.append(bids)
      
      for prices in soup.find_all("span", "bold bidsold"):
          p.append(prices)
      
      x = np.array((l,b,p))
      z = x.transpose()
      df = pd.DataFrame(z)
      df.to_csv('/Users/toasteez/ebay.csv')
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-02-02
        • 2011-01-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2012-01-23
        相关资源
        最近更新 更多