【发布时间】:2020-01-21 12:35:48
【问题描述】:
我正在尝试抓取这个名为 startup-India 的网站,我在其中抓取公司的 URL 和名称,但要抓取 URL 和名称我必须定位它们,但我不知道哪种定位方法是正确的,请帮助。
import logging
from bs4 import BeautifulSoup
import requests
import csv
import scrapy
class WebCrawlerPipeline(object):
def process_item(self, item, spider):
return item
class ProfileCrawlerPipeline(object):
def open_spider(self, spider):
self.urls = list()
self.companies = list()
pass
def process_item(self, item, spider):
item = dict(item)
url = item.get('item')
# yield scrapy.Request(url=url, callback=self.parse_content)
# logging.info(url)
r = requests.get(url).content
soup = BeautifulSoup(r, 'html.parser')
# url_txt = soup.select('div.container')
container = soup.find("div", class_="container")
logging.info(container)
# # self.write_to_csv()
def parse_content(self, response):
logging.info(response.url)
def close_spider(self, spider):
pass
def write_to_csv(self):
pass
代码将不胜感激
【问题讨论】:
-
我们推荐一个更简单的爬虫框架。这是一个例子。 github.com/yiyedata/simplified-scrapy-demo/tree/master/…
-
仅供参考,它是 scrape(和 scraping、scraped、scraper)而不是 scrap
标签: python-3.x web-scraping beautifulsoup scrapy