【问题标题】:Scraping all links from csv file从 csv 文件中抓取所有链接
【发布时间】:2020-12-27 20:24:36
【问题描述】:

我正在尝试从 links.csv 文件中的链接中抓取信息。它有 71 个链接,但它只有 (https://i.stack.imgur.com/meKQG.png) 抓取 25 个链接我做错了什么?如何从 csv 文件中递归获取所有链接到start_url?



class HurriyetEmlakPage(scrapy.Spider):
    name = 'hurriyetspider'
    n = 3
    page_number = 2
    df1 = pd.read_csv("C:/Users/Mert/Desktop/hurriyet/emlak/links.csv")
    
    start_urls = [str(df1.iloc[2 , 1])]



    custom_settings={ 'FEED_URI': "scrapped_pages.csv",
                       'FEED_FORMAT': 'csv'}


    def parse(self, response):
        il = response.xpath('//[contains(concat( " ", @class, " " ), concat( " ", "short-info-list", " " ))]//li[(((count(preceding-sibling::) + 1) = 1) and parent::*)]/text()').extract()
        ilce = response.xpath('//[contains(concat( " ", @class, " " ), concat( " ", "short-info-list", " " ))]//li[(((count(preceding-sibling::) + 1) = 2) and parent::*)]/text()').extract()
        mahalle = response.xpath('//[contains(concat( " ", @class, " " ), concat( " ", "short-info-list", " " ))]//li[(((count(preceding-sibling::) + 1) = 3) and parent::*)]/text()').extract()
        fiyat = response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "price", " " ))]/text()').extract()
        baslik = response.css('.txt::text').extract()
        deger = response.css('.adv-info-list div span , .txt+ span::text').extract()

        scraped_info = {
            'İl': il,
            'İlçe' : ilce,
            'Mahalle' : mahalle,
            'Fiyat' : fiyat,
            'İlan Bilgileri - Başlık': baslik,
            'İlan Bilgileri - Değer' : deger
        }
        yield scraped_info
        df = HurriyetEmlakPage.df1
        x = HurriyetEmlakPage.n

        next_link = str(df.iloc[x,1])

        if HurriyetEmlakPage.n < len(df):

            HurriyetEmlakPage.n +=1

            yield response.follow(next_link,callback=self.parse)

【问题讨论】:

标签: python csv web-scraping scrapy web-crawler


【解决方案1】:

以下脚本读取您的 csv 并将其加载到数据框中。

import pandas as pd

df = pd.read_csv('links.csv').drop('Unnamed: 0', axis=1)

print(df) #print dataframe
print(df['Linkler'][0]) #print first link in dataframe

项目目录结构为

  • 根目录

    • links.csv
    • main.py

csv 文件是:

,Linkler
0,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3491
1,https://www.hurriyetemlak.com/ankara-etimesgut-baglica-satilik/daire/111267-690
2,https://www.hurriyetemlak.com/antalya-kepez-gazi-satilik/daire/94212-293
3,https://www.hurriyetemlak.com/antalya-kepez-gunes-satilik/daire/94212-295
4,https://www.hurriyetemlak.com/antalya-serik-belek-satilik/villa/65665-706
5,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-altinoran-cankaya
6,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3487
7,https://www.hurriyetemlak.com/izmir-cigli-balatcik-satilik/daire/106987-28
8,https://www.hurriyetemlak.com/ankara-sincan-ertugrulgazi-satilik/daire/13848-3472
9,https://www.hurriyetemlak.com/projeler/firat-life-style/natura-batikent2
10,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/daire/77789-1398
11,https://www.hurriyetemlak.com/izmir-cesme-sifne-satilik/villa/119588-149
12,https://www.hurriyetemlak.com/istanbul-kadikoy-suadiye-satilik/daire/4369-36455
13,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/villa/77789-1400
14,https://www.hurriyetemlak.com/projeler/gozde-grubu/projelermyvia-wins-gozde-grubu
15,https://www.hurriyetemlak.com/ankara-sincan-osmanli-satilik/daire/13848-3445
16,https://www.hurriyetemlak.com/ankara-sincan-selcuklu-satilik/daire/13848-3477
17,https://www.hurriyetemlak.com/eskisehir-odunpazari-visnelik-satilik/daire/111946-185
18,https://www.hurriyetemlak.com/ankara-kecioren-pinarbasi-satilik/daire/101486-750
19,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-gokorman-sinpas-gyo
20,https://www.hurriyetemlak.com/antalya-alanya-guller-pinari-satilik/daire/119217-57
21,https://www.hurriyetemlak.com/projeler/sur-yapi/sur-yapi-antalya-sur-yapi
22,https://www.hurriyetemlak.com/ankara-kecioren-ovacik-satilik/daire/16354-18577
23,https://www.hurriyetemlak.com/antalya-muratpasa-guzeloba-satilik/daire/65665-735
24,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3491
25,https://www.hurriyetemlak.com/ankara-etimesgut-baglica-satilik/daire/111267-690
26,https://www.hurriyetemlak.com/antalya-kepez-gazi-satilik/daire/94212-293
27,https://www.hurriyetemlak.com/antalya-kepez-gunes-satilik/daire/94212-295
28,https://www.hurriyetemlak.com/antalya-serik-belek-satilik/villa/65665-706
29,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-altinoran-cankaya
30,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3487
31,https://www.hurriyetemlak.com/izmir-cigli-balatcik-satilik/daire/106987-28
32,https://www.hurriyetemlak.com/ankara-sincan-ertugrulgazi-satilik/daire/13848-3472
33,https://www.hurriyetemlak.com/projeler/firat-life-style/natura-batikent2
34,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/daire/77789-1398
35,https://www.hurriyetemlak.com/izmir-cesme-sifne-satilik/villa/119588-149
36,https://www.hurriyetemlak.com/istanbul-kadikoy-suadiye-satilik/daire/4369-36455
37,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/villa/77789-1400
38,https://www.hurriyetemlak.com/projeler/gozde-grubu/projelermyvia-wins-gozde-grubu
39,https://www.hurriyetemlak.com/ankara-sincan-osmanli-satilik/daire/13848-3445
40,https://www.hurriyetemlak.com/ankara-sincan-selcuklu-satilik/daire/13848-3477
41,https://www.hurriyetemlak.com/eskisehir-odunpazari-visnelik-satilik/daire/111946-185
42,https://www.hurriyetemlak.com/ankara-kecioren-pinarbasi-satilik/daire/101486-750
43,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-gokorman-sinpas-gyo
44,https://www.hurriyetemlak.com/antalya-alanya-guller-pinari-satilik/daire/119217-57
45,https://www.hurriyetemlak.com/projeler/sur-yapi/sur-yapi-antalya-sur-yapi
46,https://www.hurriyetemlak.com/ankara-kecioren-ovacik-satilik/daire/16354-18577
47,https://www.hurriyetemlak.com/antalya-muratpasa-guzeloba-satilik/daire/65665-735
48,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3491
49,https://www.hurriyetemlak.com/ankara-etimesgut-baglica-satilik/daire/111267-690
50,https://www.hurriyetemlak.com/antalya-kepez-gazi-satilik/daire/94212-293
51,https://www.hurriyetemlak.com/antalya-kepez-gunes-satilik/daire/94212-295
52,https://www.hurriyetemlak.com/antalya-serik-belek-satilik/villa/65665-706
53,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-altinoran-cankaya
54,https://www.hurriyetemlak.com/ankara-sincan-saraycik-satilik/daire/13848-3487
55,https://www.hurriyetemlak.com/izmir-cigli-balatcik-satilik/daire/106987-28
56,https://www.hurriyetemlak.com/ankara-sincan-ertugrulgazi-satilik/daire/13848-3472
57,https://www.hurriyetemlak.com/projeler/firat-life-style/natura-batikent2
58,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/daire/77789-1398
59,https://www.hurriyetemlak.com/izmir-cesme-sifne-satilik/villa/119588-149
60,https://www.hurriyetemlak.com/istanbul-kadikoy-suadiye-satilik/daire/4369-36455
61,https://www.hurriyetemlak.com/balikesir-edremit-akcay-satilik/villa/77789-1400
62,https://www.hurriyetemlak.com/projeler/gozde-grubu/projelermyvia-wins-gozde-grubu
63,https://www.hurriyetemlak.com/ankara-sincan-osmanli-satilik/daire/13848-3445
64,https://www.hurriyetemlak.com/ankara-sincan-selcuklu-satilik/daire/13848-3477
65,https://www.hurriyetemlak.com/eskisehir-odunpazari-visnelik-satilik/daire/111946-185
66,https://www.hurriyetemlak.com/ankara-kecioren-pinarbasi-satilik/daire/101486-750
67,https://www.hurriyetemlak.com/projeler/sinpas-gyo/sinpas-gokorman-sinpas-gyo
68,https://www.hurriyetemlak.com/antalya-alanya-guller-pinari-satilik/daire/119217-57
69,https://www.hurriyetemlak.com/projeler/sur-yapi/sur-yapi-antalya-sur-yapi
70,https://www.hurriyetemlak.com/ankara-kecioren-ovacik-satilik/daire/16354-18577
71,https://www.hurriyetemlak.com/antalya-muratpasa-guzeloba-satilik/daire/65665-735

【讨论】:

  • 在加载到数据框的链接后,如何让这些链接输入 start_urls 以进行抓取
  • df['Linkler'] 产生一个列表。你可以调用 do: ``` for link in df['Linkler'] scrape(link) ``` 假设你已经定义了 scrape
【解决方案2】:

spider-feeder Scrapy 插件为此提供了内置支持。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2014-02-25
    • 2022-01-08
    • 2018-01-09
    • 1970-01-01
    • 2014-01-10
    • 2020-06-16
    • 1970-01-01
    相关资源
    最近更新 更多