【问题标题】:Extracting elements from list using xpath使用 xpath 从列表中提取元素
【发布时间】:2021-12-09 02:00:34
【问题描述】:

我正在尝试在此页面上提取 tripadvisor 上的城市列表: https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html

虽然只使用scrapy 和xpaths。我尝试过的:

def parse(self, response):
cities = response.xpath('//div[@id="LOCATION_LIST"]')
for links in cities:
    loader = ItemLoader(AdvisorItem(), selector=links)
    loader.add_xpath('cities', './/ul[@class="geoList"]/li/span[@class="state"]//text()')
    loader.add_xpath('cities_url', './/ul[@class="geoList"]/li/a//@href')
    yield loader.load_item()

这只返回一个结果,结果是西约克郡不在该页面上!所以我不确定它是从哪里得到的。如何为该页面中的所有链接获取链接和城市名称的正确 xpath?

【问题讨论】:

    标签: python xpath scrapy


    【解决方案1】:

    您可以尝试这样选择正确的 xpath 定位器:

    //*[@class="geoList"]/li 
    

    它将选择元素列表

    ".//a/text()"  
    

    ".//a/@href/text()"
    

    他们会选择每个城市名称和每个链接

    以scrapy实现为例:

    脚本:

    import scrapy
    
    class TripSpider(scrapy.Spider):
         name = 'trip'
    
         allowed_domains = ["tripadvisor.co.uk"]
         start_urls = ['https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html']
    
      
    
        def parse(self, response):
            cities = response.xpath('//*[@class="geoList"]/li')
            for city in cities:
                url = city.xpath(".//a/@href").get()
                abs_url= f'https://www.tripadvisor.co.uk{url}'
                yield {
                    'city': city.xpath(".//a/text()").get(),
                    'link':  abs_url}
    

    输出:

    {'city': 'Bradford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186408-Bradford_West_Yorkshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Plymouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186258-Plymouth_Devon_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Southend-on-Sea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g503790-Southend_on_Sea_Essex_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Swansea Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186466-Swansea_Swansea_County_South_Wales_Wales.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Aberdeen Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186487-Aberdeen_Aberdeenshire_Scotland.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Coventry Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186403-Coventry_West_Midlands_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Portsmouth Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186298-Portsmouth_Hampshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Kingston-upon-Hull Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186317-Kingston_upon_Hull_East_Riding_of_Yorkshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Oxford Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186361-Oxford_Oxfordshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Isle of Wight Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186308-Isle_of_Wight_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Doncaster Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187067-Doncaster_South_Yorkshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Reading Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186363-Reading_Berkshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Cambridge Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186225-Cambridge_Cambridgeshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Milton Keynes Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187055-Milton_Keynes_Buckinghamshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Derby Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187048-Derby_Derbyshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Stockport Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g528793-Stockport_Greater_Manchester_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Northampton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186349-Northampton_Northamptonshire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Bolton Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187053-Bolton_Greater_Manchester_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Bath Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g186370-Bath_Somerset_England.html'}
    2021-12-08 23:58:50 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.tripadvisor.co.uk/Restaurants-g186216-oa20-United_Kingdom.html>
    {'city': 'Preston Restaurants', 'link': 'https://www.tripadvisor.co.uk/Restaurants-g187062-Preston_Lancashire_England.html'}
    2021-12-08 23:58:50 [scrapy.core.engine] INFO: Closing spider (finished)
    2021-12-08 23:58:50 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 345,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 103132,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'elapsed_time_seconds': 3.084321,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2021, 12, 8, 17, 58, 50, 809225),
     'httpcompression/response_bytes': 384303,
     'httpcompression/response_count': 1,
     'item_scraped_count': 20,
    

    【讨论】:

    • 这就像一个魅力!因为我喜欢使用加载器,所以我使用你的修改来更新我的代码。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-03-12
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-01-09
    • 1970-01-01
    相关资源
    最近更新 更多