【问题标题】:scrapy, how to separate text within a HTML tag elementscrapy,如何在 HTML 标记元素中分隔文本
【发布时间】:2013-09-04 08:47:12
【问题描述】:

包含我的数据的代码:

        <div id="content"><!-- InstanceBeginEditable name="EditRegion3" -->
      <div id="content_div">
    <div class="title" id="content_title_div"><img src="img/banner_outlets.jpg" width="920" height="157" alt="Outlets" /></div>
    <div id="menu_list">
<table border="0" cellpadding="5" cellspacing="5" width="100%">
    <tbody>
        <tr>
            <td valign="top">
                <p>
                    <span class="foodTitle">Century Square</span><br />
                    2 Tampines Central 5<br />
                    #01-44-47 Century Square<br />
                    Singapore 529509</p>
                <p>
                    <br />
                    <strong>Opening Hours:</strong><br />
                    7am to 12am (Sun-Thu &amp;&nbsp;PH)<br />
                    24 Hours (Fri &amp; Sat&nbsp;&amp;</p>
                <p>
                    Eve of PH)<br />
                    Telephone: 6789 0457</p>
            </td>
            <td valign="top">
                <img alt="Century Square" src="/assets/images/outlets/century_sq.jpg" style="width: 260px; height: 140px" /></td>
            <td valign="top">
                <span class="foodTitle">Liat Towers</span><br />
                541 Liat towers #01-01<br />
                Orchard Road<br />
                Singapore 238888<br />
                <br />
                <strong>Opening Hours: </strong><br />
                24 hours (Daily)<br />
                <br />
                Telephone: 6737 8036</td>
            <td valign="top">
                <img alt="Liat Towers" src="/assets/images/outlets/century_liat.jpg" style="width: 260px; height: 140px" /></td>
        </tr>

**我想要得到

地名:世纪广场,Liat Towers

地址:2 Tampines Central 5, 541 Liat towers #01-01

邮政编码:新加坡 529509,新加坡 238888

营业时间:早上 7 点至 12 点,每天 24 小时**

例如:

'' 中的第一个 有 3 个我想要的数据(姓名、地址、邮政编码)。 如何拆分它们?

这是我的蜘蛛代码:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
import re
from todo.items import wendyItem

class wendySpider(BaseSpider):
    name = "wendyspider"
    allowed_domains = ["wendys.com.sg"]
    start_urls = ["http://www.wendys.com.sg/outlets.php"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        values = hxs.select('//td')
        items = []
        for value in values:
            item = wendyItem()
            item['name'] = value.select('//span[@class="foodTitle"]/text()').extract()
            item['address'] = value.select().extract()
            item['postal'] = value.select().extract()
            item['hours'] = value.select().extract()
            item['contact'] = value.select().extract()
            items.append(item)
        return items

【问题讨论】:

    标签: python screen-scraping scrapy web-crawler


    【解决方案1】:

    我会选择所有包含&lt;span class="foodTitle"&gt;&lt;td valign="top"&gt;

    //div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]
    

    然后对于每个td 单元格,获取所有文本节点

    .//text()
    

    你会得到类似的东西:

    ['\n                ',
     '\n                    ',
     'Century Square',
     '\n                    2 Tampines Central 5',
     '\n                    #01-44-47 Century Square',
     '\n                    Singapore 529509',
     '\n                ',
     '\n                    ',
     'Opening Hours:',
     u'\n                    7am to 12am (Sun-Thu &\xa0PH)',
     u'\n                    24 Hours (Fri & Sat\xa0&',
     '\n                ',
     '\n                    Eve of PH)',
     '\n                    Telephone: 6789 0457',
     '\n            ']
    

    ['\n                ',
     'Liat Towers',
     '\n                541 Liat towers #01-01',
     '\n                Orchard Road',
     '\n                Singapore 238888',
     'Opening Hours: ',
     '\n                24 hours (Daily)',
     '\n                Telephone: 6737 8036']
    

    其中一些文本节点的字符串表示形式全是空格,因此去掉它们并查找“Opening hours”和“Telephone”关键字以循环处理行:

    from scrapy.spider import BaseSpider
    from scrapy.selector import HtmlXPathSelector
    import re
    from todo.items import wendyItem
    
    class wendySpider(BaseSpider):
        name = "wendyspider"
        allowed_domains = ["wendys.com.sg"]
        start_urls = ["http://www.wendys.com.sg/outlets.php"]
    
        def parse(self, response):
            hxs = HtmlXPathSelector(response)
            cells = hxs.select('//div[@id="menu_list"]//td[@valign="top"][.//span[@class="foodTitle"]]')
            items = []
            for cell in cells:
                item = wendyItem()
    
                # get all text nodes
                # some lines are blank so .strip() them
                lines = cell.select('.//text()').extract()
                lines = [l.strip() for l in lines if l.strip()]
    
                # first non-blank line is the place name
                item['name'] = lines.pop(0)
    
                # for the other lines, check for "Opening hours" and "Telephone"
                # to store lines in correct list container
    
                address_lines = []
                hours_lines = []
                telephone_lines = []
    
                opening_hours = False
                telephone = False
    
                for line in lines:
                    if 'Opening Hours' in line:
                        opening_hours = True
                    elif 'Telephone' in line:
                        telephone = True
                    if telephone:
                        telephone_lines.append(line)
                    elif opening_hours:
                        hours_lines.append(line)
                    else:
                        address_lines.append(line)
    
                # last address line is the postal code + town name
                item['address'] = "\n".join(address_lines[:-1])
                item['postal'] = address_lines[-1]
    
                # ommit "Opening hours" (first element in list)
                item['hours'] = "\n".join(hours_lines[1:])
    
                item['contact'] = "\n".join(telephone_lines)
    
                items.append(item)
    
            return items
    

    【讨论】:

    • 天啊,非常感谢保罗!你真的帮了我很多,很棒的帖子。我没有足够的声誉来投票,但是对于正在阅读本文的人,请投票给你
    • 感谢@HeadAboutToExplode...但是您已经足够接受答案了,不是吗? ;-)
    • 嗨@paultrmbrth,你能检查我的问题吗stackoverflow.com/questions/24109713/…感谢你的支持
    • @paultrmbrth 可以帮助我解决这个问题stackoverflow.com/questions/37815366/…
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-06-02
    • 1970-01-01
    • 1970-01-01
    • 2011-09-11
    • 2020-05-10
    • 1970-01-01
    相关资源
    最近更新 更多