【问题标题】:How to iterate through XML children node in scrapy with python?python - 如何使用python遍历scrapy中的XML子节点?
【发布时间】:2020-06-03 20:21:11
【问题描述】:

我想在this page 上抓取 cmets,但我似乎不知道如何遍历包裹 cmets 的节点的子节点并获取数据点。

这是 hmtl 的一部分:

        <div class="comment">
            <div class="comment-user">
                <div class="comment-user-avatar">
                    <a href="https://www.picuki.com/profile/alexandera_300">
                        <img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/98342975_2815537605343770_6875611169034338304_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&amp;_nc_ohc=VjMtcOxXuaQAX_ZCqee&amp;oh=4cf78fecbadcb57a81672c6edecc15a2&amp;oe=5F02D580" alt="alexandera_300">
                    </a>
                </div>
                <div class="comment-user-nickname">
                    <a href="https://www.picuki.com/profile/alexandera_300">@alexandera_300</a>
                </div>
            </div>
            <div class="comment-text">
                #followforfollowback
            </div>
        </div>
        <div class="comment">
            <div class="comment-user">
                <div class="comment-user-avatar">
                    <a href="https://www.picuki.com/profile/coxlogan2008">
                        <img src="https://scontent-yyz1-1.cdninstagram.com/v/t51.2885-19/s150x150/101229634_275138197009045_1475918829270859776_n.jpg?_nc_ht=scontent-yyz1-1.cdninstagram.com&amp;_nc_ohc=e4gTZqQGpEAAX_7U-Q0&amp;oh=36b7f5d1a0d7069f2447f4a318edec7d&amp;oe=5F004A54" alt="coxlogan2008">
                    </a>
                </div>
                <div class="comment-user-nickname">
                    <a href="https://www.picuki.com/profile/coxlogan2008">@coxlogan2008</a>
                </div>
            </div>
            <div class="comment-text">
                ????
            </div>
        </div>

我使用的python代码sn-p是这样的:

    def parse_post(self, response):
    img_url = response.meta['img_url']
    caption = response.meta['caption']

    url = response.meta['url']

    comments = response.xpath('//div[@id="commantsPlace"]/text()')
    for comment in comments:
        likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
        # need to put a regex here to get just the number value:
        num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()

        comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
        comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()

        yield {'img_url': img_url,
               'caption': caption,
               'url': url,
               'likes': likes,
               'num_of_comments': num_of_comments,
               'comment_user_name': comment_user_name,
               'comment_text': comment_text}

但是,当我运行此程序时,我只能获得第一个评论 n 次的数据。有人可以帮我解决这个问题吗?我不明白为什么代码不遍历节点。

提前致谢!

【问题讨论】:

    标签: python scrapy instagram screen-scraping


    【解决方案1】:

    我认为您的问题来自“cmets”的 xpath。通过仅获取文本,您不会选择节点。 以下更改使其对我有用:

    # the likes & number of comments only have to be taken once, should not be part of the loop
    likes = response.xpath('.//span[@class="icon-thumbs-up-alt"]/text()').get()
    num_of_comments = response.xpath('.//span[@id="commentsCount"]/text()').get()
    comments = response.xpath('//div[@id="commantsPlace"]/*[@class="comment"]')
    for comment in comments:  
        comment_user_name = comment.xpath('.//*[@class="comment-user-nickname"]/a/text()').get()
        comment_text = comment.xpath('.//*[@class="comment-text"]/text()').get()
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2019-06-23
      • 1970-01-01
      • 1970-01-01
      • 2011-05-07
      • 2011-08-19
      • 2013-03-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多