【问题标题】:Retrieve text from span using xpath or css query使用 xpath 或 css 查询从 span 中检索文本
【发布时间】:2016-08-13 14:38:10
【问题描述】:

我需要从以下 span 元素中检索文本而不将其拆分为文本部分。

<span class="a-size-base review-text">I purchased this from Fry's Electronics.
<br/>
<br/>
The picture is quite good after tweaking the settings.  An HDMI feed from my PC results in very clear text with no distortion.  Be sure to turn down the sharpness to avoid artifacts around text.  I think this screen may offer 4:4:4 chroma subsampling based on the attached test image.  I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.
<br/>
<br/>
I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed.  The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow.  The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files.  I would much rather just see a detailed list without thumbnails.  When you finally do find your desired movie the playback is very good.  If you keep the directory contents small (~10 items or fewer) you may not have any problems.
<br/>
<br/>
The unit is very thin and light and setup was a breeze.  You just have to put in 4 screws to attach the base and then you're ready to go.  The power adapter comes with a "brick" style converter.  The remote is well laid out and the menus are easy to navigate without feeling cumbersome.
<br/>
<br/>
The stand is 8" deep x 22.25" wide.  The TV stands 26.5" from table top to the top of the bezel with stand attached.  The TV is 42.75" wide from outside bezel edge to outside bezel edge.
<br/>
<br/>
Overall I'm very pleased with what this offers in the $400-500 range.  (I actually paid $398 but that was after some customer service adjustments at Fry's).
<br/>
<br/>
NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing.  Some of the strange patterns seen in the images are not present when viewing in person.
</span>

但是在应用我的 xpath 查询时

//*[contains(concat( " ", @class, " "), concat(" ", "review-text", " "))]/text()

我明白了:

Text='I purchased this from Fry's Electronics.'
Text=''
Text='The picture is quite good after tweaking the settings.  An HDMI feed from my PC results in very clear text with no distortion.  Be sure to turn down the sharpness to avoid artifacts around text.  I think this screen may offer 4:4:4 chroma subsampling based on the attached test image.  I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.'
Text=''
Text='I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed.  The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow.  The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files.  I would much rather just see a detailed list without thumbnails.  When you finally do find your desired movie the playback is very good.  If you keep the directory contents small (~10 items or fewer) you may not have any problems.'
Text=''
Text='The unit is very thin and light and setup was a breeze.  You just have to put in 4 screws to attach the base and then you're ready to go.  The power adapter comes with a "brick" style converter.  The remote is well laid out and the menus are easy to navigate without feeling cumbersome.'
Text=''
Text='The stand is 8" deep x 22.25" wide.  The TV stands 26.5" from table top to the top of the bezel with stand attached.  The TV is 42.75" wide from outside bezel edge to outside bezel edge.'
Text=''
Text='Overall I'm very pleased with what this offers in the $400-500 range.  (I actually paid $398 but that was after some customer service adjustments at Fry's).'
Text=''
Text='NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing.  Some of the strange patterns seen in the images are not present when viewing in person.'

我想检索一个没有破损的文本块。 我正在使用这个 xpath 测试器http://www.freeformatter.com/xpath-tester.html

【问题讨论】:

    标签: python-2.7 xpath scrapy


    【解决方案1】:

    scrapy 选择器的一个方便的功能是选择器链接,因此您可以从 CSS 选择开始,然后应用 XPath 字符串方法,例如 string()normalize-space()

    这是一个示例 scrapy 1.1 shell 会话:

    ~$ scrapy shell
    2016-08-16 12:20:57 [scrapy] INFO: Scrapy 1.1.1 started (bot: scrapybot)
    2016-08-16 12:20:57 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
    (...)
    In [1]: html = '''<span class="a-size-base review-text">I purchased this from Fry's Electronics.
       ...: <br/>
       ...: <br/>
       ...: The picture is quite good after tweaking the settings.  An HDMI feed from my PC results in very clear text with no distortion.  Be sure to turn down the sharpness to avoid artifacts around text.  I think this screen may offer 4:4:4 chroma subsampling based on the attached test image.  I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.
       ...: <br/>
       ...: <br/>
       ...: I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed.  The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow.  The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files.  I would much rather just see a detailed list without thumbnails.  When you finally do find your desired movie the playback is very good.  If you keep the directory contents small (~10 items or fewer) you may not have any problems.
       ...: <br/>
       ...: <br/>
       ...: The unit is very thin and light and setup was a breeze.  You just have to put in 4 screws to attach the base and then you're ready to go.  The power adapter comes with a "brick" style converter.  The remote is well laid out and the menus are easy to navigate without feeling cumbersome.
       ...: <br/>
       ...: <br/>
       ...: The stand is 8" deep x 22.25" wide.  The TV stands 26.5" from table top to the top of the bezel with stand attached.  The TV is 42.75" wide from outside bezel edge to outside bezel edge.
       ...: <br/>
       ...: <br/>
       ...: Overall I'm very pleased with what this offers in the $400-500 range.  (I actually paid $398 but that was after some customer service adjustments at Fry's).
       ...: <br/>
       ...: <br/>
       ...: NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing.  Some of the strange patterns seen in the images are not present when viewing in person.
       ...: </span>'''
    
    In [2]: import scrapy
    
    In [3]: selector = scrapy.Selector(text=html)
    
    In [4]: selector.css('span.review-text').xpath('string()').extract_first()
    Out[4]: 'I purchased this from Fry\'s Electronics.\n\n\nThe picture is quite good after tweaking the settings.  An HDMI feed from my PC results in very clear text with no distortion.  Be sure to turn down the sharpness to avoid artifacts around text.  I think this screen may offer 4:4:4 chroma subsampling based on the attached test image.  I\'m very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.\n\n\nI wasn\'t planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed.  The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow.  The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files.  I would much rather just see a detailed list without thumbnails.  When you finally do find your desired movie the playback is very good.  If you keep the directory contents small (~10 items or fewer) you may not have any problems.\n\n\nThe unit is very thin and light and setup was a breeze.  You just have to put in 4 screws to attach the base and then you\'re ready to go.  The power adapter comes with a "brick" style converter.  The remote is well laid out and the menus are easy to navigate without feeling cumbersome.\n\n\nThe stand is 8" deep x 22.25" wide.  The TV stands 26.5" from table top to the top of the bezel with stand attached.  The TV is 42.75" wide from outside bezel edge to outside bezel edge.\n\n\nOverall I\'m very pleased with what this offers in the $400-500 range.  (I actually paid $398 but that was after some customer service adjustments at Fry\'s).\n\n\nNOTE: If you see any strange distortion in the images it\'s likely a result of the camera, image compression, and resizing.  Some of the strange patterns seen in the images are not present when viewing in person.\n'
    
    In [5]: print(selector.css('span.review-text').xpath('string()').extract_first())
    I purchased this from Fry's Electronics.
    
    
    The picture is quite good after tweaking the settings.  An HDMI feed from my PC results in very clear text with no distortion.  Be sure to turn down the sharpness to avoid artifacts around text.  I think this screen may offer 4:4:4 chroma subsampling based on the attached test image.  I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing.
    
    
    I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed.  The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow.  The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files.  I would much rather just see a detailed list without thumbnails.  When you finally do find your desired movie the playback is very good.  If you keep the directory contents small (~10 items or fewer) you may not have any problems.
    
    
    The unit is very thin and light and setup was a breeze.  You just have to put in 4 screws to attach the base and then you're ready to go.  The power adapter comes with a "brick" style converter.  The remote is well laid out and the menus are easy to navigate without feeling cumbersome.
    
    
    The stand is 8" deep x 22.25" wide.  The TV stands 26.5" from table top to the top of the bezel with stand attached.  The TV is 42.75" wide from outside bezel edge to outside bezel edge.
    
    
    Overall I'm very pleased with what this offers in the $400-500 range.  (I actually paid $398 but that was after some customer service adjustments at Fry's).
    
    
    NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing.  Some of the strange patterns seen in the images are not present when viewing in person.
    
    
    In [6]: print(selector.css('span.review-text').xpath('normalize-space()').extract_first())
    I purchased this from Fry's Electronics. The picture is quite good after tweaking the settings. An HDMI feed from my PC results in very clear text with no distortion. Be sure to turn down the sharpness to avoid artifacts around text. I think this screen may offer 4:4:4 chroma subsampling based on the attached test image. I'm very pleased with the viewing angles and the screen is definitely usable for more than just straight ahead viewing. I wasn't planning on using the Smart features, but the Netflix app works really well and is responsive enough to not become annoyed. The wifi streaming playback is very smooth, but navigating the folder structure is horribly slow. The interface insists on creating thumbnails for each movie file, which takes forever if you have a directory with many files. I would much rather just see a detailed list without thumbnails. When you finally do find your desired movie the playback is very good. If you keep the directory contents small (~10 items or fewer) you may not have any problems. The unit is very thin and light and setup was a breeze. You just have to put in 4 screws to attach the base and then you're ready to go. The power adapter comes with a "brick" style converter. The remote is well laid out and the menus are easy to navigate without feeling cumbersome. The stand is 8" deep x 22.25" wide. The TV stands 26.5" from table top to the top of the bezel with stand attached. The TV is 42.75" wide from outside bezel edge to outside bezel edge. Overall I'm very pleased with what this offers in the $400-500 range. (I actually paid $398 but that was after some customer service adjustments at Fry's). NOTE: If you see any strange distortion in the images it's likely a result of the camera, image compression, and resizing. Some of the strange patterns seen in the images are not present when viewing in person.
    

    【讨论】:

    • 谢谢@paul trmbrth。很好的解决方案!
    【解决方案2】:

    将整个&lt;span&gt; 元素转换为string

    string(
      //*[contains(concat( " ", @class, " a-size-base review-text" ), concat( " ", "review-text", " " ))]
    )
    

    请注意,这只适用于第一个匹配条件的&lt;span&gt; 元素。在 XPath 2.0 中,您可以使用 string-join(),它可以处理任意数量的 &lt;span&gt; 元素:

    string-join(   
        //*[contains(concat( " ", @class, " a-size-base review-text" ), concat( " ", "review-text", " " ))]/text(),
        ""
    )
    

    【讨论】:

    • 我正在使用仅支持 xpath 1.0lxml,因此我无法使用 string-join。如果我将整个元素转换为stringxpath 查询似乎返回一个字符串而不是一个列表。
    • 以下返回一个scrapy shell上的列表。 response.xpath('//*[contains(concat( " ", @class, " " ), concat( " ", "review-text", " " ))]').extract().
    【解决方案3】:

    我不得不发布使用 python 正则表达式删除 html 标签的过程。

    re.sub( r'<span class="a-size-base review-text">|<br>|</span>', "", text)
    

    我尝试了@har07 的建议;

    • scrapy 使用仅支持 xpath 1.0 的 lxml,因此我无法利用 xpath 2.0 中提供的string-join
    • 当我尝试 string 时,我无法从我的 xpath 查询中获取选择器列表。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-01-13
      • 1970-01-01
      • 1970-01-01
      • 2012-04-18
      • 2011-04-27
      • 2012-07-21
      • 2011-03-08
      • 2011-11-01
      相关资源
      最近更新 更多