【问题标题】:Extract path and file name from <img > tag从 <img > 标签中提取路径和文件名
【发布时间】:2011-01-29 23:47:46
【问题描述】:

我有一些网页的源代码,我需要找到所有出现的标签并提取该图片的名称和位置(例如&lt;img src="../images/test.jpg" /&gt;我需要 path="../images/"file="test.jpg")。如何使用正则表达式做到这一点?

【问题讨论】:

    标签: python html html-parsing


    【解决方案1】:

    你应该使用 lxml.html

    >>> from urllib2 import urlopen
    >>> from lxml import html
    >>> page = urlopen('http://www.amazon.co.uk/')
    >>> page_source = html.parse(page)
    >>> from pprint import pprint
    >>> pprint(page_source.xpath('//img/@src'))
    ['http://g-ecx.images-amazon.com/images/G/02/gno/images/orangeBlue/navPackedSprites-UK-15._V202471918_.png',
     'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/uk-marketing/xmas10/janbargains/uk-january-bargains-loz75._V175451391_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-loz-1._V173375114_.png',
     'http://g-ecx.images-amazon.com/images/G/02/uk-jw/homepage/uk-wtch-police-roto._V185455265_.png',
     'http://g-ecx.images-amazon.com/images/G/02/kindle/shasta/merch/gw/shasta-gw-bestselling-01a-470x265._V173993687_.jpg',
     'http://ecx.images-amazon.com/images/I/412wF8LJ-uL._SL135_.jpg',
     'http://ecx.images-amazon.com/images/I/51YC5H64AuL._SL135_.jpg',
     'http://ecx.images-amazon.com/images/I/41%2BdpTvM1FL._SL135_.jpg',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://g-ecx.images-amazon.com/images/G/01/x-locale/common/transparent-pixel._V42752373_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192199769_.gif',
     'http://ecx.images-amazon.com/images/I/51-kiOR0NwL._SL135_.jpg',
     'http://ecx.images-amazon.com/images/I/51DRc-7HuxL._SL135_.jpg',
     'http://ecx.images-amazon.com/images/I/51SK5htD22L._SL135_.jpg',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://z-ecx.images-amazon.com/images/G/02/x-locale/common/transparent-pixel._V192234675_.gif',
     'http://ecx.images-amazon.com/images/I/31POT%2BzL1tL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
     'http://ecx.images-amazon.com/images/I/41hkDkhjrTL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
     'http://ecx.images-amazon.com/images/I/41zDYiAWasL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
     'http://ecx.images-amazon.com/images/I/31HqB5H8j%2BL._SS120_RO10,1,201,225,243,255,255,255,15_.jpg',
     'http://g-ecx.images-amazon.com/images/G/02/uk-clothing/Lingerie/UK_APP_LingerieStore_50._V171062881_.png',
     'http://g-ecx.images-amazon.com/images/G/02/uk-pets/graphics/B000FVC1HE_50._V198692831_.jpg',
     'http://g-ecx.images-amazon.com/images/G/02/uk-grocery/images/illy_50._V198779066_.gif',
     'http://g-ecx.images-amazon.com/images/G/02/uk-electronics/MI_Store/UK_MIN_MILaunch_50._V191178779_.png',
     'http://g-ecx.images-amazon.com/images/G/02/uk-lighting/graphics/NoveltyLighting_50._V192237013_.jpg',
     'http://g-ecx.images-amazon.com/images/G/02/UK-Shoe/email/7_jan_11-amzn-sale-TCG-1._V173375108_.png',
     'http://g-ecx.images-amazon.com/images/G/02/gno/images/general/navAmazonLogoFooter._V192252709_.gif']
    

    【讨论】:

      【解决方案2】:

      由于this answer 中列出的各种原因,您不应该使用正则表达式来解析 HTML。你应该使用HTML parser

      【讨论】:

        【解决方案3】:

        有很多方法,你可以使用捕获组

        path=("[^"]+")
        

        或向后看语法

        (?<=path=)"[^"]+" 
        

        可能还有很多其他选择。无论哪种方式,您都应该像之前提到的那样使用 HTML 解析器来完成这项工作。尽管如此,如果您使用正则表达式,您可能需要先提取 img 标签,然后运行上述正则表达式之一。

        【讨论】:

          猜你喜欢
          • 2010-10-01
          • 2016-06-19
          • 1970-01-01
          • 2011-03-26
          • 2012-10-04
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多