【问题标题】:Python scrapy page is not working with xpathPython scrapy 页面不适用于 xpath
【发布时间】:2016-05-22 20:53:47
【问题描述】:

我正在使用 python 3.5 废弃一个 html 字符串以提取该 html 字符串中的名称。我的代码如下:

from scrapy.selector import Selector

html_string = '<html
    xmlns:v="urn:schemas-microsoft-com:vml"
    xmlns:o="urn:schemas-microsoft-com:office:office"
    xmlns:w="urn:schemas-microsoft-com:office:word"
    xmlns="http://www.w3.org/TR/REC-html40">
    <head>
        <link href="https://www.rentlinx.com/Templates/MainStyle.css" type="text/css" rel="stylesheet" ></link>
        <style>
            <!-- /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal {margin:0in; margin-bottom:.0001pt; font-size:12.0pt;} -->
        </style>
    </head>
    <body>
        <table width="100%" cellpadding="12">
            <tr>
                <td width="100%" style="background: #e8f2f5;">
                    <img src="https://www.rentlinx.com/images/page-logo-v15.png" alt="new lead" style="padding-top: 2px; padding-bottom: 2px;" />
                </td>
            </tr>
        </table>
        <br />
        <table width="100%" cellpadding="12">
            <tr>
                <td>
                    <p style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666; font-weight: bold;">You have a new lead!</p>
                    <p style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;"> This 
                        <strong>basic (free)</strong> lead was generated for your property courtesy of RentLinx. 
                    </p>
                    <br />
                    <table cellpadding="7" style="width: 200px;">
                        <tr>
                            <td style="font-family: Tahoma, sans-serif; background: #009dc6; color: white; font-size: 20px; display: inline-block;"> Lead Details </td>
                        </tr>
                    </table>
                    <table cellpadding="0" cellspacing="0" border="0">
                        <tr>
                            <td style="border-right: 1px solid #CCC; width: 12px;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">From:</span>
                                <br />
                                <span style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;">Foo bar</span>
                            </td>
                        </tr>
                        <tr>
                            <td style="border-right: 1px solid #CCC;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">Date:</span>
                                <br />
                                <span style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;">5/21/2016 3:24:10 AM</span>
                            </td>
                        </tr>
                        <tr>
                            <td style="border-right: 1px solid #CCC;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">Regarding:</span>
                                <br />
                                <span style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;">My street and your street</span>
                            </td>
                        </tr>
                        <tr>
                            <td style="border-right: 1px solid #CCC;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">Contact Information:</span>
                                <br />
                                <span class="value" style="line-height: 28px; padding-top: 5px;">
                                    <a href="tel:1112223333" title="Call" style="color: #007998; font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; text-decoration: none;">(111) 222-3333</a>
                                    <br />
                                    <a href="mailto:foobar@gmail.com" title="Email" style="color: #007998; font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; text-decoration: none;">foobar@gmail.com</a>
                                </span>
                            </td>
                        </tr>
                        <tr>
                            <td style="border-right: 1px solid #CCC;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">Comments:</span>
                                <br />
                                <span style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;"> Hi, I like your apartment. Thanks </span>
                            </td>
                        </tr>
                        <tr>
                            <td style="border-right: 1px solid #CCC;"></td>
                            <td style="padding: 10px;">
                                <span style="font-family: Tahoma, sans-serif; color: #00aedb; font-size: 12px; font-weight: bold; text-transform: uppercase; line-height: 15px;">Lead From:</span>
                                <br />
                                <span style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;">
                                    <a href="https://www.marsplanet.com/13933360/" title="Lead from Mars Planet" style="font-family: Tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;">MarsPlanet</a>
                                </span>
                            </td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
        <table width="100%" cellpadding="12">
            <tr>
                <td>
                    <p style="font-family: tahoma, sans-serif; font-size: 16px; line-height: 20px; color: #666;"> Thanks,
                        <br /> The RentLinx Team 
                    </p>
                    <p style="background-color: #3f3d5d; color: White; padding: 8px; "> Want more leads like this? Upgrade your property to RentLinx 
                        <strong>
                            <em>Plus!</em>
                        </strong> today! Just 
                        <a href="https://www.rentlinx.com" style="color: #fff;">login to RentLinx</a>, then click "Go Plus!" 
                    </p>
                    <p>
                        <a href="http://www.facebook.com/rentlinx">
                            <img src="https://www.rentlinx.com/images/facebook/FB-f-Logo__blue_29.png" width="29" height="29" style="margin: 8px; border: 0;" align="absmiddle" />
                        </a>Like RentLinx? Please like us on facebook! 
                        <a href="http://www.facebook.com/rentlinx">www.facebook.com/rentlinx</a>
                    </p>
                </td>
            </tr>
        </table>
        <img src="http://delivery.rentlinx.com/" alt="" width="1" height="1" border="0" style="height:1px !important;width:1px !important;border-width:0 !important;margin-top:0 !important;margin-bottom:0 !important;margin-right:0 !important;margin-left:0 !important;padding-top:0 !important;padding-bottom:0 !important;padding-right:0 !important;padding-left:0 !important;"/>
    </body>
</html>'

s = Selector(text=html_string)
name = s.xpath('/html/body/table[2]/tbody/tr/td/table[2]/tbody/tr[1]/td[2]/span[2]/text()').extract()[0]

print(name)

它应该打印 Foo bar 因为那是我想要得到的。但它给出了空字符串。知道我做错了什么吗?

【问题讨论】:

    标签: python-3.x xpath web-scraping scrapy


    【解决方案1】:

    另外,我在 html 代码中没有看到一个名为 tbody 的标签,但无论如何你都在调用它。我知道当您在现代浏览器的检查选项中看到它时可能会产生误导,但它根本不在代码中。

    实际上,从 2016 年 11 月 5 日起,您应该使用选择器。 Patch notes

    试试这个

    name = response.xpath('/html/body/table[2]/tr/td/table[2]/tr[1]/td[2]/span[2]/text()').extract_first()
    
    print(name)
    

    【讨论】:

    • 谢谢拉斐尔。是的,tobdoy 出现在检查器中,但没有 tin html。有用。即使Selector 也可以工作,如果我按照你的建议使用xpath name = Selector.xpath('/html/body/table[2]/tr/td/table[2]/tr[1]/td[2]/span[2]/text()').extract_first() 为什么你建议使用response 而不是Selector
    • 选择器已被弃用,它应该在爬网开始时说。如果您发现我的回答有帮助,请考虑将其作为正确答案接受 =) 编辑 -> This 让我意识到这实际上是错误的,选择器应该是正确的选择!
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2016-06-15
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-04-14
    • 2021-07-06
    相关资源
    最近更新 更多