使用 python lxml xpath 遍历表中的所有行答案

【问题标题】：Iterate through all the rows in a table using python lxml xpath使用 python lxml xpath 遍历表中的所有行
【发布时间】：2014-11-15 20:03:17
【问题描述】：

这是我要从中提取数据的html页面的源代码。

网页：http://gbgfotboll.se/information/?scr=table&ftid=51168表格在页面底部

     <html>
               <table class="clCommonGrid" cellspacing="0">
                        <thead>
                            <tr>
                                <td colspan="3">Kommande matcher</td>
                            </tr>
                            <tr>
                                <th style="width:1%;">Tid</th>
                                <th style="width:69%;">Match</th>
                                <th style="width:30%;">Arena</th>
                            </tr>
                        </thead>

                        <tbody class="clGrid">

                    <tr class="clTrOdd">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 19:30</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669197">Guldhedens IK - IF Warta</a></td>
                        <td><a href="?scr=venue&amp;faid=847">Guldheden Södra 1 Konstgräs</a> </td>
                    </tr>

                    <tr class="clTrEven">
                        <td nowrap="nowrap" class="no-line-through">
                            <span class="matchTid"><span>2014-09-26<!-- br ok --> 13:00</span></span>



                        </td>
                        <td><a href="?scr=result&amp;fmid=2669176">Romelanda UF - IK Virgo</a></td>
                        <td><a href="?scr=venue&amp;faid=941">Romevi 1 Gräs</a> </td>
                    </tr>

                    <tr class="clTrOdd">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 13:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669167">Kode IF - IK Kongahälla</a></td>
                    <td><a href="?scr=venue&amp;faid=912">Kode IP 1 Gräs</a> </td>
                </tr>

                <tr class="clTrEven">
                    <td nowrap="nowrap" class="no-line-through">
                        <span class="matchTid"><span>2014-09-27<!-- br ok --> 14:00</span></span>



                    </td>
                    <td><a href="?scr=result&amp;fmid=2669147">Floda BoIF - Partille IF FK </a></td>
                    <td><a href="?scr=venue&amp;faid=218">Flodala IP 1</a> </td>
                </tr>


                        </tbody>
                </table>
        </html>

现在我有这段代码可以实际产生我想要的结果..

import lxml.html
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
html = lxml.html.parse(url)
for i in range(12):
    xpath1 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[1]/span/span//text()" %(i+1)
    xpath2 = ".//*[@id='content-primary']/table[3]/tbody/tr[%d]/td[2]/a/text()" %(i+1)
    time = html.xpath(xpath1)[1]
    date = html.xpath(xpath1)[0]
    teamName = html.xpath(xpath2)[0]
    if date == '2014-09-27':
        print time, teamName

给出结果：

13:00 Romelanda UF - IK 处女座

13:00 Kode IF - IK Kongahälla

14:00 Floda BoIF - Partille IF FK

现在回答问题。我不想使用带范围的 for 循环，因为它不稳定，该表中的行可以更改，如果超出范围，它将崩溃。所以我的问题是如何以安全的方式进行迭代。 意味着它将遍历表中可用的所有行。不多不少。 另外，如果您有任何其他使代码更好/更快的建议，请继续。

【问题讨论】：

标签： python xpath web-scraping html-table lxml

【解决方案1】：

以下代码将迭代任何行数。 rows_xpath 将直接过滤目标日期。 xpath 也是在 for 循环之外创建一次，因此它应该更快。

import lxml.html
from lxml.etree import XPath
url = "http://gbgfotboll.se/information/?scr=table&ftid=51168"
date = '2014-09-27'

rows_xpath = XPath("//*[@id='content-primary']/table[3]/tbody/tr[td[1]/span/span//text()='%s']" % (date))
time_xpath = XPath("td[1]/span/span//text()[2]")
team_xpath = XPath("td[2]/a/text()")

html = lxml.html.parse(url)

for row in rows_xpath(html):
    time = time_xpath(row)[0].strip()
    team = team_xpath(row)[0]
    print time, team

【讨论】：

我爱你....感谢您提供的精彩代码。这是如何运作的？由于我是 Stackoverlow 的新手，我是用你提供给我的代码来回答我自己的问题，还是你写了这个答案就足够了，所以它会被关闭？还有 .strip() 在这里做什么？因为我试图在没有它的情况下运行它并得到相同的结果再次感谢您！ @乔治·马丁
:-) 您应该会在我的答案左侧看到一个复选标记。如果我的答案适合你，请点击它...
好的，太好了:)，一个简单的问题。这里的 .strip() 方法是做什么的？因为我试图在没有它的情况下运行它，我得到了相同的结果？ @乔治·马丁
时间文本前有空格。 " 13:00".strip() 将简单地返回 "13:00"。
再次感谢：D！！ @乔治·马丁