如何使用 python 和 lxml 提取表值？

【问题标题】：How can I extract table values using python and lxml?如何使用 python 和 lxml 提取表值？
【发布时间】：2017-03-14 01:28:22
【问题描述】：

我需要在下面的 html 表中提取 IP 地址和端口号列表以及其他信息，我目前正在使用带有 lxml 的 python 2.7，但不知道如何找到这些元素的正确路径，

这是表格的地址： link to table

【问题讨论】：

这个表有多行吗？

标签： python-2.7 web-scraping lxml

【解决方案1】：

假设表中有多行，你可以找到每一行然后提取数据。

import lxml.etree

doc = lxml.etree.parse('test.xml')

# We need to locate the <tr> objects somehow... I'm assuming
# there is a single <table><tbody>.. container and no other
# span/div tags in the way.

for tr in doc.xpath('//table/tbody[1]/tr'):
    proxy_ip = tr.xpath('td[@ng-bind="proxy.IP"]/text()')[0].strip()
    proxy_port = tr.xpath('td[@ng-bind="proxy.PORT"]/text()')[0].strip()
    proxy_country = tr.xpath('td[@ng-bind="proxy.country"]/text()')[0].strip()
    print(proxy_ip, proxy_port, proxy_country)

【讨论】：

不幸的是，这不会输出任何值，我正在调查它以了解为什么会这样，您如何获得上面的 xpath？这是我要抓取的网页“hidester.com/proxylist”
我根据您的示例创建了一个有效的 xml 文档，它确实有效。但我不得不猜测元素在哪里。也许您可以发布一个更完整的文档......并且只是一个指向未找到 404 页面的链接。您可能希望减少大部分内容以获得一些小而有代表性的东西。
谢谢，这里是更新后的页面链接：https://hidester.com/proxylist/

【解决方案2】：

BeautifulSoup 会有所帮助。

from bs4 import BeautifulSoup
import requests
import re
header = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '
                    'AppleWebKit/537.36 (KHTML, like Gecko) '
                    'Chrome/51.0.2704.103 Safari/537.36'}

url = str(raw_input("Enter URL: "))

req = requests.get(url,headers= header) #if the site dont require a request 
                                        #you dont have to ask for
html = req.text                         #if you dont want to ask for a 
                                        #request use mechanize module

soup = BeautifulSoup(html,'html.parser')

for ip in soup.findAll("td",{"ng-binding":"proxy.IP"}):
    print "IP:      ", ip.get_text()

for ip_p in soup.findAll("td", {"ng-bind":"proxy.PORT"}):
    print "PORT:    ", ip_p.get_text()

for ip_c in soup.findAll("td", {"ng-bind":"proxy.country"}):
    print "COUNTRY: ", ip_c.get_text()

【讨论】：

【解决方案3】：

如果 proxy.IP、proxy.PORT 和 proxy.country 值位于相同的 [n] 单元格位置，您可以通过在 tr 行中指定 td[n] 的位置来设置：

from lxml import html

webpage = html.parse('lxml_test.html')

ip = webpage.xpath('//tr[@class="ng-scope"]/td[2]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[3]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[4]/text()')

或者，如果您特别关注单元格名称：

ip = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.IP"]/text()')
port = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.PORT"]/text()')
proxy = webpage.xpath('//tr[@class="ng-scope"]/td[@ng-bind="proxy.country"]/text()')

编辑：要从网页获取 html 代码，请使用 requests 模块：

import requests
page = requests.get('https : //hidester.com/proxylist/')
webpage = html.fromstring(page.content)

【讨论】：