【问题标题】:HTML Table Specific Row ScrapingHTML 表特定行抓取
【发布时间】:2018-08-26 01:12:55
【问题描述】:

我想从this table 的特定行中抓取数据。我只想要橙色/金色行。 以前,我使用 SIM 提供的这段代码来抓取整个表信息,然后我对其进行了操作:

from selenium.webdriver import Chrome
from contextlib import closing
from selenium.webdriver.chrome.options import Options
from bs4 import BeautifulSoup

URL = "https://www.n2yo.com/passes/?s=39090&a=1"

chrome_options = Options()  
chrome_options.add_argument("--headless")

with closing(Chrome(chrome_options=chrome_options)) as driver:
    driver.get(URL)
    soup = BeautifulSoup(driver.page_source, 'lxml')
    for items in soup.select("#passestable tr"):
        data = [item.text for item in items.select("th,td")]
        print(data)

我不确定如何更改此代码以仅获取橙色/金色行。我尝试在解析时搜索颜色代码作为标签,但它不起作用。任何和所有建议都表示赞赏。

感谢您的宝贵时间。

【问题讨论】:

    标签: python python-3.x selenium selenium-webdriver lxml


    【解决方案1】:

    您可以使用正则表达式来匹配颜色:

    from selenium import webdriver
    from bs4 import BeautifulSoup as soup
    import re
    d = driver.Chrome()
    d.get("https://www.n2yo.com/passes/?s=39090&a=1")
    s = soup(d.page_source, 'lxml')
    data = [i.text for i in s.find_all('tr', {'bgcolor':re.compile('#FFFFFF|#FFFF33|#FFCC00')})]
    

    输出:

    [u'16-Mar 20:34N12\xb020:42W265\xb079\xb020:48SSW199\xb0-Map and details', u'17-Mar 07:51S178\xb007:58W260\xb052\xb008:05NNW341\xb0-Map and details', u'17-Mar 20:00NNE19\xb020:08E102\xb050\xb020:14S180\xb0-Map and details', u'18-Mar 07:17SSE160\xb007:24E83\xb077\xb007:31N349\xb0-Map and details', u'18-Mar 08:58SW217\xb009:04W269\xb013\xb009:09NW323\xb0-Map and details', u'18-Mar 21:06N6\xb021:13WNW295\xb041\xb021:19SW217\xb0-Map and details', u'19-Mar 06:43SE142\xb006:50ENE67\xb038\xb006:57N356\xb0-Map and details', u'19-Mar 08:23SSW196\xb008:30W268\xb027\xb008:36NNW333\xb0-Map and details', u'19-Mar 20:32N12\xb020:39WNW286\xb084\xb020:46SSW198\xb0-Map and details', u'20-Mar 07:48S177\xb007:55WSW254\xb055\xb008:02NNW342\xb0-Map and details', u'20-Mar 19:58NNE20\xb020:05E98\xb047\xb020:12S178\xb0-Map and details', u'21-Mar 07:14SSE159\xb007:22NE58\xb072\xb007:28N349\xb0-Map and details', u'21-Mar 08:55SW216\xb009:01W272\xb014\xb009:07NW325\xb0-Map and details', u'21-Mar 21:03N6\xb021:10WNW288\xb043\xb021:17SW215\xb0-Map and details', u'22-Mar 06:41SE141\xb006:48ENE70\xb036\xb006:54N356\xb0-Map and details', u'22-Mar 08:20S194\xb008:27W265\xb029\xb008:34NNW335\xb0-Map and details', u'22-Mar 20:29N13\xb020:36N348\xb086\xb020:43SSW196\xb0-Map and details', u'23-Mar 07:46S176\xb007:53W265\xb059\xb008:00NNW343\xb0-Map and details', u'23-Mar 19:55NNE20\xb020:02E94\xb045\xb020:09S177\xb0-Map and details', u'24-Mar 07:12SSE157\xb007:19ENE71\xb069\xb007:26N350\xb0-Map and details', u'24-Mar 08:53SW214\xb008:59W270\xb015\xb009:04NW325\xb0-Map and details', u'24-Mar 21:01N7\xb021:08WNW292\xb046\xb021:14SW214\xb0-Map and details', u'25-Mar 06:38SE139\xb006:45ENE65\xb034\xb006:52N357\xb0-Map and details', u'25-Mar 08:18S193\xb008:24W263\xb030\xb008:31NNW335\xb0-Map and details', u'25-Mar 18:49NE39\xb018:54E87\xb010\xb018:59SE134\xb0-Map and details', u'25-Mar 20:27N13\xb020:34SSE161\xb086\xb020:41S195\xb0-Map and details']
    

    【讨论】:

      【解决方案2】:

      尝试替换此行

      for items in soup.select("#passestable tr"):
      

      这个

      for items in soup.select("#passestable tr[bgcolor='#FFCC00'], #passestable tr[bgcolor='#FFFF33']"):
      

      遍历仅需要颜色的tr 节点

      请注意,这将返回所有橙色节点,然后才返回所有金色节点

      【讨论】:

        【解决方案3】:

        您可以尝试另一种不使用selenium的方法:

        from lxml.html import fromstring
        import requests
        
        r = requests.get(URL)
        html = fromstring((r.content).decode('utf-8'))
        # only orange and yellow rows
        rows = html.xpath('//tr[@bgcolor="#FFFF33" or @bgcolor="#FFCC00"]')
        

        【讨论】:

        • 这是动态数据 - 直接 GET 请求发送到提到的 URL 不会得到它
        猜你喜欢
        • 2016-02-17
        • 2018-02-23
        • 1970-01-01
        • 2019-11-07
        • 2021-05-31
        • 2020-07-02
        • 1970-01-01
        • 2017-01-26
        • 1970-01-01
        相关资源
        最近更新 更多