【问题标题】:Scraping web page using BeautifulSoup Python使用 BeautifulSoup Python 抓取网页
【发布时间】:2014-08-16 04:35:06
【问题描述】:

我正在尝试使用 BeautifulSoup 从表中抓取数据。正在发生以下问题:[u'A Southern RV, Inc.1642 E New York AveDeland, FLPhone: (386) 734-5678Website: www.southernrvrentals.comEmail: mysouthernrv@yahoo.com\xa0\n'] 来自具有看起来像

的行的表
<table id="ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable" border="0">
                            <tbody><tr style="background-color:#990000;">
                                <th align="left" colspan="3" style="margin-top:5px;margin-bottom:5px;"><span id="ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RSCount" style="color:White;">Your search results returned (85) records </span></th>
                            </tr><tr>
                                <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/AfterMarket2.gif" alt="After Market Member Logo" border="0"> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">A Southern RV, Inc.</span><br>1642 E New York Ave<br>Deland, FL<br>Phone: (386) 734-5678<br>Website: <a href="http://www.southernrvrentals.com/" target="_blank">www.southernrvrentals.com</a><br>Email: <a href="mailto:mysouthernrv@yahoo.com" target="_blank">mysouthernrv@yahoo.com</a></td><td class="ml15" align="left" valign="top">&nbsp;</td>
                            </tr><tr>
                                <td colspan="3"><hr></td>
                            </tr><tr>
                                <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/AfterMarket2.gif" alt="After Market Member Logo" border="0"> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">Alec's Truck Trailer &amp; RV</span><br>16960 S Dixie Hwy<br>Miami, FL<br>Phone: (305) 234-5444<br>Website: <a href="http://www.alecstruck.com/" target="_blank">www.alecstruck.com</a><br>Email: <a href="mailto:austins123@bellsouth.net" target="_blank">austins123@bellsouth.net</a></td><td class="ml15" align="left" valign="top">&nbsp;</td>
                            </tr><tr>
                                <td colspan="3"><hr></td>
                            </tr><tr>
                                <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/RVRAMember2.gif" alt="RVRA Member Logo" border="0"><br>  <img src="./RVDealers-Florida_files/GoRVDealer2.gif" alt="Go RV Dealer Logo" border="0"><br> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">All Star Coaches</span><br>131 NW 73rd Terraces, Bay 1117<br>Fort Lauderdale, FL<br>Phone: (866) 838-4465<br>Website: <a href="http://www.allstarcoaches.com/" target="_blank">www.allstarcoaches.com</a><br>Email: <a href="mailto:info@allstarcoaches.com" target="_blank">info@allstarcoaches.com</a></td><td class="ml15" align="left" valign="top">&nbsp;</td>
                            </tr><tr>
                                <td colspan="3"><hr></td>
                            </tr><tr>
                                <td class="ml15" align="left" valign="top"><img src="./RVDealers-Florida_files/RVDAMember2.gif" alt="RVDA Member Logo" border="0"><br>  <img src="./RVDealers-Florida_files/GoRVDealer2.gif" alt="Go RV Dealer Logo" border="0"><br> </td><td class="ml15" align="left" valign="top"><span style="font-weight:bold;">Alliance Coach</span><br>4505 Monaco Way<br>Wildwood, FL<br>Phone: (866) 888-8941<br>Website: <a href="http://www.alliancecoachonline.com/" target="_blank">www.alliancecoachonline.com</a><br>Email: <a href="mailto:ashap1@aol.com" target="_blank">ashap1@aol.com</a></td><td class="ml15" align="left" valign="top"><table width="100%" border="0" cellpadding="0" cellspacing="5"><tbody><tr><td valign="top" width="75" align="left"><img src="./RVDealers-Florida_files/Cert_web.jpg" height="75" width="75" alt="Certified RV Technician" border="0"></td> <td valign="top" style="font-size:8px;font-weight:bold;" align="left" nowrap=""><img src="./RVDealers-Florida_files/RVLCenter_web.jpg" height="33" width="93" alt="RV Learning Center Certifications" border="0"><br>&nbsp;Certifications:<ul><li style="font-size:7px;">&nbsp;Service Writer/Advisor</li><li style="font-size:7px;">&nbsp;Parts Specialist</li><li style="font-size:7px;">&nbsp;Parts Manager</li><li style="font-size:7px;">&nbsp;Warranty Administrator</li></ul></td></tr></tbody></table></td>
                            </tr><tr>
                                <td colspan="3"><hr></td>

问题是,当我抓取数据时,它全部压缩成一个长字符串,没有任何空格或回车。我怎样才能解决这个问题?我正在使用此代码从表格中提取文本:

mech = Browser()
page = mech.open(BASE_URL_DIRECTORY)
html = page.read()
soup = BeautifulSoup(html)
data = extract(soup)

def extract(soup):
    table = soup.find("table",attrs={'id':'ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable'})
    #print table
        data = []
    for row in table.findAll("tr"):
        s = row.getText()
        data.append(s)
    return data

【问题讨论】:

    标签: python html web-scraping html-parsing beautifulsoup


    【解决方案1】:

    您可以使用replace_with() 将每个br 标记替换为换行符:

    def extract(soup):
        table = soup.find("table", attrs={'id':'ctl00_TemplateBody_WebPartManager1_gwpste_container_SearchForm_ciSearchForm_RTable'})
        for br in table.find_all('br'):
            br.replace_with('\n')
        return table.get_text().strip()
    

    对于您提供的 HTML 输入,它会打印:

    A Southern RV, Inc.
    
    1642 E New York Ave
    Deland, FL
    Phone: (386) 734-5678
    Website: www.southernrvrentals.com
    Email: mysouthernrv@yahoo.com
    

    【讨论】:

    • 我尝试了您的解决方案,但它只生成了名称(在此示例中为 A Southern RV, Inc)。我已经包含了我正在处理的 HTML 的更全面的示例;如果您能看一看,我将不胜感激。
    • @Apollo 尝试了您提供的示例 - 它确实用换行符很好地显示了结果。你能澄清一下现在的问题吗?谢谢。
    猜你喜欢
    • 1970-01-01
    • 2018-04-25
    • 2014-06-20
    • 1970-01-01
    • 1970-01-01
    • 2016-09-05
    • 1970-01-01
    • 2020-10-04
    相关资源
    最近更新 更多