【问题标题】:Struggle to obtain a clean excel with beautiful soup用美丽的汤努力获得干净的Excel
【发布时间】:2021-11-03 21:46:00
【问题描述】:

我正在尝试从一个网站获取有关其营业时间的信息,但我的结果非常令人失望。

import requests
from bs4 import BeautifulSoup
import xlsxwriter

i = "90460"

URL = "https://www.tuodi.it/negozi-dettaglio.cfm?negozio=%s" % i
page = requests.get(URL)

soup = BeautifulSoup(page.content, "html.parser")

results = soup.find(id="orario" , style="width:50%;float:left")
orari = results.find_all("div", class_="tab", style="width:220px;line-height: 25px")

print(orari)

我的输出如下所示

[<div class="tab" style="width:220px;line-height: 25px">
                            8,30 
                            - 20,00 
                            <br/>
                            
                            
                            
                            8,30 
                            - 20,00 
                            <br/>...

但我宁愿有一个可以导出为 excel 的结果

Excel result

提前致谢!

【问题讨论】:

    标签: excel web-scraping beautifulsoup


    【解决方案1】:

    要获得结果,您可以使用.stripped_stringslist comprehension

    [''.join(x.split()) for x in orari[0].stripped_strings]
    

    这将为您提供一个列表,您可以将其写入文件:

    ['8,30-20,00', '8,30-20,00', '8,30-20,00', '8,30-20,00', '8,30-20,00', '8,30-20,00', '8,00-13,00']
    

    示例

    import requests
    from bs4 import BeautifulSoup
    import pandas as pd
    i = "90460"
    
    URL = "https://www.tuodi.it/negozi-dettaglio.cfm?negozio=%s" % i
    page = requests.get(URL)
    
    soup = BeautifulSoup(page.content, "html.parser")
    
    results = soup.find(id="orario" , style="width:50%;float:left")
    orari = results.find_all("div", class_="tab", style="width:220px;line-height: 25px")
    
    data = [''.join(x.split()) for x in orari[0].stripped_strings]
    
    pd.DataFrame([data]).to_excel('test.xslx', index=False)
    

    【讨论】:

      猜你喜欢
      • 2011-11-15
      • 2018-09-04
      • 2019-04-14
      • 1970-01-01
      • 2014-05-28
      • 2021-01-15
      • 1970-01-01
      • 1970-01-01
      • 2023-03-12
      相关资源
      最近更新 更多