【问题标题】:Writing the news to CSV-file (Python 3, BeautifulSoup)将新闻写入 CSV 文件(Python 3,BeautifulSoup)
【发布时间】:2017-06-18 06:42:20
【问题描述】:

我希望 Python3.6 将以下代码的输出写入 csv。最好是这样:每篇文章都有一行(News-Website),四列分别是“标题”、“URL”、“类别”[#Politik 等]、“PublishedAt”。

from bs4 import BeautifulSoup
import requests

website = 'http://spiegel.de/schlagzeilen'
r = requests.get(website)
soup = BeautifulSoup((r.content), "lxml")

div = soup.find("div", {"class": "schlagzeilen-content schlagzeilen-overview"})

for a in div.find_all('a', title=True):
    print(a.text, a.find_next_sibling('span').text)
    print(a.get('href'))

为了写入 csv,我已经有了这个......

with open('%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f'), 'w', newline='',
              encoding='utf-8') as file:
        w = csv.writer(file, delimiter="|")
        w.writerow([...])

..并且需要知道接下来要做什么。谢谢!!提前!

【问题讨论】:

    标签: python python-3.x csv beautifulsoup


    【解决方案1】:

    您可以将所有需要提取的字段收集到字典列表中,并使用csv.DictWriter 写入 CSV 文件:

    import csv
    import datetime
    
    from bs4 import BeautifulSoup
    import requests
    
    
    website = 'http://spiegel.de/schlagzeilen'
    r = requests.get(website)
    soup = BeautifulSoup((r.content), "lxml")
    
    articles = []
    for a in soup.select(".schlagzeilen-content.schlagzeilen-overview a[title]"):
        category, published_at = a.find_next_sibling(class_="headline-date").get_text().split(",")
    
        articles.append({
            "Title": a.get_text(),
            "URL": a.get('href'),
            "Category": category.strip(" ()"),
            "PublishedAt": published_at.strip(" ()")
        })
    
    filename = '%s_schlagzeilen.csv' % datetime.datetime.now().strftime('%Y-%m-%d_%H-%M-%S.%f')
    with open(filename, 'w', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=["Title", "URL", "Category", "PublishedAt"], )
    
        writer.writeheader()
        writer.writerows(articles)
    

    请注意我们如何定位类别和“发布于” - 我们需要转到下一个同级元素并用逗号分隔文本,去掉多余的括号。

    【讨论】:

      猜你喜欢
      • 2014-10-06
      • 1970-01-01
      • 1970-01-01
      • 2020-01-12
      • 2017-06-18
      • 2016-06-30
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多