使用 bs4 在 excel 中刮掉保加利亚语文本的问题答案

【问题标题】：Problem with the scraped Bulgarian language text in excel using bs4使用 bs4 在 excel 中刮掉保加利亚语文本的问题
【发布时间】：2021-03-21 17:30:12
【问题描述】：

我正在尝试抓取一个包含保加利亚语文本的网站。它已成功抓取，但是当我将其存储到 CSV 文件中时，它不可读。请查看以下代码和图像以更好地理解我的问题。

 res = requests.get('https://m.mobile.bg/results? 
 pubtype=1&marka=Toyota&currency=%D0%BB%D0%B2.&sort=1&nup=0~1')

 soup = bs4.BeautifulSoup(res.text, 'lxml')
 file = open('cars.csv', 'w')
 writer = csv.writer(file)

 # write title row
 writer.writerow(['Car_Make', 'Price', 'info', 'date'])
 for i in soup.select('.listItem'):


 car_make = i.find('div', attrs = {"class":"title"})

 arr = i.text
 print(arr)

 writer.writerow([arr.encode('utf-8')])

 file.close()

The output in jupyter notebook is as follows. I want this to be stored as it is in csv file

This is how the output looks like in a CSV file

【问题讨论】：

在支持的情况下尝试使用 utf-8-sig
utf-8-sig 没有解决问题。
非常感谢，@barny。我不知道术语，因为这是我第一次做这样的任务。感谢您清除术语。

标签： python-3.x web-scraping beautifulsoup export-to-csv

【解决方案1】：

import requests
import csv
from bs4 import BeautifulSoup


def main(url):
    params = {
        "pubtype": "1",
        "marka": "Toyota",
        "currency": "лв.",
        "sort": "1",
        "nup": "0~1"
    }
    r = requests.get(url, params=params)
    soup = BeautifulSoup(r.text, 'lxml')
    with open('d.csv', 'w', newline='', encoding='utf-8-sig') as f:
        writer = csv.writer(f)
        writer.writerows([list(x.strings)
                          for x in soup.select('.listItem.TOPitem')])


main('https://m.mobile.bg/results')

输出：

【讨论】：