【发布时间】:2020-07-08 10:49:51
【问题描述】:
愚蠢的问题。我制作了我的第一个 scraper/crawler。它给了我我想要的东西,但是当我将它写入 csv 文件时,文本出现在 \n'] 括号中。如果我尝试以任何方式删除它 - 它会破坏我在 csv 文件中的输出。 尽管该网站是希伯来语的,但应该不成问题。看看你得到的 csv 。 提前致谢
import csv
import requests
from bs4 import BeautifulSoup as bs
import io
url='https://www.maariv.co.il/news/politics'
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'w', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
csv_writer.writerow(['Headline', 'Summary', 'Text', 'name'])
file.close()
def single_page_scraper(url):
source = requests.get(url).text
soup = bs(source, 'html.parser')
file = io.open('maariv7.csv', 'a', encoding="utf-16")
csv_writer = csv.writer(file, delimiter='|')
for article in soup.find_all(class_='article-title'):
headline = article.h1.text
print (headline,'\n')
for article in soup.find_all(class_='article-description'):
summary = article.h2.text
print(summary,'\n')
text=[]
name=[]
for par in soup.find_all(class_='article-body'):
text.append(par.get_text())
print(text)
politics = io.open('politicians.txt', 'r', encoding="utf-8")
my_list=politics.read().splitlines()
my_file=str(text)
for i in my_list:
if i in my_file:
name.append(i)
name_list = ", ".join(name)
print(name_list,'\n''\n''\n''\n')
csv_writer.writerow([headline, summary, my_file, name_list])
file.close()
for articles in soup.find_all(class_='three-articles-in-row'):
link = articles.a['href']
single_page_scraper(link)
【问题讨论】:
-
这一行出现错误:politics = io.open('politicians.txt', 'r', encoding="utf-8"),并且该文件不存在.
-
没有运行程序的能力,很难看到发生了什么。也许
csv_writer.writerow([headline.strip(), summary.strip(), my_file.strip(), name_list.strip()])会有所帮助? -
他们实际上是在他们的文本中添加换行符,所以你应该在你附加文本的地方去掉它们:而不是
text.append(par.get_text())添加条带text.append(par.get_text())。
标签: python python-3.x csv web-scraping beautifulsoup