【问题标题】:Python webscraper & missing output dataPython webscraper & 缺少输出数据
【发布时间】:2019-02-24 04:44:08
【问题描述】:

我正在尝试使用 Python (3.7) 和 BeautifulSoup 从网站上抓取评论并将其存储到 csv 中。看起来抓取成功了,但是当我写入文件时,只有一列包含完整数据,其余的只是第一个字符。

任何提示都将不胜感激,如果很明显很抱歉 - 这是一个新的爱好:)

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

#URL to scrape
my_url = "https://www.indeed.com/cmp/Capital-One/reviews?fcountry=ALL&lang="

#open connection, grab page
uClient = uReq(my_url)
page_html = uClient

#html parsing
page_soup = soup(page_html, "lxml")

#grab all reviews on page
containers = page_soup.findAll("div",{"cmp-review-container"})
uClient.close()
#write to csv
filename = "indeedreviewtest.csv"
f=open(filename, "w")

headers = "review_id, review_score, role, review_text\n"

f.write(headers)

#loop through each review, collect review ID, rating, role & verbatum
for container in containers:
    reviewid_container = container.div["data-tn-entityid"]
    reviewid = reviewid_container[0]
    score_container = container.div.div.div.meta["content"]
    reviewscore = score_container[0]
    role_container = container.find("span", attrs={"class":"cmp-reviewer- job-title"}).text
    reviewerrole = role_container[0]
    reviewtext_container = container.find("span", attrs={"class":"cmp-review-text"}).text
    reviewtext = reviewtext_container

    f.write(reviewid + "," + reviewscore + "," + reviewerrole.replace(",", "|") + "," + reviewtext.replace(",", "|") + "\n")

f.close()

谢谢!

【问题讨论】:

    标签: python python-3.x


    【解决方案1】:

    也许你混淆了find()findAll()

    find() 会在找到第一个符合条件的元素时停止,而findAll() 会带来所有这些元素。

    通过使用role_container[0],您可以从该元素文本中获取第一个字符。

    你可以试试:

    reviewerrole = container.find("span", attrs={"class":"cmp-reviewer-job-title"}).text
    reviewtext = container.find("span", attrs={"class":"cmp-review-text"}).text
    

    除此之外,考虑使用csv 模块来读取/写入CSV 文件。更多信息:https://docs.python.org/3/library/csv.html#csv.writer

    【讨论】:

    • 非常感谢你这个好人,成功了!也非常感谢进一步阅读的建议,它在我的待办事项列表的顶部。
    猜你喜欢
    • 1970-01-01
    • 2017-02-14
    • 2014-03-14
    • 1970-01-01
    • 2020-08-18
    • 2016-12-26
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多