【发布时间】:2019-11-15 00:55:30
【问题描述】:
我从this 获取了一个 Python 脚本并对其进行了编辑以适合我的喜好,我将前 20 条推文从特定页面打印到一个文本文件中。
from urllib.request import urlopen
from bs4 import BeautifulSoup
file = "tweets.txt"
f = open(file, "w")
url = "https://twitter.com/BBCWorld"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")
# Gets the tweet
tweets = soup.find_all("li", attrs = {"class":"js-stream-item"})
# Writes tweet fetched in file
for tweet in tweets:
try:
if tweet.find('p',{'tweet-text'}):
tweet_text = tweet.find('p',{'tweet-text'}).text.encode('utf8').strip()
# tweet_user = tweet.find('span',{"class":'username'}).text.strip()
# replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
# retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
# String interpolation technique
f.write(f'{tweet_text}\n')
except: AttributeError
f.close()
但是,当推文在那里打印时,它们看起来像这样(我以 BBCWorld 的提要为例):
b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0'
b'Guernsey asbestos cancer compensation scheme to launchhttps://bbc.in/2qQD9OE\xc2\xa0'
b'Construction firm fined \xc2\xa310k for Jersey water pollutionhttps://bbc.in/2KgIk19\xc2\xa0'
b'US election 2020: Deval Patrick announces presidential bidhttps://bbc.in/32QbdHH\xc2\xa0'
b'Knottfield: Joseph Marshall indecent assault trial delayedhttps://bbc.in/2XcXYjg\xc2\xa0'
b"Hugo Carvajal: Venezuelan ex-spy chief's disappearance 'a scandal'https://bbc.in/34VIwdY\xc2\xa0"
b'What fate awaits those former members of Islamic State being expelled from Turkey?https://www.bbc.co.uk/news/50396607\xc2\xa0'
b"Notre Dame: Army general tells architect to 'shut his mouth'https://bbc.in/2qVP7pX\xc2\xa0"
b'Six years after a Boeing 737-500 crashed in Kazan, Russian investigators conclude that the pilot wasn\xe2\x80\x99t qualified to fly the plane & had used falsified documents to get his job with (now defunct) Tatarstan Airlines. 50 people were killed.'
b'South Africa rugby stars strip off for cancer challengehttps://bbc.in/2rJF2Nv\xc2\xa0'
b"Diabetes: UN to tackle 'overly expensive' insulin priceshttps://bbc.in/2Op0nUf\xc2\xa0"
b'US Senator blocks move to say Armenian mass killing was genocidehttps://bbc.in/2QfjjHr\xc2\xa0'
b"Turkey to extradite American IS suspect 'stranded on border'https://bbc.in/33OJ1X1\xc2\xa0"
b'Father and daughter ballet video breaks stereotypes, says teacherhttps://bbc.in/2KlFEPY\xc2\xa0'
b'Australia seeks to curb foreign interference in universitieshttps://bbc.in/2NJcb4i\xc2\xa0'
b'Washington teacher arrested for threatening to shoot studentshttps://bbc.in/2KlUk1o\xc2\xa0'
b'Denmark holds neo-Nazi over Jewish cemetery attackhttps://bbc.in/2pjrcR9\xc2\xa0'
b'Manus Island refugee author Behrouz Boochani arrives in New Zealandhttps://bbc.in/2NMI4cs\xc2\xa0'
b'Italy to declare state of emergency over damage from Venice floodshttp://bbc.in/2OdDoeu\xc2\xa0'
b'Condor Ferries bought by Swedish investment fundhttps://bbc.in/2NLw0rT\xc2\xa0'
如何删除“b”?而且,如果特定推文具有该链接,我如何才能删除该 URL,就像所有这些一样?
另外,为什么有时会出现一串数字和字母,如何修复/删除这些?
【问题讨论】:
标签: python python-3.x twitter