【问题标题】:Python Detects URL and Removes It in Text FilePython 检测 URL 并在文本文件中删除它
【发布时间】:2019-11-15 00:55:30
【问题描述】:

我从this 获取了一个 Python 脚本并对其进行了编辑以适合我的喜好,我将前 20 条推文从特定页面打印到一个文本文件中。

from urllib.request import urlopen
from bs4 import BeautifulSoup

file = "tweets.txt"
f = open(file, "w")
url = "https://twitter.com/BBCWorld"
html = urlopen(url)
soup = BeautifulSoup(html, "html.parser")

# Gets the tweet
tweets = soup.find_all("li", attrs = {"class":"js-stream-item"})

# Writes tweet fetched in file
for tweet in tweets:
   try:
    if tweet.find('p',{'tweet-text'}):
       tweet_text = tweet.find('p',{'tweet-text'}).text.encode('utf8').strip()
       # tweet_user = tweet.find('span',{"class":'username'}).text.strip()
       # replies = tweet.find('span',{"class":"ProfileTweet-actionCount"}).text.strip()
       # retweets = tweet.find('span', {"class" : "ProfileTweet-action--retweet"}).text.strip()
       # String interpolation technique
       f.write(f'{tweet_text}\n')
  except: AttributeError
f.close()

但是,当推文在那里打印时,它们看起来像这样(我以 BBCWorld 的提要为例):

b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0' 

b'Guernsey asbestos cancer compensation scheme to launchhttps://bbc.in/2qQD9OE\xc2\xa0'

b'Construction firm fined \xc2\xa310k for Jersey water pollutionhttps://bbc.in/2KgIk19\xc2\xa0'

b'US election 2020: Deval Patrick announces presidential bidhttps://bbc.in/32QbdHH\xc2\xa0'

b'Knottfield: Joseph Marshall indecent assault trial delayedhttps://bbc.in/2XcXYjg\xc2\xa0'

b"Hugo Carvajal: Venezuelan ex-spy chief's disappearance 'a scandal'https://bbc.in/34VIwdY\xc2\xa0"

b'What fate awaits those former members of Islamic State being expelled from Turkey?https://www.bbc.co.uk/news/50396607\xc2\xa0'

b"Notre Dame: Army general tells architect to 'shut his mouth'https://bbc.in/2qVP7pX\xc2\xa0"

b'Six years after a Boeing 737-500 crashed in Kazan, Russian investigators conclude that the pilot wasn\xe2\x80\x99t qualified to fly the plane & had used falsified documents to get his job with (now defunct) Tatarstan Airlines. 50 people were killed.'

b'South Africa rugby stars strip off for cancer challengehttps://bbc.in/2rJF2Nv\xc2\xa0'

b"Diabetes: UN to tackle 'overly expensive' insulin priceshttps://bbc.in/2Op0nUf\xc2\xa0"

b'US Senator blocks move to say Armenian mass killing was genocidehttps://bbc.in/2QfjjHr\xc2\xa0'

b"Turkey to extradite American IS suspect 'stranded on border'https://bbc.in/33OJ1X1\xc2\xa0"

b'Father and daughter ballet video breaks stereotypes, says teacherhttps://bbc.in/2KlFEPY\xc2\xa0'

b'Australia seeks to curb foreign interference in universitieshttps://bbc.in/2NJcb4i\xc2\xa0'

b'Washington teacher arrested for threatening to shoot studentshttps://bbc.in/2KlUk1o\xc2\xa0'

b'Denmark holds neo-Nazi over Jewish cemetery attackhttps://bbc.in/2pjrcR9\xc2\xa0'

b'Manus Island refugee author Behrouz Boochani arrives in New Zealandhttps://bbc.in/2NMI4cs\xc2\xa0'

b'Italy to declare state of emergency over damage from Venice floodshttp://bbc.in/2OdDoeu\xc2\xa0'

b'Condor Ferries bought by Swedish investment fundhttps://bbc.in/2NLw0rT\xc2\xa0'

如何删除“b”?而且,如果特定推文具有该链接,我如何才能删除该 URL,就像所有这些一样?

另外,为什么有时会出现一串数字和字母,如何修复/删除这些?

【问题讨论】:

    标签: python python-3.x twitter


    【解决方案1】:

    要删除 b,您需要执行以下操作:

    str_tweet = tweet_text.decode('utf-8')

    要去掉最后的超链接,你可以做这样的事情,这既快又脏:

    only_tweet = str_tweet.split('https://')[0]

    然后当然要更改您的 write 语句以指向新变量。这将导致如下输出:

    'Van crash in south-east Iran kills 28 Afghan nationals'

    而不是

    b'Van crash in south-east Iran kills 28 Afghan nationalshttps://bbc.in/2qcsg9P\xc2\xa0'

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2023-02-25
      • 1970-01-01
      • 1970-01-01
      • 2011-08-22
      • 1970-01-01
      • 2015-02-21
      • 2013-05-01
      • 1970-01-01
      相关资源
      最近更新 更多