【发布时间】:2015-06-27 21:38:01
【问题描述】:
现在我的文件输出如下:
<b>Nov 22–24</b> <b>Nov 29–Dec 1</b> <b>Dec 6–8</b> <b>Dec 13–15</b> <b>Dec 20–22</b> <b>Dec 27–29</b> <b>Jan 3–5</b> <b>Jan 10–12</b> <b>Jan 17–19</b> <b><i>Jan 17–20</i></b> <b>Jan 24–26</b> <b>Jan 31–Feb 2</b> <b>Feb 7–9</b> <b>Feb 14–16</b> <b><i>Feb 14–17</i></b> <b>Feb 21–23</b> <b>Feb 28–Mar 2</b> <b>Mar 7–9</b> <b>Mar 14–16</b> <b>Mar 21–23</b> <b>Mar 28–30</b>
我想删除所有的“”和css标签(、)。我尝试使用 .remove 和 .replace 函数,但出现错误:
SyntaxError: Non-ASCII character '\xc2' in file -- FILE NAME-- on line 70, but no encoding declared; see http://python.org/dev/peps/pep-0263/ for details
上面的输出在一个列表中,我从一个网络爬虫函数中得到:
def getWeekend(item_url):
dates = []
href = item_url[:37]+"page=weekend&" + item_url[37:]
response = requests.get(href)
soup = BeautifulSoup(response.content, "lxml") # or BeautifulSoup(response.content, "html5lib")
date= soup.select('table.chart-wide > tr > td > nobr > font > a > b')
return date
我把它写到这样的文件中:
for item in listOfDate:
wr.writerow(item)
如何删除所有标签,只留下日期?
【问题讨论】:
-
页面编码是什么?
标签: python beautifulsoup web-crawler