从 webscrape 输出中删除 'u答案

【问题标题】：Remove 'u from a webscrape output从 webscrape 输出中删除 'u
【发布时间】：2014-03-02 15:12:05
【问题描述】：

嗨，我正在使用 Beautifulsoup 解析网站并获取名称作为输出。但是在运行脚本之后，我得到了一个[u'word1', u'word2', u'word3'] 输出。我正在寻找的是'word1 word2 word3'。如何摆脱这个u' 并将结果变成一个字符串？

from bs4 import BeautifulSoup
import urllib2
import re

myfile = open("base/dogs.txt","w+")
myfile.close()

url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=soup.findAll('a',{'href':re.compile("dog")})
myfile = open("base/dogs.txt","w+")
for eachname in names:
    d = (str(eachname.string.split()))+"\n"
    print [x.encode('ascii') for x in d]
    myfile.write(d)

myfile.close()

【问题讨论】：

print [str(x.encode('ascii')) for x in d]?
请注意，如果您的字符串可以包含多字节字符，则将其从 Unicode 字符串更改为 ASCII 字符串实际上会破坏数据。你确定这是你想做的事吗？
请注意，如果你只是打印字符串——比如print——或者直接写它们（不是作为对象的一部分，用repr()对其内容进行字符串化）——它们' 将显示为文字，而不是 u'' 装饰。

标签： python web-scraping beautifulsoup

【解决方案1】：

BeautifulSoup 和 Unicode, Dammit!

>>> from bs4 import BeautifulSoup
>>> BeautifulSoup("Sacr&eacute; bleu!")
<html><body><p>Sacré bleu!</p></body></html>

这不是很好吗？制作汤时，文档被转换为 Unicode，HTML 实体被转换为 Unicode 字符！因此，您将获得 Unicode 对象作为结果。如意。没有错。

所以您的问题是关于 Unicode。 Unicode 解释为in this video。不喜欢视频？阅读Introduction to Unicode。

u 是 “以下字符串是 Unicode 编码”的缩写。您现在可以使用所有 Unicode 字符，而不是 128 个 ASCII 字符。此刻超过 110.000。 u 不会保存到文件或数据库中。它是视觉反馈，因此您可以看到您正在处理 Unicode 编码的字符串。像普通字符串一样使用它，因为它是普通字符串。

这个故事的寓意：

☺ 当你看到`u'…'`

【讨论】：

但是当你打印一个列表对象时，你总是会看到你。当你打印一个列表时，repr() 会在其中的每个项目上调用而不是 str()。但是，列表也是如此.__str__() 在打印列表时调用，而不是 list.__repr__() （如果它们都被定义）。所以打印列表是不同的
当你import antigravity你飞。

【解决方案2】：

这里使用.encode() 的答案可以满足您的要求，但可能不是您需要的。您可以保留 unicode 编码，而不是以向您显示其编码或类型的方式表示事物。因此，它们仍将成为[u'word1', u'word2', u'word3']——这避免了对无法以ASCII 表示的语言的支持——但打印为word1 word2 word3。

只要做：

for eachname in names:
    d = ' '.join(eachname.string.split()) + '\n'
    print d
    myfile.write(d)

【讨论】：

【解决方案3】：

BeutifulSoap 是一个非常棒的 html 解析器。将它用于解析 html 的最大潜力。所以只需修改你的代码如下

names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]

这将在锚标签之间进行，因此您不需要d = (str(eachname.string.split()))+"\n"

所以最终的代码是

from bs4 import BeautifulSoup
import urllib2
import re
import codecs
url="http://trackinfo.com/entries-race.jsp?raceid=GBR$20140302A01"
page=urllib2.urlopen(url)
soup = BeautifulSoup(page.read())
names=[texts.text for texts in soup.findAll('a',{'href':re.compile("dog")})]
myfile = codecs.open("base/dogs.txt","wb",encoding="Utf-8")
for eachname in names:
    eachname=re.sub(r"[\t\n]","",eachname)
    myfile.write(eachname+"\n")
myfile.close()

如果你只需要它而文件中没有你，那么使用codecs.open() 或io.open() 使用适当的文本编码（即encoding="..."）打开文本文件，而不是使用open() 打开字节文件。

那就是

myfile = codecs.open("base/dogs.txt","w+",encoding="Utf-8")

在你的情况下。

文件中的输出将是

BARTSSHESWAYCOOL                            
DK'S SEND ALL                            
SHAKIN THINGS UP                            
FROSTED COOKIE                            
JD EMBELLISH                            
WW CASH N CARRY                            
FREEDOM ROCK                            
HVAC BUTCHIE

另请参阅我提出的几乎相同的问题problem

【讨论】：

☺ 当你看到u'…'

☺ 当你看到`u'…'`