BeautifulSoup 没有正确解析 <td> 数据答案

【问题标题】：BeautifulSoup is not parsing <td> data properlyBeautifulSoup 没有正确解析 <td> 数据
【发布时间】：2018-01-10 23:59:54
【问题描述】：

我正在尝试使用带有 Python2.7.5 的 BeatifulSoup4 解析 this page。我的代码如下所示：

url = "https://coinmarketcap.com/currencies/CRYPTO/historical-data/?
      start=20171124&end=20171130"
url.replace('CRYPTO', crypto['id'])
response = urllib2.urlopen(url)

data = response.read()
soup = BeautifulSoup(data, 'html5lib')

trs = soup.find(id="historical-data").findAll('tr')

其中 CRYPTO 被“比特币”等取代。

查看 PyCharm 中的变量，除了表中的数据，一切看起来都不错。而不是看到这个：

<tr class="text-right">
<td class="text-left">Nov 30, 2017</td>
<td>9906.79</td>
<td>10801.00</td>
<td>9202.05</td>
<td>10233.60</td>
<td>8,310,690,000</td>
<td>165,537,000,000</td>
</tr>

这是 Google Chrome 的 Inspect 窗口和 curl 向我展示的内容，BeautifulSoup 向我展示了这个：

<tr class="text-right">
<td class="text-left">Nov 30, 2017</td>
<td>0.009829</td>
<td>0.013792</td>
<td>0.009351</td>
<td>0.013457</td>
<td>152</td>
<td>119,171</td>
</tr>

为什么数字不同？

我使用了 urllib2 和请求。我使用了 response.text 和 response.read()。我已经使用 lxml 和 html5lib 进行了解析。我尝试过不同的编码，例如 iso-8859 和 ascii。没有任何效果。

如何让正确的数字显示？

【问题讨论】：

标签： python beautifulsoup html-parsing

【解决方案1】：

您需要改为执行以下操作：

url = "https://coinmarketcap.com/currencies/CRYPTO/historical-data/?
      start=20171124&end=20171130"
response = urllib2.urlopen(url.replace('CRYPTO', crypto['id']))

...或者更明确地说正在发生的事情：

url = "https://coinmarketcap.com/currencies/CRYPTO/historical-data/?
      start=20171124&end=20171130"
newurl = url.replace('CRYPTO', crypto['id'])
response = urllib2.urlopen(newurl)

...因为您的代码现在是这样，您的 url.replace('CRYPTO', crypto['id']) 本身不会改变任何东西；相反，它只是创建一个新字符串，但从不对该新字符串做任何事情。

您的代码不会更改 url 字符串，因为 string.replace(…) 不是这样工作的，Python 字符串也不是这样工作的。

因此，您当前的代码所发生的情况是 URL 中的 CRYPTO 子字符串在您调用 urllib2.urlopen(…) 之前没有被替换。因此，您得到的结果来自这个 URL：

https://coinmarketcap.com/currencies/CRYPTO/historical-data/?start=20171124&end=20171130

【讨论】：

感谢您指出我的愚蠢。是的，我知道字符串不是这样工作的。只是其中之一......