Python中的BeautifulSoup解析不正确答案

【问题标题】：BeautifulSoup in Python not parsing rightPython中的BeautifulSoup解析不正确
【发布时间】：2013-08-29 16:11:14
【问题描述】：

我正在运行 Python 2.7.5 并使用内置的 html 解析器来完成我将要描述的内容。

我要完成的任务是获取一大块 html，它本质上是一个食谱。这是一个例子。

html_chunk = "<h1>Miniature Potato Knishes</h1>Posted by bettyboop50 at recipegoldmine.com May 10, 2001Makes about 42 miniature knishesThese are just yummy for your tummy!3 cups mashed potatoes (about &nbsp;&nbsp;&nbsp; 2 very large potatoes) 2 eggs, slightly beaten 1 large onion, diced 2 tablespoons margarine 1 teaspoon salt (or to taste) 1/8 teaspoon black pepper 3/8 cup Matzoh meal 1 egg yolk, beaten with 1 tablespoon waterPreheat oven to 400 degrees F.Sauté diced onion in a small amount of butter or margarine until golden brown.In medium bowl, combine mashed potatoes, sautéed onion, eggs, margarine, salt, pepper, and Matzoh meal.Form mixture into small balls about the size of a walnut. Brush with egg yolk mixture and place on a well-greased baking sheet and bake for 20 minutes or until well browned."

目标是分离出标题、垃圾、成分、说明、服务和成分数量。

这是我的代码，可以实现这一点

from bs4 import BeautifulSoup

def list_to_string(list):
   joined = ""
   for item in list:
      joined += str(item)
   return joined

def get_ingredients(soup):
   for p in soup.find_all('p'):
      if p.find('br'):
         return p

def get_instructions(p_list, ingredient_index):
   instructions = []
   instructions += p_list[ingredient_index+1:]
   return instructions

def get_junk(p_list, ingredient_index):
   junk = []
   junk += p_list[:ingredient_index]
   return junk

def get_serving(p_list):
   for item in p_list:
      item_str = str(item).lower()
      if ("yield" or "make" or "serve" or "serving") in item_str:
         yield_index = p_list.index(item)
         del p_list[yield_index]
         return item

def ingredients_count(ingredients):
   ingredients_list = ingredients.find_all(text=True)
   return len(ingredients_list)

def get_header(soup):
   return soup.find('h1')

def html_chunk_splitter(soup):
   ingredients = get_ingredients(soup)
   if ingredients == None:
      error = 1
      header = ""
      junk_string = ""
      instructions_string = ""
      serving = ""
      count = ""
   else:
      p_list = soup.find_all('p')
      serving = get_serving(p_list)
      ingredient_index = p_list.index(ingredients)
      junk_list = get_junk(p_list, ingredient_index)
      instructions_list = get_instructions(p_list, ingredient_index)
      junk_string = list_to_string(junk_list)
      instructions_string = list_to_string(instructions_list)
      header = get_header(soup)
      error = ""
      count = ingredients_count(ingredients)
   return (header, junk_string, ingredients, instructions_string, 
   serving, count, error)

它运行良好，除非我有包含 "Sauté" 之类的字符串的块，因为 soup = BeautifulSoup(html_chunk) 会导致 Sauté 变成 Sauté，这是一个问题，因为我有一个巨大的 csv 文件，如 html_chunk 和我'正在尝试很好地构建所有这些，然后将输出返回到数据库中。我试着用这个html previewer检查它的Sauté出来，它仍然是Sauté出来的。我不知道该怎么办。

奇怪的是，当我做 BeautifulSoup 的文档显示的事情时

BeautifulSoup("Sacr&eacute; bleu!")
# <html><head></head><body>Sacré bleu!</body></html>

我明白了

# Sacr├⌐ bleu!

但我的同事在他的 Mac 上尝试过，从终端运行，他得到了文档显示的内容。

非常感谢您的所有帮助。谢谢。

【问题讨论】：

您是否在终端/cmd 中运行脚本，您的文件是否使用 UTF-8 编码，# -- coding: utf-8 -- 位于顶部你的文件？
你从哪里得到SacrÃ© bleu!？在浏览器中？在航站楼？
我在运行 Windows 7 的 PC 上执行此操作。我使用的是 Wing IDE 101 4.1（免费版）。起初我认为这可能是问题所在，所以我只是启动了 IDLE 本身并在那里尝试了它，我得到了相同的结果。我实际上对编码知之甚少，但我几乎从here 复制并粘贴了它。我在 IDLE 和 Wing IDE 中得到SacrÃ© bleu!，所以我想是终端。
那么你有 # -- coding: utf-8 -- 在顶部吗？
什么意思？没有文件。我只是从link 复制并粘贴了它。虽然我这样做是因为它会运行我的一个文件，但它是一个 CSV，我不知道如何检查编码。在什么的顶部？

标签： python html encoding beautifulsoup

【解决方案1】：

这不是解析问题；而是关于编码。

每当处理可能包含非 ASCII 字符的文本时（或在包含此类字符的 Python 程序中，例如在 cmets 或 docstrings 中），您应该将编码 cookie 放在第一行或 - 在 shebang 行之后 - 第二行：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

...并确保这与您的文件编码匹配（使用 vim：:set fenc=utf-8）。

【讨论】：

【解决方案2】：

BeautifulSoup 尝试猜测编码，有时会出错，但是您可以通过添加 from_encoding 参数来指定编码：例如

soup = BeautifulSoup(html_text, from_encoding="UTF-8")

编码通常在网页的header中可用

【讨论】：