【问题标题】:Python splitting strings and convert them to a list that notices empty fieldsPython 拆分字符串并将它们转换为注意空字段的列表
【发布时间】:2020-09-29 01:34:56
【问题描述】:

我花了一整天的时间试图解决这个问题,但我没有找到解决方案,所以我希望你能帮助我。我已经尝试从网站上提取数据。但问题是我不知道如何拆分列表以便 500g 转换为 500,g。问题是在网站上有时数量是 1,有时是 1 1/2 kg 或 sth。现在我需要将其转换为 CSV 文件,然后再转换为 MySQL 数据库。最后我想要的是一个 CSV 文件,其中包含以下列:成分 ID、成分、数量和成分的数量单位。例如: 0,肉,500,克。这是我已经从this 网站提取数据的代码:

import re
from bs4 import BeautifulSoup
import requests
import csv

urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
urls_urls = []
urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
ingredients = []
menge = []

def read_recipes():
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content, "lxml")
        for ingredient in soup2.select('.td-left'):
            menge.append([*[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                ingredients.append([id2, *[ingredient.get_text(strip=True)]])
            else:
                ingredients.append([id2, *[re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))]])

        read_recipes()

希望你能帮助我,谢谢!

【问题讨论】:

    标签: python mysql csv beautifulsoup export-to-csv


    【解决方案1】:

    似乎包含分数的字符串使用 unicode 符号表示 1/2 等,所以我认为一个好的开始方法是通过查找特定的 code 并将其传递给 str.replace() 来替换那些。拆分此示例的单位和数量很容易,因为它们用空格分隔。但是,如果您遇到其他组合,则可能有必要对此进行更多概括。 以下代码适用于此特定示例:

    import re
    from bs4 import BeautifulSoup
    import requests
    import csv
    import pandas as pd
    
    urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
    mainurl = "https://www.chefkoch.de/rs/s0e1n1z1b0i1d1,2,3/Rezepte.html"
    urls_urls = []
    urls_recipes = ['https://www.chefkoch.de/rezepte/3039711456645264/Ossobuco-a-la-Milanese.html']
    ingredients = []
    menge = []
    einheit = []
    
    
    for url, id2 in zip(urls_recipes, range(len(urls_recipes))):
        soup2 = BeautifulSoup(requests.get(url).content)
        for ingredient in soup2.select('.td-left'):
            # get rid of multiple spaces and replace 1/2 unicode character
            raw_string = re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True)).replace(u'\u00BD', "0.5")
            # split into unit and number
            splitlist = raw_string.split(" ")
            menge.append(splitlist[0])
            if len(splitlist) == 2:
                einheit.append(splitlist[1])
            else:
                einheit.append('')
        for ingredient in soup2.select('.recipe-ingredients h3, .td-right'):
            if ingredient.name == 'h3':
                continue
            else:
                ingredients.append([id2, re.sub(r'\s{2,}', ' ', ingredient.get_text(strip=True))])
    
    result = pd.DataFrame(ingredients, columns=["ID", "Ingredients"])
    result.loc[:, "unit"] = einheit
    result.loc[:, "amount"] = menge
    

    输出:

     >>> result
         ID                                        Ingredients   unit amount
     0    0  Beinscheibe(n), vom Rind, ca. 4 cm dick geschn...             4
     1    0                                               Mehl         etwas
     2    0                                         Zwiebel(n)             1
     3    0                                   Knoblauchzehe(n)             2
     4    0                                         Karotte(n)             1
     5    0                                     Lauchstange(n)             1
     6    0                                    Staudensellerie           0.5
     7    0                                Tomate(n), geschält   Dose      1
     8    0                                        Tomatenmark     EL      1
     9    0                              Rotwein zum Ablöschen
     10   0                       Rinderfond oder Fleischbrühe  Liter    0.5
     11   0                                Olivenöl zum Braten
     12   0                                     Gewürznelke(n)             2
     13   0                                       Pimentkörner            10
     14   0                                  Wacholderbeere(n)             5
     15   0                                      Pfefferkörner
     16   0                                               Salz
     17   0                    Pfeffer, schwarz, aus der Mühle
     18   0                                            Thymian
     19   0                                           Rosmarin
     20   0                            Zitrone(n), unbehandelt             1
     21   0                                   Knoblauchzehe(n)             2
     22   0                                    Blattpetersilie   Bund      1
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-01-29
      • 2018-06-10
      • 2019-03-25
      • 2021-10-18
      • 1970-01-01
      • 2019-07-30
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多