【问题标题】:How to extract text from all paragrahs on website that contain specific string如何从包含特定字符串的网站上的所有段落中提取文本
【发布时间】:2020-01-12 21:03:54
【问题描述】:

我通过此site 遇到问题。 我想以表格形式提取我的本地语言及其含义

import requests
from bs4 import BeautifulSoup

res2 = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup2 = BeautifulSoup(res2.content,'html')

Yoruba = []
English = []
for ol in soup2.findAll('ol'):
   proverb = ol.find('li')
   Yoruba.append(proverb.text)

我成功地将我的本地语言提取到一个列表,我还想将每个以字符串Meaning: 开头的句子提取到另一个列表,例如:['你的生活状态决定了你对你的态度同龄人'、'做人要成熟,以免名声不好。'等]

【问题讨论】:

  • 您能否更具体地说明问题所在?此外,变量和函数名称应遵循lower_case_with_underscores 样式。
  • 在成功将我所有的本地语言提取到list(即上面的代码)之后,通过上面的站点将其含义提取到list 是我遇到的问题
  • 您可以提供一些等效 HTML 的 sn-p,而不是引用网站内容本身。了解如何创建minimal reproducible example

标签: python web-scraping beautifulsoup


【解决方案1】:

此脚本会抓取谚语、翻译和含义,并从中创建一个 pandas DataFrame。 含义列表在data['Meaning']内:

import re
import requests
import pandas as pd
from bs4 import BeautifulSoup

res = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
soup = BeautifulSoup(res.content,'html.parser')

data = {'Yoruba':[], 'Translation':[], 'Meaning':[]}
for youruba, translation, meaning in zip(soup.select('ol'), soup.select('ol + p'), soup.select('ol + p + p')):
    data['Yoruba'].append(youruba.get_text(strip=True))
    data['Translation'].append(re.sub(r'Translation:\s*', '', translation.get_text(strip=True)))
    data['Meaning'].append(re.sub(r'Meaning:\s*', '', meaning.get_text(strip=True)))

# print(data['Meaning']) # <-- your meanings list

df = pd.DataFrame(data)
print(df)

打印:

                                               Yoruba                                        Translation                                            Meaning
0                         Ile oba t'o jo, ewa lo busi  When a king's palace burns down, the re-built ...  Necessity is mother of invention, creativity i...
1   Gbogbo alangba lo d'anu dele, a ko mo eyi t'in...  All lizards lie flat on their stomach and it i...  Everyone looks the same on the outside but eve...
2                           Ile la ti n ko eso re ode                             Charity begins at Home  A man cannot give what he does not have good o...
3                        A pę ko to jęun, ki ję ibaję  The person that eat late, will not eat spoiled...  It is more profitable to exercise patience whi...
4        Eewu bę loko Longę, Longę fun ara rę eewu ni  There is danger at Longę's farm (Longę is a na...  You should be extremely careful of situations ...
5   Bi Ēēgun nla ba ni ohùn o ri gontò, gontò na a...  If a big masquerade claims it doesn't see the ...  If an important man does not respect those les...
6   Kò sí ęni tí ó ma gùn ęşin tí kò ní ju ìpàkó. ...  No one rides a horse without moving his head, ...  Your status in life dictates your attitude tow...
7               Bí abá so òkò sójà ará ilé eni ní bá;  He who throws a stone in the market will hit h...  Be careful what you do unto others it may retu...
8             Agba ki wa loja, ki ori omo titun o wo.     Do not go crazy, do not let the new baby look.  Behave in a mature manner so avoid bad reputat...
9                      Adìẹ funfun kò mọ ara rẹ̀lágbà         The white chicken does not realize its age                                   Respect yourself
10                           Ọbẹ̀ kìí gbé inú àgbà mì   The soup does not move round in an elder’s belly                 You should be able to keep secrets

... and so on

【讨论】:

    【解决方案2】:

    只需搜索所有段落并检查段落文本是否以“含义”开头。

    试试这个:

    import requests
    from bs4 import BeautifulSoup
    
    res2 = requests.get('https://steemit.com/nigeria/@leopantro/50-yoruba-proverbs-and-idioms')
    soup2 = BeautifulSoup(res2.content,'html')
    
    yoruba = []
    english = []
    for ol in soup2.findAll('ol'):
        proverb = ol.find('li')
        yoruba.append(proverb.text)
    
    for paragraph in soup2.findAll('p'):
        if paragraph.text.startswith("Meaning:"):
            english.append(paragraph.text)
    
    english = [x.replace("Meaning: ", "") for x in english]
    print(english)
    

    打印出来:

    [' Necessity is mother of invention, creativity is often achieved after overcoming many difficulties.',
     ' Everyone looks the same on the outside but everyone has problems that are invisible to outsiders.',
    ...
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-07-17
      • 1970-01-01
      • 2014-06-07
      • 1970-01-01
      • 2018-06-28
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多