【问题标题】:How can I access tag's value inside id with beautifulsoap in python?如何在python中使用beautifulsoup访问id内的标签值?
【发布时间】:2022-11-14 08:28:24
【问题描述】:

我正在尝试使用python中的beautifulsoap从网站中提取数据,但数据让我有点困惑,我不太明白该怎么做。我想要做的实际上是提取某些数据。我只想捕捉标题,例子,意义起源页面中的数据,我该怎么做?

我将分享我自己的代码,但这不是正确的代码

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import json

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
url = "https://www.englishclub.com/ref/Idioms/"


mylist = [
    "A",
    "B",
    "C",
    "D",
    "E",
    "F",
    "G",
    "H",
    "I",
    "J",
    "K",
    "L",
    "M",
    "N",
    "O",
    "P",
    "Q",
    "R",
    "S",
    "T",
    "U",
    "V",
    "W"
]

list = {}
idiomsUrls=[]


for i in range(23):

    list[mylist[i]] = []
    result = requests.get(url+mylist[i]+"/", headers = headers)
    doc = BeautifulSoup(result.text, "html.parser")
    idiomsUrls = doc.select('.linktitle a')

    for tag in idiomsUrls:
        result = requests.get(tag['href'])
        doc = BeautifulSoup(result.text,"html.parser")
        idioms = doc.select('main')
        

with open('idioms.json', 'w', encoding='utf-8') as f:
    json.dump(list, f, ensure_ascii=False, indent=4)

我分享了我想要捕获的数据的屏幕截图。

我这里要捕获的数据是h1标签中的成语标题,举个例子,这里是above board,然后意义和它下面的示例部分。例子ulli标签的底部还有一个叫做origin的部分,我找不到如何拍摄这些部分。

【问题讨论】:

    标签: python web-scraping beautifulsoup


    【解决方案1】:

    尽量保持简单,通过tagidclass选择更具体的元素,并尽量避免使用保留的keywords作为变量名:

    data = []
    
    for i in mylist:
        result = requests.get(url+i+"/", headers = headers)
        doc = BeautifulSoup(result.text)
    
        for tag in doc.select('.linktitle a'):
            result = requests.get(tag['href'])
            doc = BeautifulSoup(result.text)
            data.append({
                'idiom': doc.h1.get_text(strip=True),
                'meaning': doc.select_one('h1 ~ h2 + p').get_text(strip=True),
                'examples':[e.get_text(strip=True) for e in doc.select('main ul li')]
            })
    

    例子

    import requests
    from bs4 import BeautifulSoup
    
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
    url = "https://www.englishclub.com/ref/Idioms/"
    
    
    mylist = ["A"] #...
    
    data = []
    
    for i in mylist:
    
        result = requests.get(url+i+"/", headers = headers)
        doc = BeautifulSoup(result.text)
    
        for tag in doc.select('.linktitle a'):
            result = requests.get(tag['href'])
            doc = BeautifulSoup(result.text)
            data.append({
                'idiom': doc.h1.get_text(strip=True),
                'meaning': doc.select_one('h1 ~ h2 + p').get_text(strip=True),
                'examples':[e.get_text(strip=True) for e in doc.select('main ul li')]
            })
    
    data
    

    输出

    [{'idiom': 'above board',
      'meaning': "If something is above board, it's been done in a legal and honest way.",
      'examples': ["I'm sure the deal was completely above board as I know James well and he'd never do anything illegal or corrupt.",
       'The minister claimed all the appointments were above board and denied claims that some positions had been given to his friends.']},
     {'idiom': 'above the law',
      'meaning': 'If someone is above the law, they are not subject to the laws of a society.',
      'examples': ["Just because his father is a rich and powerful man, he seems to think he's above the law and he can do whatever he likes.",
       'In a democracy, no-one is above the law - not even a president or a prime-minister.']},
     {'idiom': "Achilles' heel",
      'meaning': "An Achilles' heel is a weakness that could result in failure.",
      'examples': ["He's a good golfer, but his Achilles' heel is his putting and it's often made him lose matches.",
       "The country's dependence on imported oil could prove to be its Achilles' heel if prices keep on rising."]},...]
    

    【讨论】:

    • 哦,非常感谢。
    最近更新 更多