【发布时间】:2022-11-14 08:28:24
【问题描述】:
我正在尝试使用python中的beautifulsoap从网站中提取数据,但数据让我有点困惑,我不太明白该怎么做。我想要做的实际上是提取某些数据。我只想捕捉标题,例子,意义和起源页面中的数据,我该怎么做?
我将分享我自己的代码,但这不是正确的代码
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import json
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'}
url = "https://www.englishclub.com/ref/Idioms/"
mylist = [
"A",
"B",
"C",
"D",
"E",
"F",
"G",
"H",
"I",
"J",
"K",
"L",
"M",
"N",
"O",
"P",
"Q",
"R",
"S",
"T",
"U",
"V",
"W"
]
list = {}
idiomsUrls=[]
for i in range(23):
list[mylist[i]] = []
result = requests.get(url+mylist[i]+"/", headers = headers)
doc = BeautifulSoup(result.text, "html.parser")
idiomsUrls = doc.select('.linktitle a')
for tag in idiomsUrls:
result = requests.get(tag['href'])
doc = BeautifulSoup(result.text,"html.parser")
idioms = doc.select('main')
with open('idioms.json', 'w', encoding='utf-8') as f:
json.dump(list, f, ensure_ascii=False, indent=4)
我分享了我想要捕获的数据的屏幕截图。
我这里要捕获的数据是h1标签中的成语标题,举个例子,这里是above board,然后意义和它下面的示例部分。例子ul和li标签的底部还有一个叫做origin的部分,我找不到如何拍摄这些部分。
【问题讨论】:
标签: python web-scraping beautifulsoup