用 beautifulsoup 提取属性值答案

【问题标题】：Extracting an attribute value with beautifulsoup用 beautifulsoup 提取属性值
【发布时间】：2011-02-06 10:33:23
【问题描述】：

我正在尝试在网页上的特定“输入”标签中提取单个“值”属性的内容。我使用以下代码：

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTag = soup.findAll(attrs={"name" : "stainfo"})

output = inputTag['value']

print str(output)

我得到一个 TypeError：列表索引必须是整数，而不是 str

尽管从 Beautifulsoup 文档中我了解到字符串在这里应该不是问题......但我不是专家，我可能误解了。

非常感谢任何建议！

【问题讨论】：

标签： python parsing attributes beautifulsoup

【解决方案1】：

.find_all() 返回所有找到的元素的列表，所以：

input_tag = soup.find_all(attrs={"name" : "stainfo"})

input_tag 是一个列表（可能只包含一个元素）。取决于你到底想要什么，你应该做什么：

output = input_tag[0]['value']

或使用.find() 方法，它只返回一个（第一个）找到的元素：

input_tag = soup.find(attrs={"name": "stainfo"})
output = input_tag['value']

【讨论】：

好东西！谢谢。现在我有一个关于解析输出的问题，我是一大堆非 ASCII 字符，但我会在一个单独的问题中问这个问题。
不应该按照stackoverflow.com/questions/2616659/… 访问“值”。是什么让上面的代码在这种情况下工作？我以为您必须通过 output = inputTag[0].contents 访问该值
@Seth - 不，因为他正在寻找 input-tag 的属性“值”，而 .contents 返回标签封装的文本（我是 .contents） - - （现在才回复，因为我必须仔细检查发生了什么；认为其他人可能会受益）
很好的答案。但是，我会使用inputTag[0].get('value') 而不是inputTag[0]['value'] 来防止没有指针，以防标记为无值属性
不直接链接到访问网站首页的链接怎么办，无论是直接链接还是间接链接到网页的所有链接如何获取。

【解决方案2】：

在Python 3.x 中，只需在使用find_all 获得的标签对象上使用get(attr_name)：

xmlData = None

with open('conf//test1.xml', 'r') as xmlFile:
    xmlData = xmlFile.read()

xmlDecoded = xmlData

xmlSoup = BeautifulSoup(xmlData, 'html.parser')

repElemList = xmlSoup.find_all('repeatingelement')

for repElem in repElemList:
    print("Processing repElem...")
    repElemID = repElem.get('id')
    repElemName = repElem.get('name')

    print("Attribute id = %s" % repElemID)
    print("Attribute name = %s" % repElemName)

针对 XML 文件 conf//test1.xml，看起来像：

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<root>
    <singleElement>
        <subElementX>XYZ</subElementX>
    </singleElement>
    <repeatingElement id="11" name="Joe"/>
    <repeatingElement id="12" name="Mary"/>
</root>

打印：

Processing repElem...
Attribute id = 11
Attribute name = Joe
Processing repElem...
Attribute id = 12
Attribute name = Mary

【讨论】：

您介意我编辑它以遵循 PEP 8 并使用更现代的字符串格式化方法吗？
没关系，去吧
这是最有用和最清晰的答案。应该是被接受的

【解决方案3】：

对我来说：

<input id="color" value="Blue"/>

这个可以通过下面的sn-p来获取。

page = requests.get("https://www.abcd.com")
soup = BeautifulSoup(page.content, 'html.parser')
colorName = soup.find(id='color')
print(colorName['value'])

【讨论】：

你在哪里定义color？
我猜，他忘了用colorName['value'] 而不是color['value']。

【解决方案4】：

如果你想从上面的源中检索多个属性值，你可以使用findAll 和一个列表推导来获得你需要的一切：

import urllib
f = urllib.urlopen("http://58.68.130.147")
s = f.read()
f.close()

from BeautifulSoup import BeautifulStoneSoup
soup = BeautifulStoneSoup(s)

inputTags = soup.findAll(attrs={"name" : "stainfo"})
### You may be able to do findAll("input", attrs={"name" : "stainfo"})

output = [x["stainfo"] for x in inputTags]

print output
### This will print a list of the values.

【讨论】：

【解决方案5】：

如果您知道哪种标签具有这些属性，我实际上会建议您采用一种节省时间的方法。

假设标签 xyz 具有名为“staininfo”的属性..

full_tag = soup.findAll("xyz")

我想让你明白 full_tag 是一个列表

for each_tag in full_tag:
    staininfo_attrb_value = each_tag["staininfo"]
    print staininfo_attrb_value

这样你就可以得到所有标签xyz的staininfo的所有attrb值

【讨论】：

【解决方案6】：

你也可以用这个：

import requests
from bs4 import BeautifulSoup
import csv

url = "http://58.68.130.147/"
r = requests.get(url)
data = r.text

soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("input", attrs={"name":"stainfo"})

for val in get_details:
    get_val = val["value"]
    print(get_val)

【讨论】：

这与已经存在的旧答案有何不同？

【解决方案7】：

我在 Beautifulsoup 4.8.1 中使用它来获取某些元素的所有类属性的值：

from bs4 import BeautifulSoup

html = "<td class='val1'/><td col='1'/><td class='val2' />"

bsoup = BeautifulSoup(html, 'html.parser')

for td in bsoup.find_all('td'):
    if td.has_attr('class'):
        print(td['class'][0])

重要的是要注意，即使属性只有一个值，属性键也会检索列表。

【讨论】：

【解决方案8】：

您可以尝试使用名为 requests_html 的新强大包：

from requests_html import HTMLSession
session = HTMLSession()

r = session.get("https://www.bbc.co.uk/news/technology-54448223")
date = r.html.find('time', first = True) # finding a "tag" called "time"
print(date)  # you will have: <Element 'time' datetime='2020-10-07T11:41:22.000Z'>
# To get the text inside the "datetime" attribute use:
print(date.attrs['datetime']) # you will get '2020-10-07T11:41:22.000Z'

【讨论】：

【解决方案9】：

这是一个如何提取所有a标签的href属性的示例：

import requests as rq 
from bs4 import BeautifulSoup as bs

url = "http://www.cde.ca.gov/ds/sp/ai/"
page = rq.get(url)
html = bs(page.text, 'lxml')

hrefs = html.find_all("a")
all_hrefs = []
for href in hrefs:
    # print(href.get("href"))
    links = href.get("href")
    all_hrefs.append(links)

print(all_hrefs)

【讨论】：

【解决方案10】：

你可以试试gazpacho:

使用pip install gazpacho安装它

获取 HTML 并使用以下方法制作 Soup：

from gazpacho import get, Soup

soup = Soup(get("http://ip.add.ress.here/"))  # get directly returns the html

inputs = soup.find('input', attrs={'name': 'stainfo'})  # Find all the input tags

if inputs:
    if type(inputs) is list:
        for input in inputs:
             print(input.attr.get('value'))
    else:
         print(inputs.attr.get('value'))
else:
     print('No <input> tag found with the attribute name="stainfo")

【讨论】：