【问题标题】:beautifulsoup scraping python, i can't find "title"Beautifulsoup 刮蟒蛇,我找不到“标题”
【发布时间】:2021-08-02 20:03:13
【问题描述】:

我想从链接中抓取,但我发现一些困难,要么我找不到它,要么我不知道如何在一个链接中选择一些列表和一些文本... . 我用 BeautifulSoup 做这个:

response = requests.get(LINK)                   
response.raise_for_status()                 
soup = bs4.BeautifulSoup(response.text,'html.parser')       


for select in soup.select("script",type="text/javascript"):
    print(select)

其中 LINK 是 https,作为输出我得到:

OTHER <script type="text/javascript"> WRITINGS 

<script type="text/javascript">
$(function () {
    $('#chart_t_2021').highcharts({
    chart: {
        ...
    },

    title: {
        text: 'I WANT TO PRINT THIS TEXT'
    },
    ...
  })
});
</script>
<script type="text/javascript">
$(function () {
    $('#chart_2021').highcharts({
    title: {
        text: '...'
    },
    yAxis: {
        ...
    },
    xAxis: {tickPositions: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30] <!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
    },
    legend: {
        layout: 'vertical',
        align: 'center',
        verticalAlign: 'bottom'
    },

    plotOptions: {
        series: {
            pointStart: 15
        }
    },

    series: [{
        name: 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE',
        data: [0,0,0,0,0,0,0,0,0,3,1,8,12,21,22,13]<!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
    }, {
        name: 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE',
        data: [0,0,0,0,0,0,0,0,0,3,1,7,12,21,19,13]<!--I WOULD LIKE TO TAKE THIS LIST AND PUT IT IN A VARIABLE-->
    }]
  })
});</script>

OTHER <script type="text/javascript"> WRITINGS 

我尝试过这样做:

for select1 in soup.select("script",type="text/javascript"):
    for select2 in select1.select("title"):
        print(select2)

但是它不打印任何东西,有人可以帮我打印至少我作为输出的第一个标题吗?

【问题讨论】:

  • 之前的回复是否回答了您的问题? -- stackoverflow.com/a/35956388/13261176
  • 不,因为前一个一般要求html的标题,但我要求的是 中的标题,或者更确切地说我要求的是文本即在里面: script> function () > $('#chart_t_2021').highcharts > title > text

标签: python web-scraping beautifulsoup


【解决方案1】:

您尝试提取的信息在 javascript 中。这部分不能使用 BeautifulSoup。一种方法是使用正则表达式提取部分,并使用ast.literal_eval() 将文本转换为 Python 变量。

例如:

from bs4 import BeautifulSoup
from ast import literal_eval
import re

def extract(pattern, script, var):
    if script.string:
        for value in re.findall(pattern, script.string):
            var.append(literal_eval(value))
    
    
html = """<<script text copied from question>>"""

soup = BeautifulSoup(html, 'html.parser')
titles = []
tickpositions = []
names = []
data = []

for script in soup.select('script', type='application/json'):
    extract("text: ('.*?')", script, titles)
    extract("tickPositions: (\[.*?\])", script, tickpositions)
    extract("name: ('.*?')", script, names)
    extract("data: (\[.*?\])", script, data)
            
print(titles)
print(tickpositions)
print(names)
print(data)

对于您提供的数据,这将为您提供以下类型的输出:

['I WANT TO PRINT THIS TEXT', '...']
[[15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30]]
['I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE', 'I WOULD LIKE TO TAKE THIS TEXT AND PUT IT IN A VARIABLE']
[[0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 8, 12, 21, 22, 13], [0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 1, 7, 12, 21, 19, 13]]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2015-06-16
    • 1970-01-01
    • 1970-01-01
    • 2014-09-09
    • 1970-01-01
    • 1970-01-01
    • 2022-06-13
    • 1970-01-01
    相关资源
    最近更新 更多