【问题标题】:get title inside link tag in HTML using beautifulsoup使用beautifulsoup在HTML中的链接标签内获取标题
【发布时间】:2017-07-07 06:58:13
【问题描述】:

我正在从https://data.gov.au/dataset?organization=reservebankofaustralia&_groups_limit=0&groups=business 中提取数据 并得到了我想要的输出,但现在的问题是:我得到的输出是 Business Support an... 和 Reserve Bank of Aus....,不是完整的文本,我想打印整个文本而不是“.... ...“ 对全部。我用 jezrael 替换了第 9 行和第 10 行,请参考 Fetching content from html and write fetched content in a specific format in CSV 代码 org = soup.find_all('a', {'class':'nav-item active'})[0].get('title') groups = soup.find_all('a', {'class':'nav-item active'})[1].get('title') .而且我正在单独运行它并收到错误:列表索引超出范围。我应该用什么来提取完整​​的句子?我也试过: org = soup.find_all('span',class_="filtered pill"),当我单独运行但无法使用整个代码运行时,它给出了字符串类型的答案。

【问题讨论】:

    标签: python beautifulsoup


    【解决方案1】:

    所有文本较长的数据都在属性title中,较短的数据在文本中。所以加双if:

    for i in webpage_urls:
        wiki2 = i
        page= urllib.request.urlopen(wiki2)
        soup = BeautifulSoup(page, "lxml")
    
        lobbying = {}
        #always only 2 active li, so select first by [0]  and second by [1]
        l = soup.find_all('li', class_="nav-item active")
    
        org = l[0].a.get('title')
        if org == '':
            org = l[0].span.get_text()
    
        groups = l[1].a.get('title')
        if groups == '':
            groups = l[1].span.get_text()
    
        data2 = soup.find_all('h3', class_="dataset-heading")
        for element in data2:
            lobbying[element.a.get_text()] = {}
        data2[0].a["href"]
        prefix = "https://data.gov.au"
        for element in data2:
            lobbying[element.a.get_text()]["link"] = prefix + element.a["href"]
            lobbying[element.a.get_text()]["Organisation"] = org
            lobbying[element.a.get_text()]["Group"] = groups
    
            #print(lobbying)
            df = pd.DataFrame.from_dict(lobbying, orient='index') \
                   .rename_axis('Titles').reset_index()
            dfs.append(df)
    

    df = pd.concat(dfs, ignore_index=True)
    df1 = df.drop_duplicates(subset = 'Titles').reset_index(drop=True)
    
    df1['Organisation'] = df1['Organisation'].str.replace('\(\d+\)', '')
    df1['Group'] = df1['Group'].str.replace('\(\d+\)', '')
    

    print (df1.head())
    
                                                  Titles  \
    0                                     Banks – Assets   
    1  Consolidated Exposures – Immediate and Ultimat...   
    2  Foreign Exchange Transactions and Holdings of ...   
    3  Finance Companies and General Financiers – Sel...   
    4                   Liabilities and Assets – Monthly   
    
                                                    link  \
    0           https://data.gov.au/dataset/banks-assets   
    1  https://data.gov.au/dataset/consolidated-expos...   
    2  https://data.gov.au/dataset/foreign-exchange-t...   
    3  https://data.gov.au/dataset/finance-companies-...   
    4  https://data.gov.au/dataset/liabilities-and-as...   
    
                    Organisation                            Group  
    0  Reserve Bank of Australia  Business Support and Regulation  
    1  Reserve Bank of Australia  Business Support and Regulation  
    2  Reserve Bank of Australia  Business Support and Regulation  
    3  Reserve Bank of Australia  Business Support and Regulation  
    4  Reserve Bank of Australia  Business Support and Regulation  
    

    【讨论】:

    • 非常感谢。你能解释一下这个逻辑“if org == '':”在做什么吗?
    • 检查html有些属性title为空有问题,需要if。如果省略它就没有文本。
    • @jezrael, :),祝你有美好的一天!
    【解决方案2】:

    我猜你正在尝试这样做。在每个链接中都有标题属性。所以在这里我只是检查了是否存在任何标题属性,如果是,那么我只是打印它。

    有空行是因为title="" 的链接很少,因此您可以使用条件语句来避免这种情况,然后从中获取所有标题。

    >>> l = soup.find_all('a')
    >>> for i in l:
    ...     if i.has_attr('title'):
    ...             print(i['title'])
    ... 
    Remove
    Remove
    Reserve Bank of Australia
    
    Business Support and Regulation
    
    
    
    
    
    
    
    
    
    
    
    
    
    Creative Commons Attribution 3.0 Australia
    >>> 
    

    【讨论】:

    • 谢谢,它适用于一个 URL,现在我已经为所有 URL 运行了程序。让我们看看会输出什么。
    • @shashank,在运行多个 URL 时,我得到了同样的结果。我认为它应该是循环中的循环。
    • 你能详细说明你想做什么吗?我的意思是你打算如何去获取数据
    • @shashank,完成了,谢谢你的关注。我的问题中给出了一个链接,详细说明了我想要做什么。
    猜你喜欢
    • 2014-11-08
    • 2017-08-04
    • 1970-01-01
    • 1970-01-01
    • 2021-12-23
    • 2015-09-10
    • 2015-12-09
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多