【问题标题】:beautifulSoup does not match chrome inspect while Python webscrapingBeautifulSoup 在 Python 网页抓取时与 chrome 检查不匹配
【发布时间】:2018-10-05 08:19:26
【问题描述】:

我目前正在尝试从 ncbi 蛋白质数据库中抓取蛋白质序列。此时,用户可以搜索一种蛋白质,我可以获得指向数据库吐出的第一个结果的链接。但是,当我通过美丽的汤运行这个时,汤与 chrome 检查元素不匹配,也根本没有序列。

这是我当前的代码:

import string
import requests
from bs4 import BeautifulSoup

def getSequence():
    searchProt = input("Enter a Protein Name!:")
    if searchProt != '':
        searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
        page = requests.get(searchString)
        soup = BeautifulSoup(page.text, 'html.parser')
        soup = str(soup)
        accIndex = soup.find("a")
        accessionStart = soup.find('<dd>',accIndex)
        accessionEnd = soup.find('</dd>', accessionStart + 4)
        accession = soup[accessionStart + 4: accessionEnd]
        newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
        try:
            newPage = requests.get(newSearchString)
            #This is where it fails
            newSoup = BeautifulSoup(newPage.text, 'html.parser')
            aaList = []
            spaceCount = newSoup.count("ff_line")
            print(spaceCount)
            for i in range(spaceCount):
                startIndex = newSoup.find("ff_line")
                startIndex = newSoup.find(">", startIndex) + 2
                nextAA = newSoup[startIndex]
                while nextAA in string.ascii_lowercase:
                    aaList.append(nextAA)
                    startIndex += 1
                    nextAA = newSoup[startIndex]
            return aaList        
         except:
            print("Please Enter a Valid Protein")

我一直在尝试使用搜索“p53”来运行它并获得了链接:here

我查看了该网站上的一系列网页抓取条目,并尝试了很多东西,包括安装 selenium 和使用不同的解析器。我仍然对为什么这些不匹配感到困惑。 (对不起,如果这是一个重复的问题,我对网络抓取非常陌生,目前有脑震荡,所以我正在寻找一些个别案例的反馈)

【问题讨论】:

    标签: python html web-scraping beautifulsoup python-requests


    【解决方案1】:

    此代码将使用 Selenium 提取您想要的蛋白质序列。我已经修改了您的原始代码以提供您想要的结果。

    from bs4 import BeautifulSoup
    from selenium import webdriver
    import requests
    
    driver = webdriver.Firefox()
    
    def getSequence():
        searchProt = input("Enter a Protein Name!:")
        if searchProt != '':
            searchString = "https://www.ncbi.nlm.nih.gov/protein/?term=" + searchProt
            page = requests.get(searchString)
            soup = BeautifulSoup(page.text, 'html.parser')
            soup = str(soup)
            accIndex = soup.find("a")
            accessionStart = soup.find('<dd>',accIndex)
            accessionEnd = soup.find('</dd>', accessionStart + 4)
            accession = soup[accessionStart + 4: accessionEnd]
            newSearchString = "https://www.ncbi.nlm.nih.gov/protein/" + accession
            try:
                driver.get(newSearchString)
                html = driver.page_source
                newSoup = BeautifulSoup(html, "lxml")
                ff_tags = newSoup.find_all(class_="ff_line")
                aaList = []
                for tag in ff_tags:
                    aaList.append(tag.text.strip().replace(" ",""))
                protSeq = "".join(aaList)
                return protSeq
            except:
                print("Please Enter a Valid Protein")
    
    sequence = getSequence()
    print(sequence)
    

    它为“p53”的输入产生以下输出:

    meepqsdlsielplsqetfsdlwkllppnnvlstlpssdsieelflsenvtgwledsggalqgvaaaaastaedpvtetpapvasapatpwplsssvpsyktfqgdygfrlgflhsgtaksvtctyspslnklfcqlaktcpvqlwvnstpppgtrvramaiykklqymtevvrrcphherssegdslappqhlirvegnlhaeylddkqtfrhsvvvpyeppevgsdcttihynymcnsscmggmnrrpiltiitledpsgnllgrnsfevricacpgrdrrteeknfqkkgepcpelppksakralptntssspppkkktldgeyftlkirgherfkmfqelnealelkdaqaskgsedngahssylkskkgqsasrlkklmikregpdsd
    

    【讨论】:

      猜你喜欢
      • 2019-03-25
      • 2018-04-25
      • 2014-06-20
      • 1970-01-01
      • 1970-01-01
      • 2014-06-20
      • 2020-09-14
      • 1970-01-01
      相关资源
      最近更新 更多