【问题标题】:Unable to scrape the name from the inner page of each result using requests无法使用请求从每个结果的内页中抓取名称
【发布时间】:2020-04-17 21:05:04
【问题描述】:

我在 python 中创建了一个脚本,利用 post http 请求从网页获取搜索结果。要填充结果,需要单击按顺序显示的字段here。现在将出现一个新页面,this 是如何填充结果。

第一页有十个结果,下面的脚本可以完美解析结果。

我现在想做的是使用results 到达他们的inner page,以便从那里解析Sole Proprietorship Name (English)

website address

到目前为止,我已经尝试过:

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"

payload = {
    'QueryString': '0',
    'SourceAppCode': 'cambodia-br-soleproprietorships',
    'OriginalVersionIdentifier': '',
    '_CBASYNCUPDATE_': 'true',
    '_CBHTMLFRAG_': 'true',
    '_CBNAME_': 'buttonPush'
}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
    res = s.get(url)
    target_url = res.url.split("&")[0].replace("view.", "update.")
    node = re.findall(r"nodeW\d.+?-Advanced",res.text)[0].strip()
    payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
    payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
    payload[node] = 'N'
    payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[2]
    payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()

    res = s.post(target_url,data=payload)
    soup = BeautifulSoup(res.content, 'html.parser')
    for item in soup.find_all("span", class_="appReceiveFocus")[3:]:
        print(item.text)

如何使用请求从每个结果内页解析Name (English)

【问题讨论】:

  • 您链接的问题与我在这里提出的问题不同@αԋɱҽԃ αмєяιcαη。这是关于从不同深度刮取name。谢谢。
  • 我相信我之前已经问过你关于最终目标的问题,你确认你可以处理剩下的事情,但目前看到的是你正在从一个问题转移到另一个问题,这意味着你需要有人来继续为你写代码。
  • @asmitu 是否必须访问内页才能从那里刮取英文名称?你不能从appReceiveFocus 元素中刮取英文名称吗?所有的搜索结果似乎都在链接中加入了英文名称。
  • 是的,我在创建这篇文章时注意到了这一点。问题是我也会解析该页面中的其他字段,因此有必要访问内页。

标签: python python-3.x web-scraping beautifulsoup python-requests


【解决方案1】:

这是您可以从网站内页解析名称,然后从地址选项卡解析电子邮件地址的方法之一。我添加此功能.get_email() 只是因为我想让您知道如何解析来自不同选项卡的内容。

import re
import requests
from bs4 import BeautifulSoup

url = "https://www.businessregistration.moc.gov.kh/cambodia-master/service/create.html?targetAppCode=cambodia-master&targetRegisterAppCode=cambodia-br-soleproprietorships&service=registerItemSearch"
result_url = "https://www.businessregistration.moc.gov.kh/cambodia-master/viewInstance/update.html?id={}"
base_url = "https://www.businessregistration.moc.gov.kh/cambodia-br-soleproprietorships/viewInstance/update.html?id={}"

def get_names(s):
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
    res = s.get(url)
    target_url = result_url.format(res.url.split("id=")[1])
    soup = BeautifulSoup(res.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}

    payload['QueryString'] = 'a'
    payload['SourceAppCode'] = 'cambodia-br-soleproprietorships'
    payload['_CBNAME_'] = 'buttonPush'
    payload['_CBHTMLFRAG_'] = 'true'
    payload['_VIKEY_'] = re.findall(r"viewInstanceKey:'(.*?)',", res.text)[0].strip()
    payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
    payload['_CBNODE_'] = re.findall(r"Callback\('(.*?)','buttonPush", res.text)[-1]
    payload['_CBHTMLFRAGNODEID_'] = re.findall(r"AsyncWrapper(W\d.+?)'",res.text)[0].strip()

    res = s.post(target_url,data=payload)
    soup = BeautifulSoup(res.text,"lxml")
    payload.pop('_CBHTMLFRAGNODEID_')
    payload.pop('_CBHTMLFRAG_')
    payload.pop('_CBHTMLFRAGID_')

    for item in soup.select("a[class*='ItemBox-resultLeft-viewMenu']"):
        payload['_CBNAME_'] = 'invokeMenuCb'
        payload['_CBVALUE_'] = ''
        payload['_CBNODE_'] = item['id'].replace('node','')

        res = s.post(target_url,data=payload)
        soup = BeautifulSoup(res.text,'lxml')
        address_url = base_url.format(res.url.split("id=")[1])
        node_id = re.findall(r"taba(.*)_",soup.select_one("a[aria-label='Addresses']")['id'])[0]
        payload['_CBNODE_'] = node_id
        payload['_CBHTMLFRAGID_'] = re.findall(r"guid:(.*?),", res.text)[0].strip()
        payload['_CBNAME_'] = 'tabSelect'
        payload['_CBVALUE_'] = '1'
        eng_name = soup.select_one(".appCompanyName + .appAttrValue").get_text()
        yield from get_email(s,eng_name,address_url,payload)

def get_email(s,eng_name,url,payload):
    res = s.post(url,data=payload)
    soup = BeautifulSoup(res.text,'lxml')
    email = soup.select_one(".EntityEmailAddresses:contains('Email') .appAttrValue").get_text()
    yield eng_name,email

if __name__ == '__main__':
    with requests.Session() as s:
        for item in get_names(s):
            print(item)

输出如下:

('AMY GEMS', 'amy.n.company@gmail.com')
('AHARATHAN LIN LIANJIN FOOD FLAVOR', 'skykoko344@gmail.com')
('AMETHYST DIAMOND KTV', 'twobrotherktv@gmail.com')

【讨论】:

  • 在执行上述脚本之前,请确保您的BeautifulSoup 版本是4.7.0 或更高版本,以便它支持我在脚本中用于解析名称和电子邮件的pesudo selector从那里开始。
【解决方案2】:

要获得名称(英文),您只需将 print(item.text) 替换为打印 AMY GEMS

print(item.text.split('/')[1].split('(')[0].strip())

【讨论】:

  • 在提出解决方案之前,至少尝试阅读@Harish Vutukuri 的帖子。如果我想从它的登录页面解析那部分,我可以首先做到这一点。但是,那不是关于谢谢的帖子。
猜你喜欢
  • 1970-01-01
  • 2020-05-26
  • 1970-01-01
  • 1970-01-01
  • 2021-02-07
  • 1970-01-01
  • 2021-07-21
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多