遍历嵌套列表以附加网络抓取结果答案

【问题标题】：Iterating through nested list for appending web scrape result遍历嵌套列表以附加网络抓取结果
【发布时间】：2021-02-17 11:17:09
【问题描述】：

我正在尝试遍历列表“公司”，为其中的每个元素启动谷歌搜索，抓取结果，并将谷歌结果附加到每个元素。

公司变量就是这样，由 895 个列表组成

company = [['24/7 CUSTOMER Private Limited'], ['3 K TECHNOLOGIES Limited'], ['3I INFOTECH B P O Limited'], ['3I INFOTECH CONSULTANCY SERVICES Limited'], ['3I INFOTECH Limited'], ['4D CORPORATION Private Limited'], ['8K MILES SOFTWARE SERVICES Limited'], ['A B P Private Limited']...]]

我希望输出是

[['24/7 CUSTOMER Private Limited', New Dehli India], ['3 K TECHNOLOGIES Limited', Palo Alto United States], ['3I INFOTECH B P O Limited', New Dehli India], ['3I INFOTECH CONSULTANCY SERVICES Limited', New York United States], ['3I INFOTECH Limited', New York United States], ['4D CORPORATION Private Limited', Mumbai India], ['8K MILES SOFTWARE SERVICES Limited', New Dehli India ], ['A B P Private Limited', New Dehli India]...]]

这是一个以公司名称为参数并输出其抓取结果的函数

def scrape(row):
        
    query = "https://www.google.com/search?q="+ row + " headquarters"
    r = requests.get(query)   
    html_doc = r.text
    soup = BeautifulSoup(html_doc, 'html.parser')
    cleanr = re.compile('<.*?>')
    snippett = re.sub(cleanr, '', str(soup.find_all('div', attrs={'class':'BNeawe s3v9rd AP7Wnd'})[0]))
    
    return snippett

然后通过遍历公司列表并附加结果来调用函数

for lst in company():
    for row in lst():    
        hq_result = scrape(row)
        row.append(hq_result)

出现此错误： IndexError: list index out of range

【问题讨论】：

好像soup.find_all() 返回空列表。

标签： python list web-scraping iteration

【解决方案1】：

几件事：

如果您只是要获取<div class="BNeawe s3v9rd AP7Wnd"> 的第一个元素（索引0），只需使用.find() 而不是f.ind_all()，因为它只会返回第一个节点。
无需使用正则表达式来获取文本/内容。只需使用 .text 的 BeautfulSoup 方法即可
BNeawe s3v9rd AP7Wnd 的类属性似乎是动态的。它可能不会出现在您对谷歌搜索所做的每个查询中。要么将其更改为动态的（相应调整），要么使用 google api 获取搜索结果。
我的最后一点（也是最重要的一点）是谷歌足够复杂，可以识别自动化流程/机器人抓取。因此，您最终可能会收到以下回复

Our systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot.

所以，我还是建议寻找api的方法。

【讨论】：