Web 抓取 - 为多个 URL 抓取数据给出无答案

【问题标题】：Web scraping - scraping data for multiple URL's gives NoneWeb 抓取 - 为多个 URL 抓取数据给出无
【发布时间】：2021-06-20 19:31:50
【问题描述】：

1)我正在尝试抓取存储在 CSV 中的多个 URL 的数据，但结果却没有。

2)我想将获取的数据同时存储在名为df的数据框中，但它只存储一行。

这是我的代码（我在下面粘贴了数据提取开始的地方）-

import csv
df=pd.DataFrame()
with open('test1.csv', newline='', encoding='utf-8-sig' ) as f:
    reader = csv.reader(f)
    for line in reader:
        link = line[0]
        print(type(link))
        print(link)
        driver.get(link)
        height = driver.execute_script("return document.body.scrollHeight")
        for scrol in range(100,height,100):
            driver.execute_script(f"window.scrollTo(0,{scrol})")
            time.sleep(0.2)
        src = driver.page_source
        soup = BeautifulSoup(src, 'lxml')
        name_div = soup.find('div', {'class': 'flex-1 mr5'})
        name_loc = name_div.find_all('ul')
        name = name_loc[0].find('li').get_text().strip()
        loc = name_loc[1].find('li').get_text().strip()    
        connection = name_loc[1].find_all('li')
        connection = connection[1].get_text().strip()
        exp_section = soup.find('section', {'id': 'experience-section'})
        exp_section = exp_section.find('ul')
        div_tag = exp_section.find('div')
        a_tag = div_tag.find('a')
        job_title = a_tag.find('h3').get_text().strip()
        company_name = a_tag.find_all('p')[1].get_text().strip()
        joining_date = a_tag.find_all('h4')[0].find_all('span')[1].get_text().strip()
        exp = a_tag.find_all('h4')[1].find_all('span')[1].get_text().strip()
        df['name']=[name]
        df['location']=[loc]
        df['connection']=[connection]
        df['company_name']=[company_name]
        df['job_title']=[job_title]
        df['joining_date']=[joining_date]
        df['tenure']=[exp]
df

输出 -

    name    location    connection  company_name    job_title   joining_date    tenure
0   None    None    None    None    None    None    None

我不确定 for 循环是否出错或确切的问题是什么，但对于单个 URL，它可以正常工作。

我是第一次使用美丽的汤，所以我没有适当的知识。请帮助我进行所需的更改。谢谢。

【问题讨论】：

剩下的代码在哪里？例如。您有硒代码，但没有导入或实例化。另外，您使用的是什么网址（几个示例）？请参阅 minimal reproducible example 和 How to Ask 以获取有关发布的指导。为了帮助您，我们需要能够在我们自己的机器上重现您的问题。
我真的很抱歉先生没有发布我的完整代码，我认为阅读这么长的代码会很困难。请不要因为我的愚蠢错误而删除帖子，这是 csv 和 python 文件 - wetransfer.com/downloads/…
不，先生，实际上我的链接存储在 csv 中，但是先生，我已经给了我们将链接转移到那里，您将获得 csv 以及我的完整代码。请仔细阅读，先生，您不必在代码中遇到任何问题
@QHarr 先生，可以吗？
好的先生，让我尝试为两个网址做这件事

标签： python pandas web-scraping beautifulsoup jupyter-notebook

【解决方案1】：

我不认为您的代码末尾是在数据帧中添加新行。

尝试将 df["name""] = [name] 和其他行替换为以下内容：

new_line = {
    "name": [name],
    "location": [loc],
    "connection": [connection],
    "company_name": [company_name],
    "job_title": [job_title],
    "joining_date": [joining_date],
    "tenure": [exp],
}
temp_df = pd.DataFrame.from_dict(new_line)
df.append(temp_df)

【讨论】：

非常感谢您的回复，我知道它并没有正确存储数据，这正是我要纠正最后一部分的人除外，但先生它给了我一个错误 - ValueError: If using all scalar values, you must pass an index 在线temp_df = pd.DataFrame.from_dict(new_line)
对不起，我用缺少的括号更新了我的答案，请用更新后的代码重试。
先生，我的要求是不要为我道歉 :)...你是我的前辈
先生，它成功了，现在它没有显示任何错误，先生，您能帮我解决 csv 的循环部分吗？...这是唯一剩下的东西
我想您可以将代码封装在一个通用循环中以遍历 csv 文件名列表：for csv_name in csv_list: ... with open(csv_name, newline ...)。跨度>