【问题标题】:Searching the web pages for keywords and extracting resutls在网页中搜索关键字并提取结果
【发布时间】:2019-09-24 16:23:01
【问题描述】:

我编写了一个代码,可以从网站上搜索某些关键字

当我使用print(url, count, the_word) 时,它会给我结果,但我无法将其转换为可提取的数据集。 我试过用它 panda's 但它只输出最后一个搜索结果。

def getLinks(url):
html_page = urlopen(url)
soup = bs(html_page)
links = []


for link in soup.find_all('a', href=True):
    links.append(link.get('href'))

 newlist=[ii for n,ii in enumerate(links) if ii not in links[:n]]
 newlist.insert(0,url)

 return newlist[0:10]

the_words = ['20gb', '10gb']
total_words = []

for the_word in the_words:

 for url in getLinks('https://www.bt.com/'):
    r = requests.get(url, allow_redirects=False)
    soup = bs(r.content.lower(), 'lxml')
    words = soup.find_all(text=lambda text: text and the_word.lower() in text)
    count = len(words)
    words_list = [ ele.strip() for ele in words ]
    for word in words:
        total_words.append(word.strip())

    #print('\nUrl: {}\ncontains {} of word: {}'.format(url, count, the_word))
    print(url, count, the_word)

    results=url,count,the_word
    #df=pd.DataFrame(results, columns=[the_word])
    #df.to_csv(r'C:\Users\nn1\Downloads\Python\trial.csv')
    #print(total_words)

我希望将print(url, count, the_word) 代码原样导出为 csv 文件。

【问题讨论】:

  • 请更新您的代码块,使其更具可读性
  • 此代码运行不正确。请以正确的编码方法更新您的代码

标签: python loops web-scraping beautifulsoup


【解决方案1】:

首先,更正所有缩进并将结果保存在列表中。从该列表中读取并将其写入 csv。这是一种方法。

from urllib2 import urlopen
from bs4 import BeautifulSoup as bs
import requests
import csv

def getLinks(url):
    html_page = urlopen(url)
    soup = bs(html_page)
    links = []


    for link in soup.find_all('a', href=True):
        links.append(link.get('href'))
        newlist=[ii for n,ii in enumerate(links) if ii not in links[:n]]
        newlist.insert(0,url)

    return newlist[0:10]

the_words = ['20gb', '10gb']
total_words = []
results = []

for the_word in the_words:
    for url in getLinks('https://www.bt.com/'):
        r = requests.get(url, allow_redirects=False)
        soup = bs(r.content.lower(), 'lxml')
        words = soup.find_all(text=lambda text: text and the_word.lower() in text)
        count = len(words)
        words_list = [ ele.strip() for ele in words]

        for word in words:
            total_words.append(word.strip())

        results.append([url,count,the_word]) #append results in list 

#write result in output csv file        
with open("out.csv", "w") as writeFile:
     for result in results:
        writer = csv.writer(writeFile,delimiter=',')
        writer.writerow(result)
writeFile.close()       

【讨论】:

    猜你喜欢
    • 2012-11-20
    • 2016-12-12
    • 1970-01-01
    • 2019-10-03
    • 1970-01-01
    • 2022-10-15
    • 2012-01-24
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多