【问题标题】:python website-scraping using variables from text filespython网站抓取使用来自文本文件的变量
【发布时间】:2017-01-02 13:28:45
【问题描述】:

有人知道我如何抓取网站从 .txt 读取 URL 列表 IE,然后使用 .txt 中的名称将每个 url 结果写入 .txt。因此,将有代码从 .txt 文件中读取和写入正文的 URL 和名称文件。我发现的最接近的代码是下面的代码,但是它将所有内容都保存到一个 .txt 文件中,该文件是一个固定名称而不是变量;并从列表中读取 URL。我猜循环将是最好的方法,但是我没有看到此类任务的代码或太多帮助。

import requests
from bs4 import BeautifulSoup
from collections import Counter
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]

with open('thisisanew.txt', 'w', encoding='utf-8') as outfile:
     for url in urls:
     website = requests.get(url)
     soup = BeautifulSoup(website.content)
     text = [''.join(s.findAll(text=True))for s in soup.findAll('p')]
     for item in text:
            print(item ,file=outfile,)

提前感谢您的帮助!

【问题讨论】:

  • 所以您还没有尝试自己编写任何代码,只是在找人为您编写代码?
  • Go google python 逐行读取txt文件; python将数据写入文件。

标签: python web-scraping web scrape


【解决方案1】:
import requests
from bs4 import BeautifulSoup
from pathlib import Path
urls = ["http://en.wikipedia.org/wiki/Wolfgang_Amadeus_Mozart","http://en.wikipedia.org/wiki/Golf"]
for url in urls:
    # make requests
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'lxml')
    text = '\n'.join(s.get_text() for s in soup.findAll('p'))  # add '\n' betweent each p
    # write to file 
    filename = Path(url).name   # get the last part of url
    out_file = Path(filename).with_suffix('.txt')  # add.txt to filename
    with out_file.open(mode='w') as f:
        f.writelines(text)

出来:

Golf.txt:

Golf is a club and ball sport in which players use various clubs to hit balls into a series of holes on a course in as few strokes as possible.
Golf, unlike most ball games, does not require a standardized playing area. The game is played on a course with an arranged progression of either nine or 18 holes. Each hole on the course must contain a tee box to start from, and a putting green containing the actual hole or cup (4.25 inches in width). There are other standard forms of terrain in between, such as the fairway, rough (long grass), sand traps, and hazards (water, rocks, fescue) but each hole on a course is unique in its specific layout and arrangement.
Golf is played for the lowest number of strokes by an individual, known as stroke play, or the lowest score on the most individual holes in a complete round by an individual or team, known as match play. Stroke play is the most commonly seen format at all levels.

Wolfgang_Amadeus_Mozart.txt

Wolfgang Amadeus Mozart (/ˈwʊlfɡæŋ æməˈdeɪəs ˈmoʊtsɑːrt/; MOHT-sart;[1] German: [ˈvɔlfɡaŋ amaˈdeːʊs ˈmoːtsaʁt]; 27 January 1756 – 5 December 1791), baptised as Johannes Chrysostomus Wolfgangus Theophilus Mozart,[2] was a prolific and influential composer of the Classical era.
Born in Salzburg, he showed prodigious ability from his earliest childhood. Already competent on keyboard and violin, he composed from the age of five and performed before European royalty. At 17, Mozart was engaged as a musician at the Salzburg court, but grew restless and traveled in search of a better position. While visiting Vienna in 1781, he was dismissed from his Salzburg position. He chose to stay in the capital, where he achieved fame but little financial security. During his final years in Vienna, he composed many of his best-known symphonies, concertos, and operas, and portions of the Requiem, which was largely unfinished at the time of his death. The circumstances of his early death have been much mythologized. He was survived by his wife Constanze and two sons.

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-06-07
    • 2016-04-06
    • 2017-01-05
    • 1970-01-01
    • 1970-01-01
    • 2021-10-25
    • 2018-11-12
    • 1970-01-01
    相关资源
    最近更新 更多