【问题标题】:Beautiful Soup - Crawl Wiki Page美丽的汤 - Crawl Wiki 页面
【发布时间】:2018-09-24 18:29:24
【问题描述】:

我正在尝试抓取 wiki 页面“https://en.wikipedia.org/wiki/Glossary_of_nautical_terms”上的列表,获取每个航海术语的标题/描述,我的第一个问题是正确处理描述中的列表,如下所示:

import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Glossary_of_nautical_terms'
page = requests.get(url)

get_title = []
get_desc = []
corrected_desc = []
output = ''

if page.status_code == 200:
    soup = BeautifulSoup(page.text, 'html.parser')
    get_title = soup.find_all('dt', class_='glossary')
    get_desc = soup.find_all('dd', class_='glossary')

    for i in get_desc:
        first_char = i.get_text()[:1]
        second_char = i.get_text()[1:2]

        if (first_char.isnumeric() and second_char == '.'):
            if(first_char == '1' and output):
                corrected_desc.append(output)
                output = ''
                output += '{} '.format(i.get_text())
                continue
            else:
                output += '{} '.format(i.get_text())
                continue

        if output:
            corrected_desc.append(output)
            output = ''
            corrected_desc.append(i.get_text())
        else:
            corrected_desc.append(i.get_text())
else:
    print('failed to get the page!')


print(str(len(get_title)) + ' - ' + str(len(corrected_desc)))
zipped = zip(get_title, corrected_desc)

for j in zipped:
    output = '{}, {}\n'.format(j[0].get_text(), j[1].strip())
    with open('test.txt', "a", encoding='utf-8') as myfile:
        myfile.write(output)

但我似乎无法弄清楚如何处理同时包含列表和句子的描述。

编辑: 我正在寻找的输出是:

"Title", "Description"
"Title", "Description"
"Title", "Description"
"Title", "Description"

但我不确定如何调整我的代码以处理描述为列表 + 句子的情况。

【问题讨论】:

  • 这里不需要完整的代码。你能准确地发布你遇到问题的部分吗?此外,发布预期的输出也会有所帮助。创建一个minimal reproducible example
  • 我想这更像是一个方法问题,我认为抓住所有描述/标题并压缩它们会更容易。现在我想知道是否值得使用标题标签作为开始/停止点遍历每个元素并解析它们之间的任何内容?

标签: python python-3.x beautifulsoup python-requests


【解决方案1】:

所有标题都在<dt> 标签内,描述在<dd> 标签内。因此,第一步是找到所有这些标签。可以使用soup.find_all(['dt', 'dd']) 来完成。然后,遍历这些标签并使用if tag.name == 'dt' 检查标签是dt 还是dd。如果标签是dd,则将其内容附加到description 变量中,否则打印变量的当前内容。

完整代码:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/Glossary_of_nautical_terms')
soup = BeautifulSoup(r.text, 'lxml')

curr_title, curr_description = '', ''
for tag in soup.find_all(['dt', 'dd']):
    if tag.name == 'dt':
        if curr_title:
            print('{}: {}'.format(curr_title, curr_description))
            curr_description = ''
        curr_title = tag.text.strip()
    else:
        curr_description = ' '.join((curr_description, tag.text.strip()))

部分输出:

A-back:  A foresail when against the wind, used when tacking to help the vessel turn.[1]
Abaft:  Toward the stern, relative to some object ("abaft the fore hatch").
Abaft the beam:  Further aft than the beam: a relative bearing of greater than 90 degrees from the bow: "two points abaft the beam, starboard side". That would describe "an object lying 22.5 degrees toward the rear of the ship, as measured clockwise from a perpendicular line from the right side, center, of the ship, toward the horizon."[2]
Abandon ship!:  An imperative to leave the vessel immediately, usually in the face of some imminent overwhelming danger.[3] It is an order issued by the Master or a delegated person in command. (It must be a verbal order). It is usually the last resort after all other mitigating actions have failed or become impossible, and destruction or loss of the ship is imminent; and customarily followed by a command to "man the lifeboats" or life rafts.[3][4]
Abeam:  On the beam, a relative bearing at right angles to the ship's keel.[5]
Able seaman:  Also able-bodied seaman. A merchant seaman qualified to perform all routine duties, or a junior rank in some navies.
Aboard:  On or in a vessel. Synonymous with "on board." (See also close aboard.)
About:  "To go about is to change the course of a ship by tacking. Ready about, or boutship, is the order to prepare for tacking."[6]
Above board:  On or above the deck, in plain view, not hiding anything. Pirates would hide their crews below decks, thereby creating the false impression that an encounter with another ship was a casual matter of chance.[7]
Above-water hull:  The hull section of a vessel above the waterline, the visible part of a ship. Also, topsides.
Absentee pennant:  Special pennant flown to indicate absence of commanding officer, admiral, his chief of staff, or officer whose flag is flying (division, squadron, or flotilla commander).
Absolute bearing:  The bearing of an object in relation to north. Either true bearing, using the geographical or true north, or magnetic bearing, using magnetic north. See also bearing and relative bearing.
Accommodation ladder:  A portable flight of steps down a ship's side.
Accommodation ship (or accommodation hulk):  A ship or hulk used as housing, generally when there is a lack of quarters available ashore. An operational ship can be used, but more commonly a hulk modified for accommodation is used.
Act of Pardon or Act of Grace:  A letter from a state or power authorising action by a privateer. See also Letter of marque.
Action Stations:  See Battle stations.
Admiral:  Senior naval officer of Flag rank. In ascending order of seniority, Rear Admiral, Vice Admiral, Admiral and (until about 2001 when all UK five-star ranks were discontinued) Admiral of the Fleet (Royal Navy). Derivation Arabic, from Amir al-Bahr ("Ruler of the sea").
Admiralty:  1.  A high naval authority in charge of a state's Navy or a major territorial component. In the Royal Navy (UK) the Board of Admiralty, executing the office of the Lord High Admiral, promulgates Naval law in the form of Queen's (or King's) Regulations and Admiralty Instructions. 2.  Admiralty law
Admiralty law:  Body of law that deals with maritime cases. In the UK administered by the Probate, Divorce and Admiralty Division of the High Court of Justice or supreme court.
Adrift:  1.  Afloat and unattached in any way to the shore or seabed, but not under way. When referring to a vessel, it implies that the vessel is not under control and therefore goes where the wind and current take her (loose from moorings or out of place). 2.  Any gear not fastened down or put away properly. 3.  Any person or thing that is misplaced or missing. When applied to a member of the navy or marine corps, such a person is "absent without leave" (AWOL) or, in US Navy and US Marine Corps terminology, is guilty of an "unauthorized absence" (UA).[8]

【讨论】:

  • 谢谢,我正要采取类似的方法,我想我只是以错误的方式处理了整个事情!
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-01-15
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多