使用 pandas 获取和操作表格。这些链接是静态的,我的意思是它们可以很容易地用罪犯的名字和姓氏重新创建。
然后,你可以使用requests和BeautifulSoup来抓取罪犯的最后一句话,非常感人。
方法如下:
import requests
import pandas as pd
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
df.to_csv("offenders.csv", index=False)
这让你:
编辑:
我实际上继续添加了获取所有罪犯最后陈述的代码。
import random
import time
import pandas as pd
import requests
from lxml import html
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
statement_xpath = '//*[@id="content_right"]/p[6]/text()'
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
def get_last_statement(statement_url: str) -> str:
page = requests.get(statement_url).text
statement = html.fromstring(page).xpath(statement_xpath)
text = next(iter(statement), "")
return " ".join(text.split())
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(
columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
inplace=True,
)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
offender_data = list(
zip(
df["First Name"],
df["Last Name"],
df["Last Statement URL"],
)
)
statements = []
for item in offender_data:
*names, url = item
print(f"Fetching statement for {' '.join(names)}...")
statements.append(get_last_statement(statement_url=url))
time.sleep(random.randint(1, 4))
df["Last Statement"] = statements
df.to_csv("offenders_data.csv", index=False)
这将需要几分钟,因为代码“休眠”了 1 到 4 秒之间的任何时间,或多或少,因此服务器不会被滥用。
完成此操作后,您将得到一个 .csv 文件,其中包含所有罪犯的数据及其陈述(如果有的话)。