【发布时间】:2015-09-22 03:49:22
【问题描述】:
我有一个 Python pandas 脚本的这些开头,它在 Google 上搜索值并抓取它可以在第一页找到的任何 PDF 链接。
我有两个问题,如下所列。
import pandas as pd
from bs4 import BeautifulSoup
import urllib2
import re
df = pd.DataFrame(["Shakespeare", "Beowulf"], columns=["Search"])
print "Searching for PDFs ..."
hdr = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Connection": "keep-alive"}
def crawl(search):
google = "http://www.google.com/search?q="
url = google + search + "+" + "PDF"
req = urllib2.Request(url, headers=hdr)
pdf_links = None
placeholder = None #just a column placeholder
try:
page = urllib2.urlopen(req).read()
soup = BeautifulSoup(page)
cite = soup.find_all("cite", attrs={"class":"_Rm"})
for link in cite:
all_links = re.search(r".+", link.text).group().encode("utf-8")
if all_links.endswith(".pdf"):
pdf_links = re.search(r"(.+)pdf$", all_links).group()
print pdf_links
except urllib2.HTTPError, e:
print e.fp.read()
return pd.Series([pdf_links, placeholder])
df[["PDF links", "Placeholder"]] = df["Search"].apply(crawl)
df.to_csv(FileName, index=False, delimiter=",")
print pdf_links 的结果将是:
davidlucking.com/documents/Shakespeare-Complete%20Works.pdf
sparks.eserver.org/books/shakespeare-tempest.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
www.w3.org/People/maxf/.../hamlet.pdf
calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
www.yorku.ca/inpar/Beowulf_Child.pdf
www.yorku.ca/inpar/Beowulf_Child.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
https://is.muni.cz/el/1441/.../2._Beowulf.pdf
www.penguin.com/static/pdf/.../beowulf.pdf
www.neshaminy.org/cms/lib6/.../380/text.pdf
www.neshaminy.org/cms/lib6/.../380/text.pdf
sparks.eserver.org/books/beowulf.pdf
csv 输出将如下所示:
Search PDF Links
Shakespeare calhoun.k12.il.us/teachers/wdeffenbaugh/.../Shakespeare%20Sonnets.pdf
Beowulf sparks.eserver.org/books/beowulf.pdf
问题:
- 有没有办法将所有结果作为行写入 csv 而不是
只是底部的?如果可能,请在
Search中包含对应于"Shakespeare"或"Beowulf"的每一行的值? - 如何写出完整的 pdf 链接,而不用自动缩写为
"..."的长链接?
【问题讨论】:
-
您使用的是什么搜索词?
-
嗨@PadraicCunningham!我使用“莎士比亚”和“贝奥武夫”作为搜索词(来自 DataFrame)。
-
错误链接 pastebin.com/Z38X8hWU ,除非你真的想要一个数据框,否则也可以使用 csv 模块完成
-
谢谢!这看起来是正确的:),但我不能在函数中包含获取 DF,因为在我的原始代码中,我从使用
os.listdir选择的 csv 中获取搜索词列表。因此,我使用这种抓取 DF 的方法:df[["PDF links", "Placeholder"]] = df["Search"].apply(crawl) -
你将 df 传递给函数。当我回到我的como时,我会添加一个替代方案
标签: python pandas beautifulsoup urllib2