【发布时间】:2016-03-12 03:08:07
【问题描述】:
我在使用 BeautifulSoup 时遇到了问题。 这是我想做的:
对于我阅读的每个html页面中的每个表单,我都想获取“action”指向的URL。
这是我的代码:
def myfunction(path)
from bs4 import BeautifulSoup
#Retrieve htmlFiles from a folder
pages = find_files(path, '.html') #as a list
for page in pages:
stream = open(page, "rw")
soup = BeautifulSoup(stream, "lxml")
formsoup = soup.find('form', attrs={"method":u"post"})
if formsoup is not None:
action = soup.find('form', attrs={"method":u"post"}).findAll("action")
print "Action is => %s\n" % action
print ("Source: %s\ncode: %s\n\n\n\n\n" % (page, formsoup))
stream.close()
这是我得到的结果:
Action is => []
Source: mysource.html
code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]
这是我应该得到的结果:
Action is => http://actionIshouldget.com/
Source: mysource.html
code: <form accept-charset="UTF-8" action="http://actionIshouldget.com/" id="user-login" method="post"><div><div class="form-item form-type-textfield form-item-name">
[... hidhing about ~20 lines that are useless for me]
我没有设法使用for form in soup.find('form', attrs={"method":u"post"}) 或正则表达式...
【问题讨论】:
-
你的 HTML 文件是什么样的?
标签: python html css regex beautifulsoup