【发布时间】:2019-02-21 08:49:54
【问题描述】:
我正在尝试从网页中获取 HTML。但是,并非所有 URL 都正确编写。列表中的大多数无效 URL 包括 http,但现在 URL 使用的是 https。有些缺少“www.”,有些缺少“www.”。需要添加。
def repl_www_http(url):
x = url.replace("www.", "")
y = x.replace("http", "https")
return y
def repl_www(url):
y = url.replace("www.", "")
return y
def repl_http(url):
y = url.replace("http", "https")
return y
def repl_no_www(url):
y = url.replace("//", "//www.")
return y
def get_html(urllist):
for i in urllist:
html = ""
try:
html = requests.get(i)
html = html.text
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_http(i))
html = html.text
print("replaced // with //www.")
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_http(i))
html = html.text
print("replaced http with https")
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_www(i))
html = html.text
print("replaced www. with .")
return html
except requests.exceptions.ConnectionError:
try:
html = requests.get(repl_www_http(i))
html = html.text
print("replaced www with . and http with https")
return html
except requests.exceptions.ConnectionError:
return "no HTML found on this URL"
print("gethtml finished", html)
这是我得到的错误:
Traceback (most recent call last): File "C:\replacer.py", line 76, in <module> html = get_html(i)
File "C:\replacer.py", line 37, in get_html html = requests.get(repl_http(i))
File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 498, in request
prep = self.prepare_request(req) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\sessions.py", line 441, in prepare_request
hooks=merge_hooks(request.hooks, self.hooks),
File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",line 309, in prepare
self.prepare_url(url, params) File "C:\Users\LorenzKort\AppData\Local\Programs\Python\Python37\lib\site-packages\requests-2.19.1-py3.7.egg\requests\models.py",
line 383, in prepare_url
raise MissingSchema(error)requests.exceptions.MissingSchema: Invalid URL 'h': No schema supplied. Perhaps you meant http://h?
如何解决此问题以更正错误的 URL?
【问题讨论】:
-
什么是
repl_http? -
def repl_www_http(url): x = url.replace("www.", "") y = x.replace("http", "https") return y def repl_www(url): y = url.replace("www.", "") 返回 y def repl_http(url): y = url.replace("http", "https") 返回 y def repl_no_www(url): y = url.replace( "//", "//www.") 返回 y
-
你能把这个放到你的问题中吗?
-
我做到了!这是我在 Stackoverflow 上的第一个问题 ;-)
-
您是否尝试打印您正在分析的
url?可能是您的repl_http函数未按预期工作,并且仅返回h作为 url。
标签: python python-3.x url web-scraping python-requests