【问题标题】:crawl data from website into hdfs将数据从网站抓取到 hdfs
【发布时间】:2015-03-23 06:35:28
【问题描述】:

我想从网站上抓取数据,所以我使用的是 openweather.org 的 API。 我配置的数据流代理如下

weather.channels= memory-channel
weather.channels.memory-channel.capacity=10000
weather.channels.memory-channel.type = memory
weather.sinks = hdfs-write
weather.sinks.hdfs-write.channel=memory-channel
weather.sinks.hdfs-write.type = logger
weather.sinks.hdfs-write.hdfs.path = hdfs://localhost:8020/user/hadoop/flume/
weather.sinks.hdfs-write.rollInterval = 1200
weather.sinks.hdfs-write.hdfs.writeFormat=Text
weather.sinks.hdfs-write.hdfs.fileType=DataStream
weather.sources= Weather
weather.sources.Weather.bind =     api.openweathermap.org/data/2.5/forecast/city?id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a
weather.sources.Weather.username= abc
weather.sources.Weather.password= ********
weather.sources.Weather.channels=memory-channel
weather.sources.Weather.type = http
weather.sources.Weather.port = 11111

当我使用以下命令运行水槽代理时 flume-ng 代理 -f weather.conf -n 天气

我收到以下错误

15/03/23 05:17:34 INFO node.PollingPropertiesFileConfigurationProvider: Reloading configuration file:weather.conf
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Added sinks: hdfs-write Agent: weather
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Processing:hdfs-write
15/03/23 05:17:34 INFO conf.FlumeConfiguration: Post-validation flume configuration contains configuration for agents: [weather]
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Creating channels
15/03/23 05:17:34 INFO channel.DefaultChannelFactory: Creating instance of channel memory-channel type memory
15/03/23 05:17:34 INFO node.AbstractConfigurationProvider: Created channel memory-channel
15/03/23 05:17:34 INFO source.DefaultSourceFactory: Creating instance of sourceWeather, type http
15/03/23 05:17:35 INFO sink.DefaultSinkFactory: Creating instance of sink: hdfs-write, type: logger
15/03/23 05:17:35 INFO node.AbstractConfigurationProvider: Channel memory-channel connected to [Weather, hdfs-write]
15/03/23 05:17:35 INFO node.Application: Starting new configuration:{     
sourceRunners:{Weather=EventDrivenSourceRunner: {    
source:org.apache.flume.source.http.HTTP
Source{name:Weather,state:IDLE} }} sinkRunners:{hdfs-write=SinkRunner: {   
policy:org.apache.flume.sink.DefaultSinkProcessor@529d1dd7 counterGroup:{    
name:null counters:{} } }} channels:{memory-   
channel=org.apache.flume.channel.MemoryChannel{name: memory-channel}} }
15/03/23 05:17:35 INFO node.Application: Starting Channel memory-channel
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Monitored  
countergroup for type: CHANNEL, name: memory-channel: Successfully  
registered new MBean.
15/03/23 05:17:35 INFO instrumentation.MonitoredCounterGroup: Component   
type: CHANNEL, name: memory-channel started
15/03/23 05:17:35 INFO node.Application: Starting Sink hdfs-write
15/03/23 05:17:35 INFO node.Application: Starting Source Weather
15/03/23 05:17:35 INFO mortbay.log: Logging to 
org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via   
org.mortbay.log.Slf4jLog
15/3/23 05:17:35 INFO mortbay.log: jetty-6.1.26
15/03/23 05:17:36 WARN mortbay.log: failed 
SelectChannelConnector@api.openweathermap.org/data/2.5/forecast/city?
id=285787&APPID=8ce9bbbe446da25b19242763bdddb90a:11111:   
java.net.SocketException: Unresolved address
15/03/23 05:17:36 WARN mortbay.log: failed Server@642c189d: 
java.net.SocketException: Unresolved address
15/03/23 05:17:36 ERROR http.HTTPSource: Error while starting HTTPSource.    
  Exception follows.java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open
    (SelectChannelConnector.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java_
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run
    (LifecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
   15/03/23 05:17:36 ERROR lifecycle.LifecycleSupervisor: Unable to start 
   EventDrivenSourceRunner: {   
   source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} } 
   - Exception follows.
   java.lang.RuntimeException: java.net.SocketException: Unresolved address
    at com.google.common.base.Throwables.propagate(Throwables.java:156)
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:224)
    at org.apache.flume.source.EventDrivenSourceRunner.start
    (EventDrivenSourceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.SocketException: Unresolved address
    at sun.nio.ch.Net.translateToSocketException(Net.java:157)
    at sun.nio.ch.Net.translateException(Net.java:183)
    at sun.nio.ch.Net.translateException(Net.java:189)
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:76)
    at org.mortbay.jetty.nio.SelectChannelConnector.open(SelectChannelConnec
    tor.java:216)
    at org.mortbay.jetty.nio.SelectChannelConnector.doStart(SelectChannelCon
    nector.java:315)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.mortbay.jetty.Server.doStart(Server.java:235)
    at org.mortbay.component.AbstractLifeCycle.start(AbstractLifeCycle.java:
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:220)
    ... 9 more
    Caused by: java.nio.channels.UnresolvedAddressException
    at sun.nio.ch.Net.checkAddress(Net.java:127)
    at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java
    at sun.nio.ch.ServerSocketAdaptor.bind(ServerSocketAdaptor.java:74)
    ... 15 more
    15/03/23 05:17:39 ERROR lifecycle.LifecycleSupervisor: Unable to start 
    EventDrivenSourceRunner: {   
    source:org.apache.flume.source.http.HTTPSource{name:Weather,state:IDLE} 
    } - Exception follows.
    java.lang.IllegalStateException: Running HTTP Server found in source:  
    Weather before I started one.Will not attempt to start.
    at com.google.common.base.Preconditions.checkState(Preconditions.java:14
    at org.apache.flume.source.http.HTTPSource.start(HTTPSource.java:189)
    at org.apache.flume.source.EventDrivenSourceRunner.start(EventDrivenSour
    ceRunner.java:44)
    at org.apache.flume.lifecycle.LifecycleSupervisor$MonitorRunnable.run(Li
    fecycleSupervisor.java:251)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:304)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    access$301(ScheduledThreadPoolExecutor.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.
    run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.
    java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor
    .java:615)
    at java.lang.Thread.run(Thread.java:745)
    ^C15/03/23 05:17:41 INFO lifecycle.LifecycleSupervisor: Stopping  
    lifecycle supervisor 10
    15/03/23 05:17:41 INFO node.PollingPropertiesFileConfigurationProvider:  
    Configuration provider stopping

请帮我解决这个问题?

或者在配置水槽代理之前我必须做其他事情。

或者我应该使用 nutch 来抓取数据,还是应该使用storm。

请帮助我这样做的最佳选择是什么

提前谢谢你

【问题讨论】:

    标签: web-crawler hdfs nutch apache-storm flume


    【解决方案1】:

    HTTPSourcebind 参数指定您的代理将要侦听数据的 IP 地址或主机名。不是爬取端点,而是爬虫必须发送数据的端点(连同端口)。

    话虽如此,我建议使用Exec 源来执行爬取openweather.org 并在输出中生成数据的脚本;然后将该输出用作代理的输入数据。

    【讨论】:

    • 您好,感谢您的回复。这意味着在绑定参数中,我必须提供 java 代码包,其中代码将从天气 API 获取输入。或者我可以直接使用 exec 源并编写 shell 命令,这些命令将从 API 中获取数据,然后将其作为源提供给水槽?但我很困惑,所以在这个过程中,数据已经在 HDFS 中,那么水槽到底会做什么?
    • 我将尝试更好地解释它:当您打开一个监听端口(例如 HTTPSource 使用的端口)时,操作系统必须在监听端口(传输层)和 IP 地址之间创建一个绑定(网络层)。换句话说,如果运行 Flume 代理的机器有一个(完全发明的)130.207.90.34 IP 地址,那么操作系统将在该 IP 地址和监听端口(例如 80)之间创建一个绑定。如果您的机器有多个 IP 地址,那么您必须指定其中哪一个将用于绑定,这就是 HTTPSource 的 bind 参数的用途。
    • 关于你的另一个疑惑:Exec源会运行负责爬取数据的代码,爬取到的数据会打印在标准输出中; HDFS 上还没有放任何东西。然后 Exec 源将 catch 输出数据并启动代理内部的数据流,你知道:这些数据将被转换为将被放入通道的 Flume 事件,这些事件将由 HDFS 接收器获取,以便将数据保存在 HDFS。
    • 好的,我明白了。所以在我的代码中,我首先使用 exec 来抓取数据,然后将其提供给水槽源。但我还有一个问题,我有一个 shell 脚本,它将提取数据 $url = "api.weatherunlocked.com/api/forecast/29.36,47.97?app_id={your_app_id}&app_key={your_app_key}"; $req = [System.Net.WebRequest]::Create($url) $req.Method ="GET" $req.ContentLength = 0 $req.Timeout = 600000 $resp = $req.GetResponse() $reader = new -object System.IO.StreamReader($resp.GetResponseStream()) $reader.ReadToEnd() |输出文件 weatherOutput.json
    • 因此,使用此脚本文件,它将以 weatherOutput.json 格式将数据存储在我的本地文件系统中。现在我如何配置应该直接通过这些脚本文件接收数据的水槽代理。然后交给水槽代理源?
    猜你喜欢
    • 1970-01-01
    • 2014-07-06
    • 2018-07-01
    • 1970-01-01
    • 1970-01-01
    • 2019-05-17
    • 1970-01-01
    相关资源
    最近更新 更多