【问题标题】:How to append search keyword to twitter json data?如何将搜索关键字附加到 twitter json 数据?
【发布时间】:2019-04-12 05:18:25
【问题描述】:

我正在通过 kafka 做 twitter 流数据。我设法流式传输数据并使用 twitter json。但是现在我如何创建一个包含 twitter 数据和搜索关键字的 pyspark 数据框?

以下是我如何编写 kafka 生产者

我设法从 twitter 对象中创建了我想要的数据的数据框。但我不知道如何获取搜索关键字。

class StdOutListener(StreamListener):
def __init__(self, producer):
    self.producer_obj = producer

#on_status is activated whenever a tweet has been heard
def on_data(self, data):
    try:
        self.producer_obj.send("twitterstreamingdata", data.encode('utf-8'))
        print(data)
        return True
    except BaseException as e:
        print("Error on_data: %s" % str(e))
    return True

# When an error occurs
def on_error(self, status):
    print (status)
    return True

# When reach the rate limit
def on_limit(self, track):
    # Print rate limiting error
    print("Rate limited, continuing")
    # Continue mining tweets
    return True

# When timed out
def on_timeout(self):
    # Print timeout message
    print(sys.stderr, 'Timeout...')
    # Wait 10 seconds
    time.sleep(120)
    return True  # To continue listening

def on_disconnect(self, notice):
    #Called when twitter sends a disconnect notice
    return


if __name__ == '__main__':

spark = SparkSession \
    .builder \
    .appName("Kafka Producer Application") \
    .getOrCreate()

#This is the initialization of Kafka producer
producer = KafkaProducer(bootstrap_servers='xx.xxx.xxx.xxx:9092')

#This handles twitter auth and the conn to twitter streaming API
auth = OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)
stream = Stream(auth, StdOutListener(producer))

print("Kafka Producer Application: ")

WORDS = input("Enter any words: ")
print ("Is this what you just said?", WORDS)
word = [u for u in WORDS.split(',')]
#This line filter twitter stream to capture data by keywords
stream.filter(track=word)

【问题讨论】:

    标签: python-3.x tweepy twitter-streaming-api kafka-python


    【解决方案1】:

    解决问题的一种方法是更改​​ StdOutListener 类构造函数以接收“关键字”参数并在“on_data”函数中将“关键字”添加到 JSON 以发送到 Kafka

    import json
    import sys
    import time
    
    from kafka import KafkaProducer
    from pyspark.sql import SparkSession
    from tweepy import StreamListener, Stream, OAuthHandler
    
    
    class StdOutListener(StreamListener):
    
        def __init__(self, producer: KafkaProducer = None, keyword=None):
            super(StreamListener, self).__init__()
            self.producer = producer
            self.keyword = keyword
    
        # on_status is activated whenever a tweet has been heard
        def on_data(self, data):
            try:
                data = json.loads(data)
                data['keyword'] = self.keyword
                data = json.dumps(data)
                self.producer.send("twitterstreamingdata", data.encode('utf-8'))
                return True
            except BaseException as e:
                print("Error on_data: %s" % str(e))
            return True
    
        # When an error occurs
        def on_error(self, status):
            print(status)
            return True
    
        # When reach the rate limit
        def on_limit(self, track):
            # Print rate limiting error
            print("Rate limited, continuing")
            # Continue mining tweets
            return True
    
        # When timed out
        def on_timeout(self):
            # Print timeout message
            print(sys.stderr, 'Timeout...')
            # Wait 10 seconds
            time.sleep(120)
            return True  # To continue listening
    
        def on_disconnect(self, notice):
            # Called when twitter sends a disconnect notice
            return
    
    
    if __name__ == '__main__':
        CONSUMER_KEY = 'YOUR CONSUMER KEY'
        CONSUMER_SECRET = 'YOUR CONSUMER SECRET'
        ACCESS_TOKEN = 'YOUR ACCESS TOKEN'
        ACCESS_SECRET = 'YOUR ACCESS SECRET'
    
        print("Kafka Producer Application: ")
        words = input("Enter any words: ")
        print("Is this what you just said?", words)
        word = [u for u in words.split(',')]
    
        spark = SparkSession \
            .builder \
            .appName("Kafka Producer Application") \
            .getOrCreate()
    
        # This is the initialization of Kafka producer
        kafka_producer = KafkaProducer(bootstrap_servers='35.240.157.219:9092')
        # This handles twitter auth and the conn to twitter streaming API
        auth = OAuthHandler(CONSUMER_KEY, CONSUMER_SECRET)
        auth.set_access_token(ACCESS_TOKEN, ACCESS_SECRET)
        stream = Stream(auth, StdOutListener(producer=kafka_producer, keyword=word))
        stream.filter(track=word)
    

    希望对你有帮助!

    【讨论】:

      猜你喜欢
      • 2021-07-26
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2022-01-06
      • 1970-01-01
      • 2011-10-16
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多