大数据风控数据孤岛
Mandy Brown very recently posted a thoughtful and well articulated look at the importance of digital preservation. Her post was sparked by a flurry of sites that have recently been either dissolved (Cork’d), “sunsetted” (Delicious) or simply completely removed from existence (Geocities). She also mentions the infamous data loss of Ma.gnolia where many people (me included) lost the bookmarks we had been carefully collecting and curating.
曼迪·布朗(Mandy Brown)最近发布了关于数字保存重要性的深思熟虑和清晰表达的观点。 她的职位是由最近被解散(Cork'd), “落日” (Delicious)或完全被淘汰(Geocities)的众多网站激发的。 她还提到Ma.gnolia臭名昭著的数据丢失,许多人(包括我在内)都丢失了我们精心收集和管理的书签。
Her post, along with recent posts by Stephen Hay and Alistair Croll have got me thinking quite extensively about hosting my own data—everything from bookmarks to tweets. Tantek Celik is already utilizing his own home-brewed solution to do this, and he recently posted about his stance on hosting your own data.
她的帖子以及Stephen Hay和Alistair Croll的最新帖子使我对托管自己的数据(从书签到推文)的所有内容进行了广泛的思考。 Tantek Celik已经在利用自己的自制解决方案来做到这一点,并且他最近发表了关于托管您自己的数据的立场 。
This is what I mean by “own your data”. Your site should be the source and hub for everything you post online. This doesn’t exist yet, it’s a forward looking vision, and I and others are hard at work building it. It’s the future of the indie web.
我的意思是“拥有您的数据”。 您的网站应该是您在线发布的所有内容的来源和中心。 这还不存在,这是一个前瞻性的愿景,我和其他人正在努力构建它。 这是独立网络的未来。
Right now, our social networks are all essentially data silos. I post tweets to Twitter, status updates to Facebook, bookmarks to Delicious (at least I used to) and images to Flickr. I don’t own that data, nor do I have easy access to it all in one central location. Aside from backups, if one of those services were to disappear, it would take any data I had posted there with it.
目前,我们的社交网络基本上都是数据孤岛。 我将推文发布到Twitter,将状态更新发布到Facebook,将书签发布到Delicious(至少我曾经使用过),并将图像发布到Flickr。 我没有这些数据,也无法在一个中央位置轻松访问所有数据。 除了备份之外,如果其中一项服务消失了,它将带走我在其中发布的所有数据。
Even if I am diligent with my backups, most silo-based sites today do not have a common exportable data format. Which means that while I may be able to back my data up, in most cases I can not easily move that data over to another service.
即使我很努力地进行备份,当今大多数基于筒仓的站点也没有通用的可导出数据格式。 这意味着尽管我可以备份我的数据,但是在大多数情况下,我无法轻松地将数据移至另一服务。
One solution is to post through your own site, as Tantek is doing. Then, using the appropriate protocols and semantic standards, post that data to the appropriate hub or service. That way, you now have a local copy on your site, and you can continue to post even if say, Twitter, were to go down (which, you know, never happens).
一种解决方案是像Tantek一样在自己的网站上发布。 然后,使用适当的协议和语义标准,将该数据发布到适当的中心或服务。 这样,您现在在站点上拥有本地副本,并且即使说Twitter崩溃了,您也可以继续发布(这是永远不会发生的)。
The more I consider it, the more I am surprised that we ever settled for anything else. So often my posts into these silos are related to each other. I gain interest in a topic so I research it and add several bookmarks to Evernote. I talk about it with some people on Twitter. All of this leads me to write a post on my blog. With all three of those types of information being sectioned off from one another, each inherently loses some of it’s original value. If I have access to all the updates and replies from all of these different services in one location, however, I can see an the actual progression of an idea.
我考虑得越多,就对其他任何事情都感到满意,这让我感到惊讶。 因此,我在这些筒仓中发布的文章常常彼此相关。 我对某个主题很感兴趣,因此我对其进行了研究,并在Evernote中添加了几个书签。 我在Twitter上和一些人谈论过它。 所有这些使我在博客上写了一篇文章。 由于所有这三种类型的信息都相互割裂,因此每种信息固有地失去了一些原始价值。 但是,如果我可以在一个位置访问所有这些不同服务的所有更新和答复,则可以看到一个想法的实际进展。
This future version of the web does not come without problems though. One such problem is how do we capture the social aspect of these sites. For example, consider any conversation on Twitter. If I’m archiving my tweets, I will only get half of the conversation. Without somehow having access to the other half of the data I no longer have a complete thought and that conversation loses it’s value.
网络的未来版本并非没有问题。 这样的问题之一就是我们如何捕捉这些网站的社会方面。 例如,考虑在Twitter上进行的任何对话。 如果我存档自己的推文,则只会获得一半的对话。 如果无法以某种方式访问数据的另一半,我将不再有完整的想法,而这种对话将失去它的价值。
Thankfully, there are some solutions being actively developed. The first step, is to use a protocol called PubSubHubbub which allows you to specify a “hub” for third-party services, like Twitter or Google Buzz for example, to refer to for new content. Using this model, whenever you post something, you ping your hub with the new content. The hub in turn alerts any services that have subscribed to the feed of that content that there is a new update for them to publish.
值得庆幸的是,正在积极开发一些解决方案。 第一步是使用称为PubSubHubbub的协议,该协议允许您为第三方服务(例如Twitter或Google Buzz)指定“集线器”,以引用新内容。 使用此模型,每当您发布内容时,便使用新内容对集线器执行ping操作。 该中心反过来提醒已订阅该内容的提要的所有服务有新的更新要发布。
This accomplishes a few things. First, it allows you to publish once and potentially update many services. Secondly, it operates at near real time as opposed to the current practice of repeated polling.
这完成了几件事。 首先,它允许您发布一次,并可能更新许多服务。 其次,与当前的重复轮询做法相反,它几乎实时运行。
To resolve the conversation issue, services can implement the Salmon Protocol. In the model described by the Salmon Protocol, the source (your site or whatever you are using to publish content) pushes new content out via one consistent protocol (PubSubHubbub) to all the aggregator services (Google Buzz for example). When a comment or reply is posted on one of those services, that service will then push the comment back to the source.
为了解决会话问题,服务可以实现Salmon协议 。 在Salmon协议描述的模型中,源(您的网站或您用于发布内容的任何内容)通过一种一致的协议(PubSubHubbub)将新内容推送到所有聚合服务(例如Google Buzz)。 在其中一项服务上发布评论或答复时,该服务将把评论推回源。
At this point, you can choose to simply store locally, or push that comment back out to your hub so other services can post the reply as well. If a large amount of services get behind this technology, it offers tremendous potential. Imagine being able to maintain a conversation that would span across several social networks!
此时,您可以选择只在本地存储,也可以将该评论推送回您的中心,以便其他服务也可以发布回复。 如果大量服务落后于该技术,则它具有巨大的潜力。 想象一下,能够维持跨越多个社交网络的对话!
Of course, the downside to all of this is that I may very well have to reconsider my aversion to Google Buzz. They already implement the PubSubHub protocol, and they claim to be actively working on implementing the Salmon Protocol as well.
当然,所有这一切的不利之处在于,我很可能不得不重新考虑对Google Buzz的厌恶。 他们已经实现了PubSubHub协议,并且声称也正在积极致力于实现Salmon协议。
大数据风控数据孤岛