微服务服务监控

Monitoring is a critical part of any software development life cycle (SLDC) and with the rising of microservices architecture and DevOps practices, it becomes more important and more complex. to understand how to monitor microservices we must take a step back to the monolith legacy app and how we used to monitor it.

监视是任何软件开发生命周期(SLDC)的关键部分,并且随着微服务架构和DevOps实践的兴起,监视变得越来越重要和复杂。 要了解如何监视微服务,我们必须退后一步来返回完整的旧版应用程序以及我们过去如何对其进行监视。

三棱锥监测理念 (Three Pyramids Monitoring Philosophy)

In a monolith environment we used to get some metrics which tell us how is our application status we usually start with infrastructure the physical hardware the host my application for example:is my server up?is my database up?can web server talk to the database?

在整体环境中,我们通常会获取一些指标,这些指标可以告诉我们我们的应用程序状态通常是如何从基础结构开始的,例如作为应用程序宿主的物理硬件:服务器是否启动?数据库是否启动?Web服务器可以与数据库进行通信吗? ?

Then we move to another step to inquire about our application it self and ask a different question:is my application process running?

然后,我们进行下一步以自行查询我们的应用程序,并提出另一个问题:我的应用程序进程正在运行吗?

Then we move another level up we monitor the functionality and business capability and that lead to ask different question like: can user place an order?

然后我们再上一层,监控功能和业务能力,从而引发不同的问题,例如:用户可以下订单吗?

The past 3 level infrastructure, application, and business capability is called Monitoring Areas

过去的三级基础架构,应用程序和业务功能称为“ 监视区域”

微服务服务监控_监控微服务技术
Monitoring Areas Pyramid
监视区域金字塔

let’s move to different perspective and let’s change the questions a littlelet’s check for application health by askingis my server up ? and check for application performance by asking is there is high CPU?and check about capacity by asking do i have enough disk space?by answering these 3 question i get another metrics about health, performance and capacity of the system and this is called Monitoring Concerns.

让我们转到不同的角度,并通过询问我的服务器是否启动来更改小应用程序运行状况检查的问题。 并通过询问是否有高CPU来检查应用程序性能?通过询问我是否有足够的磁盘空间来检查容量?通过回答这3个问题,我可以获得有关系统的运行状况,性能和容量的另一个指标,这称为监视问题

and there is many to many relation between Monitoring Areas and Monitoring Concerns and it depends of the combination of a question we ask for example:is my server up ? is there is high CPU? do i have enough disk space?here i targeting health, performance and capacity of my infrastructure and if i ask:is my application generating exceptions? how quickly system process messages? can I handle month end batch job?here I targeting health, performance and capacity of application layer and if i change the questions again and ask:can users access checkout cart? are we meeting SLAs? what is the impact of adding another customer?we targeting the health, performance and capacity of business capability layer.

并且“ 监视区域”和“ 监视问题”之间存在许多关系,这取决于我们提出的一个问题的组合:例如,我的服务器是否启动? 有高CPU吗? 我是否有足够的磁盘空间?在这里,我的目标是基础结构的运行状况,性能和容量,如果我问:我的应用程序是否生成异常? 系统如何快速处理消息? 我可以处理月末批处理作业吗?在这里,我针对应用程序层的运行状况,性能和容量以及是否再次更改问题并询问:用户可以访问结帐车吗? 我们在满足SLA吗? 添加另一个客户有什么影响?我们针对业务功能层的运行状况,性能和容量。

微服务服务监控_监控微服务技术
Relation between Monitoring Areas & Monitoring Concerns
监视区域与监视关注点之间的关系

there is also third permit i want to introduce this the Interaction Types which show how i monitor the system - passive monitoring: where you access the system dashboard and see current and past values - Reactive monitoring : where monitoring system alert me when something happens like system send email when queue length is reach 50- Proactive Monitoring: where monitoring system take action automatically to repair system like when the queue length reach 50 auto scale up another instance to solve the problem

还有第三项许可,我想介绍这种交互类型 ,以显示我如何监视系统-被动监视:访问系统仪表板并查看当前和过去的值-被动监视:监视系统在发生类似系统的事件时向我发出警报当队列长度达到50时发送电子邮件-主动监视:监视系统会自动采取措施修复系统,例如当队列长度达到50时自动放大另一个实例来解决问题

微服务服务监控_监控微服务技术
3 Pyramids Monitoring Philosophy
3金字塔监控理念

so again it’s many to many relationship between the 3 monitoring pyramids so if I ask the first 3 question in the beginning of article: is my server up?is my database up?can web server talk to the database?then I monitor about infrastructure health at the point of time and it’s of-course passive monitoring.

因此,这3个监视金字塔之间的关系是多对多的,因此,如果我问文章开头的前3个问题:我的服务器是否已启动?我的数据库是否已启动?Web服务器是否可以与该数据库通信?然后我监视有关基础结构运行状况的信息在时间点上,当然是被动监控。

so whenever you decide the metric you want to monitor keep in your mind what’s the Area you want to monitor what concern you want to get information about and interaction

因此,每当您决定要监控的指标时,请记住您要监控的区域是什么,您希望了解哪些关注点以获取有关信息和交互的信息

these 3 pyramids are a way of thinking about what you are monitoring and interrogate whether it’s monolith system or distributed system that’s useful for you.

这3个金字塔是思考您正在监视的内容并询问是对您有用的整体系统还是分布式系统的一种方式。

What’s Happens when we deal with distributed system?the problem with distributed system is w start with a single point and we carve off pieces of functionality we communicate with messaging protocols and we will spin up a few others areas we got more than server to watch each of them has it’s own database that’s a lot more infrastructure to worry about on top of that the dynamic nature of microservices what if i scale out one of my services you got 4 instance of it all consuming one input queue or may be also distributed queue does it make sense to monitor queue length. it’s little tricky may be yes you should monitor it may be no. and it get more complicated as you increase the dynamic nature of systems you can run and it will be a lot of information that we can collect and it doesn’t make sense we look at every thing.

当我们处理分布式系统时会发生什么? 分布式系统的问题是从单点开始的,我们剥离了与消息传递协议进行通信的功能,并且我们将剥离一些其他领域,而不仅仅是服务器,我们可以看到它们中每个都有自己的数据库除了微服务的动态性质之外,还有更多基础架构需要担心。如果我扩展我的一项服务,您会得到4个实例,它们全部消耗一个输入队列,或者也可能是分布式队列,那么监视队列长度是否有意义。 这有点棘手,可能是,您应该监视它,可能不是。 随着您增加可以运行的系统的动态特性,它变得更加复杂,这将是我们可以收集的大量信息,而我们无所不能。

Let’s take a look at component of distributed system and see how we can monitor it.

让我们看一下分布式系统的组件,看看如何监控它。

Queue Length is the simplest metric every broker technology or queue technology have some method to provide queue length so what this metric tells us- Queue length is an indicator of work outstanding- High queue length doesn’t necessarily mean there is a problem so if it’s high but stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

队列长度是每个代理技术或队列技术都有某种提供队列长度的方法的最简单的度量标准,因此该度量标准告诉我们-队列长度是未完成工作的指标-队列长度高并不一定意味着存在问题,因此高但稳定或下降,或者有一些峰值可能是好的,但是如果每次都在增加,那是有问题的

so for our pyramids we monitoring for infrastructure performance but this not give us a clear insight so let’s look for another important metric here.Message Processing Time so we should get the time from message be in the front of queue until it finish her task whatever it was upload file via FTP or perform some query on database and finish it and remove the message from queue- Processing time is the time taken to successfully process the message- Processing time doesn’t include error handling time- it’s dependent on queue waiting time finish the process successfully here is important because if error thrown during processing that’s mean shouldn’t be removed it can sent to another pod to handle it again if it’s stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

因此,对于我们的金字塔,我们监视基础架构性能,但这并不能为我们提供清晰的见解,因此让我们在此处查找另一个重要指标。 消息处理 时间,因此我们应该使消息从出现在队列的最前面开始,直到完成其任务为止,无论它是通过FTP上载文件还是在数据库上执行一些查询并完成并将其从队列中删除-处理时间是时间成功处理消息所用的时间-处理时间不包括错误处理时间-它取决于队列等待时间,因此成功完成此过程非常重要,因为如果在处理期间抛出的错误意味着不应删除该错误,则可以将其发送到另一个Pod如果它是稳定的或减少的,或者有一些峰值可能是好的,请再次处理,但是如果每次都在增加,那是有问题的

微服务服务监控_监控微服务技术

this lead us to a new concept it’s the Critical Time. it’s time counter from raising our message to reach at the front of the queue the processed and then time stop so what if there is network latency or even the instance that will process message is crashed and restarted and there was many retries to deliver message does critical time stop no it’s actually still counting. from that we can get a formula that describe Critical Time.

这导致我们有了一个新概念,那就是关键时刻。 从将我们的消息提升到已处理的队列的最前面然后停止的时间是计时器,因此,如果存在网络延迟,或者甚至将处理消息的实例崩溃并重新启动,并且有很多重试来传递消息,这很关键时间停止不,它实际上还在计数。 从中我们可以得到一个描述关键时间公式。

Critical Time = Time In Queue + Processing Time + Retries Time + Network Latency Time

关键时间=排队时间+处理时间+重试时间+网络等待时间

and very similar to other metrics if it’s stable or decreasing or there is some spikes that’s can be good but if it’s increasing per the time it is problem

并且与其他指标非常相似(如果指标稳定或下降,或者有一些峰值可能很好,但如果每次都在增加,则有问题)

Let’s Put All Together

让我们放在一起

  • Each of these metrics represent a part of the puzzle.

    这些指标中的每一个都代表了难题的一部分。
  • Looking at them from endpoint’s perspective not per message.

    从端点的角度来看它们,而不是每条消息。
  • Look at them together gives great insight into your system.

    一起查看它们可为您的系统提供深刻的见解。

Let’s show some cases and analysis them

让我们展示一些案例并进行分析

Case 1:

情况1:

微服务服务监控_监控微服务技术

what we have here? stable critical time a spiky processing time and stable queue length over time. what this tells us about system?system kind of keeping up with all the messages that are coming in also we are processing them because the queue length isn’t increasing but why processing time is not stable? it could be a number of things cause that jumping around may be there is contention of resources or may there is locking mechanism when handler receive the message it lock until it update some resource it could be also some messages which handle by that end point go quickly and others don’t and you can use that information to isolate the slow ones into their endpoint and scale the new endpoint out independently.

我们这里有什么? 稳定的临界时间,尖峰的处理时间和稳定的队列长度。 这告诉我们关于系统的什么信息?系统如何跟进所有传入的消息,我们也在处理它们,因为队列长度没有增加,但是为什么处理时间不稳定? 当处理程序收到它锁定的消息直到更新某些资源时,可能有很多原因导致跳来跳去,可能是资源争用或可能存在锁定机制,直到该端点处理的某些消息也很快其他信息则没有,您可以使用该信息将慢速信息隔离到其端点中,并独立扩展新端点。

Case 2:

情况2:

微服务服务监控_监控微服务技术

here we have high critical time, high processing time and and kinda medium queue length but all is stable . what this tells us?the system is keeping up with the capacity but we are at the limit so as soon as any traffic spike that queue length is sky rocket and the critical time will be as well. so this may be a good indication to scale out those resources.

在这里,我们有高临界时间,高处理时间和中等长度的队列,但是一切都很稳定。 这告诉我们的是:系统正在满足容量要求,但我们已处在极限,因此,只要任何交通高峰(队列长度是天空火箭),关键时间也将达到极限。 因此,这可能是扩展这些资源的良好指示。

Case 3:

情况3:

微服务服务监控_监控微服务技术

here we have high critical time , low processing time and low queue length. what is this means?may be there is problem in network because if you remember the equation of critical time include the network latency time also may be a lot of retries t in processing the message we measure processing time for successfully processed message only so the problem connectivity or retries.so if monitoring distributed system how you there’s communication breakdowns?actually if you monitoring distributed system the easy way i to do health check and if your services replies with 200 status that’s mean it’s up but communication into distributed system usually done using brokers and when instance send message to the broker it doesn’t know if this message reach their destination or not the easiest option here is when the message reach it’s destination a read receipt is send back. is this good idea ?!! it’s not why ? we create turned our decoupled system into req/res system :( and we got double the message sent over the system.the solution here is peer-to-peer connectivity tells us if an endpoint is actually processing the message from another.

在这里,我们有高临界时间,低处理时间和低队列长度。 这是什么意思?网络中可能有问题,因为如果您记住关键时间的等式包括网络等待时间,那么在处理邮件时可能还会重试很多,因此我们仅测量成功处理邮件的处理时间,因此问题就出在这里。连接性或重试。 那么,如果监视分布式系统,您的通信故障如何? 实际上,如果您监视分布式系统是我进行运行状况检查的简便方法,并且您的服务是否以200个状态答复,则表示状态已启动,但通常使用代理完成与分布式系统的通信,并且当实例向代理发送消息时,它不知道是否邮件到达目的地还是不是最简单的选择,这是当邮件到达目的地时,已读回执将被发回。 这是个好主意吗? 这不是为什么吗? 我们创建了将已解耦的系统转换为req / res系统:(并通过该系统发送了两倍的消息。此处的解决方案是对等连接,告诉我们端点是否实际上正在处理来自另一个端点的消息。

What’s the tools to use? we have a bunch of tools can collect metrics for us splunk , kibana, D3 and Grafana all are suitable for monitoring.

使用什么工具? 我们有一堆工具可以为我们收集指标,而splunk,kibana,D3和Grafana均适合监控。

How we will collect all this information we sent?if we talk about critical time or processing time it will be per message metrics when we send a message that message will have it’s processing time and critical time associated with it.queue length and connectivity you might do checks periodically every minuet or every 5 minuets

我们如何收集发送的所有这些信息 ?如果我们谈论关键时间或处理时间,则它是每条消息度量标准,即当我们发送一条消息时,该消息将具有与之相关的处理时间和关键时间。 您可能会每隔一分钟或每五分钟进行一次队列长度和连接性检查

微服务服务监控_监控微服务技术

How we store this? A good schema to store this is: metrics type, message type, timestamp, and the value. But this is a very expensive way to store your metrics there are different techniques to do this but it’s out of this lecture scope.

我们如何存储呢? 一个好的存储模式是:度量标准类型,消息类型,时间戳和值。 但这是一种非常昂贵的指标存储方式,可以通过多种技术来完成,但这超出了本教程的范围。

How we display metrics?We can use ELK stack to do this it will be suitable use case.

我们如何显示指标? 我们可以使用ELK堆栈来做到这一点,这将是合适的用例。

Conclusion:

结论:

Monitoring distributed systems is not easy process and direct proportional with how much dynamic is the system but with understanding the philosophy of monitoring and by choosing the right metrics that help to analysis system and keep it healthy :)

监视分布式系统不是一个容易的过程,它与系统的动态程度成正比,但是与理解监视的原理以及通过选择有助于分析系统并保持其健康的正确度量标准成正比:)

翻译自: https://levelup.gitconnected.com/monitoring-microservices-techniques-f554e32e5101

微服务服务监控

分类:

技术点:

相关文章: