zning

The alarm platform should add a “one-click Google” button

zning@newsletter.paragraph.com (zning) — Fri, 06 May 2022 12:24:48 GMT

It’s been nearly a month since the city-wide closure of Shanghai. But I have enough things to buy and distribute in my house that I have now evolved from the stage of worrying about running out of food to the stage of people running after the speed of food spoilage …

Recently, I have been investigating the architecture design of the monitoring and alarming system. When I was cooking at noon today, I suddenly had a brain hole: Why can’t the alarm system be displayed — whether it be a real-time alarm or a historical alarm — with a button behind each alarm called “One-click Google” .

To be honest, on the one hand, this idea is because some elements of the monitoring and alarm system design are considered every day, and on the other hand, it is mainly due to the fact that a friend has encountered several operation and maintenance problems in the past few days, and called at night to ask for solutions. In fact, it’s not that the problem is complicated, but because the operation and maintenance personnel have never encountered this problem, and then when they see some nonsensical errors, they may not know where to start for a while.

This situation is also normal, because sometimes the errors we see for the first time are generally the ones returned by the developer to the front end through the program, and the real error will not be directly displayed to the user. This is especially common on 2C systems. After all, on the one hand, non-professional users cannot understand it. If the exception is thrown directly to the user, the user will be confused; on the other hand, due to the purpose of network security, if a black hat appears Intentionally causing system exceptions through injection to make targeted attacks through error returns, then throwing exceptions directly is actually nothing.

However, since the users of the monitoring and alerting platform are all professional users, and they are all internal employees, the information of the monitoring and alerting platform should be structured and unstructured log information related to system resources, middleware, databases or containers. During the above collection, a huge data system is formed, which is displayed to the operation and maintenance personnel through the logical design of the DevOps platform developers, so that they can see the alarm and its detailed abnormal information at the first time.

So, in fact, if you see an abnormal error report on the front end, you can go to the monitoring and alarm platform to view the corresponding alarm information according to the time point and the system, and then deal with it, the incident can be resolved normally. The logic is simple and straightforward.

However, what if the operation and maintenance personnel still cannot use the “operation and maintenance personnel experience” to solve the problem when they see the alarm information?

At this time, operation and maintenance personnel generally do two things:

First: Open the search engine and search by the exception thrown to see if someone shares the solution;

Second: ask experienced or more senior personnel.

However, in general, due to human nature, these two schemes are generally operated serially. Because after all, if you can solve it yourself, you won’t usually bother people.

And when I raised this question, in fact, the pain point to solve is this weakness of human nature: maybe when you search on the search engine, what you search for is not the real error.

At this time, if the monitoring and alarm platform adds the function of “one-click Google”, then it is equivalent to the platform telling the operation and maintenance personnel: the error is this, follow the map to find out.

As for the idea of this function, the simplest solution is to directly search for the information of this abnormal log when you click the button. Of course, the search source can be Google or an internal knowledge base.

A further evolution is that the most important keywords of the log can be learned through NLP.

A further evolution is the most important keyword that can correlate each system to find the root cause log.

When it comes to this, some people may find it funny: they are all used for NLP and root cause analysis, why not go directly to AIOps?

What to say, I always have a point of view: the use of AI in operation and maintenance can only be done well in simple self-healing processing and alarm-assisted decision-making. Whether the real event can be handled well, still needs people to the final decision.

Because operation and maintenance is not as fault-tolerant as AI applications in other fields, if there is a problem with face recognition, you can try it again. But the operation and maintenance itself is the only solution to return to normal. That is to say, the failure recovery success rate should be close to 100%. If this kind of work is given to AI, especially NLP is currently one of the slower development of several machine learning fields, and its accuracy has never been able to reach the fault tolerance level of operation and maintenance. The operation and maintenance work may be digging a hole for yourself.

Therefore, in this analysis and judgment idea, I think it is better to display this kind of auxiliary decision through a function that facilitates operation and maintenance personnel to find problems.

Of course, I am not completely negative about AIOps. After all, the accuracy rate is gradually achieved through the accumulation of computing power and samples. Perhaps the general accuracy of AIOps — not problem-specific accuracy — may someday be as fault-tolerant as operations work. At that time, it may really be a particularly happy day for the operation and maintenance personnel.

It could also be a day to look for other new jobs.

告警展示应该加一个「一键Google」的按钮 | 源创库

zning@newsletter.paragraph.com (zning) — Fri, 06 May 2022 12:17:09 GMT

上海全城封控已经接近一个月了。不过我屋子里因为后来各种买和发的东西够多，以至于现在已经从担心没饭吃的阶段演进到人追着食物变质速度在跑着吃的阶段了……

最近在对监控告警系统的架构设计进行调研。今天中午做饭的时候，突然有个脑洞：为什么告警系统在展示时——不论是实时告警还是历史告警——不能在每条告警后面加一个按钮，叫「一键Google」呢（狗头）。

说实话，这个想法一方面是因为天天考虑监控告警系统设计一些要素，另一方面主要也是源于，这几天有个朋友遇到了几个运维的问题，晚上打电话来问解决方案。其实并不是说问题有多复杂，只不过因为运维人员的经验之中，没有遇到过这个问题，然后在看到一些无厘头的报错的时候，可能一时间不知道该从哪里下手。

这种情况也很正常，因为有的时候我们第一时间看到的报错，一般都是开发人员通过程序返回给前端的报错，而真正的报错是不会直接展示给用户的。这在2C的系统上还是尤为常见的，毕竟一方面非专业用户看不懂，如果直接将异常抛出给用户，用户会一头雾水；另一方面由于网络安全的目的，如果出现像黑帽子通过注入的方式故意引起系统异常从而通过错误返回来做针对性攻击，那么直接抛出异常其实就是白给。

不过，由于监控告警平台本身使用者都是专业用户，且都是内部自己人，因此监控告警平台的信息，理应会将系统资源、中间件、数据库或者容器编排相关的结构化非结构化日志信息上收集中，形成一个很庞大的数据体系，通过DevOps平台开发者的逻辑设计，展示给运维人员，使他们能够在第一时间看到告警及其详细异常信息。

那么，其实如果说在前端看到异常报错之后，根据时间点和系统对应到监控告警平台查看对应的告警信息，然后再做处理，就正常能解决事件了。这个逻辑很简单，也很直接。

然而，假如运维人员再看到告警信息时仍然无法用“运维人员经验”去解决的时候呢？

这时，运维人员一般会干两件事：

第一：打开搜索引擎，按抛出的异常来搜索看是否有人分享解决方案；

第二：询问有经验或者更资深的人员。

不过一般情况下，由于人性的使然，对于这两个方案，一般都是串行去操作的。因为毕竟自己能解决的，一般就不会麻烦人了嘛。

而我这个问题的提出，其实解决的痛点就是这个人性的弱点：可能你在搜索引擎搜索的时候，搜索的并不是真正的错误。

这个时候，如果监控告警平台增加了「一键Google」的功能，那么就是相当于平台告诉运维人员：错误就是这个，按图索骥去吧。

而这个功能的思路呢，最简单的方案就是，在点按按钮的时候，直接去搜索这条异常日志的信息。当然搜索源可以是Google也可以是内部知识库。

再进化一点就是通过NLP可以学习到该日志最主要的关键词。

再进化一点就是可以关联各系统找到根因日志的最主要关键词。

说到这可能有人就觉得好笑了：都拿来用NLP和根因分析了，为啥不直接上AIOps？

咋说呢，我一直有个观点是：AI在运维上的使用，目前还是只能在简单的自愈处理和告警辅助决策上能够做好，真正这个事件是否能够处理好，仍然还是需要人去最终定夺的。

因为运维不像其他领域AI的应用有容错性，像人脸识别有问题你再试一次就好了，特效换脸觉得场景不好你再拍一次就好了。但是运维本身这个工作，只有恢复正常这一种解决方案。也就是要近乎100%的故障恢复成功率。这种活如果给AI，尤其是NLP目前是几个机器学习领域发展较为缓慢的一个了，它的准确率一直也无法达到运维的容错程度。那运维这个活，可能是自己给自己挖坑了。

所以，在这个分析研判的思路上，我觉得倒不如将这种辅助决策，通过一个方便运维人员去查找问题的功能展示，更实在一点。

当然，我也并不完全对AIOps持否定态度，毕竟准确率都是通过算力和样本的积累所逐步达到的。或许真的有朝一日，AIOps的普遍准确度——不是特定问题的准确度——能够达到运维工作的容错程度。到那个时候，或许真的是对运维人员是一个特别幸福的日子。

也有可能，是一个要找其他新工作的日子。