当前位置：网站首页>Build cloud native observability capability suitable for organizations

Build cloud native observability capability suitable for organizations

2022-06-30 16:06:00 【Spruce network】

CNCF In the definition of cloud primitives [1] in , Will observability （Observability） Be clear as an essential element . therefore , Use cloud native application architecture , Enjoy the efficiency gains it brings , What we have to face is how to build the observability of matching . To this day , Observability has a large number of solution puzzles in open source and business ,CNCF Cloud Native Landscape[2] There are hundreds of related contents in . This paper summarizes the maturity model of observability capability , Hope to provide guidance for organizations to choose their own observability scheme .

1.0 | pillar ： Basic observability

Time to go back to 2017 year ,Peter Bourgon A blog post summarizes the three pillars of observability ： indicators （Metrics）、 track （Tracing）、 journal （Logging）[3]. In the following years, this view was widely recognized in the industry , Develop into the basic requirement of observability capability , And there are many mature solutions in every aspect . for example , Focus on open source components Metrics Of Prometheus、Telegraf、InfluxDB、Grafana etc. , Focus on Tracing Of Skywalking、Jaeger、OpenTracing etc. , Focus on Logging Of Logstash、Elasticsearch、Loki etc. .

The construction of three pillars is the primary stage of observable capacity-building , Based on open source components, it is easy to build a set of observability facilities for each business system out of the box . There are two main problems in this stage ：

1） data silos ： When the team faces a business failure , You may need to jump frequently to Metrics、Tracing、Logging Between systems , Because the data on these systems are not well connected , The whole troubleshooting process is highly dependent on manual information connection , Sometimes it may be necessary to coordinate different personnel responsible for different systems to participate in problem troubleshooting .

2） Redundant construction ： Because the collection of observation data depends on StatsD Pile insertion 、Tracing SDK Pile insertion 、Logging SDK Pile insertion , Observability capability at this stage is generally driven by business development team , Business units will only build observation facilities to serve themselves , This leads to repeated construction between different business units . On the other hand , Out of the box solutions often have scalability problems , It is difficult to grow into a basic service for all businesses .

2.0 | service ： Uniform observability

When the opening of observation data and the optimization of observation system are more and more frequent in daily operation and maintenance work , It means that we need to be prepared to improve the observability ability to the next level . This level of observability is centered on Service , Collaboration between the infrastructure team and the business development team . The infrastructure team needs to build a unified observability platform for all businesses , Provide Metrics、Tracing、Logging Data collection 、 Storage 、 Retrieve infrastructure , It also supports the association of different types of data to eliminate islands . The business development team acts as the consumer , Using unity SDK Inject observation data on this platform .

The first problem we face is how to associate different types of data when collecting ,OpenTelemetry[4] It is expected to solve this problem through the collection and transmission of standardized data . follow OpenTelemetry standard , We can see Metrics It can be done by Exemplars Linked to Trace,Trace adopt TraceID、SpanID Linked to Log,Log adopt Instance Name、Service Name Linked to Metrics.OpenTelemetry The community has finished Tracing canonical 1.0 edition , And plan to 2021 Years to complete Metrics standard 、2022 Years to complete Logging standard . This is a rapidly developing project , However, it has received a lot of attention and recognition from the industry , It can also be seen that the observable data has been isolated for a long time ！

secondly , We are also faced with the storage of different types of data , Unfortunately, in this regard OpenTelemetry Does not relate to .Metrics and Trace/Log The data are quite different , Usually used TSDB（ Such as InfluxDB） Storage Metrics data ,Search Engine（ Such as Elasticsearch） Storage Trace/Log data . In order to provide a unified observable platform service , The system needs to have horizontal expansion capability , but TSDB Due to the high base problem, it is usually difficult to store fine to every micro service 、API Index data of , and Search Engine Due to the problem of full-text indexing, it usually brings high resource overhead . To solve these two problems, it is generally considered to choose the real-time data warehouse based on sparse index , for example ClickHouse etc. , The object storage mechanism is used to realize the separation of cold and hot data .

besides , The bigger challenge for the observation system to become a unified service lies in , It needs to have stronger horizontal expansion ability than the business system . for example , Mixing clouds 、 In complex environments such as edge clouds , The observation system should be able to scale up to Region/AZ And the edge machine room , It enables the whole link to monitor complex services .

After solving the problem of data collection and storage , The infrastructure team can open the observation system to the business development team as a unified service , However, there are still two unsolved problems in the observability of this stage ：

1） Team coupling ： Observation capability as a service （Service）, Must be actively invoked by the business development team （Call）, But business security KPI Undertaken by the operation and maintenance team , It doesn't fall directly on the development team . In the context of high-speed iteration of cloud native architecture application , Whether the observation service can be improved every time the business is launched 100% call ？ Even if the development team can strictly abide by the rules , The whole operation and maintenance team has no initiative . In addition, from the perspective of the development team , The business code has to insert all kinds of mandatory requirements by the operation and maintenance team SDK call .

2） Observation blind spot ： Not every line of code of all software services involved in the application architecture is written by the development team , Therefore, intrusive code piling methods are bound to encounter observation blind spots . For example, on the communication path of two microservices API gateway 、iptables/ipvs、 The host machine vSwitch、SLB、Redis Caching services 、MQ Message queue service, etc , Can't get the observation receipt by inserting code .

3.0 | Force ： Endogenous observability

When the coupling between development and operation and maintenance teams begins to restrict the development of the organization , When the observation blind spot of basic services begins to restrict the business SLO When further improved , It means that we need to improve the observability level again .

Since every cloud native application needs observability , So can we let infrastructure endogenously provide such capabilities , It's like the force （The Force） equally , Everywhere . therefore , The direction is clear ： If no line of observation code is inserted into the business code , How much observable power can we get ？ The main challenge at this stage comes from data collection and storage .

How to realize the endogenous application observation data acquisition capability of infrastructure ？ A kind of Green Field The idea is to realize through service grid . We can see , Whether it is pure service grid, such as Istio, Or more radical application runtime, such as Dapr, Observability capability has been considered from the beginning of design . Suppose that the access paths between microservices pass through the service grid , Then we can solve the problem of observation data collection from the infrastructure level . The main challenge here comes from the change of application architecture —— All applications need to be migrated to the service grid architecture . But even relying on the service grid , There will still be middleware 、 database 、 cache 、 Observation blind spots on systems such as message queues . Maybe wait until the service grid looks like TCP equally —— When it becomes a layer of the network protocol stack [5], We can achieve endogenous observability through this method .

Another kind Brown Field The solution is to use BPF Zero invasion 、 The ability to observe everywhere .BPF Is an endogenous Linux Kernel Observation technology in , classic BPF（cBPF） It mainly focuses on the filtering and acquisition of network traffic , But in Kernel 4.X The version has been greatly enhanced （eBPF）. utilize eBPF, There is no need to change the business code 、 No need to restart the business process , Every... Can be observed end-to-end TCP/UDP（kprobe）、HTTP2/HTTPS（uprobe） Function call ; utilize cBPF, Extract the information of each service access from the network traffic Metrics、Tracing、Logging Observation data , The communication between services can be observed through the virtual machine network card 、 Host network card 、SLB Performance data when waiting for intermediate equipment . With Linux Kernel 4.X More and more widely used , We see the cloud monitoring leader Datadog Recently released based on eBPF Of Universal Service Monitoring（USM） Zero intrusion monitoring capability [6], Domestic Alibaba cloud ARMS The team also recently released based on eBPF Zero intrusion monitoring products Kubernetes monitor [7], Open source community Skywalking v9 And started to pay attention to eBPF[8]. But please note that it only depends on eBPF There will be dependencies 4.X Linux Kernel problems , It may degenerate into a kind of Green Field programme .

The force needs to be everywhere , Network traffic has long been everywhere ！

The challenge of data storage is actually related to full link monitoring . Observability based on application code often only considers business and application aspects , The Internet 、 Infrastructure becomes a blind spot . On the middle path （API gateway 、iptables/ipvs、 The host machine vSwitch、SLB、Redis Caching services 、MQ Message Queuing service ） How can the collected observation data be connected with the observation data at the application and business levels , We need to build a microservice oriented Knowledge map . With the cloud platform API、K8s apiserver as well as Service registry Synchronize resource and service information , Build regions for each microservice / Availability zone 、VPC/ subnet 、 Cloud server / The host machine 、 Containers colony / node / The workload 、 service name / Multi dimensional knowledge map information such as method name , As a data tag attached to the observation data , So as to get through all levels of the whole chain Metrics、Tracing、Logging data .

When you get to the 3 Class time , Observability has become an endogenous capability on cloud infrastructure , Like the force , It is contained in every running application system 、 And in each application system that will be added in the future , It is an innate basic ability , This capability does not need to depend on the... In the business code “ call ” To trigger , It's right there .

Learn more about cloud native observability technology practices , Welcome to the... Hosted by spruce network “ Cloud native observability sharing meeting ” A series of live events .7 month 6 Friday night 20:00～21:30, Li Qian, senior product manager of Picea network, will bring 《 HD cloud observable full link tracking practice 》 Theme sharing .

Event registration ：https://www.slidestalk.com/m/960/OSCjishuwenzhang