当前位置:网站首页>The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation

The way of intelligent operation and maintenance application, bid farewell to the crisis of enterprise digital transformation

2022-07-07 00:20:00 Microservice spring cloud

Problems and challenges

Development history of data center

2000 China's data center was built in , So far, I have experienced the following 3 Big stage . In the early : Discrete data center IT Due to the guidance of project construction , Therefore, there is no planning and no special operation and maintenance management system , Besides , The developed and constructed projects are independent operation and maintenance , So it's inefficient . Mid - : All walks of life have carried out a large concentration of applications , Gradually build the production center according to the standardized system 、 Disaster recovery center , And introduce “ Two third centers ” The pattern of . Besides , At this stage, there is a standardized operation and maintenance system , Representative established IT Service system 、 Monitoring system and other operation and maintenance framework . On the other hand , It also realizes cross departmental operation and maintenance coordination through processes , With clear development 、 test 、 Boundary of operation and maintenance , The operation and maintenance tools in various professional fields are booming . later stage : since 2015 After year , because IT The rapid development of Technology , Changing business needs , The data center is gradually evolving to a multi live hybrid cloud environment , Gradually expand from the financial industry to operators 、 energy 、 The government 、 Military industry and other industries . The data center began to provide operation and maintenance guarantee around the supply of services and resources , The operation and maintenance work is gradually moving towards integration 、 automation , Finally, it will evolve towards intelligence . The solution described in this article is mainly for the automatic operation and maintenance scenario .

Data center status analysis

Because the current operation and maintenance environment of most data centers is relatively complex , And IT The technology stack is diverse , Therefore, the operation and maintenance objects become more and more huge , The daily operation and maintenance work of operation and maintenance personnel is becoming more and more complex and cumbersome . Statistics , In the above complex operation and maintenance work 70% Are regular and repetitive , As a result, the cost of human input is increasing . Besides , Due to the lack of standardized operation and maintenance workflow , Therefore, the quality of operation and maintenance work can only be judged by the personal subjective factors of technicians . On the other hand , The sunk knowledge of most enterprise operation and maintenance staff can not be reused effectively , Handover is a mere formality , Therefore, the effect of operation and maintenance work is generally .

In addition to the above 70% In addition to the repetitive operation and maintenance work , also 30% The operation and maintenance work is complex and high operational risk . Enterprises often cause hidden dangers in business due to the personal factors of technicians . The overall operation and maintenance efficiency is low , Bring about business interruption for a long time , The problem of low effect of emergency disposal .

Automation O & M challenges and best practices

Based on the above data center operation and maintenance problems , Cloud intelligence is based on its many years of experience in automatic operation and maintenance , Summarize the challenges encountered in the construction of automation operation and maintenance projects in the past . Many customers of cloud intelligence have built automatic operation and maintenance platforms , But the platform itself lacks out of the box scenarios ; Besides , The construction cycle of some customer projects is too long , And lack of reference to the least practice in the industry .

Cloud wisdom Automation The platform is delivered in all walks of life , And precipitate the automation operation and maintenance business scenarios commonly used in all walks of life into standard product components , It's really out of the box . Such as massive inspection indicators , Standardized arrangement of application release , Disaster cutting best practices, etc , Can greatly shorten the construction cycle of automation platform , Give best practices for enterprises to refer to and choose .

Most of the traditional automation platforms focus on script scheduling , Lack of remote acquisition and control mechanism of various agent-free protocols . Cloud intelligence is based on the understanding of Taihua in operation and maintenance , Created a special whole stack purchase control center cdc, In support of the scheduling function of script class and the function of start-up and use , It also supports various hardware 、 virtualization 、 Containers 、 Microservices 、 Business class and other encapsulated acquisition and control API Interface . Such as the creation of various virtualization 、 Expansion and contraction interface , Hardware IPMI Purchase and control agreement ,K8S And so on . Cloud intelligence adopts distributed big data architecture and intelligent scheduling engine to solve the problem of high concurrent processing capacity , Support the working state of millions of management nodes with high efficiency and high documents .

Previous automation tools lack standardized and out of the box service invocation interfaces , Nowadays, many other operation and maintenance tools lack scenario connectivity , It is easy to form an automatic data island . Cloud intelligence relies on its own operation and maintenance center , Standardized service interface out of the box , Whether other operation and maintenance tools call automation tools , It is also an automatic tool to access the data of a third-party operation and maintenance tool , Can meet .

Introduction to solutions and functional scenarios

Architecture diagram of automation operation and maintenance platform

The following figure shows the architecture of cloud intelligent automation operation and maintenance platform , It is divided into the following latitudes :

  • Tube object layer : Including the whole stack of objects for daily operation and maintenance of the data center , For example, operating system 、 database 、 middleware 、 Physical servers 、 Business applications 、 Network devices 、 Storage 、 Cloud and virtualized resources .
  • Execute channel layer : For the managed objects in the figure below , Cloud Intelligence Acquisition and Control Center cdc Provides agent The proxy pattern , And like ssh、ipmi、snmp、jdbc、smi-s、jmx And all kinds of api And so on .
  • Service Management : Cloud intelligence has standardized process management functions , Such as unified script management 、 Operation arrangement 、 Script execution management 、 Timing task 、 Various query functions . The above generalized functions will provide the underlying support capability for the upper automatic operation and maintenance scenario .
  • Operation and maintenance scenario layer : This layer includes application release management 、 Automated inspection 、 Software installation 、 Compliance establishment 、 Operation and maintenance toolbox 、 Emergency disposal, etc .
  • Interconnection : The operation and maintenance tools in this module can be the third 3 Our tools . Cloud intelligence as a manufacturer of intelligent full stack operation and maintenance , In addition to automation modules , It also has IT Service management 、 monitor 、 To configure 、 Visualization and other general tools for operation and maintenance . Therefore, it can help enterprise customers establish a set of best practices in operation and maintenance .

Function scenario Introduction

  • Efficient application release management

Because the traditional application publishing mostly depends on manual , So releasing a system will probably cost 1-2 Hours . The time after the release of automated applications using cloud intelligence can be reduced to 10-30 Minutes to effectively improve the release efficiency . The cloud intelligent automatic operation and maintenance platform is based on DevOps idea , The purpose is to strengthen the development of 、 Communication between test and operation and maintenance 、 Collaboration and integration , Realize the standardization of application release and delivery . Besides , The overall release model of the platform adopts “ Environmental Science ”+“ Components ” The design of the , At the same time, it provides visual editing function . The cloud intelligence visual orchestration engine uses the generalization ability of service orchestration in its own operation and maintenance , Support complex serial parallel , Nodes can call different environments and components , With global parameterization and other capabilities , At the same time, it supports different scenarios such as fully automatic and semi-automatic . Last , Skip at platform support node 、 repeat 、 Pause and other general details processing capabilities .

On the other hand , The cloud intelligent automatic operation and maintenance management platform also provides the function of publishing cockpit 、 Various data Kanban , A global overview is available . With the help of the generalization ability of automation platform , As ordered 、 Centralized script management, etc , Support agent and no agent mode , fine-grained 、 Comprehensive authority management and control , To ensure that all operation and maintenance operations are safe and controllable .

  • Convenient   automation   On-Site Inspection

The cloud intelligent automation platform has built-in full stack inspection template , From common operating systems 、 database 、 Middleware to network 、 Hardware 、 Storage 、 cloud 、 Containers 、 Microservices, etc , Completely out of the box . Relying on the ability of cloud intelligence to operate and maintain the medium platform index system , Enterprises can also maintain and edit by themselves . Patrol execution can be triggered manually , It can also be triggered automatically by timed tasks . Traditional patrol inspection relies on manual work , Time of each patrol inspection 30-60 Different minutes , Automated patrol can reduce patrol time to 1-2 minute , Besides , The patrol inspection report can also be automatically sent to the management or leaders , And mark the problem inspection items . The comparison setting of patrol inspection indicators and benchmark values relies on the threshold management under the indicator management in the cloud intelligent operation and maintenance center . This item supports traditional static threshold 、 Combination of dynamic threshold and inspection index , It also supports patrol inspection results / Double compound , Automatic patrol inspection can be carried out according to the latitude of business system or equipment type . Besides , It can also be combined with cloud smart knowledge base , Give the handling methods of abnormal patrol inspection items for reference . Patrol inspection also supports the function of generating work orders from abnormal patrol items , Enterprises can use as needed . The platform index system of cloud smart operation and maintenance has good high concurrency ability , It can support millions of management objects and patrol in parallel at the same time .

  • Flexible operation and maintenance toolbox

The key technology of the operation and maintenance toolbox is the atomic tool out of the box . Cloud wisdom has 10 Years of automation operation and maintenance experience , It has a rich set of built-in atomization tools out of the box . The tool set provided by enterprise operation and maintenance managers through cloud intelligence , Only specified parameters are required ( Such as IP Address , File system directory, etc ) Automation tools can automatically execute , You can also call multiple tools or execute multiple objects in parallel at the same time . Besides , The above toolsets can be edited and maintained later , Enterprises can supplement the commonly used atomization tool set according to their own needs , It can only be released for online use after approval . The execution process of all automatic operation scheduling has log marks , All operations support post audit , It can also be connected with the enterprise fortress machine . It can reduce the manual interaction with the production environment directly , Reduce the production risk caused by manual misoperation .

  • Safe and stable batch   automation

Batch automation is mainly used in banking 、 Run batch business at the end of the day , Therefore, it is necessary to ensure the safe and stable operation of the whole automation platform at all times . And in the whole batch running process , Need to monitor the whole process , After an extreme system disaster , There should be a disaster recovery mechanism . The automation platform of cloud intelligence can replace control-m Function of , In addition to the common functions , It also supports batch topology analysis . Doing it control-m When moving , Can be control-m export xml The key element fields in the file are compared and mapped with the cloud intelligence platform , And then turn it into exl Field file . Besides , The platform supports the use of scripts on the original system , Transform the transformed exl Import the file into the automation platform of cloud intelligence , The topology view of batch scheduling can be automatically generated , Then carry out later parametric adjustment , You can complete the migration .

  • One click disaster recovery switch

The business scenario of disaster recovery switching involves contingency plans 、 Disaster cutting model 、 Check etc. , Therefore, it is more complex . The key point lies in the disaster recovery switching and disaster cutting drill at the data center level , One click disaster cutting is to improve the emergency handling capacity in response to emergencies . The automatic arrangement ability can support the arrangement of complex disaster cutting processes . Like the environment involved in disaster cutting 、 Data consistency 、 Network connectivity 、 Configure consistency check , Cloud intelligence has related functions to support . Besides , Cloud intelligence additionally provides the function of sand table drill , The overall disaster recovery model can be decoupled from the target , Meet the requirements of process reuse . And a separate mobile pad As the control end of disaster cutting , Reflect one key switching , All data in the switching process will be monitored in real time , Feed back to the disaster cutting screen .

  • Safety compliance audit

Cloud intelligence provides an industry benchmark out of the box , for example CIS、PCI DSS、SOX etc. , A set of platforms can provide more than just physical servers and virtual machines , It also includes the database 、 Compliance audit of data center resources such as middleware and network . Besides , It also provides detailed reports on current and historical vulnerability risk trends , Establish configuration standards and monitor changes , adopt 8,000 Multiple out of the box automated operation processes greatly shorten the repair time when repairing problems . meanwhile , It also supports thousands of device types 、 Model combination .

  • One button opening and closing market ( brokers )

For the securities industry , A series of business operations need to be performed regularly every day , Such as the opening process 、 Closing process . The above business scenario requires the operation and maintenance personnel to operate the application systems on different devices according to the business rules . This business process is complex , At the same time, it has serial function 、 Branch 、 Judge 、 parallel 、 polymerization 、 loop , To reduce the complexity of the process , Each sub process is required ; Besides , Business rule judgment is also complicated , You need to judge whether there are errors in process execution according to business data ; meanwhile , When an intermediate step is executed incorrectly , The error needs to be repaired manually or confirmed by the administrator role , Before we can continue . The automated operation and maintenance platform provided by cloud intelligence has strict security control over process execution , Such as authority control 、 Time control, etc . In addition to one button opening and closing , It can also provide the automatic business ability of pre clearing preparation for the clearing business of securities companies . In addition to daily operation and maintenance work , There are also some business operations that can be implemented using automated tools .

  • Software   automation   management

The cloud intelligent automatic operation and maintenance platform comes with its own software media management platform , Media versions of various software can be uploaded and managed , Support parallel execution of installation and deployment steps on multiple target objects . meanwhile , The platform provides rich interaction parameters , File parameters 、 Encryption parameters, etc . In addition to the installation function , The platform also supports uninstall, startup and so on , Relying on the generalization ability of the acquisition control center in the operation and maintenance center of cloud intelligence .

  • Patch   automation   management

To prevent security risks caused by system vulnerabilities , The system operation and maintenance personnel must install patches for the operating system regularly . But in the traditional operation and maintenance mode , It is difficult for system operation and maintenance personnel to intuitively understand the patch installation of each machine , Therefore, it is necessary to perform vulnerability scanning for each machine and install patches for each machine according to the scanning results . This manual operation , Not only does it take a lot of time , It's easy to make mistakes . Automated job products provide patch management 、 Host scan 、 Patch installation and other functions , Not only can the operation and maintenance personnel understand the health status of the server , It can also install missing patches for the server according to the scanning results , Solve potential safety hazards in time .

The patch automatic management function of cloud intelligence can be used together with the compliance audit function , Fix some missing patches . The focus is on batch concurrent execution , Actively scan and discover the current host operating system 、 database 、 Middleware patch installation . meanwhile , Cloud intelligence will update the patch library regularly .

  • Application release integration scenario

The integration scenario of application release is the integration scenario of automatic operation and maintenance , It mainly appears in the actual release scenario , The scene contains ITSM Tools 、 Configuration management tools 、 Automated publishing tools 、 Unified monitoring tools and other tools . The integration scenario of application release is a typical integration, collaboration and linkage scenario . The task is synchronized from the imported production scheduling information to ITSM System , after ITSM The examination and approval , Determine the release task ; Modify task status ( From pending approval to pending release ) when ,ITSM The monitoring system will be notified to skip the monitoring of the business system during the release period , Wait for the release time , Automatic release can be triggered manually or automatically ; During the release , The monitoring system will filter out the alarms of the business system , After publishing , Publish task status and synchronously send it back to ITSM System ,ITSM End the publishing process and initiate the configuration process , After a series of requests 、 collection 、 Compare and update the final configuration information of the system to the configuration library , The integration process is over .

  • Process as a service application scenario

The linkage scenario of process as service mainly reflects the scenario that all kinds of service requests can be delivered automatically through the automation platform . Enterprises select various service requests through the service portal , After the work order of the service request is approved , Trigger the of the automation platform according to different business scenarios API Service interface ,ITSM The parameters in the work order are synchronized to the automation platform , The automation platform automatically delivers according to the business scenario , After automatic delivery, the result will be sent back to ITSM platform , Can significantly reduce delivery time . Like daily virtualization resource expansion and contraction service requests , Standardized data changes , Standardization, environmental change, etc .

  • Integrated fault handling scenario

The scenario of fault handling integration involves the pre plan 、 Disposal process arrangement 、 As well as ITSM linkage . The scenario of integration of normal triggering and fault handling can be triggered manually by the administrator , It can also be triggered automatically by monitoring alarm , After triggering, according to the plan built in the system ( If the process starts and stops 、 Clean up space, etc ) Generate one ITSM Disposal or change work orders , According to the urgency 、 The degree of influence is influenced by many factors ITSM Node classification of process flow , Include urgent changes 、 General changes 、 Standard changes , Each type involves different approvers and approval processes . Approval can also include automatic approval 、 Manual approval, etc , After the final approval, trigger the disposal tool for automatic disposal , After scheduling, the result will be returned to ITSM Review the work order .

Case sharing

  • Typical cases - A bank

Business background : Due to the operation and maintenance objects of the data center, the managed equipment objects of the enterprise are about 3000 Multiple , The operation and maintenance technicians of the enterprise have to do many complete inspections every day , Each complete inspection will cost 1 Over hours , The results of the patrol inspection are not particularly well displayed , They are all filled in some standard forms , Their needs are particularly clear , It's all automated patrol inspection .

Solution : Cloud wisdom passes through 3 Months of project construction , In charge of the whole bank 3000 Many software and hardware operation and maintenance objects , Automation system with cloud intelligence 1 You can finish it in minutes 1 A complete patrol inspection , The result of patrol inspection , View the version through the inspection screen of cloud intelligence and data , It can also generate a complete patrol report . If abnormal points are found during patrol inspection , It will also be pushed to the enterprise's integrated alarm platform , Alarm . This project is a short and fast project , Relying on the built-in out of the box inspection indicators of cloud intelligence , Complete the test and launch work in a very short period , And it directly improves the efficiency of daily operation and maintenance .

  • Typical cases - A manufacturing group company

Business background : The enterprise is in the process of digital transformation , Find a lack of systematization 、 Standardized operation and maintenance process . The daily operation and maintenance work relies too much on the ability and attitude of the operation and maintenance personnel , The risk is high , New O & M personnel come in , If the skills are not up to standard , It will cause extremely high business risk . In order to solve the above problems , The enterprise plans to use the automation platform to reduce the dependence on the technical ability of operation and maintenance personnel , Including reducing the risk caused by the attitude of operation and maintenance personnel .

Solution : Cloud wisdom has been built for about half a year , The overall operation and maintenance efficiency of the enterprise has been improved 70%, To reduce the 30% Operational risks of daily operation and maintenance . Enterprise applications are released and managed 56 Set of application system , The automation rate of application release is 90% above , The number of regular releases per month exceeds 80 Time . Network automation module , Take care of 500 Multiple network devices , Switch 、 A firewall 、 Router 、 Load balancing is managed , The automation rate has reached 95%, Monthly average routine network changes 40 many times . In addition to network devices , The enterprise has also managed 2000 Multiple operating systems 、 database 、 The operation and maintenance object of middleware , The automatic rate has reached 98%, The atomized toolset in the O & M tool exceeds 2000 Kind of , The use times of operation and maintenance toolbox per month exceed 4000 Time . Besides , The enterprise has also done disaster recovery switching , Will be the most important 30 The system came in , The automation rate of disaster cutting is 55%, stay 60 The data center level disaster recovery switching can be completed within minutes . this 30 An important business system , And more than 100 A set of emergency response plans , It mainly solidifies some daily fault self-healing scenarios . The automation platform contains more than 100 More than one software version media package , The automation rate of daily software installation exceeds 99% .

Value and advantage

  1. automation   Operation and maintenance full stack acquisition and control capacity

Relying on the acquisition and control capability of the whole stack in the operation and maintenance center of cloud intelligence , Not only for all kinds of platform equipment , Like the operating system 、 database 、 Middleware for purchase control scheduling , It also supports various heterogeneous automated jobs , For example, in addition to common scripts , And support http Homework 、c/s Architecture software 、as400 The homework , image 400 It belongs to an older system in the financial industry , But these can support , And something like a database SQL、 Job of stored procedure 、 Mail 、FTP These types of homework can be supported perfectly .

  1. Mature out of the box   automation   Business scenario

Automation platform provided by cloud intelligence , Have mature business scenarios out of the box , It can greatly shorten the construction cycle of the project . The following figure shows the common 9 A scenario , Plus integration scenario , Can better support , Like app publishing 、 Automatic inspection 、 Operation and maintenance toolbox 、 Disaster cut 、 Batch 、 The Internet 、 Application disposal 、 Safety compliance 、 Software installation .

  1. A secure and trusted technology platform

Cloud intelligence provides a safe and reliable technology platform , The whole stack of cloud wisdom products are independently developed , Avoid potential safety defects . It can run in the localized information creation environment , Like domestic cpu( Kun Peng )、 Localized operating system ( kirin 、 Tongxin )、 database ( Renmin Jincang 、 Reach a dream )、 middleware ( Eastcom 、 Bolland ) wait . The state recently issued the latest data security law , The automation platform of cloud intelligence complies with various national data security regulations , Whether it's data transmission 、 Storage 、 Analysis and so on , All safety compliance . This platform has gone through 10 More than years of long-term iteration , The stability rate exceeds 5 individual 9, Support all kinds of abnormal fuses 、 Disaster recovery in extreme cases .

原网站

版权声明
本文为[Microservice spring cloud]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/188/202207061642373493.html