当前位置：网站首页>Combat tactics based on CEPH object storage

Combat tactics based on CEPH object storage

2022-06-12 09:34:00 【YoungerChina】

1. Enemy and know yourself , you can fight a hundred battles with no danger of defeat

Dissect business IO Model

Understand the basic business storage model ：

Maximum concurrency , Maximum read / write bandwidth requirements .
- How much concurrency determines when you know a single RGW How many do you need to use under the premise of the maximum concurrent number RGW To support these concurrency .
- The maximum read / write bandwidth determines how much you use OSD To support such a large read-write bandwidth , Also consider endpoint Whether the bandwidth at the entrance meets this demand .
Whether the clients are distributed inside or outside the network .
- Clients are mainly distributed on the Internet , It means that in such a complex network environment as the public network , Data reading and writing will be affected by some uncontrollable factors , So the public cloud for object storage now dare not tell you its bandwidth, delay and concurrency .
- Clients are distributed in the intranet , It means that we should minimize endpoint The number of routing hops between the portal and the client , Guarantee its bandwidth .
Read write ratio , Average request size .
- If the reading is high , such as CDN scene , You can consider endpoint Add read cache components at the entrance .
- If the write ratio is high , Like data backup , Consider controlling the inlet bandwidth , Avoid multiple services preempting write bandwidth during peak periods , Affect the overall service quality .
- Average request size , Determines the entire object storage service , Whether to optimize the performance of large files or small files .
  Old driver experience ： Many times we don't know the details of the business IO Model composition , At this time, small-scale access services can be considered , And connect the front end endpoint Access log （ For example, the front-end uses nginx） Access to ELK A kind of log analysis system , With the help of ELK It is very convenient to analyze the business IO Model .

2. The troops did not move , Gateway leading

All business ideas ultimately need to be built on the basic hardware platform , The business scale in the early stage is small , May not pay much attention to hardware selection , But as the business grows , Choosing a reliable and cost-effective hardware platform will become particularly important .

Experience in small-scale hardware resources

Small scale generally means that the number of cluster machines is 20 Within the platform or osd Quantity in 200 No more than one scene , This stage is to save money ,MON Nodes can consider using virtual machines , But it has to meet 2 Conditions :

* MON Quantity in 3 Table above , most 5 platform , A waste of money , also CPU And memory control in 2 nucleus 4G above .
* MON It must be distributed on different physical machines , At the same time NTP Sync .

as for OSD node , The following conditions are met :

* OSD  Don't go to the virtual machine , It is troublesome to change disks and troubleshoot problems , And because of the virtual machine IO There is another layer on the stack , The data security is very poor .
*  The bandwidth limitation of the network should be considered , If it's Gigabit , Can only do bond, Try to satisfy every OSD Yes 40~60M/s Network bandwidth , From experience, if the network can reach tens of thousands of megabytes, do not engage in Gigabit bond, Subsequent network upgrades need to be stopped OSD It is also a big pit .
*  Every OSD No matter how hard you try, you should try to keep it to a minimum 1 nucleus 2G Physical resources , Otherwise, if something goes wrong, the memory will hold Unable to live .
*  If you don't care about performance , No, SSD do journal, Well, forget it .
* index pool  It works SSD best , I can't help it .

RGW Nodes are much easier :

* RGW  Consider using virtual machines , even to the extent that docker, Because we can improve the high availability and concurrency of the overall service by opening more services .
*  Every RGW  The least resources 2 nucleus 4G, Because of every request The request was not written rados Before , Are cached in memory .
* RGW Service node 2 It's about time , The front end can be accessed nginx Do reverse proxy to improve concurrency .
*  Without tuning civetweb Concurrency is better than fastcgi Much worse .

Hardware resource experience in medium scale

Medium scale generally means that the number of cluster machines is 20~40 Within the platform or osd Quantity in 400 No more than one scene , At this stage, the business has basically begun to take shape ,MON Nodes are not very suitable for virtual machine scenarios , At the same time, we should pay attention to some matters :

* MON Quantity in 3 Table above , most 5 platform , also CPU And memory control in 4 nucleus 8G above , If there are conditions MON Of metadata Stored in SSD above . Because when the cluster reaches a certain scale ,Mon above LevelDB There will be performance bottlenecks , Especially when doing data compression .
* MON It must be distributed on different physical machines , Deploy across multiple cabinets if possible , But be careful not to span more than one IP Network segment , If the network fluctuates between network segments , Easy to trigger Mon Frequent elections , Of course, the election parameters can be adjusted .

OSD node , The following conditions are met :

* OSD  Consider doing well crushmap Fault domain isolation , It works 3 copy , Never save money 2 copy , Later, after the disk batch reaches its lifetime , This is a big hidden danger .
*  Every OSD Physical nodes OSD The number of disks should not be too large , And the capacity of a single disk should not be too large , You make a 8T Of SATA disc , If the single disk has been used up 80% The above bad disks , So the whole data backfill is a long wait , Of course you can control backfill The concurrent number of , But it has an impact on the business , Make your own trade-offs .
*  Every OSD  It's better to have SSD journal Escort , At this scale, we can save SSD Your money is no longer necessary .
* index pool Be sure to go to SSD, This will be a qualitative leap in performance .

RGW Nodes are much easier :

* RGW  You can still consider using virtual machines , even to the extent that docker.
*  The front-end entrance shall be connected to the load balancing scheme , such as LVS Or use nginx Anti agency . High availability and load balancing are required .
*  according to SSD Quantity and fault domain design shall be well controlled rgw_override_bucket_index_max_shards  The number of , tuning bucket Of index performance .
* RGW  Services can be considered in every OSD Deploy one above , However, it is necessary to ensure that the corresponding nodes CPU And enough memory .

Hardware resource experience in large and medium-sized scale

Large and medium scale generally refers to that the number of cluster machines is 50 Within the platform or osd Quantity in 500 No more than one scene , At this stage, the business has basically reached a certain scale ,MON Nodes must not use virtual machines , At the same time, we should pay attention to some matters :

* MON Quantity in 3 Table above , most 5 platform , On SSD.
* MON It must be distributed on different physical machines , Be sure to deploy across multiple cabinets , But be careful not to cross IP Network segment , 10 Gigabit Internet is the best .
*  Every MON The guarantee is 8 The core 16G Memory is basically enough .

OSD node , The following conditions are met :

* OSD  Be sure to design in advance crushmap Fault domain isolation , recommend 3 copy , as for EC programme , It depends on your ability to solve the problem of server power failure in batches EC Data recovery issues .
* OSD  The hard disk must be configured uniformly , Don't do it 4T and 8T blend , such weight Control can be cumbersome , And we must do well in every osd Of pg Distribution tuning , Avoid uneven distribution of performance pressure and capacity .pg Distribution tuning will be introduced later .
* SSD journal  and  SSD index pool Is a must .

RGW Nodes are much easier :

* RGW  Still be honest and practical , On this scale , The money saved today will have to pay the tuition in tears one day .
*  Front end load balancing scheme , Can do a layer of optimization , For example, improve front-end bandwidth and concurrency , Increase the read cache , Even on dpdk.
*  Use civetweb  replace fastcgi, It can greatly improve the deployment efficiency , If combined docker It can achieve rapid concurrency and elastic scaling .
* RGW It can be centrally deployed on several separate physical nodes , You can also consider using and OSD Mixing scheme .

3. Everything is done in advance , If you don't anticipate, you will lose .

Hard disk valuable data priceless , Storage systems that deal with data must be treated with a little more awe . Some high-risk operations , Be careful , Don't get into the habit of restarting and deleting data in the test environment , After you go online, you may have a habit that you will never forget . In addition, various tests must be done before launching , Otherwise, waiting until something goes wrong on the line to cram for a temporary job may have become hopeless . The following points must be done for testing .

Failure drill and recovery ： Use Cosbench Simulate various dials while reading and writing , Broken net , Cabinet power failure, etc , To test your crushmap Fault domain design capability and basic level of operation and maintenance personnel , I can't get through this , After the system goes online, the operation and maintenance personnel can only seek their own blessings .
Independent client machines must be prepared for performance pressure test , Try not to mix client and server , At the same time, all network traffic generated by pressure measurement should be isolated , Don't affect the online environment .
NTP The importance of service is really worth mentioning alone , Be sure to check the clock of all nodes before going online , Those increases mon_clock_drift_allowed The method of delaying is pure deceiving , Finally, I would like to remind you that , All related to hardware update and maintenance （ For example, replace the motherboard and memory after shutdown 、CPU、RAID card ）, Be sure to check that the clock is correct before resuming the service , Or you can drink a pot .
Functional coverage testing depends on you QA The foundation of , To be honest, whether it's official use case or Cosbench It's hard to meet your expectations , Let's do this test case by ourselves , Conditional of various languages SDK All have one set , Maybe one day I'll step on the pit . When you go online, you must sort out one API Compatibility list , Avoid trying to verify the interface availability later .

4. Devise strategies among , Successful in distant

After the system goes online , How to make the operation and maintenance work easy is a great knowledge . Smart operation and maintenance , Seven points depend on tools , Rely on experience . Faced with a variety of operation and maintenance tools , And write a script to make a wheel , Obviously, it is not advisable , What's more, you can't control the large scale simply by using scripts , Succession of follow-up personnel , The cost of scripted management will be higher and higher , Therefore, I recommend an operation and maintenance framework scheme suitable for small and medium-sized operation and maintenance teams , Here's the picture

Deployment tools

Recommended ansible, Why not puppet, Why not saltstack,puppet Tell the truth ruby Syntax is really unfriendly to O & M , and puppet Too heavy , The client should be deployed separately for maintenance , Although it is widely used in production , But for O & M ceph It still feels like killing a chicken with an ox knife . as for saltstack, although python Grammar is basically familiar to operation and maintenance , But then ceph Of calamari Team and saltstack Because of the version compatibility problem, both sides lose , Final calamari The project has become a uncompleted residential building , So for saltstack Be reserved .ansible and ceph Just like being RedHat Acquisition , meanwhile ansible It's also ceph The official deployment tool ,SSH Agent free and Ceph-deploy similar , however ansible More inclined to engineering practice ,ceph-deploy Small scale use can , In the direction of normalization, we still use ansible By spectrum .

Log collection and management

first ELK,ELK I won't introduce the basic functions of , The first open source log management solution , If you want to roast about your shortcomings , That's learning GROK Regular is a bit disgusting , But I got used to it slowly , hold MON/OSD/RGW/MDS Log to ELK I lost it , The next step is to see you accumulate Ceph Operation and maintenance experience , be familiar with Ceph journal , At the same time, various abnormal and alarm triggering conditions are continuously improved , Put the daily disk 、RAID Card and other common hardware fault logs are also integrated , Basically, you can quickly diagnose through logs OSD Disk failure , Don't hang up foolishly anymore OSD I don't know what's going on yet , Just tell the computer room to change the disk for you .

Asynchronous task scheduling

Why do we need asynchronous task scheduling , Because the front ELK Although the diagnosis analyzes the cause of the fault , Finally, someone needs to log in to the specific machine to deal with the fault , introduce Celery After this distributed task scheduling middleware , The operation and maintenance personnel encapsulate the corresponding fault handling operations into ansible playbook, such as ELK Disk failure found , Call the operation and maintenance personnel playbook Go to the corresponding disk out fall , then umount, Use megacli A class of tools lights up the disk fault light , The last e-mail informs XX Computer room XX Cabinet IP by XX Your machine needs to replace the hard disk , The rest is to restore the data after the computer room changes the disk .

Notice of news

Communication tools such as wechat greatly facilitate the communication of internal personnel , Especially when you're not at work , Trigger by operating the robot ansible playbook Script processing failure , This kind of feeling makes you feel the real bitterness of O & M ！ The computer room students have changed the disk , Send the message of operation completion to wechat robot , The robot will build a OSD Recovery task , At the same time, send the corresponding execution request to the operation and maintenance personnel , The operation and maintenance students only need to confirm the operation with the robot , The rest is to let the robot inform you of the progress of the recovery by message at any time . It is definitely not extravagant to flirt with your sister and deal with the fault at the same time .

Finally, the recommended tools are attached :

https://github.com/ceph/ceph-ansible
https://www.elastic.co/cn/products
http://docs.celeryproject.org/en/latest/index.html
http://wxpy.readthedocs.io/zh/latest/index.html
http://ansible-tran.readthedocs.io/en/latest/

author ： Qin Yangmu

original text ： Live ammunition military exercises - be based on Ceph The actual combat tactics stored by the object - Wait for the wind of the rabbit - OSCHINA - Chinese open source technology exchange community

You are welcome to recommend the original author of this official account