One 、 Preface
618 During the launch of an activity . But the launch didn't go well , There was a performance problem that day , Interface timeout , The user can't open the web page , Last but not the last temporary offline . It took three days and two nights , Refactored the background core code , Just let the activity go on .
I look back at my time record , from 5 month 31 That night 8 spot 25 I'm ready to go online , Exception found , The analysis reason , Refactoring code , When I left the company, it was already 6 month 2 The no. 23 spot 54, experience 51 Hours 29 branch , The middle sleep time is less than 5 Hours , This is an explosion of the universe .
This wave has just passed , One wave is not flat, another wave is rising again , Because of the rewards of the event , A large number of wool party came to , There are all those who sell scripts on a treasure , Seriously affected the normal users to collect wool .
A customer feedback said : Let's not talk about collecting wool , Now it's the whole sheep that they took away !
The next few days , We have to fight wisdom and courage with the script people who collect wool , Until the end of the event .
And this article makes a deep copy of this , Make yourself a little happier in future projects .
Two 、 A seemingly perfect project summary
When we resume the project process , Can find a lot of problems , such as :
-
Manpower shortage , The requirements are too complex , Development and testing work is heavy .
-
Front and back development 、 The tests were taken from other teams , Not familiar with the business and technology of the current project .
-
A temporary team formed across teams , The definition of responsibility is not clear , Project control is not strict .
-
Development is not familiar with the technology used in the project , Without going through the original project members CodeReview.
-
It's too sloppy to pass the test , Unreasonable design of pressure measurement scheme .
....
After listing the questions , Soon, I can write the improvement points one by one .
-
Strengthen the overall project arrangement from the company level , Avoid repetitive projects , Resources are invested in several key activities .
-
Strengthen the ability of the team , Summary document , For new people to learn .
-
For the core code CodeReview, When there is a problem , The project manager coordinates with senior developers to help solve .
-
Define the responsibilities of the temporary team clearly , All responsible persons communicate clearly .
-
Strictly control the test quality , The test has the veto right to go online .
...
There seems to be no problem with these summaries , List the questions , It also lists the improvements , It can even be used as a model , Is that the end for us .
Of course not. , There is nothing wrong with what it says , Wrong Take the premise of the problem as the cause of the problem .
Let's look at two expressions .
-
Next time we're going to build an experienced project team , Avoid quality problems .
-
Next time we face a temporary build , When an inexperienced project team , How to avoid quality problems .
What's the difference between the two statements ?
The former is because we “ The team ” Why , This leads to the quality problem , So we have to solve “ The team ” The problem of .
The second is that our team is formed temporarily , Our development 、 Testing is not familiar with the business and technology of new projects , Under this premise , It's the quality problem , So on this premise , How to avoid quality problems ?
Ad hoc organization , Inexperience is not a problem reason , They are problematic Premise , This is objective .
It's like when we say to solve a problem , The fastest way is , We don't solve the problem , The person who solves the problem will do .
It's not like , We don't solve the problem , The team that solves the problem will do .
It's because of this misunderstanding , A lot of times when we have project quality problems , There's something wrong with the collaboration of leaving the pot to our team , Or our project is short of time , Then the next improvement is over .
Such a universal answer , It seems right , But it's often impossible to land .
It is clear that the project time is tight , The lack of experience in new teamwork is an objective existence , Without it, there would be no problem , How can we solve it as the problem itself .
1、 Key causes of quality problems
With this premise , Let's go back to the previous summary , In fact, it can filter out the really valuable points .
We can also ask , The problem is inevitable , But why are our performance problems not exposed during the project ?
Three angles :
-
From a project perspective , Not strictly following the project process , Especially the final test task is tense ,bug More time , The engineer gave the test report .
-
From a development perspective , I didn't find someone who was familiar with the business and technology CodeReview.
-
From a test point of view , Unreasonable design of pressure measurement scheme , It doesn't fit the real scene .
Analyze one by one .
The accident mentioned earlier is a performance problem in the background , From a project perspective , Even if the process is rigorous, performance problems cannot be exposed , Especially in the process of the project , The exposed risk is the lack of front-end manpower , There's more people in the middle , From a functional point of view , Back end progress is completely normal .
Let's look at the development perspective , I don't mention the lack of experience in development , It's not shirking responsibility , This is the same as our inexperience in business as a temporary team , It is the premise of an objective existence . When you come into contact with new projects , When using new technology , Inexperience is certain .
The problem is when you are inexperienced , How to finish the task , Then do it with students who are familiar with business and technology CodeReview Is the main means .
And then from the perspective of testing , There is no problem with functional testing , But there is a problem with the performance related piezometric scheme , And it didn't arouse the attention from the beginning . The first piezometric plan is to develop only interface and parameter documents , Throw it directly to the test to depress , Now it seems , This is wrong .
therefore , The key points of this quality problem are summarized as follows .
Next time we face a temporary build , When an inexperienced project team , In the face of large traffic business demand , Developers need to pay attention to :
-
Let the students who are familiar with business and technology help to do CodeReview.
-
Design a pressure test scheme that conforms to the business scenario .
These two points can come to the ground , This is not to say that there is no improvement in project management , It's a priority to make sure that , Can reduce the risk more effectively .
CodeReview There's not much to say about , Let's talk about the improvement of several pressure measurement schemes we have made .
2、 Improvement of three wheel pressure test
-
A single user , Single interface , Double press test
-
Random users , Multiple interfaces , Full pressure measurement
-
Random users , Function grouping interface , Full pressure measurement
The first pressure measurement scheme is to use a user , Two servers , A cache is partitioned for pressure measurement , Then simply use the server QPS The mean value of is multiplied by the number of machines deployed on the line as the result of pressure test .
If this scheme is the scene on the left of the figure below , The server on the call link can expand elastically at the same time, which is OK .
But if the scene on the right , There is a bottleneck on the call link , For example, the database is a node , And can't extend , That's the problem .
alike , The problem with this project is Redis It becomes a single node bottleneck . In addition, due to users id Is constant , So the cache is likely to be reused , This makes it difficult to test frequent cache creation scenarios .
After system reconfiguration , An improved pressure measurement scheme , Through random users Id, Batch polling interface , And through the elastic expansion of the test environment , Fully simulate the online deployment environment .
Also by adding a degradation switch , Join the law 、 Risk control 、 Temporary closure of timeliness verification, etc , So that the piezometric request can run through the whole main flow .
Then on the basis of this plan , By grouping interfaces and forging appropriate data , Write scripts close to the real calling behavior , Pressure test was done again .
On the Executive , Also experienced from development to provide data , The test is solely responsible for ; To test leading , Development participation ; Then to development led , The process of test assistance .
thus , The pressure measurement scheme is more and more close to the real scene , The conclusion of pressure measurement is more reliable .
3、 Design in high concurrency scenario
I talked about the unreasonable design of the system which led to the performance problem , To analyze the root cause of this .
The first thing to understand is ,Redis A cluster is made up of multiple segments , In which segment is a piece of data written , By key Of hash Value to discrete .
for instance , We will be having Redis There is a cache of user information , And through ID To access .
If you use Redis Self contained Hash The table structure is written as follows :
save :redis.hset("userMap",ID,userInfo)
read :redis.hget("userMap",ID)
that , because key Is constant userMap, This means that all user information will be written in a fragment .
And for the design of the usual distributed system , A basic principle is : Let the traffic be shared equally by the clustered machines as much as possible .
fixed key You can't take advantage of distributed , And if the concurrency is high , This will allow a fragment to resist all traffic , Plus if there are hundreds of thousands of users , There is also a one-time operation to read all the data , It's a disaster .
In actual design , Directly put the whole Redis Cluster as a Hash The way of the watch is more efficient .
save :redis.set("userMap"+ID,userInfo)
read :redis.get("userMap"+ID)
there key="userMap"+ID,ID Different key It's dispersed , The requests will be clustered and shared , So as to give full play to the performance of the distributed system .
3、 ... and 、 Black production and wool party problems
After the project went online, another problem that didn't pay attention to appeared , That's a lot of black produce and wool party , The event rewards are all occupied by these scripted people .
There is too little forethought about black labor , Only simple risk control verification has been done , Not enough detection at all abnormal users , As a result, black production can be achieved through a large number of script interfaces .
There are two lessons here :
-
Yes, including cash 、 Cash equivalents or high value incentive activities , There should be psychological expectation in the face of black labor .
-
In big companies , Find professional people to do professional things , Based on business scenarios , Communicate with risk control team in advance .
For the first point , Basically, it's something worth a little money , Heichan can't run away , White Wolf for nothing , To seize is to earn , Think about it if you're black , Combined with the business scenario , How would you brush your system .
Based on the first point , If the company has no risk control team, it can only do it by itself , In general, companies with a small scale have their own risk control team , Make good use of existing resources .
Risk control mainly considers two aspects :
-
With a risk control team , Access to their generic risk control model .
-
For the business scenario of the project , Customize some risk control models .
The general risk control model is basically based on new and old accounts 、 Remote login 、 Human machine identification and other user behaviors create user profiles , Through off-line calculation and real-time verification .
The customized model depends on the situation , For example, pull a single small black household , Users can't participate in this activity and so on .
The intercepted users usually go through the verification code or pull black directly , For the latter , Don't forget to say hello to the girls in customer service , Prepare to speak to the guest .
Four 、 Conclusion
Finally, summarize the experience of the project .
The first is the premise :
-
When you're in front of a makeshift organization , When it comes to the inexperienced project team .
-
When you face a big flow , When an activity includes cash or equivalent .
Please do these three points well :
-
Looking for familiar with the business and technology development of this project to participate in the design and CodeReview.
-
Please develop and actively participate in the pressure measurement task , Design the pressure measurement scheme , Pay attention to simulating the real scene as much as possible .
-
Be prepared to deal with black labor , Until the end of the promotion .
From , One continuous overtime 51 Hours 29 branch , Make complaints about the development of the whole sheep by the user .