当前位置：网站首页>Summary of the resumption of a 618 promotion project

Summary of the resumption of a 618 promotion project

2020-11-07 20:19:00 【The beginning】

One 、 Preface

618 During the launch of an activity . But the launch didn't go well , There was a performance problem that day , Interface timeout , The user can't open the web page , Last but not the last temporary offline . It took three days and two nights , Refactored the background core code , Just let the activity go on .

I look back at my time record , from 5 month 31 That night 8 spot 25 I'm ready to go online , Exception found , The analysis reason , Refactoring code , When I left the company, it was already 6 month 2 The no. 23 spot 54, experience 51 Hours 29 branch , The middle sleep time is less than 5 Hours , This is an explosion of the universe .

This wave has just passed , One wave is not flat, another wave is rising again , Because of the rewards of the event , A large number of wool party came to , There are all those who sell scripts on a treasure , Seriously affected the normal users to collect wool .

A customer feedback said ： Let's not talk about collecting wool , Now it's the whole sheep that they took away ！

The next few days , We have to fight wisdom and courage with the script people who collect wool , Until the end of the event .

And this article makes a deep copy of this , Make yourself a little happier in future projects .

Two 、 A seemingly perfect project summary

When we resume the project process , Can find a lot of problems , such as ：

Manpower shortage , The requirements are too complex , Development and testing work is heavy .
Front and back development 、 The tests were taken from other teams , Not familiar with the business and technology of the current project .
A temporary team formed across teams , The definition of responsibility is not clear , Project control is not strict .
Development is not familiar with the technology used in the project , Without going through the original project members CodeReview.
It's too sloppy to pass the test , Unreasonable design of pressure measurement scheme .

....

After listing the questions , Soon, I can write the improvement points one by one .

Strengthen the overall project arrangement from the company level , Avoid repetitive projects , Resources are invested in several key activities .
Strengthen the ability of the team , Summary document , For new people to learn .
For the core code CodeReview, When there is a problem , The project manager coordinates with senior developers to help solve .
Define the responsibilities of the temporary team clearly , All responsible persons communicate clearly .
Strictly control the test quality , The test has the veto right to go online .

...

There seems to be no problem with these summaries , List the questions , It also lists the improvements , It can even be used as a model , Is that the end for us .

Of course not. , There is nothing wrong with what it says , Wrong Take the premise of the problem as the cause of the problem .

Let's look at two expressions .

Next time we're going to build an experienced project team , Avoid quality problems .
Next time we face a temporary build , When an inexperienced project team , How to avoid quality problems .

What's the difference between the two statements ？

The former is because we “ The team ” Why , This leads to the quality problem , So we have to solve “ The team ” The problem of .

The second is that our team is formed temporarily , Our development 、 Testing is not familiar with the business and technology of new projects , Under this premise , It's the quality problem , So on this premise , How to avoid quality problems ？

Ad hoc organization , Inexperience is not a problem reason , They are problematic Premise , This is objective .

It's like when we say to solve a problem , The fastest way is , We don't solve the problem , The person who solves the problem will do .

It's not like , We don't solve the problem , The team that solves the problem will do .

It's because of this misunderstanding , A lot of times when we have project quality problems , There's something wrong with the collaboration of leaving the pot to our team , Or our project is short of time , Then the next improvement is over .

Such a universal answer , It seems right , But it's often impossible to land .

It is clear that the project time is tight , The lack of experience in new teamwork is an objective existence , Without it, there would be no problem , How can we solve it as the problem itself .

1、 Key causes of quality problems

With this premise , Let's go back to the previous summary , In fact, it can filter out the really valuable points .

We can also ask , The problem is inevitable , But why are our performance problems not exposed during the project ？

Three angles ：

From a project perspective , Not strictly following the project process , Especially the final test task is tense ,bug More time , The engineer gave the test report .
From a development perspective , I didn't find someone who was familiar with the business and technology CodeReview.
From a test point of view , Unreasonable design of pressure measurement scheme , It doesn't fit the real scene .

Analyze one by one .

The accident mentioned earlier is a performance problem in the background , From a project perspective , Even if the process is rigorous, performance problems cannot be exposed , Especially in the process of the project , The exposed risk is the lack of front-end manpower , There's more people in the middle , From a functional point of view , Back end progress is completely normal .

Let's look at the development perspective , I don't mention the lack of experience in development , It's not shirking responsibility , This is the same as our inexperience in business as a temporary team , It is the premise of an objective existence . When you come into contact with new projects , When using new technology , Inexperience is certain .

The problem is when you are inexperienced , How to finish the task , Then do it with students who are familiar with business and technology CodeReview Is the main means .

And then from the perspective of testing , There is no problem with functional testing , But there is a problem with the performance related piezometric scheme , And it didn't arouse the attention from the beginning . The first piezometric plan is to develop only interface and parameter documents , Throw it directly to the test to depress , Now it seems , This is wrong .

therefore , The key points of this quality problem are summarized as follows .

Next time we face a temporary build , When an inexperienced project team , In the face of large traffic business demand , Developers need to pay attention to ：

Let the students who are familiar with business and technology help to do CodeReview.
Design a pressure test scheme that conforms to the business scenario .

These two points can come to the ground , This is not to say that there is no improvement in project management , It's a priority to make sure that , Can reduce the risk more effectively .

CodeReview There's not much to say about , Let's talk about the improvement of several pressure measurement schemes we have made .

2、 Improvement of three wheel pressure test

A single user , Single interface , Double press test
Random users , Multiple interfaces , Full pressure measurement
Random users , Function grouping interface , Full pressure measurement

The first pressure measurement scheme is to use a user , Two servers , A cache is partitioned for pressure measurement , Then simply use the server QPS The mean value of is multiplied by the number of machines deployed on the line as the result of pressure test .

If this scheme is the scene on the left of the figure below , The server on the call link can expand elastically at the same time, which is OK .

But if the scene on the right , There is a bottleneck on the call link , For example, the database is a node , And can't extend , That's the problem .

alike , The problem with this project is Redis It becomes a single node bottleneck . In addition, due to users id Is constant , So the cache is likely to be reused , This makes it difficult to test frequent cache creation scenarios .

After system reconfiguration , An improved pressure measurement scheme , Through random users Id, Batch polling interface , And through the elastic expansion of the test environment , Fully simulate the online deployment environment .

Also by adding a degradation switch , Join the law 、 Risk control 、 Temporary closure of timeliness verification, etc , So that the piezometric request can run through the whole main flow .

Then on the basis of this plan , By grouping interfaces and forging appropriate data , Write scripts close to the real calling behavior , Pressure test was done again .

On the Executive , Also experienced from development to provide data , The test is solely responsible for ; To test leading , Development participation ; Then to development led , The process of test assistance .

thus , The pressure measurement scheme is more and more close to the real scene , The conclusion of pressure measurement is more reliable .

3、 Design in high concurrency scenario

I talked about the unreasonable design of the system which led to the performance problem , To analyze the root cause of this .

The first thing to understand is ,Redis A cluster is made up of multiple segments , In which segment is a piece of data written , By key Of hash Value to discrete .

for instance , We will be having Redis There is a cache of user information , And through ID To access .

If you use Redis Self contained Hash The table structure is written as follows ：

save ：redis.hset("userMap",ID,userInfo)

read ：redis.hget("userMap",ID)

that , because key Is constant userMap, This means that all user information will be written in a fragment .

And for the design of the usual distributed system , A basic principle is ： Let the traffic be shared equally by the clustered machines as much as possible .

fixed key You can't take advantage of distributed , And if the concurrency is high , This will allow a fragment to resist all traffic , Plus if there are hundreds of thousands of users , There is also a one-time operation to read all the data , It's a disaster .

In actual design , Directly put the whole Redis Cluster as a Hash The way of the watch is more efficient .

save ：redis.set("userMap"+ID,userInfo)

read ：redis.get("userMap"+ID)

there key="userMap"+ID,ID Different key It's dispersed , The requests will be clustered and shared , So as to give full play to the performance of the distributed system .

3、 ... and 、 Black production and wool party problems

After the project went online, another problem that didn't pay attention to appeared , That's a lot of black produce and wool party , The event rewards are all occupied by these scripted people .

There is too little forethought about black labor , Only simple risk control verification has been done , Not enough detection at all abnormal users , As a result, black production can be achieved through a large number of script interfaces .

There are two lessons here ：

Yes, including cash 、 Cash equivalents or high value incentive activities , There should be psychological expectation in the face of black labor .
In big companies , Find professional people to do professional things , Based on business scenarios , Communicate with risk control team in advance .

For the first point , Basically, it's something worth a little money , Heichan can't run away , White Wolf for nothing , To seize is to earn , Think about it if you're black , Combined with the business scenario , How would you brush your system .

Based on the first point , If the company has no risk control team, it can only do it by itself , In general, companies with a small scale have their own risk control team , Make good use of existing resources .

Risk control mainly considers two aspects ：

With a risk control team , Access to their generic risk control model .
For the business scenario of the project , Customize some risk control models .

The general risk control model is basically based on new and old accounts 、 Remote login 、 Human machine identification and other user behaviors create user profiles , Through off-line calculation and real-time verification .

The customized model depends on the situation , For example, pull a single small black household , Users can't participate in this activity and so on .

The intercepted users usually go through the verification code or pull black directly , For the latter , Don't forget to say hello to the girls in customer service , Prepare to speak to the guest .

Four 、 Conclusion

Finally, summarize the experience of the project .

The first is the premise ：

When you're in front of a makeshift organization , When it comes to the inexperienced project team .
When you face a big flow , When an activity includes cash or equivalent .

Please do these three points well ：

Looking for familiar with the business and technology development of this project to participate in the design and CodeReview.
Please develop and actively participate in the pressure measurement task , Design the pressure measurement scheme , Pay attention to simulating the real scene as much as possible .
Be prepared to deal with black labor , Until the end of the promotion .

From , One continuous overtime 51 Hours 29 branch , Make complaints about the development of the whole sheep by the user .

版权声明
本文为[The beginning]所创，转载请带上原文链接，感谢