当前位置:网站首页>Practice of issuing vouchers for Tiktok payment of 100000 TPS traffic

Practice of issuing vouchers for Tiktok payment of 100000 TPS traffic

2022-06-23 23:13:00 ByteDance technical team

Focus on   Dry goods don't get lost  

background

During the Spring Festival in recent years , Tiktok will bring users a variety of spring festival activities , Hundreds of millions of users participate in it every year .2022 In the Spring Festival , Tiktok payment also participated in the spring festival activities , Issue Tiktok payment vouchers to a large number of users , Help users get a better experience in the Tiktok spring festival activities . For Tiktok payment , It's a big challenge , Because the Tiktok payment team has not really experienced the application scenario of the spring festival activity before , This is a great test for the Tiktok payment marketing system .

Introduction to Tiktok payment marketing system

The current marketing system hierarchy and structure are as follows :

9dd59bd8d759ff9b0122ad170d9f65c7.png

The business of the marketing team is mainly divided into three directions , Marketing launch 、 Marketing activities 、 Marketing assets . Marketing delivery is mainly responsible for marketing rights and interests , Expose marketing interests to users , Marketing activities are mainly responsible for the construction of marketing methods and the distribution of rights and interests , Marketing assets are mainly responsible for the management of user marketing assets , Like coupons 、 Distribution and use of legislative deductions . For the Spring Festival issue , The link is connected with the main venue of the Spring Festival by marketing activities , Call the marketing asset interface to issue payment vouchers to users .

Challenge

  • The Spring Festival has a large number of concurrent activities , At the same time, a large number of people will participate in the collection of vouchers , The impact on the system during the peak period will be very large .

  • Spring festival activities are for users all over the country , The audience is very wide , User experience is very important , Therefore, the time-consuming action of issuing bonds needs to be as low as possible .

  • The number of participants in spring festival activities is very large , During the Spring Festival, a large number of payment vouchers are expected to be issued , Fund security needs key guarantee .

programme

Performance guarantee

Asynchronous issuance - Improve interface response speed

Considering that most payment vouchers are used in Tiktok e-commerce scenarios , And the online shopping flow of users during the Spring Festival is small , Users have a low probability of using the coupon immediately after receiving it , You only need to ensure that the user receives the actual perceived coupon ( Check the coupon or use the coupon ) The time delay is controllable , Then the user experience will not be affected , Therefore, we have adopted the asynchronous issuance mode , After the marketing receives the upstream issuance request , Return immediately after being singled out , Notify the upstream that the acceptance is successful .

c70ab9933a16eccb1d69eb4ef0c38b39.png

One of the problems brought about by asynchronous issuance is that the user perceives the success of collecting vouchers , The actual coupon was not issued to the user , Most of them are out of stock 、 Risk control interception and other reasons . In response to this question , We had a discussion with our operation classmates , During the Spring Festival, the inventory of marketing activities will be allocated as much as possible , The interception rate of risk control will be reduced to the minimum , No interception is made except for abnormal bill brushing , Try to reduce the possibility of asynchronous issuance failure .

Two tier local queues - Improve processing power , Smooth flow

After the traffic is asynchronized , Because the activity flow is mostly the structure of ECG , The peaks and troughs are obvious , We learn from the producers - Consumer model , Queues are introduced to smooth traffic . At first we thought about RocketMQ, However, once additional middleware is introduced into the core link of event coupon issuance , And then we rely on it , Its availability and disaster recovery plan need to be considered additionally , And RocketMq Belongs to the remote queue , The delay between producers and consumers is not easy to control , So we designed a local queue model , To avoid the above problems .

778824d60ebb6c8f06caa9d9a3f5e521.png

The local queue model is as above , The queue consumer logic first obtains the token from the distributed flow limiter , Get the data from the queue after successful acquisition , Create a new one goroutine Process activity award logic , Then repeat the process .

Although the queue itself has the function of peak clipping , But we still can't control the consumption rate accurately , When the upstream flow is too large , The consumption rate of the queue consumer will also rise , There is a risk of breaking the system , Therefore, it is still necessary for the current limiting layer to do more accurate flow control , However, the current limiting granularity only depending on the interface dimension is too coarse , Therefore, a layer of business flow restriction is introduced in the business logic layer . The distributed current limiter here uses a self-developed distributed current limiter component , Priority will get the token locally , When the local token is insufficient , It will pull tokens from the remote in batches to the local .

In addition, in this model, we use a double-layer queue , The first tier queue is used to protect the marketing campaign tier , Set flow limit based on the processing capacity of marketing activities ; After the activity decision is passed , Then put the request into the second level queue , The current limit of the second tier queue is set based on the system carrying capacity of the marketing assets . Through a two-tier queue , It can avoid the capacity difference between marketing activities and marketing assets , The throughput of both systems can be maximized .

Inventory deduction optimization - Reduce hot spots , Pressure drop

For performance reasons , The coupon batch inventory of marketing assets is placed in Redis, At present, the logic of marketing asset inventory operation is : Received a user's ticket issuance request -> Read Redis, Check whether the inventory of issued coupons is sufficient -> write in Redis, Deduction of stock in bond batch .

In order to make Redis The cluster traffic is uniform , The inventory data of different coupon batches are scattered to different Redis In pieces , However, when a certain batch of coupons is distributed in a centralized way for a period of time , The traffic will still be largely shifted into a partition , cause Redis Data hot issues . If we can find a way to combine the operations of multiple sporadic deduction of inventory in a coupon batch , Then the data hotspot problem can be greatly alleviated .

21224a9169e01481759405f26ae08de9.png

The logic of consolidated issuance is shown in the figure above , The marketing campaign attempts to get... Non blocking from the tier 2 queue first N Coupon issuance request data , If you can get , Package and send these data to marketing assets , If the data obtained from the queue is less than N strip , It means that there is no more data in the queue at the moment , Send the coupon directly as soon as possible ; If you can't get the data , After a short period of random sleep , Try again to get , If the data cannot be obtained after a limited number of retries , End this cycle .

After the marketing assets receive the request for consolidated issuance , It will try to merge the coupon issuance requests of the same coupon batch in the request , Perform centralized deduction when deducting inventory , For example, before N Different users issue the same coupon batch A, Each inventory minus 1, Need to be right Redis Conduct N Write operations ; After the merger and issuance of securities , Only need to Redis Conduct 1 Write operations , Inventory deduction N that will do .

in addition , Verification logic before deducting inventory , Actually, you don't need to visit every time Redis, This verification itself is only a pre verification , Whether the final deduction is successful or not depends on the execution result of the subsequent deduction operation , The most important function of verification is Redis When the inventory is insufficient, unnecessary deductions can be blocked , It doesn't need to be very precise , Therefore, we consider maintaining the inventory information of a coupon batch in the local memory of the application , It will be Redis Synchronize inventory information to the local , When issuing vouchers, you only need to simply verify the inventory information in the local memory , No longer need to access remote Redis.

Graceful exit - Improve system robustness

A major drawback of using local queues for data processing is that memory volatility will make the data unable to be stored permanently , When the application is republished or upgraded , The coupon issuance data in the local queue may be lost , The user's ticket issuance request cannot be processed normally .

In order not to lose the data in memory , We need to be able to sense the exit signal of the application , Process the data in memory before the application exits . Therefore, we investigated the life cycle of byte cloud application instances , When the instance terminates , First, the current application instance will be removed from the service registry , This operation means that the current instance will no longer receive new external traffic ; It will be sent later SIGINT Exit signal to business process . App received SIGINT After the signal , No longer consume the remaining coupon issuance request data in the queue , Instead, the data is sent to a remote queue , The data is consumed by other application instances that are still alive .

fb388c21c8971ac75c67689acf904621.png

Bottom compensation - Ensure ultimate consistency

Although the application graceful exit has been realized , But in extreme cases , such as panic、oom、 Application exit caused by abnormal conditions such as physical downtime , The application cannot receive SIGINT The signal , It is impossible to execute the business logic of graceful exit . Therefore, we have added an additional compensation mechanism , Scan the table through scheduled tasks , Re deliver the data stuck in the intermediate state for a long time to the local queue for processing .

622786c5e3305e0c0eed4b9c8cbf8991.png

Now that you have a scheduled task to compensate for , Is there still a need for elegant exit logic ? In fact, it is necessary , When the application goes online, frequent application restarts will occur , At this point, there is likely to be a large number of requests from local queues that have not been processed , If you only rely on timed tasks to handle everything , Then the time delay between the user receiving the coupon and the actual receipt of the coupon may be very large , It may make the user experience worse . Therefore, elegant exit and compensation are complementary , Elegant exit maximizes the user experience , Bottom compensation ensures the final consistency of data .

Green channel - Enhance user experience

The assumption of asynchronous issuance is that the user takes the action of collecting vouchers , To actually perceive the existence of the coupon , There is a period of buffer in the middle , However, users may directly enter the Spring Festival wallet to check after receiving the coupon , If the asynchronous issuance has not been completed at this time , It may cause customer complaints . In this case , We have made an agreement with the upstream Spring Festival host , When the user enters the Spring Festival wallet to view the coupons within a short time after receiving the coupons , The upstream will call the coupon issuance interface again , And add the green channel logo , After receiving this ID, we will change the asynchronous issuance to synchronous , Give priority to issuing vouchers to current users , Ensure the user experience .

Fund prevention and control

In addition to the performance guarantee , Capital security also needs to be paid attention to , In this Spring Festival coupon issuance activity , We have mainly taken the following prevention and control measures .

Idempotent check

Each coupon issuance action will generate a globally unique serial number , The serial number will be entered into the database as a unique index when issuing the coupons , When the user continuously clicks to collect vouchers or the network is abnormal and retries, etc , The same serial number cannot be successfully dropped due to a unique index conflict , So as to avoid the capital loss caused by repeated issuance of securities .

User dimension collection limit

Idempotent check through serial number can solve some problems , But for some professional scalpers , This restriction may be bypassed , Bypass idempotent check by forging serial number . In this case , We maintain a collection data of user coupons , The coupon issuance will verify whether each user has reached the upper limit of collection , If the upper limit is not reached, the coupon will be issued normally , At the same time, update the user's collection record , Otherwise, the issuance will be terminated .

Coupon batch groups are mutually exclusive

The user dimension collection limit mainly prevents the same user from collecting the same batch multiple times , But in the whole spring festival activities , The operating students may issue multiple coupons for different purposes , However, the distribution groups may overlap or even the same batch , If different batches of vouchers are issued to users for many times , It may drive up the marketing cost . For the above reasons , We extracted the concept of coupon batch group , Coupons in the same group , The marketing purpose is basically the same , For example, they are all new 、 The purpose of promoting or preserving , When the user has received a coupon in the coupon batch group , Users can no longer receive other coupons in the group , That is, there is a mutually exclusive relationship between coupon batches in the group , In this way, avoid double subsidy of marketing expenses .

4d7eb3db5c9e9d2661a26b70a74e2028.png

Inventory oversold prevention

As mentioned above , The inventory data of the marketing coupon batch is stored in Redis, Every time the Redis When making inventory deductions , There may be a network timeout 、 Failure and other abnormal conditions , The result of inventory deduction is unknown . When this happens , We choose " tolerate ", Think that the issuance of securities failed , Directly end the issuance logic , Do not rollback ; After deducting inventory successfully , And then issue vouchers for users , If the issuance fails , At this point, you can try to rollback Redis stock , Because it has been determined that the inventory has been successfully deducted in this request , But rollback failed , No additional retry processing .

85579536ef3ebd2db13d6c5ec368f7b4.png

The above plan , It may cause the inventory to be sold less , But this conservative strategy can effectively prevent the possibility of oversold inventory , It can be seen as a trade-off and balance between data consistency and availability .

The risk control platform is connected

In the coupon issuance link , We also connected to the risk control platform inside the byte , The risk control platform will collect and analyze user and equipment information , Identify scalpers and malicious users through risk assessment , Intercept the issuance of coupons , Avoid potential asset losses .

Data monitoring and checking

In addition to the above capital control measures , We also do a lot of monitoring on the issuance activities , Including the quantity of coupons issued by the batch , The issuing rate of the coupon batch 、 Local queue accumulation 、 Local queue consumer rate, etc , When the monitoring data is abnormal year-on-year or month on month, it will give an alarm in time and manually intervene for troubleshooting . in addition , When the issuance of the coupon batch is completed , We will check the consistency of the data again , Including comparing whether the number of vouchers issued by the user is consistent with the inventory consumption 、 Check whether the voucher user has exceeded the upper limit of voucher batch collection .

summary

After the optimization of the above scheme , We successfully supported this year's spring festival main venue ticket issuance activities , And achieved good results :

  • On the system

    • The overall external marketing can undertake 100000 TPS The throughput of issuing bonds .

  • Business

    • During the Spring Festival, tens of millions of Tiktok cards were distributed for payment 、DOU Installment coupons , Support Tiktok payment 、DOU The activity demands of the two core businesses by stages .

    • 99% Your ticket can be in 0.5s To the user's account , The actual delay of asynchronous issuance is very low , Better user experience , Meet business expectations .

Follow up planning

After the test of the flow of spring festival activities this year , Marketing has accumulated a lot of experience and systematic ability , However, there are still areas that need continuous iteration and improvement :

  • Standardization of asynchronous issuance capability . We have initially tried asynchronous coupon issuance and applied it to spring festival activities , The effect is good , It can be predicted that 618、 There will still be many scenes suitable for asynchronous issuance of bonds in the double 11 and other major promotion festivals , Therefore, we are ready to standardize the interface for issuing coupons for external marketing , Make asynchronous coupon issuance an optional capability , With the access party 、 Scenarios, etc , To achieve flexible selection and configuration of the issuance mode .

  • Promotion of local queue mode . This design and implementation of the two-tier local queue , We have successfully completed the task of issuing bonds , The task execution delay is lower than that of the remote queue , Queue hierarchy 、 Current limiting 、 Graceful exit 、 Compensation and other auxiliary functions also have a good guarantee for the robustness of the system , Later, we will abstract this module into a general-purpose small framework , So that it can support more business scenarios suitable for asynchronous processing .

原网站

版权声明
本文为[ByteDance technical team]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231945147033.html