当前位置:网站首页>System design. Seckill system
System design. Seckill system
2022-08-04 04:32:00 【idle cat】
秒杀
A spike is a deadly move with an overwhelming advantage or极短时间(比如一秒钟)solve the opponent,or instantaneous seconds(瞬间秒杀).The word originally comes from online games,Describe an instant kill a game characters fast.
The seckill in the e-commerce system refers to,The scene of grabbing enough goods in a short time.It's a marketing strategy,Usually a short sales period,价格稍低,There was a lot of publicity in the early stage,Goods first.Drain this way,Quantitative commodity price concessions in exchange for sufficient impact.
The seckill in the architecture refers to,To cope with short time,大量请求,Architectural patterns for snatch resource priority scenarios.
E-commerce seckill page
场景:
- 秒杀界面+商品详情+购物车下单
- 秒杀界面+点击“秒杀”interface to place an order
Get to know it intuitively
点击进入详情页,Some place an order directly.
然后加入购物车,下单,支付.
Business ready to kill
Seconds kill although can become a model,But there are still many subtle differences.老规矩,先同再异.
To do a slash event,Business requires practical planning,The time chosen,advance marketing(Ads are active need to entry of a web page,It is best to have favorites and reminders),Predict traffic growth based on actual business traffic data and past marketing experience(如:Past advertising and marketing expenses perWhow much revenue),Pre- and post-event strategies(Before the activity is rise in price,According to the sales situation extending activities after activity),and the momentum of the event(如:Display of Double Eleven sales of a package) ,and organizational technical support.
Spike is a kind of sales activity to put it bluntly,And technology support only而已,There is no single architecture or existing system that can support all seckill activities,Only for technical reference.The reason is that it is the marketing strategy that decides the spike,而不是技术.Some scene technologies can in turn affect sales strategy decisions.
反例:I'm going to do a second kill,流量不确定,Can you help design this system?,Infinite extend as far as possible.分析如下:
- 这类需求 Cannot be directly input into R&D for development
- This type of decision is more like a strategic task carried out by the middle and high level,It needs to be broken down to land
- 秒杀活动 ,啥时候搞,Implement marketing strategies,Activity index prediction,Event product category,物流支持,供应链调度,Warehouse delivery preparation,Existing system support 等,You will find that the spike activity is not something that can be done by technology alone,Organizations need a higher level,And some technical parameters depend on the above non-technical factors.
- a good strategic or tactical move,Need to organizational guarantee.If it is really high-level requirements, input directly to the technical team,Then the technical team must not be able to handle it..If you are part of the technical teamLeader,What would you do in this situation?
Below to compare existing a general seconds kill scenes for technical support what work needs to be performed.
问题分析
流程:
- Seconds kill activities page and product page for details 展示
- Add to cart order,15minPayment countdown
- 付款
分析:
- 活动页 PV最高,Ingress for most traffic
- Hot product details page display,活动PV*点击率
- 商品详情 中价格,预约人数 等动态数据 Need from the backend interface
- Items need to be locked when placing an order,Reduce inventory and generate orders
- Inventory needs to set a lower limit,如:当前库存1W件,If oversold can7Delivery or generation within days1K件,So inventory can be set to:0~1.1W;或者-1k~1W.
- 下单后,锁定商品15min,Release item if no payment
- 付款,并改变订单状态
- unspoken needs,系统需要稳定,Can't crash suddenly
- performance as high as possible,This will make the user experience better,Event pages and detail pages need to be displayed,But order or payment can be a little more slowly,But the function must be available
- Although there are users andPV预测,But it really beyond system support,System availability must be guaranteed,and give user friendly tips
- worse,Even if the seckill system is down,also does not affect other functions
问题&方案
活动页展示
Active pages can be statichtmland poster images,In order to ensure access concurrency and speed,后端可以用Nginxproxy for static data,并不经过后端.
优化:可以根据实际情况使用CDN
商品详情页
High concurrency solution for detail pages,First, you need to separate out the active goods,页面动静分离,Then make the details page static,and cache the page toCDN,Dynamic data through the interface(数据整合)访问.
Separate Active Goods:Make all items static,并放在CDNIt is technically feasible to perform caching,Commercially unrealistic.The commodities involved in this activity need to be separated,For this part of data processing.
页面动静分离:General details page,Contains a large amount of data is accessed through interface.或者类似JSP模板技术,Every visit to return after assembly,耗时耗力,It is very likely to make the page static for hot products,Part of the dynamic data is accessed through the interface.
The page is static and cached toCDN:If it involves a large mall, it is also cost-effective to put all active hot product static pages in the cache,What needs to be consideredCDN中,按照什么keyalso cache,过期失效,命中率问题.
CDN失效问题:Any activity will be controlled within the second-level activity range,More than this time should not have access to,This failed control pairCDNSystem requirements are high.
数据整合方案:分离出动静态数据之后,前端如何组织数据页就是一个新的问题,主要在于动态数据的加载处理,通常有两种方案:ESI(Edge Side Includes)方案和 CSI(Client Side Include)方案.
ESI 方案:Web 代理服务器上请求动态数据,并将动态数据插入到静态页面中,用户看到页面时已经是一个完整的页面.这种方式对服务端性能要求高,但用户体验较好
CSI 方案:Web 代理服务器上只返回静态页面,前端单独发起一个异步 JS 请求动态数据.这种方式对服务端性能友好,但用户体验稍差
经验:动静分离对于性能的提升,抽象起来只有两点,一是数据要尽量少,以便减少没必要的请求,二是路径要尽量短,以便提高单次请求的效率.
Details data interface
- 高并发读场景
- 调用链路尽可能短,减少依赖
- Can be used directly when necessaryservlet编程
- For hot data, it can be cached toredis
- 接口限流
- Blacklist can be added,Blacklist malicious traffickers
- 拦截无效请求
热点数据
- Hotspot data is divided into Both static and dynamic data
- 静态数据:Data that can be determined before the big sale,比如:Items participating in the promotion,商家等,Or through the technology analysisTOP N商品数据.
- 动态数据:The acquired data cannot be predicted in advance,Hot and cold data tend to change as the scene changes,It is necessary to load cold data into the cache according to the actual number of visits and traffic statistics,防止redis缓存击穿,流量打到DB上.
- 热点数据发现:
一个常见的实现思路是:
1.异步采集交易链路各个环节的热点 Key 信息,如 Nginx采集访问URL或 Agent 采集热点日志(一些中间件本身已具备热点发现能力),提前识别潜在的热点数据
2.聚合分析热点数据,达到一定规则的热点数据,通过订阅分发推送到链路系统,各系统根据自身需求决定如何处理热点数据,或限流或缓存,从而实现热点保护
需要注意的是:
1.热点数据采集最好采用异步方式,一方面不会影响业务的核心交易链路,一方面可以保证采集方式的通用性
2.热点发现最好做到秒级实时,这样动态发现才有意义,实际上也是对核心节点的数据采集和分析能力提出了较高的要求
热点隔离
热点数据识别出来之后,第一原则就是将热点数据隔离出来,不要让 1% 影响到另外的 99%,可以基于以下几个层次实现热点隔离:
业务隔离.秒杀作为一种营销活动,卖家需要单独报名,从技术上来说,系统可以提前对已知热点做缓存预热
系统隔离.系统隔离是运行时隔离,通过分组部署和另外 99% 进行分离,另外秒杀也可以申请单独的域名,入口层就让请求落到不同的集群中
数据隔离.秒杀数据作为热点数据,可以启用单独的缓存集群或者DB服务组,从而更好的实现横向或纵向能力扩展
当然,实现隔离还有很多种办法.比如,可以按照用户来区分,为不同的用户分配不同的 Cookie,入口层路由到不同的服务接口中;再比如,域名保持一致,但后端调用不同的服务接口;又或者在数据层给数据打标进行区分等等,这些措施的目的都是把已经识别的热点请求和普通请求区分开来.
热点优化
热点数据隔离之后,也就方便对这 1% 的请求做针对性的优化,方式无外乎两种:
缓存:热点缓存是最为有效的办法.如果热点数据做了动静分离,那么可以长期缓存静态数据
限流:流量限制更多是一种保护机制.需要注意的是,各服务要时刻关注请求是否触发限流并及时进行review
高并发读
- 热点数据 高并发读,使用缓存
- 分层校验:request link layer filter request,只在“漏斗” 最末端进行有效处理,从而缩短系统瓶颈的影响路径
- The read request does not verify the seckill qualification,Commodity project,答题,验证码等,Verify only on write,Keep the most accurate data consistency
热点操作
- 零点刷新、零点下单、零点添加购物车等都属于热点操作.
- 热点操作是用户的行为,不好改变,但可以做一些限制保护,比如用户频繁刷新页面时进行提示阻断.
- 多次刷新:浏览器缓存js,css,html文件,and dynamic cache data,Get the local cache file every time you refresh;Dynamic data request by a timer;The backend caches requests lazily,比如:Whether a user seckills the product and the result is3S,只有redisIf there is no data in the database, the database will be checked or processed..
- 下单按钮控制:You can place an order in the front-end limit spike(According to the interface)后,Grey orenable.
inventory consistency
Deduct inventory timing:
- Inventory will be deducted after order is placed:May be someone can place an order without paying;can give some time15minLock in inventory;You can also limit the number of items that each person can buy
- 支付后扣库存:easily oversold,No stock after payment,worse user experience
Mysqldirect deduction of inventory:
- update table1 set count = count -1 where id = 101; 会导致超卖
- update table1 set count = count -1 where id = 101 and account > 1;May be out of stock
- If the first query judgment,然后在update,操作不原子
- 加分布式锁,The traffic volume can't go up
- Database writes are the bottleneck
Redis
- redisClient.incrby(productId, -1); possibly oversold
- get,判断是否>1, incrby non-atomic operation
- 加分布式锁,性能上不去
- 方案:可用luaThe script ensures that the above logic is inredisAtomic operations on the side
StringBuilder lua = new StringBuilder();
lua.append("if (redis.call('exists', KEYS[1]) == 1) then");
lua.append(" local stock = tonumber(redis.call('get', KEYS[1]));");
lua.append(" if (stock == -1) then");
lua.append(" return 1;");
lua.append(" end;");
lua.append(" if (stock > 0) then");
lua.append(" redis.call('incrby', KEYS[1], -1);");
lua.append(" return stock;");
lua.append(" end;");
lua.append(" return 0;");
lua.append("end;");
lua.append("return -1;");
redisClient.evel(lua.toString());
高并发写
有效性判断
For some malicious brushing operations,Verification code can be added on the front end,Answer questions, etc.;Control the number of refreshes per minute per user in the backend,是否已下单.
削峰填谷
Instant Flow SolutionsMQ;secondary can be usedMQ,一级MQOnly write data;二级MQ进行下单操作.通过JOBSupplementary mechanism to address message loss,重复等问题
Limit flow processing:限流
If flow rate is more than system throughput(包括web+MQ的综合能力),则需要限流
限流类型:Same user limit;同一IP限制;Same Interface Restriction
技术实现:nginx;后端redis;sentinel第三方组件
延缓请求
Use captcha and answer questions to lengthen the request
In business, you can also kill in batches,分散流量
B计划
高可用建设,其实是一个系统工程,贯穿在系统建设的整个生命周期.
具体来说,系统的高可用建设涉及架构阶段、编码阶段、测试阶段、发布阶段、运行阶段,以及故障发生时,逐一进行分析:
架构阶段:考虑系统的可扩展性和容错性,避免出现单点问题.例如多地单元化部署,即使某个IDC甚至地市出现故障,仍不会影响系统运转
编码阶段:保证代码的健壮性,例如RPC调用时,设置合理的超时退出机制,防止被其他系统拖垮,同时也要对无法预料的返回错误进行默认的处理
测试阶段:保证CI的覆盖度以及Sonar的容错率,对基础质量进行二次校验,并定期产出整体质量的趋势报告
发布阶段:系统部署最容易暴露错误,因此要有前置的checklist模版、中置的上下游周知机制以及后置的回滚机制
运行阶段:系统多数时间处于运行态,最重要的是运行时的实时监控,及时发现问题、准确报警并能提供详细数据,以便排查问题
故障发生:首要目标是及时止损,防止影响面扩大,然后定位原因、解决问题,最后恢复服务
对于日常运维而言,高可用更多是针对运行阶段而言的,此阶段需要额外进行加强建设,主要有以下几种手段:
预防:建立常态压测体系,定期对服务进行单点压测以及全链路压测,摸排水位
管控:做好线上运行的降级、限流和熔断保护.需要注意的是,无论是限流、降级还是熔断,对业务都是有损的,所以在进行操作前,一定要和上下游业务确认好再进行.就拿限流来说,哪些业务可以限、什么情况下限、限流时间多长、什么情况下进行恢复,都要和业务方反复确认
监控:建立性能基线,记录性能的变化趋势;建立报警体系,发现问题及时预警
恢复:遇到故障能够及时止损,并提供快速的数据订正工具,不一定要好,但一定要有
在系统建设的整个生命周期中,每个环节中都可能犯错,甚至有些环节犯的错,后面是无法弥补的或者成本极高的.所以高可用是一个系统工程,必须放到整个生命周期中进行全面考虑.同时,考虑到服务的增长性,高可用更需要长期规划并进行体系化建设.
END
边栏推荐
猜你喜欢
随机推荐
7-2 LVS+DR Overview and Deployment
mq应用场景介绍
机器学习之视频学习【更新】
Deep learning -- CNN clothing image classification, for example, discussed how to evaluate neural network model
The Shell function
How to automatically export or capture abnormal login ip and logs in elastic to the database?
if,case,for,while
Y86. Chapter iv Prometheus giant monitoring system and the actual combat, Prometheus storage (17)
小程序 + 电商,玩转新零售
移动支付线上线下支付场景
初识Numpy
2022 Hangzhou Electric Power Multi-School League Game 5 Solution
【C语言进阶】程序环境和预处理
Oracle与Postgresql在PLSQL内事务回滚的重大差异
docker安装mysql与宿主机相差8小时的问题。
中信证券网上开户怎么开的?安全吗?
10 Convolutional Neural Networks for Deep Learning 3
图像处理之Bolb分析(一)
【Ryerson情感说话/歌唱视听数据集(RAVDESS) 】
【id类型和NSObject指针 ObjectIve-C中】