当前位置:网站首页>System design learning (III) design Amazon's sales rank by category feature
System design learning (III) design Amazon's sales rank by category feature
2022-07-06 13:09:00 【Geometer】
Old rules , First look at the question
Use cases
We’ll scope the problem to handle only the following use case
Service calculates the past week’s most popular products by category
User views the past week’s most popular products by category
Service has high availability
Out of scope
The general e-commerce site:Design components only for calculating sales rank Constraints and assumptions
State assumptions Traffic is not evenly distributed
Items can be in multiple categories
Items cannot change categories
There are no subcategories ie foo/bar/baz Results must be updated hourly
More popular products might need to be updated more frequently
10 million products
1000 categories
1 billion transactions per month
100 billion read requests per month
100:1 read to write ratio
Then the first step is to calculate , Clear understanding , If you want to calculate the rough usage
First, let's see how much storage space each transaction needs
creared_at 5bytes
product_at 8bytes
category_id 4bytes
seller_id 8bytes
buyer_id 8bytes
quantity 4bytes
total_price 5bytes
total 40bytes
Because he said one month 1billion Transactions , So every month it costs 40gb To store these transactions Guoheng
If it takes three years 1.44tb For storage
Then after completing the rough calculation, we should simply design this system
After completing the design of the system outline, we need to fill in the details
First , Since the service needs to count the most popular goods in the past week , Then he must count the information of all orders , We can use sale api Store the processed logs , stay Object Store Instead of managing our own file system
We assume that the log we see is as follows
timestamp product_id category_id qty total_price seller_id buyer_id
t1 product1 category1 2 20.00 1 1
t2 product1 category2 2 20.00 2 2
t2 product1 category2 1 10.00 2 3
t3 product2 category1 3 7.00 3 4
t4 product3 category2 7 2.00 4 5
t5 product4 category1 1 5.00 5 6
...
Sales Rank service have access to sale api Print the log as input data , Then use it internally mapreduce Conduct a distributed log statistics , At last, it comes to sales_rank Such information is stored in a SQL In the database
We will use a multipart mapreduce
1. Convert data to (category, product_id), sum(quantity)
2. Do a distributed sort
class SalesRanker(MRJob):
def within_past_week(self, timestamp):
"""Return True if timestamp is within past week, False otherwise."""
...
def mapper(self, _ line):
"""Parse each log line, extract and transform relevant lines. Emit key value pairs of the form: (category1, product1), 2 (category2, product1), 2 (category2, product1), 1 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
timestamp, product_id, category_id, quantity, total_price, seller_id, \
buyer_id = line.split('\t')
if self.within_past_week(timestamp):
yield (category_id, product_id), quantity
def reducer(self, key, value):
"""Sum values for each key. (category1, product1), 2 (category2, product1), 3 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
yield key, sum(values)
def mapper_sort(self, key, value):
"""Construct key to ensure proper sorting. Transform key and value to the form: (category1, 2), product1 (category2, 3), product1 (category1, 3), product2 (category2, 7), product3 (category1, 1), product4 The shuffle/sort step of MapReduce will then do a distributed sort on the keys, resulting in: (category1, 1), product4 (category1, 2), product1 (category1, 3), product2 (category2, 3), product1 (category2, 7), product3 """
category_id, product_id = key
quantity = value
yield (category_id, quantity), product_id
def reducer_identity(self, key, value):
yield key, value
def steps(self):
"""Run the map and reduce steps."""
return [
self.mr(mapper=self.mapper,
reducer=self.reducer),
self.mr(mapper=self.mapper_sort,
reducer=self.reducer_identity),
]
The result is an ordered list below , We can deposit it in sales_rank surface
(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3
sales_rank The deconstruction of the table is as follows
id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)
We will be in id 、 category_id and product_id Create an index on to speed up the search ( Record the time instead of scanning the entire table ) And keep the data in memory .
In this case , When users want to access the most popular products of the past week ( In each category ) When
client Send a request to webserver, Perform reverse proxy ,
webserver Direct request read api,
read api Read data from sql In the database
We use a public rest api
$ curl https://amazon.com/api/v1/popular?category_id=1234
return :
{
"id": "100",
"category_id": "1234",
"total_sold": "100000",
"product_id": "50",
},
{
"id": "53",
"category_id": "1234",
"total_sold": "90000",
"product_id": "200",
},
{
"id": "75",
"category_id": "1234",
"total_sold": "80000",
"product_id": "3",
},
The next step is to extend the existing structure
边栏推荐
- 系统设计学习(一)Design Pastebin.com (or Bit.ly)
- 十分鐘徹底掌握緩存擊穿、緩存穿透、緩存雪崩
- TYUT太原理工大学往年数据库简述题
- [算法] 剑指offer2 golang 面试题9:乘积小于k的子数组
- How to ensure data consistency between MySQL and redis?
- Application architecture of large live broadcast platform
- Wechat applet development experience
- [untitled]
- Answer to "software testing" exercise: Chapter 1
- [algorithm] sword finger offer2 golang interview question 2: binary addition
猜你喜欢
[算法] 剑指offer2 golang 面试题2:二进制加法
Alibaba cloud side: underlying details in concurrent scenarios - pseudo sharing
[algorithm] sword finger offer2 golang interview question 4: numbers that appear only once
【GNSS数据处理】赫尔默特(helmert)方差分量估计解析及代码实现
[algorithm] sword finger offer2 golang interview question 9: subarray with product less than k
系统设计学习(二)Design a key-value cache to save the results of the most recent web server queries
2022国赛Re1 baby_tree
The port is occupied because the service is not shut down normally
面渣逆袭:Redis连环五十二问,三万字+八十图详解。
MySQL shutdown is slow
随机推荐
wsl常用命令
十分钟彻底掌握缓存击穿、缓存穿透、缓存雪崩
堆排序【手写小根堆】
MySQL backup -- common errors in xtrabackup backup
[算法] 剑指offer2 golang 面试题8:和大于或等于k的最短子数组
Record: Navicat premium can't connect to MySQL for the first time
《软件测试》习题答案:第一章
How do architects draw system architecture blueprints?
[untitled]
String class
[Chongqing Guangdong education] reference materials for regional analysis and planning of Pingdingshan University
Application architecture of large live broadcast platform
阿里云微服务(一)服务注册中心Nacos以及REST Template和Feign Client
TYUT太原理工大学2022软工导论考试题型大纲
[Yu Yue education] guide business reference materials of Wuxi Vocational and Technical College of Commerce
[algorithm] sword finger offer2 golang interview question 3: the number of 1 in the binary form of the first n numbers
染色法判定二分图
[算法] 剑指offer2 golang 面试题12:左右两边子数组的和相等
[algorithme] swordfinger offer2 golang question d'entrevue 2: addition binaire
TYUT太原理工大学2022“mao gai”必背