当前位置:网站首页>System design learning (III) design Amazon's sales rank by category feature
System design learning (III) design Amazon's sales rank by category feature
2022-07-06 13:09:00 【Geometer】
Old rules , First look at the question
Use cases
We’ll scope the problem to handle only the following use case
Service calculates the past week’s most popular products by category
User views the past week’s most popular products by category
Service has high availability
Out of scope
The general e-commerce site:Design components only for calculating sales rank Constraints and assumptions
State assumptions Traffic is not evenly distributed
Items can be in multiple categories
Items cannot change categories
There are no subcategories ie foo/bar/baz Results must be updated hourly
More popular products might need to be updated more frequently
10 million products
1000 categories
1 billion transactions per month
100 billion read requests per month
100:1 read to write ratio
Then the first step is to calculate , Clear understanding , If you want to calculate the rough usage
First, let's see how much storage space each transaction needs
creared_at 5bytes
product_at 8bytes
category_id 4bytes
seller_id 8bytes
buyer_id 8bytes
quantity 4bytes
total_price 5bytes
total 40bytes
Because he said one month 1billion Transactions , So every month it costs 40gb To store these transactions Guoheng
If it takes three years 1.44tb For storage
Then after completing the rough calculation, we should simply design this system
After completing the design of the system outline, we need to fill in the details
First , Since the service needs to count the most popular goods in the past week , Then he must count the information of all orders , We can use sale api Store the processed logs , stay Object Store Instead of managing our own file system
We assume that the log we see is as follows
timestamp product_id category_id qty total_price seller_id buyer_id
t1 product1 category1 2 20.00 1 1
t2 product1 category2 2 20.00 2 2
t2 product1 category2 1 10.00 2 3
t3 product2 category1 3 7.00 3 4
t4 product3 category2 7 2.00 4 5
t5 product4 category1 1 5.00 5 6
...
Sales Rank service have access to sale api Print the log as input data , Then use it internally mapreduce Conduct a distributed log statistics , At last, it comes to sales_rank Such information is stored in a SQL In the database
We will use a multipart mapreduce
1. Convert data to (category, product_id), sum(quantity)
2. Do a distributed sort
class SalesRanker(MRJob):
def within_past_week(self, timestamp):
"""Return True if timestamp is within past week, False otherwise."""
...
def mapper(self, _ line):
"""Parse each log line, extract and transform relevant lines. Emit key value pairs of the form: (category1, product1), 2 (category2, product1), 2 (category2, product1), 1 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
timestamp, product_id, category_id, quantity, total_price, seller_id, \
buyer_id = line.split('\t')
if self.within_past_week(timestamp):
yield (category_id, product_id), quantity
def reducer(self, key, value):
"""Sum values for each key. (category1, product1), 2 (category2, product1), 3 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
yield key, sum(values)
def mapper_sort(self, key, value):
"""Construct key to ensure proper sorting. Transform key and value to the form: (category1, 2), product1 (category2, 3), product1 (category1, 3), product2 (category2, 7), product3 (category1, 1), product4 The shuffle/sort step of MapReduce will then do a distributed sort on the keys, resulting in: (category1, 1), product4 (category1, 2), product1 (category1, 3), product2 (category2, 3), product1 (category2, 7), product3 """
category_id, product_id = key
quantity = value
yield (category_id, quantity), product_id
def reducer_identity(self, key, value):
yield key, value
def steps(self):
"""Run the map and reduce steps."""
return [
self.mr(mapper=self.mapper,
reducer=self.reducer),
self.mr(mapper=self.mapper_sort,
reducer=self.reducer_identity),
]
The result is an ordered list below , We can deposit it in sales_rank surface
(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3
sales_rank The deconstruction of the table is as follows
id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)
We will be in id 、 category_id and product_id Create an index on to speed up the search ( Record the time instead of scanning the entire table ) And keep the data in memory .
In this case , When users want to access the most popular products of the past week ( In each category ) When
client Send a request to webserver, Perform reverse proxy ,
webserver Direct request read api,
read api Read data from sql In the database
We use a public rest api
$ curl https://amazon.com/api/v1/popular?category_id=1234
return :
{
"id": "100",
"category_id": "1234",
"total_sold": "100000",
"product_id": "50",
},
{
"id": "53",
"category_id": "1234",
"total_sold": "90000",
"product_id": "200",
},
{
"id": "75",
"category_id": "1234",
"total_sold": "80000",
"product_id": "3",
},
The next step is to extend the existing structure
边栏推荐
- Record: Navicat premium can't connect to MySQL for the first time
- [算法] 剑指offer2 golang 面试题4:只出现一次的数字
- TYUT太原理工大学往年数据库简述题
- Introduction pointer notes
- [algorithm] sword finger offer2 golang interview question 13: sum of numbers of two-dimensional submatrix
- TYUT太原理工大学2022数据库大题之数据库操作
- [rtklib] preliminary practice of using robust adaptive Kalman filter under RTK
- Usage differences between isempty and isblank
- 记录:下一不小心写了个递归
- [算法] 剑指offer2 golang 面试题12:左右两边子数组的和相等
猜你喜欢
阿里云一面:并发场景下的底层细节 - 伪共享问题
十分鐘徹底掌握緩存擊穿、緩存穿透、緩存雪崩
Rt-ppp test using rtknavi
How to ensure data consistency between MySQL and redis?
Dark chain lock (lca+ difference on tree)
[Chongqing Guangdong education] Shandong University College Physics reference materials
【GNSS数据处理】赫尔默特(helmert)方差分量估计解析及代码实现
[algorithm] sword finger offer2 golang interview question 1: integer division
图书管理系统小练习
Fairygui bar subfamily (scroll bar, slider, progress bar)
随机推荐
Dark chain lock (lca+ difference on tree)
[算法] 剑指offer2 golang 面试题4:只出现一次的数字
Record: Navicat premium can't connect to MySQL for the first time
十分钟彻底掌握缓存击穿、缓存穿透、缓存雪崩
MySQL 30000 word essence summary + 100 interview questions, hanging the interviewer is more than enough (Collection Series
阿里云微服务(一)服务注册中心Nacos以及REST Template和Feign Client
162. Find peak - binary search
记录:newInstance()过时的代替方法
闇の連鎖(LCA+树上差分)
[rtklib 2.4.3 B34] version update introduction I
记录:动态Web项目servlet访问数据库404错误之解决
TYUT太原理工大学往年数据库简述题
编辑距离(多源BFS)
平衡二叉树详解 通俗易懂
All in one 1405: sum and product of prime numbers
Problems and solutions of robust estimation in rtklib single point location spp
系统设计学习(二)Design a key-value cache to save the results of the most recent web server queries
10 minutes pour maîtriser complètement la rupture du cache, la pénétration du cache, l'avalanche du cache
Sharing ideas of on-chip transplantation based on rtklib source code
系统设计学习(三)Design Amazon‘s sales rank by category feature