当前位置:网站首页>System design learning (III) design Amazon's sales rank by category feature
System design learning (III) design Amazon's sales rank by category feature
2022-07-06 13:09:00 【Geometer】
Old rules , First look at the question
Use cases
We’ll scope the problem to handle only the following use case
Service calculates the past week’s most popular products by category
User views the past week’s most popular products by category
Service has high availability
Out of scope
The general e-commerce site:Design components only for calculating sales rank Constraints and assumptions
State assumptions Traffic is not evenly distributed
Items can be in multiple categories
Items cannot change categories
There are no subcategories ie foo/bar/baz Results must be updated hourly
More popular products might need to be updated more frequently
10 million products
1000 categories
1 billion transactions per month
100 billion read requests per month
100:1 read to write ratio
Then the first step is to calculate , Clear understanding , If you want to calculate the rough usage
First, let's see how much storage space each transaction needs
creared_at 5bytes
product_at 8bytes
category_id 4bytes
seller_id 8bytes
buyer_id 8bytes
quantity 4bytes
total_price 5bytes
total 40bytes
Because he said one month 1billion Transactions , So every month it costs 40gb To store these transactions Guoheng
If it takes three years 1.44tb For storage
Then after completing the rough calculation, we should simply design this system
After completing the design of the system outline, we need to fill in the details
First , Since the service needs to count the most popular goods in the past week , Then he must count the information of all orders , We can use sale api Store the processed logs , stay Object Store Instead of managing our own file system
We assume that the log we see is as follows
timestamp product_id category_id qty total_price seller_id buyer_id
t1 product1 category1 2 20.00 1 1
t2 product1 category2 2 20.00 2 2
t2 product1 category2 1 10.00 2 3
t3 product2 category1 3 7.00 3 4
t4 product3 category2 7 2.00 4 5
t5 product4 category1 1 5.00 5 6
...
Sales Rank service have access to sale api Print the log as input data , Then use it internally mapreduce Conduct a distributed log statistics , At last, it comes to sales_rank Such information is stored in a SQL In the database
We will use a multipart mapreduce
1. Convert data to (category, product_id), sum(quantity)
2. Do a distributed sort
class SalesRanker(MRJob):
def within_past_week(self, timestamp):
"""Return True if timestamp is within past week, False otherwise."""
...
def mapper(self, _ line):
"""Parse each log line, extract and transform relevant lines. Emit key value pairs of the form: (category1, product1), 2 (category2, product1), 2 (category2, product1), 1 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
timestamp, product_id, category_id, quantity, total_price, seller_id, \
buyer_id = line.split('\t')
if self.within_past_week(timestamp):
yield (category_id, product_id), quantity
def reducer(self, key, value):
"""Sum values for each key. (category1, product1), 2 (category2, product1), 3 (category1, product2), 3 (category2, product3), 7 (category1, product4), 1 """
yield key, sum(values)
def mapper_sort(self, key, value):
"""Construct key to ensure proper sorting. Transform key and value to the form: (category1, 2), product1 (category2, 3), product1 (category1, 3), product2 (category2, 7), product3 (category1, 1), product4 The shuffle/sort step of MapReduce will then do a distributed sort on the keys, resulting in: (category1, 1), product4 (category1, 2), product1 (category1, 3), product2 (category2, 3), product1 (category2, 7), product3 """
category_id, product_id = key
quantity = value
yield (category_id, quantity), product_id
def reducer_identity(self, key, value):
yield key, value
def steps(self):
"""Run the map and reduce steps."""
return [
self.mr(mapper=self.mapper,
reducer=self.reducer),
self.mr(mapper=self.mapper_sort,
reducer=self.reducer_identity),
]
The result is an ordered list below , We can deposit it in sales_rank surface
(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3
sales_rank The deconstruction of the table is as follows
id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)
We will be in id 、 category_id and product_id Create an index on to speed up the search ( Record the time instead of scanning the entire table ) And keep the data in memory .
In this case , When users want to access the most popular products of the past week ( In each category ) When
client Send a request to webserver, Perform reverse proxy ,
webserver Direct request read api,
read api Read data from sql In the database
We use a public rest api
$ curl https://amazon.com/api/v1/popular?category_id=1234
return :
{
"id": "100",
"category_id": "1234",
"total_sold": "100000",
"product_id": "50",
},
{
"id": "53",
"category_id": "1234",
"total_sold": "90000",
"product_id": "200",
},
{
"id": "75",
"category_id": "1234",
"total_sold": "80000",
"product_id": "3",
},
The next step is to extend the existing structure
边栏推荐
- [algorithm] sword finger offer2 golang interview question 1: integer division
- Fundamentals of UD decomposition of KF UD decomposition [1]
- 闇の連鎖(LCA+树上差分)
- 记录:Navicat Premium初次无法连接数据库MySQL之解决
- Answer to "software testing" exercise: Chapter 1
- [GNSS data processing] Helmert variance component estimation analysis and code implementation
- 162. Find peak - binary search
- Wechat applet development experience
- [algorithm] sword finger offer2 golang interview question 13: sum of numbers of two-dimensional submatrix
- 异常:IOException:Stream Closed
猜你喜欢
国企秋招经验总结
阿里云一面:并发场景下的底层细节 - 伪共享问题
十分钟彻底掌握缓存击穿、缓存穿透、缓存雪崩
《软件测试》习题答案:第一章
Ten minutes to thoroughly master cache breakdown, cache penetration, cache avalanche
Fairygui bar subfamily (scroll bar, slider, progress bar)
服务未正常关闭导致端口被占用
[算法] 剑指offer2 golang 面试题8:和大于或等于k的最短子数组
Detailed explanation of balanced binary tree is easy to understand
Problems and solutions of robust estimation in rtklib single point location spp
随机推荐
162. Find peak - binary search
Small exercise of library management system
TYUT太原理工大学2022“mao gai”必背
[algorithm] sword finger offer2 golang interview question 8: the shortest subarray with a sum greater than or equal to K
记录:下一不小心写了个递归
IText 7 generate PDF summary
Excel导入,导出功能实现
Record: I accidentally wrote a recursion next time
【GNSS数据处理】赫尔默特(helmert)方差分量估计解析及代码实现
2年经验总结,告诉你如何做好项目管理
MySQL backup -- common errors in xtrabackup backup
分支语句和循环语句
[Chongqing Guangdong education] reference materials for regional analysis and planning of Pingdingshan University
[algorithm] sword finger offer2 golang interview question 9: subarray with product less than k
The port is occupied because the service is not shut down normally
Several high-frequency JVM interview questions
RTKLIB: demo5 b34f. 1 vs b33
[算法] 劍指offer2 golang 面試題2:二進制加法
Edit distance (multi-source BFS)
Abstract classes and interfaces