当前位置:网站首页>Practice of constructing ten billion relationship knowledge map based on Nebula graph
Practice of constructing ten billion relationship knowledge map based on Nebula graph
2022-06-28 14:13:00 【NebulaGraph】
This article was first published in Nebula Graph Community official account

One 、 Project background
Micro LAN is a query technology 、 industry 、 Enterprises 、 Scientific research institutions 、 Application of knowledge atlas of disciplines and their relationships , It contains billions of relationships and billions of entities , In order to make this business run perfectly , Through investigation and research , We use Nebula Graph As the main database carrying our knowledge map business , With Nebula Graph Product iteration , We finally chose to use it v2.5.1 Version of Nebula Graph As the final version .
Two 、 Why choose Nebula Graph?
In the field of open source graph database , There are undoubtedly many options , But in order to support the knowledge map service of such large-scale data ,Nebula Graph Compared with other graph databases, it has the following advantages , This is also our choice Nebula Graph Why :
- Small memory footprint
In our business scenario , our QPS Relatively low and without high volatility , At the same time, compared with other graph databases ,Nebula Graph With less idle time memory consumption , So we can run by using a machine with a lower memory configuration Nebula Graph service , This will undoubtedly save us costs .
- Use multi-raft Agreement of conformity
multi-raft Compared to traditional raft, It not only increases the availability of the system , And the performance is better than traditional raft higher . The performance of consensus algorithm mainly depends on whether it allows holes and granularity segmentation , In the application layer, no matter KV Database or SQL , Make good use of these two features , The performance will not be bad . because raft The serial commit of is highly dependent on the performance of the state machine , This leads to even in KV On , One key Of op slow , It will significantly slow down others key. therefore , The key to the performance of a consistent protocol , It must be how the state machine makes it possible to be parallel as much as possible , Even if multi-raft The grain size of is relatively coarse ( Compared with Paxos), But for those who are not allowed to be empty raft Agreement for , There is still a huge improvement .
- The storage side uses RocksDB As a storage engine
RocksDB As a storage engine / embedded database , It has been widely used as the storage end in various databases . More to the point Nebula Graph Can be adjusted by RocksDB To improve database performance .
- Write fast
Our business needs to write a lot frequently ,Nebula Graph Even when there is a lot of long text content vertex Under the circumstances ( Within cluster 3 Taiwan machine 、3 Copy of the data ,16 Thread insertion ) The insertion speed can also reach 2 ten thousand /s Speed of insertion , The insertion speed of the edge without attributes can reach 35 ten thousand /s.
3、 ... and 、 Use Nebula Graph What problems have we encountered ?
In our knowledge map business , Many scenarios need to show users the paged one degree relationship , At the same time, there are some super nodes in our data , But according to our business scenario , The super node must be the node with the highest user access probability , So this can't be simply classified as a long tail problem ; And because we don't have a large number of users , So the cache will not be hit very often , We need a solution to reduce the query latency of users .
give an example : The business scenario is to query the downstream technologies of this technology , At the same time, we should sort according to the sorting key we set , This sort key is a local sort key . such as , An organization ranks particularly high in a certain field , But in general or in other fields , In this scenario, we must set the sorting attribute to the edge , And the global sorting items are fitted and standardized , So that the variance of data in each dimension is 1, The average is 0, For local sorting , At the same time, it also supports paging operation to facilitate user query .
The statement is as follows :
MATCH (v1:technology)-[e:technologyLeaf]->(v2:technology) WHERE id(v2) == "foobar" \ RETURN id(v1), v1.name, e.sort_value AS sort ORDER BY sort | LIMIT 0,20;This node has 13 Ten thousand neighbors , In this case, even for sort_value Attribute is indexed , The query still takes nearly two seconds . This speed is obviously unacceptable .
We finally chose to use ant financial open source OceanBase Database to help us achieve business , The data model is as follows :
technologydownstream
| technology_id | downstream_id | sort_value |
|---|---|---|
| foobar | id1 | 1.0 |
| foobar | id2 | 0.5 |
| foobar | id3 | 0.0 |
technology
| id | name | sort_value |
|---|---|---|
| id1 | aaa | 0.3 |
| id2 | bbb | 0.2 |
| id3 | ccc | 0.1 |
The query statement is as follows :
SELECT technology.name FROM technology INNER JOIN (SELECT technologydownstream.downstream_id FROM technologydownstream WHERE technologydownstream.technology_id = 'foobar' ORDER BY technologydownstream.sort_value DESC LIMIT 0,20) AS t WHERE t.downstream_id=technology.id; This sentence takes time 80 millisecond . Here is the whole architecture design

Four 、 Use Nebula Graph How do we tune ?
I talked about it before. Nebula Graph One of the great advantages of is the ability to use native RocksDB Parameters for tuning , Reduce the cost of learning , We share the specific meaning of tuning items and some tuning strategies as follows :
| RocksDB Parameters | meaning |
|---|---|
| max_total_wal_size | once wal Your file exceeds max_total_wal_size Will force the creation of new wal file , The default value is 0 when ,max_total_wal_size = write_buffer_size * max_write_buffer_number * 4 |
| delete_obsolete_files_period_micros | Delete the period of expired files , Expired files contain sst Document and wal file , The default is 6 Hours |
| max_background_jobs | Maximum number of background threads = max_background_flushes + max_background_compactions |
| stats_dump_period_sec | If it is not empty , Is every stats_dump_period_sec Seconds will print rocksdb.stats Information to LOG file |
| compaction_readahead_size | The amount of data pre read from the hard disk during compression . If in the SSD Running on disk RocksDB, For performance reasons, it should be set to at least 2 MB. If it's non-zero , At the same time, it will force new_table_reader_for_compaction_inputs=true |
| writable_file_max_buffer_size | WritableFileWriter Maximum buffer size used RocksDB Write cache for , about Direct IO In terms of mode , Tuning this parameter is important . |
| bytes_per_sync | Every time sync The amount of data , Accumulated to bytes_per_sync Will take the initiative Flush To disk , This option applies to sst file ,wal Files use wal_bytes_per_sync |
| wal_bytes_per_sync | wal Every time the file is full wal_bytes_per_sync File size , By calling sync_file_range To refresh the file , The default value is 0 It doesn't work |
| delayed_write_rate | In case of Write Stall, The speed of writing will be limited to delayed_write_rate following |
| avoid_flush_during_shutdown | By default ,DB All... Will be refreshed when closing memtable, If this option is set, the refresh will not be forced , May cause data loss |
| max_open_files | RocksDB Number of handles that can open a file ( Mainly sst file ), In this way, the next time you visit, you can directly use , Without having to reopen . When the cached file handle exceeds max_open_files after , Some handles will be close fall , Note the handle close When the corresponding sst Of index cache and filter cache Will be released together , because index block and filter block Cache on heap , The maximum quantity is limited by max_open_files Options control . basis sst Of documents index_block How to organize , Generally speaking index_block Than data_block Big 1 To 2 An order of magnitude , So every time you read data, you must first load index_block, here index Data on heap , It will not actively eliminate data ; If a lot of random reads , Can cause severe read amplification , In addition, it may lead to RocksDB Occupy a large amount of physical memory for unknown reasons , So the adjustment of this value is very important , Need to be based on your own workload Trade off between performance and memory usage . If this value is -1,RocksDB All open handles will always be cached , But this will cause a lot of memory overhead |
| stats_persist_period_sec | If it is not empty , Is every stats_persist_period_sec Automatically save statistics to hidden column families ___ rocksdb_stats_history___. |
| stats_history_buffer_size | If it's not zero , The statistics snapshot is taken regularly and stored in memory , The maximum memory size for statistics snapshots is stats_history_buffer_size |
| strict_bytes_per_sync | RocksDB Write data to the hard disk for performance reasons , There is no synchronization by default Flush, Therefore, there is the possibility of data loss under abnormal conditions , In order to control the amount of lost data , You need some parameters to set the refresh action . If this parameter is true, that RocksDB Will be strictly in accordance with wal_bytes_per_sync and bytes_per_sync Set the brush disc , That is to refresh a complete file every time , If this parameter is false Only part of the data is refreshed each time : That is to say, if you don't care about possible data loss care, You can set false, However, it is recommended that true |
| enable_rocksdb_prefix_filtering | Open or not prefix_bloom_filter, After it is enabled, it will be written according to key Before rocksdb_filtering_prefix_length Bit in memtable structure bloom filter |
| enable_rocksdb_whole_key_filtering | stay memtable establish bloomfilter, Where mapped key yes memtable Integrity key name , So this configuration and enable_rocksdb_prefix_filtering Conflict , If enable_rocksdb_prefix_filtering by true, This configuration will not take effect |
| rocksdb_filtering_prefix_length | see enable_rocksdb_prefix_filtering |
| num_compaction_threads | Background concurrency compaction Maximum number of threads , It is actually the maximum number of threads in the thread pool ,compaction By default, the thread pool of considers low priority |
| rate_limit | Used to record in the code through NewGenericRateLimiter Create the parameters of the rate controller , In this way, these parameters can be used to build the rate_limiter.rate_limiter yes RocksDB Used to control Compaction and Flush Write rate tool , Because too fast write will affect the data reading , We can set it like this :rate_limit = {"id":"GenericRateLimiter"; "mode":"kWritesOnly"; "clock":"PosixClock"; "rate_bytes_per_sec":"200"; "fairness":"10"; "refill_period_us":"1000000"; "auto_tuned":"false";} |
| write_buffer_size | memtable Maximum size, If it exceeds this value ,RocksDB Will turn it into immutable memtable, And create another new memtable |
| max_write_buffer_number | Maximum memtable The number of , contain mem and imm. If the full ,RocksDB Will stop subsequent writes , Usually this is because the writing is too fast but Flush Not in time |
| level0_file_num_compaction_trigger | Leveled Compaction Special trigger parameters , When L0 The number of files reached level0_file_num_compaction_trigger The value of , The trigger L0 and L1 The merger of . The higher the value , The smaller the write magnification , The larger the read magnification . When this value is large , Is close to Universal Compaction state |
| level0_slowdown_writes_trigger | When level0 The number of files is greater than this value , It will reduce the writing speed . Adjust this parameter to level0_stop_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
| level0_stop_writes_trigger | When level0 The number of files is greater than this value , Will reject writing . Adjust this parameter to level0_slowdown_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
| target_file_size_base | L1 Of documents SST size . Increasing this value will reduce the overall DB Of size, If you need to adjust, you can make target_file_size_base = max_bytes_for_level_base / 10, That is to say level 1 There will be 10 individual SST File can |
| target_file_size_multiplier | bring L1 The upper (L2...L6) Of the file SST Of size Will be larger than the current layer target_file_size_multiplier times |
| max_bytes_for_level_base | L1 Maximum capacity of the layer ( all SST Sum of file sizes ), Exceeding this capacity triggers Compaction |
| max_bytes_for_level_multiplier | Incremental parameter of the total file size of each layer compared with the previous layer |
| disable_auto_compactions | Whether to disable automatic Compaction |
Communication graph database technology ? Join in Nebula Communication group please first Fill in your Nebula Business card ,Nebula The little assistant will pull you into the group ~~
边栏推荐
- 线程终止的 4 种方式
- 《畅玩NAS》家庭 NAS 服务器搭建方案「建议收藏」
- 华泰证券app 怎么办理开户最安全
- Why will the new 5g standard bring lower TCO to the technology stack
- (original) [Maui] realize "floating action button" step by step
- 求解汉诺塔问题
- RSLO:自监督激光雷达里程计(实时+高精度,ICRA2022)
- Only four breakthrough Lenovo smart Summer Palace in mainland China won the "IDC Asia Pacific Smart City Award in 2022"
- G: maximum flow problem
- 外贸SEO 站长工具
猜你喜欢

Work study management system based on ASP

openGauss内核:SQL解析过程分析

中国内地仅四家突围 联想智慧颐和园荣获 “2022年IDC亚太区智慧城市大奖”

如何设计数据可视化平台

PCB懂王,你是吗?我不是

Mulan open work license 1.0 open to the public for comments

Adding virtual environments to the Jupiter notebook

Arcgis 矢量中心点生成矩形并裁剪tif图像进行深度学习样本训练

Foreign trade SEO Webmaster Tools

Open source invites you to participate in openinfra days China 2022. Topic collection is in progress ~
随机推荐
Template_ Large integer multiplication
Prediction of red wine quality by decision tree
欧拉恒等式:数学史上的真正完美公式!
N皇后问题
【二叉树】在二叉树中分配硬币
Foreign trade SEO Webmaster Tools
G : 最大流问题
30 sets of JSP website source code collection "suggestions collection"
Research and Simulation of chaotic digital image encryption technology based on MATLAB
RSLO:自监督激光雷达里程计(实时+高精度,ICRA2022)
腾讯云国际云服务器登录之后没有网络,如何排查?
Kubernetes' in-depth understanding of kubernetes (II) declaring organizational objects
单一职责原则
Analysis and processing of GPS data format [easy to understand]
MySQL从库Error:“You cannot ‘Alter‘ a log table...“
SPI interface introduction -piyu dhaker
线程终止的 4 种方式
荐书丨《大脑通信员》:如果爱情只是化学反应,那还能相信爱情吗?
Npoi export excel and download to client
Is it safe to open an account on the flush