当前位置:网站首页>Practice of constructing ten billion relationship knowledge map based on Nebula graph
Practice of constructing ten billion relationship knowledge map based on Nebula graph
2022-07-02 15:56:00 【Figure database nebulagraph】
This article was first published in Nebula Graph Community official account

One 、 Project background
Micro LAN is a query technology 、 industry 、 Enterprises 、 Scientific research institutions 、 Application of knowledge atlas of disciplines and their relationships , It contains billions of relationships and billions of entities , In order to make this business run perfectly , Through investigation and research , We use Nebula Graph As the main database carrying our knowledge map business , With Nebula Graph Product iteration , We finally chose to use it v2.5.1 Version of Nebula Graph As the final version .
Two 、 Why choose Nebula Graph?
In the field of open source graph database , There are undoubtedly many options , But in order to support the knowledge map service of such large-scale data ,Nebula Graph Compared with other graph databases, it has the following advantages , This is also our choice Nebula Graph Why :
- Small memory footprint
In our business scenario , our QPS Relatively low and without high volatility , At the same time, compared with other graph databases ,Nebula Graph With less idle time memory consumption , So we can run by using a machine with a lower memory configuration Nebula Graph service , This will undoubtedly save us costs .
- Use multi-raft Agreement of conformity
multi-raft Compared to traditional raft, It not only increases the availability of the system , And the performance is better than traditional raft higher . The performance of consensus algorithm mainly depends on whether it allows holes and granularity segmentation , In the application layer, no matter KV Database or SQL , Make good use of these two features , The performance will not be bad . because raft The serial commit of is highly dependent on the performance of the state machine , This leads to even in KV On , One key Of op slow , It will significantly slow down others key. therefore , The key to the performance of a consistent protocol , It must be how the state machine makes it possible to be parallel as much as possible , Even if multi-raft The grain size of is relatively coarse ( Compared with Paxos), But for those who are not allowed to be empty raft Agreement for , There is still a huge improvement .
- The storage side uses RocksDB As a storage engine
RocksDB As a storage engine / embedded database , It has been widely used as the storage end in various databases . More to the point Nebula Graph Can be adjusted by RocksDB To improve database performance .
- Write fast
Our business needs to write a lot frequently ,Nebula Graph Even when there is a lot of long text content vertex Under the circumstances ( Within cluster 3 Taiwan machine 、3 Copy of the data ,16 Thread insertion ) The insertion speed can also reach 2 ten thousand /s Speed of insertion , The insertion speed of the edge without attributes can reach 35 ten thousand /s.
3、 ... and 、 Use Nebula Graph What problems have we encountered ?
In our knowledge map business , Many scenarios need to show users the paged one degree relationship , At the same time, there are some super nodes in our data , But according to our business scenario , The super node must be the node with the highest user access probability , So this can't be simply classified as a long tail problem ; And because we don't have a large number of users , So the cache will not be hit very often , We need a solution to reduce the query latency of users .
give an example : The business scenario is to query the downstream technologies of this technology , At the same time, we should sort according to the sorting key we set , This sort key is a local sort key . such as , An organization ranks particularly high in a certain field , But in general or in other fields , In this scenario, we must set the sorting attribute to the edge , And the global sorting items are fitted and standardized , So that the variance of data in each dimension is 1, The average is 0, For local sorting , At the same time, it also supports paging operation to facilitate user query .
The statement is as follows :
MATCH (v1:technology)-[e:technologyLeaf]->(v2:technology) WHERE id(v2) == "foobar" \
RETURN id(v1), v1.name, e.sort_value AS sort ORDER BY sort | LIMIT 0,20;
This node has 13 Ten thousand neighbors , In this case, even for sort_value Attribute is indexed , The query still takes nearly two seconds . This speed is obviously unacceptable .
We finally chose to use ant financial open source OceanBase Database to help us achieve business , The data model is as follows :
technologydownstream
| technology_id | downstream_id | sort_value |
|---|---|---|
| foobar | id1 | 1.0 |
| foobar | id2 | 0.5 |
| foobar | id3 | 0.0 |
technology
| id | name | sort_value |
|---|---|---|
| id1 | aaa | 0.3 |
| id2 | bbb | 0.2 |
| id3 | ccc | 0.1 |
The query statement is as follows :
SELECT technology.name FROM technology INNER JOIN (SELECT technologydownstream.downstream_id FROM technologydownstream
WHERE technologydownstream.technology_id = 'foobar' ORDER BY technologydownstream.sort_value DESC LIMIT 0,20) AS t
WHERE t.downstream_id=technology.id;
This sentence takes time 80 millisecond . Here is the whole architecture design

Four 、 Use Nebula Graph How do we tune ?
I talked about it before. Nebula Graph One of the great advantages of is the ability to use native RocksDB Parameters for tuning , Reduce the cost of learning , We share the specific meaning of tuning items and some tuning strategies as follows :
| RocksDB Parameters | meaning |
|---|---|
| max_total_wal_size | once wal Your file exceeds max_total_wal_size Will force the creation of new wal file , The default value is 0 when ,max_total_wal_size = write_buffer_size * max_write_buffer_number * 4 |
| delete_obsolete_files_period_micros | Delete the period of expired files , Expired files contain sst Document and wal file , The default is 6 Hours |
| max_background_jobs | Maximum number of background threads = max_background_flushes + max_background_compactions |
| stats_dump_period_sec | If it is not empty , Is every stats_dump_period_sec Seconds will print rocksdb.stats Information to LOG file |
| compaction_readahead_size | The amount of data pre read from the hard disk during compression . If in the SSD Running on disk RocksDB, For performance reasons, it should be set to at least 2 MB. If it's non-zero , At the same time, it will force new_table_reader_for_compaction_inputs=true |
| writable_file_max_buffer_size | WritableFileWriter Maximum buffer size used RocksDB Write cache for , about Direct IO In terms of mode , Tuning this parameter is important . |
| bytes_per_sync | Every time sync The amount of data , Accumulated to bytes_per_sync Will take the initiative Flush To disk , This option applies to sst file ,wal Files use wal_bytes_per_sync |
| wal_bytes_per_sync | wal Every time the file is full wal_bytes_per_sync File size , By calling sync_file_range To refresh the file , The default value is 0 It doesn't work |
| delayed_write_rate | In case of Write Stall, The speed of writing will be limited to delayed_write_rate following |
| avoid_flush_during_shutdown | By default ,DB All... Will be refreshed when closing memtable, If this option is set, the refresh will not be forced , May cause data loss |
| max_open_files | RocksDB Number of handles that can open a file ( Mainly sst file ), In this way, the next time you visit, you can directly use , Without having to reopen . When the cached file handle exceeds max_open_files after , Some handles will be close fall , Note the handle close When the corresponding sst Of index cache and filter cache Will be released together , because index block and filter block Cache on heap , The maximum quantity is limited by max_open_files Options control . basis sst Of documents index_block How to organize , Generally speaking index_block Than data_block Big 1 To 2 An order of magnitude , So every time you read data, you must first load index_block, here index Data on heap , It will not actively eliminate data ; If a lot of random reads , Can cause severe read amplification , In addition, it may lead to RocksDB Occupy a large amount of physical memory for unknown reasons , So the adjustment of this value is very important , Need to be based on your own workload Trade off between performance and memory usage . If this value is -1,RocksDB All open handles will always be cached , But this will cause a lot of memory overhead |
| stats_persist_period_sec | If it is not empty , Is every stats_persist_period_sec Automatically save statistics to hidden column families ___ rocksdb_stats_history___. |
| stats_history_buffer_size | If it's not zero , The statistics snapshot is taken regularly and stored in memory , The maximum memory size for statistics snapshots is stats_history_buffer_size |
| strict_bytes_per_sync | RocksDB Write data to the hard disk for performance reasons , There is no synchronization by default Flush, Therefore, there is the possibility of data loss under abnormal conditions , In order to control the amount of lost data , You need some parameters to set the refresh action . If this parameter is true, that RocksDB Will be strictly in accordance with wal_bytes_per_sync and bytes_per_sync Set the brush disc , That is to refresh a complete file every time , If this parameter is false Only part of the data is refreshed each time : That is to say, if you don't care about possible data loss care, You can set false, However, it is recommended that true |
| enable_rocksdb_prefix_filtering | Open or not prefix_bloom_filter, After it is enabled, it will be written according to key Before rocksdb_filtering_prefix_length Bit in memtable structure bloom filter |
| enable_rocksdb_whole_key_filtering | stay memtable establish bloomfilter, Where mapped key yes memtable Integrity key name , So this configuration and enable_rocksdb_prefix_filtering Conflict , If enable_rocksdb_prefix_filtering by true, This configuration will not take effect |
| rocksdb_filtering_prefix_length | see enable_rocksdb_prefix_filtering |
| num_compaction_threads | Background concurrency compaction Maximum number of threads , It is actually the maximum number of threads in the thread pool ,compaction By default, the thread pool of considers low priority |
| rate_limit | Used to record in the code through NewGenericRateLimiter Create the parameters of the rate controller , In this way, these parameters can be used to build the rate_limiter.rate_limiter yes RocksDB Used to control Compaction and Flush Write rate tool , Because too fast write will affect the data reading , We can set it like this :rate_limit = {"id":"GenericRateLimiter"; "mode":"kWritesOnly"; "clock":"PosixClock"; "rate_bytes_per_sec":"200"; "fairness":"10"; "refill_period_us":"1000000"; "auto_tuned":"false";} |
| write_buffer_size | memtable Maximum size, If it exceeds this value ,RocksDB Will turn it into immutable memtable, And create another new memtable |
| max_write_buffer_number | Maximum memtable The number of , contain mem and imm. If the full ,RocksDB Will stop subsequent writes , Usually this is because the writing is too fast but Flush Not in time |
| level0_file_num_compaction_trigger | Leveled Compaction Special trigger parameters , When L0 The number of files reached level0_file_num_compaction_trigger The value of , The trigger L0 and L1 The merger of . The higher the value , The smaller the write magnification , The larger the read magnification . When this value is large , Is close to Universal Compaction state |
| level0_slowdown_writes_trigger | When level0 The number of files is greater than this value , It will reduce the writing speed . Adjust this parameter to level0_stop_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
| level0_stop_writes_trigger | When level0 The number of files is greater than this value , Will reject writing . Adjust this parameter to level0_slowdown_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
| target_file_size_base | L1 Of documents SST size . Increasing this value will reduce the overall DB Of size, If you need to adjust, you can make target_file_size_base = max_bytes_for_level_base / 10, That is to say level 1 There will be 10 individual SST File can |
| target_file_size_multiplier | bring L1 The upper (L2…L6) Of the file SST Of size Will be larger than the current layer target_file_size_multiplier times |
| max_bytes_for_level_base | L1 Maximum capacity of the layer ( all SST Sum of file sizes ), Exceeding this capacity triggers Compaction |
| max_bytes_for_level_multiplier | Incremental parameter of the total file size of each layer compared with the previous layer |
| disable_auto_compactions | Whether to disable automatic Compaction |
Communication graph database technology ? Join in Nebula Communication group please first Fill in your Nebula Business card ,Nebula The little assistant will pull you into the group ~~
边栏推荐
- Boot 中bean配置覆盖
- 6095. Strong password checker II
- 《大学“电路分析基础”课程实验合集.实验四》丨线性电路特性的研究
- Aike AI frontier promotion (7.2)
- Use ffmpeg command line to push UDP and RTP streams (H264 and TS), and ffplay receives
- Add an empty column to spark dataframe - add an empty column to spark dataframe
- win10系统升级一段时间后,内存占用过高
- 【5G NR】RRC连接释放
- 数据湖(十一):Iceberg表数据组织与查询
- [solution] educational codeforces round 82
猜你喜欢

Ant group's large-scale map computing system tugraph passed the national evaluation

The outline dimension function application of small motherboard
![[salesforce] how to confirm your salesforce version?](/img/ce/4c844b1b686397faa1b6aa3d57e034.png)
[salesforce] how to confirm your salesforce version?

动态规划入门二(5.647.62)

数仓中的维度表与事实表

Comparison between rstan Bayesian regression model and standard linear regression model of R language MCMC

蚂蚁集团大规模图计算系统TuGraph通过国家级评测

Traversal before, during and after binary tree

PHP static members

Review materials for the special topic of analog electronics with all essence: basic amplification circuit knowledge points
随机推荐
仙人掌之歌——投石问路(2)
Bean configuration override in boot
处理gzip: stdin: not in gzip formattar: Child returned status 1tar: Error is not recoverable: exitin
Pattern matching extraction of specific subgraphs in graphx graph Computing Practice
Some problems about pytorch extension
Review materials for the special topic of analog electronics with all essence: basic amplification circuit knowledge points
Usage of group by
Boot transaction usage
Soul torture, what is AQS???
Why does the system convert the temp environment variable to a short file name?
Xpt2046 four wire resistive touch screen
lseek 出错
【5G NR】RRC连接释放
/Bin/ld: cannot find -lpam
/Bin/ld: cannot find -llz4
locate: 无法执行 stat () `/var/lib/mlocate/mlocate.db‘: 没有那个文件或目录
Wise target detection 23 - pytoch builds SSD target detection platform
华为云服务器安装mysqlb for mysqld.service failed because the control process exited with error code.See “sys
PHP static members
Group by的用法