当前位置:网站首页>Practice of constructing ten billion relationship knowledge map based on Nebula graph
Practice of constructing ten billion relationship knowledge map based on Nebula graph
2022-07-02 15:56:00 【Figure database nebulagraph】
This article was first published in Nebula Graph Community official account
One 、 Project background
Micro LAN is a query technology 、 industry 、 Enterprises 、 Scientific research institutions 、 Application of knowledge atlas of disciplines and their relationships , It contains billions of relationships and billions of entities , In order to make this business run perfectly , Through investigation and research , We use Nebula Graph As the main database carrying our knowledge map business , With Nebula Graph Product iteration , We finally chose to use it v2.5.1 Version of Nebula Graph As the final version .
Two 、 Why choose Nebula Graph?
In the field of open source graph database , There are undoubtedly many options , But in order to support the knowledge map service of such large-scale data ,Nebula Graph Compared with other graph databases, it has the following advantages , This is also our choice Nebula Graph Why :
- Small memory footprint
In our business scenario , our QPS Relatively low and without high volatility , At the same time, compared with other graph databases ,Nebula Graph With less idle time memory consumption , So we can run by using a machine with a lower memory configuration Nebula Graph service , This will undoubtedly save us costs .
- Use multi-raft Agreement of conformity
multi-raft Compared to traditional raft, It not only increases the availability of the system , And the performance is better than traditional raft higher . The performance of consensus algorithm mainly depends on whether it allows holes and granularity segmentation , In the application layer, no matter KV Database or SQL , Make good use of these two features , The performance will not be bad . because raft The serial commit of is highly dependent on the performance of the state machine , This leads to even in KV On , One key Of op slow , It will significantly slow down others key. therefore , The key to the performance of a consistent protocol , It must be how the state machine makes it possible to be parallel as much as possible , Even if multi-raft The grain size of is relatively coarse ( Compared with Paxos), But for those who are not allowed to be empty raft Agreement for , There is still a huge improvement .
- The storage side uses RocksDB As a storage engine
RocksDB As a storage engine / embedded database , It has been widely used as the storage end in various databases . More to the point Nebula Graph Can be adjusted by RocksDB To improve database performance .
- Write fast
Our business needs to write a lot frequently ,Nebula Graph Even when there is a lot of long text content vertex Under the circumstances ( Within cluster 3 Taiwan machine 、3 Copy of the data ,16 Thread insertion ) The insertion speed can also reach 2 ten thousand /s Speed of insertion , The insertion speed of the edge without attributes can reach 35 ten thousand /s.
3、 ... and 、 Use Nebula Graph What problems have we encountered ?
In our knowledge map business , Many scenarios need to show users the paged one degree relationship , At the same time, there are some super nodes in our data , But according to our business scenario , The super node must be the node with the highest user access probability , So this can't be simply classified as a long tail problem ; And because we don't have a large number of users , So the cache will not be hit very often , We need a solution to reduce the query latency of users .
give an example : The business scenario is to query the downstream technologies of this technology , At the same time, we should sort according to the sorting key we set , This sort key is a local sort key . such as , An organization ranks particularly high in a certain field , But in general or in other fields , In this scenario, we must set the sorting attribute to the edge , And the global sorting items are fitted and standardized , So that the variance of data in each dimension is 1, The average is 0, For local sorting , At the same time, it also supports paging operation to facilitate user query .
The statement is as follows :
MATCH (v1:technology)-[e:technologyLeaf]->(v2:technology) WHERE id(v2) == "foobar" \
RETURN id(v1), v1.name, e.sort_value AS sort ORDER BY sort | LIMIT 0,20;
This node has 13 Ten thousand neighbors , In this case, even for sort_value Attribute is indexed , The query still takes nearly two seconds . This speed is obviously unacceptable .
We finally chose to use ant financial open source OceanBase Database to help us achieve business , The data model is as follows :
technologydownstream
technology_id | downstream_id | sort_value |
---|---|---|
foobar | id1 | 1.0 |
foobar | id2 | 0.5 |
foobar | id3 | 0.0 |
technology
id | name | sort_value |
---|---|---|
id1 | aaa | 0.3 |
id2 | bbb | 0.2 |
id3 | ccc | 0.1 |
The query statement is as follows :
SELECT technology.name FROM technology INNER JOIN (SELECT technologydownstream.downstream_id FROM technologydownstream
WHERE technologydownstream.technology_id = 'foobar' ORDER BY technologydownstream.sort_value DESC LIMIT 0,20) AS t
WHERE t.downstream_id=technology.id;
This sentence takes time 80 millisecond . Here is the whole architecture design
Four 、 Use Nebula Graph How do we tune ?
I talked about it before. Nebula Graph One of the great advantages of is the ability to use native RocksDB Parameters for tuning , Reduce the cost of learning , We share the specific meaning of tuning items and some tuning strategies as follows :
RocksDB Parameters | meaning |
---|---|
max_total_wal_size | once wal Your file exceeds max_total_wal_size Will force the creation of new wal file , The default value is 0 when ,max_total_wal_size = write_buffer_size * max_write_buffer_number * 4 |
delete_obsolete_files_period_micros | Delete the period of expired files , Expired files contain sst Document and wal file , The default is 6 Hours |
max_background_jobs | Maximum number of background threads = max_background_flushes + max_background_compactions |
stats_dump_period_sec | If it is not empty , Is every stats_dump_period_sec Seconds will print rocksdb.stats Information to LOG file |
compaction_readahead_size | The amount of data pre read from the hard disk during compression . If in the SSD Running on disk RocksDB, For performance reasons, it should be set to at least 2 MB. If it's non-zero , At the same time, it will force new_table_reader_for_compaction_inputs=true |
writable_file_max_buffer_size | WritableFileWriter Maximum buffer size used RocksDB Write cache for , about Direct IO In terms of mode , Tuning this parameter is important . |
bytes_per_sync | Every time sync The amount of data , Accumulated to bytes_per_sync Will take the initiative Flush To disk , This option applies to sst file ,wal Files use wal_bytes_per_sync |
wal_bytes_per_sync | wal Every time the file is full wal_bytes_per_sync File size , By calling sync_file_range To refresh the file , The default value is 0 It doesn't work |
delayed_write_rate | In case of Write Stall, The speed of writing will be limited to delayed_write_rate following |
avoid_flush_during_shutdown | By default ,DB All... Will be refreshed when closing memtable, If this option is set, the refresh will not be forced , May cause data loss |
max_open_files | RocksDB Number of handles that can open a file ( Mainly sst file ), In this way, the next time you visit, you can directly use , Without having to reopen . When the cached file handle exceeds max_open_files after , Some handles will be close fall , Note the handle close When the corresponding sst Of index cache and filter cache Will be released together , because index block and filter block Cache on heap , The maximum quantity is limited by max_open_files Options control . basis sst Of documents index_block How to organize , Generally speaking index_block Than data_block Big 1 To 2 An order of magnitude , So every time you read data, you must first load index_block, here index Data on heap , It will not actively eliminate data ; If a lot of random reads , Can cause severe read amplification , In addition, it may lead to RocksDB Occupy a large amount of physical memory for unknown reasons , So the adjustment of this value is very important , Need to be based on your own workload Trade off between performance and memory usage . If this value is -1,RocksDB All open handles will always be cached , But this will cause a lot of memory overhead |
stats_persist_period_sec | If it is not empty , Is every stats_persist_period_sec Automatically save statistics to hidden column families ___ rocksdb_stats_history___. |
stats_history_buffer_size | If it's not zero , The statistics snapshot is taken regularly and stored in memory , The maximum memory size for statistics snapshots is stats_history_buffer_size |
strict_bytes_per_sync | RocksDB Write data to the hard disk for performance reasons , There is no synchronization by default Flush, Therefore, there is the possibility of data loss under abnormal conditions , In order to control the amount of lost data , You need some parameters to set the refresh action . If this parameter is true, that RocksDB Will be strictly in accordance with wal_bytes_per_sync and bytes_per_sync Set the brush disc , That is to refresh a complete file every time , If this parameter is false Only part of the data is refreshed each time : That is to say, if you don't care about possible data loss care, You can set false, However, it is recommended that true |
enable_rocksdb_prefix_filtering | Open or not prefix_bloom_filter, After it is enabled, it will be written according to key Before rocksdb_filtering_prefix_length Bit in memtable structure bloom filter |
enable_rocksdb_whole_key_filtering | stay memtable establish bloomfilter, Where mapped key yes memtable Integrity key name , So this configuration and enable_rocksdb_prefix_filtering Conflict , If enable_rocksdb_prefix_filtering by true, This configuration will not take effect |
rocksdb_filtering_prefix_length | see enable_rocksdb_prefix_filtering |
num_compaction_threads | Background concurrency compaction Maximum number of threads , It is actually the maximum number of threads in the thread pool ,compaction By default, the thread pool of considers low priority |
rate_limit | Used to record in the code through NewGenericRateLimiter Create the parameters of the rate controller , In this way, these parameters can be used to build the rate_limiter.rate_limiter yes RocksDB Used to control Compaction and Flush Write rate tool , Because too fast write will affect the data reading , We can set it like this :rate_limit = {"id":"GenericRateLimiter"; "mode":"kWritesOnly"; "clock":"PosixClock"; "rate_bytes_per_sec":"200"; "fairness":"10"; "refill_period_us":"1000000"; "auto_tuned":"false";} |
write_buffer_size | memtable Maximum size, If it exceeds this value ,RocksDB Will turn it into immutable memtable, And create another new memtable |
max_write_buffer_number | Maximum memtable The number of , contain mem and imm. If the full ,RocksDB Will stop subsequent writes , Usually this is because the writing is too fast but Flush Not in time |
level0_file_num_compaction_trigger | Leveled Compaction Special trigger parameters , When L0 The number of files reached level0_file_num_compaction_trigger The value of , The trigger L0 and L1 The merger of . The higher the value , The smaller the write magnification , The larger the read magnification . When this value is large , Is close to Universal Compaction state |
level0_slowdown_writes_trigger | When level0 The number of files is greater than this value , It will reduce the writing speed . Adjust this parameter to level0_stop_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
level0_stop_writes_trigger | When level0 The number of files is greater than this value , Will reject writing . Adjust this parameter to level0_slowdown_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem |
target_file_size_base | L1 Of documents SST size . Increasing this value will reduce the overall DB Of size, If you need to adjust, you can make target_file_size_base = max_bytes_for_level_base / 10, That is to say level 1 There will be 10 individual SST File can |
target_file_size_multiplier | bring L1 The upper (L2…L6) Of the file SST Of size Will be larger than the current layer target_file_size_multiplier times |
max_bytes_for_level_base | L1 Maximum capacity of the layer ( all SST Sum of file sizes ), Exceeding this capacity triggers Compaction |
max_bytes_for_level_multiplier | Incremental parameter of the total file size of each layer compared with the previous layer |
disable_auto_compactions | Whether to disable automatic Compaction |
Communication graph database technology ? Join in Nebula Communication group please first Fill in your Nebula Business card ,Nebula The little assistant will pull you into the group ~~
边栏推荐
- After the win10 system is upgraded for a period of time, the memory occupation is too high
- Invalid bound statement (not found)解决方法总结
- Comment réaliser un graphique Nebula d'importation CSV hors ligne de niveau milliard
- beforeEach
- GraphX 图计算实践之模式匹配抽取特定子图
- How to import a billion level offline CSV into Nepal graph
- How to use percona tool to add fields to MySQL table after interruption
- Moveit obstacle avoidance path planning demo
- Comprehensively interpret the background and concept of service mesh
- /Bin/ld: cannot find -llz4
猜你喜欢
[5g NR] RRC connection release
Why does the system convert the temp environment variable to a short file name?
如何实现十亿级离线 CSV 导入 Nebula Graph
隐藏在 Nebula Graph 背后的星辰大海
After the win10 system is upgraded for a period of time, the memory occupation is too high
Postgressql stream replication active / standby switchover primary database no read / write downtime scenario
PostgresSQL 流复制 主备切换 主库无读写宕机场景
解决** WARNING ** : Your ApplicationContext is unlikely to start due to a @ComponentScan of the defau
中科大脑知识图谱平台建设及业务实践
Xpt2046 four wire resistive touch screen
随机推荐
数据库系统概论第一章简答题-期末考得怎么样?
解决** WARNING ** : Your ApplicationContext is unlikely to start due to a @ComponentScan of the defau
Tree binary search tree
死锁的条件及解决方法
win10系统升级一段时间后,内存占用过高
How to import a billion level offline CSV into Nepal graph
Crawl the information of national colleges and universities in 1 minute and make it into a large screen for visualization!
处理gzip: stdin: not in gzip formattar: Child returned status 1tar: Error is not recoverable: exitin
/bin/ld: 找不到 -lxslt
《大学“电路分析基础”课程实验合集.实验六》丨典型信号的观察与测量
数据湖(十一):Iceberg表数据组织与查询
After the win10 system is upgraded for a period of time, the memory occupation is too high
全是精华的模电专题复习资料:基本放大电路知识点
SQL修改语句
【5G NR】RRC连接释放
Usage of group by
[salesforce] how to confirm your salesforce version?
构建多架构镜像的最佳实践
2279. Maximum number of backpacks filled with stones
PyObject 转 char* (string)