当前位置：网站首页>Practice of constructing ten billion relationship knowledge map based on Nebula graph

Practice of constructing ten billion relationship knowledge map based on Nebula graph

2022-06-28 14:13:00 【NebulaGraph】

This article was first published in Nebula Graph Community official account

be based on Nebula Graph Practice of building a ten billion relationship knowledge map

One 、 Project background

Micro LAN is a query technology 、 industry 、 Enterprises 、 Scientific research institutions 、 Application of knowledge atlas of disciplines and their relationships , It contains billions of relationships and billions of entities , In order to make this business run perfectly , Through investigation and research , We use Nebula Graph As the main database carrying our knowledge map business , With Nebula Graph Product iteration , We finally chose to use it v2.5.1 Version of Nebula Graph As the final version .

Two 、 Why choose Nebula Graph？

In the field of open source graph database , There are undoubtedly many options , But in order to support the knowledge map service of such large-scale data ,Nebula Graph Compared with other graph databases, it has the following advantages , This is also our choice Nebula Graph Why ：

Small memory footprint

In our business scenario , our QPS Relatively low and without high volatility , At the same time, compared with other graph databases ,Nebula Graph With less idle time memory consumption , So we can run by using a machine with a lower memory configuration Nebula Graph service , This will undoubtedly save us costs .

Use multi-raft Agreement of conformity

multi-raft Compared to traditional raft, It not only increases the availability of the system , And the performance is better than traditional raft higher . The performance of consensus algorithm mainly depends on whether it allows holes and granularity segmentation , In the application layer, no matter KV Database or SQL , Make good use of these two features , The performance will not be bad . because raft The serial commit of is highly dependent on the performance of the state machine , This leads to even in KV On , One key Of op slow , It will significantly slow down others key. therefore , The key to the performance of a consistent protocol , It must be how the state machine makes it possible to be parallel as much as possible , Even if multi-raft The grain size of is relatively coarse （ Compared with Paxos）, But for those who are not allowed to be empty raft Agreement for , There is still a huge improvement .

The storage side uses RocksDB As a storage engine

RocksDB As a storage engine / embedded database , It has been widely used as the storage end in various databases . More to the point Nebula Graph Can be adjusted by RocksDB To improve database performance .

Write fast

Our business needs to write a lot frequently ,Nebula Graph Even when there is a lot of long text content vertex Under the circumstances （ Within cluster 3 Taiwan machine 、3 Copy of the data ,16 Thread insertion ） The insertion speed can also reach 2 ten thousand /s Speed of insertion , The insertion speed of the edge without attributes can reach 35 ten thousand /s.

3、 ... and 、 Use Nebula Graph What problems have we encountered ？

In our knowledge map business , Many scenarios need to show users the paged one degree relationship , At the same time, there are some super nodes in our data , But according to our business scenario , The super node must be the node with the highest user access probability , So this can't be simply classified as a long tail problem ; And because we don't have a large number of users , So the cache will not be hit very often , We need a solution to reduce the query latency of users .

give an example ： The business scenario is to query the downstream technologies of this technology , At the same time, we should sort according to the sorting key we set , This sort key is a local sort key . such as , An organization ranks particularly high in a certain field , But in general or in other fields , In this scenario, we must set the sorting attribute to the edge , And the global sorting items are fitted and standardized , So that the variance of data in each dimension is 1, The average is 0, For local sorting , At the same time, it also supports paging operation to facilitate user query .

The statement is as follows ：

MATCH (v1:technology)-[e:technologyLeaf]->(v2:technology) WHERE id(v2) == "foobar" \ RETURN id(v1), v1.name, e.sort_value AS sort ORDER BY sort | LIMIT 0,20;

This node has 13 Ten thousand neighbors , In this case, even for sort_value Attribute is indexed , The query still takes nearly two seconds . This speed is obviously unacceptable .

We finally chose to use ant financial open source OceanBase Database to help us achieve business , The data model is as follows ：

technologydownstream

technology_id	downstream_id	sort_value
foobar	id1	1.0
foobar	id2	0.5
foobar	id3	0.0

technology

id	name	sort_value
id1	aaa	0.3
id2	bbb	0.2
id3	ccc	0.1

The query statement is as follows ：

SELECT technology.name FROM technology INNER JOIN (SELECT technologydownstream.downstream_id FROM technologydownstream WHERE technologydownstream.technology_id = 'foobar' ORDER BY technologydownstream.sort_value DESC LIMIT 0,20) AS t WHERE t.downstream_id=technology.id;

This sentence takes time 80 millisecond . Here is the whole architecture design

be based on Nebula Graph Practice of building a ten billion relationship knowledge map

Four 、 Use Nebula Graph How do we tune ？

I talked about it before. Nebula Graph One of the great advantages of is the ability to use native RocksDB Parameters for tuning , Reduce the cost of learning , We share the specific meaning of tuning items and some tuning strategies as follows ：

RocksDB Parameters	meaning
max_total_wal_size	once wal Your file exceeds max_total_wal_size Will force the creation of new wal file , The default value is 0 when ,max_total_wal_size = write_buffer_size * max_write_buffer_number * 4
delete_obsolete_files_period_micros	Delete the period of expired files , Expired files contain sst Document and wal file , The default is 6 Hours
max_background_jobs	Maximum number of background threads = max_background_flushes + max_background_compactions
stats_dump_period_sec	If it is not empty , Is every stats_dump_period_sec Seconds will print rocksdb.stats Information to LOG file
compaction_readahead_size	The amount of data pre read from the hard disk during compression . If in the SSD Running on disk RocksDB, For performance reasons, it should be set to at least 2 MB. If it's non-zero , At the same time, it will force new_table_reader_for_compaction_inputs=true
writable_file_max_buffer_size	WritableFileWriter Maximum buffer size used RocksDB Write cache for , about Direct IO In terms of mode , Tuning this parameter is important .
bytes_per_sync	Every time sync The amount of data , Accumulated to bytes_per_sync Will take the initiative Flush To disk , This option applies to sst file ,wal Files use wal_bytes_per_sync
wal_bytes_per_sync	wal Every time the file is full wal_bytes_per_sync File size , By calling sync_file_range To refresh the file , The default value is 0 It doesn't work
delayed_write_rate	In case of Write Stall, The speed of writing will be limited to delayed_write_rate following
avoid_flush_during_shutdown	By default ,DB All... Will be refreshed when closing memtable, If this option is set, the refresh will not be forced , May cause data loss
max_open_files	RocksDB Number of handles that can open a file ( Mainly sst file ), In this way, the next time you visit, you can directly use , Without having to reopen . When the cached file handle exceeds max_open_files after , Some handles will be close fall , Note the handle close When the corresponding sst Of index cache and filter cache Will be released together , because index block and filter block Cache on heap , The maximum quantity is limited by max_open_files Options control . basis sst Of documents index_block How to organize , Generally speaking index_block Than data_block Big 1 To 2 An order of magnitude , So every time you read data, you must first load index_block, here index Data on heap , It will not actively eliminate data ; If a lot of random reads , Can cause severe read amplification , In addition, it may lead to RocksDB Occupy a large amount of physical memory for unknown reasons , So the adjustment of this value is very important , Need to be based on your own workload Trade off between performance and memory usage . If this value is -1,RocksDB All open handles will always be cached , But this will cause a lot of memory overhead
stats_persist_period_sec	If it is not empty , Is every stats_persist_period_sec Automatically save statistics to hidden column families ___ rocksdb_stats_history___.
stats_history_buffer_size	If it's not zero , The statistics snapshot is taken regularly and stored in memory , The maximum memory size for statistics snapshots is stats_history_buffer_size
strict_bytes_per_sync	RocksDB Write data to the hard disk for performance reasons , There is no synchronization by default Flush, Therefore, there is the possibility of data loss under abnormal conditions , In order to control the amount of lost data , You need some parameters to set the refresh action . If this parameter is true, that RocksDB Will be strictly in accordance with wal_bytes_per_sync and bytes_per_sync Set the brush disc , That is to refresh a complete file every time , If this parameter is false Only part of the data is refreshed each time ： That is to say, if you don't care about possible data loss care, You can set false, However, it is recommended that true
enable_rocksdb_prefix_filtering	Open or not prefix_bloom_filter, After it is enabled, it will be written according to key Before rocksdb_filtering_prefix_length Bit in memtable structure bloom filter
enable_rocksdb_whole_key_filtering	stay memtable establish bloomfilter, Where mapped key yes memtable Integrity key name , So this configuration and enable_rocksdb_prefix_filtering Conflict , If enable_rocksdb_prefix_filtering by true, This configuration will not take effect
rocksdb_filtering_prefix_length	see enable_rocksdb_prefix_filtering
num_compaction_threads	Background concurrency compaction Maximum number of threads , It is actually the maximum number of threads in the thread pool ,compaction By default, the thread pool of considers low priority
rate_limit	Used to record in the code through NewGenericRateLimiter Create the parameters of the rate controller , In this way, these parameters can be used to build the rate_limiter.rate_limiter yes RocksDB Used to control Compaction and Flush Write rate tool , Because too fast write will affect the data reading , We can set it like this ：`rate_limit = {"id":"GenericRateLimiter"; "mode":"kWritesOnly"; "clock":"PosixClock"; "rate_bytes_per_sec":"200"; "fairness":"10"; "refill_period_us":"1000000"; "auto_tuned":"false";}`
write_buffer_size	memtable Maximum size, If it exceeds this value ,RocksDB Will turn it into immutable memtable, And create another new memtable
max_write_buffer_number	Maximum memtable The number of , contain mem and imm. If the full ,RocksDB Will stop subsequent writes , Usually this is because the writing is too fast but Flush Not in time
level0_file_num_compaction_trigger	Leveled Compaction Special trigger parameters , When L0 The number of files reached level0_file_num_compaction_trigger The value of , The trigger L0 and L1 The merger of . The higher the value , The smaller the write magnification , The larger the read magnification . When this value is large , Is close to Universal Compaction state
level0_slowdown_writes_trigger	When level0 The number of files is greater than this value , It will reduce the writing speed . Adjust this parameter to level0_stop_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem
level0_stop_writes_trigger	When level0 The number of files is greater than this value , Will reject writing . Adjust this parameter to level0_slowdown_writes_trigger Parameters are used to solve too many problems L0 Caused by files Write Stall problem
target_file_size_base	L1 Of documents SST size . Increasing this value will reduce the overall DB Of size, If you need to adjust, you can make target_file_size_base = max_bytes_for_level_base / 10, That is to say level 1 There will be 10 individual SST File can
target_file_size_multiplier	bring L1 The upper (L2...L6) Of the file SST Of size Will be larger than the current layer target_file_size_multiplier times
max_bytes_for_level_base	L1 Maximum capacity of the layer （ all SST Sum of file sizes ）, Exceeding this capacity triggers Compaction
max_bytes_for_level_multiplier	Incremental parameter of the total file size of each layer compared with the previous layer
disable_auto_compactions	Whether to disable automatic Compaction

Communication graph database technology ？ Join in Nebula Communication group please first Fill in your Nebula Business card ,Nebula The little assistant will pull you into the group ~~

原网站

版权声明
本文为[NebulaGraph]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/179/202206281408076304.html