当前位置：网站首页>Introduce you to ldbc SNB, a powerful tool for database performance and scenario testing

Introduce you to ldbc SNB, a powerful tool for database performance and scenario testing

2022-06-27 15:49:00 【Huawei cloud developer Alliance】

Abstract ： This paper mainly introduces the data generator based on interactive query （ Hereinafter referred to as" Datagen）, And LDBC SNB How data is served in Huawei graphics engine GES Application in .

This article is shared from Huawei cloud community 《【 Figure database performance and scenario test tools LDBC SNB】 A series of ： Introduction to data generator & be applied to GES service 》, author ： Farce and ball

The main content of this article includes ： A data generator based on interactive queries （ Hereinafter referred to as" Datagen） Introduce , And LDBC SNB How data is served in Huawei graphics engine GES Application in .LDBC SNB Preset nodes and relationships 、 Test cases for data generators and systems , Form a logical self - appropriate data “ Wulin ”, With ldbc snb For testing standard graph database products , Like the Xiake walking in it , All have to follow the same set “ Wulin rules ”（ The test case ）, Who can defeat all the experts , To be the leader of the Alliance ？

LDBC SNB summary

LDBC SNB, Full name The Linked Data Benchmark Council’s Social Network Benchmark, Official website address ：http://ldbcouncil.org.LDBC It is an industrial alliance organization dedicated to developing map data management , It developed a set of standard benchmarks, It is used to systematically measure the function and performance of different graph database products .SNB It is a group based on social network scenario development benchmarks, By interactive scene (Interactive workload) And business intelligence scenarios (Business Intelligence workload) form .

LDBC SNB The project includes 3 A component ： Data generator （Datagen）、 Test driver （Test Driver, Used to perform Benchmark Test of ） And test case implementation （Reference Implementation, Currently available based on Cypher(Neo4j) and SQL(PostgreSQL) Test case implementation of two query languages ）

LDBC SNB There are two working modes ：

1、 Interactive query (Interactive workload), It is suitable for transactional online query scenarios , For example, basic addition, deletion, modification and query 、shortestpath、 Jump more and wait ;

2、 business intelligence (Business Intelligence workload), It is applicable to complex queries and large-scale offline graph analysis based on enterprise business scenarios .

In different working modes ,【Datagen】、【Test Driver】 and 【 Test case implementation 】 It's all different .

Chapter overview

One 、Datagen Introduce

Data model
Data Types
Data Schema
Datagen Installation and operation process of
Datagen Parameter Settings
General parameter settings
Scale factor
Serialization mode

Two 、LDBC SNB stay GES Application in

One 、Datagen Introduce

Data model

Data Types

Datagen Supported properties datatype as follows , Each attribute supports both single value and list modes .

( The screenshot comes from the official documents http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf)

Data Schema

( The screenshot comes from the official documents http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf)

As shown in the figure ,Datagen The generated data has a preset set of graph models , Include ：

8 Types of nodes ：organization & place & tag & tagClass & person & forum & post & comment

15 Kind of relationship , The following table ：

These preset nodes and relationships , Form a logical self - appropriate data “ Wulin ”, With ldbc snb For testing standard graph database products , Like the Xiake walking in it , All have to follow the same set “ Wulin rules ”（ The test case ）, Who can defeat all the experts , To be the leader of the Alliance ？ Let's wait and see .

Installation and operation process

stay Interactive Workload In mode ,Datagen The base of is hadoop; stay BI Workload In mode , The base is Spark.

This survey mainly uses pseudo distributed hadoop Of Datagen.

1） Download based on hadoop Of ldbc datagen

GitHub - ldbc/ldbc_snb_datagen_hadoop: The Hadoop-based variant of the SNB Datagen

2） Use pseudo distributed hadoop

cd ldbc_snb_datagen_hadoop/
cp params-csv-composite.ini params.ini
wget http://archive.apache.org/dist/hadoop/core/hadoop-3.2.1/hadoop-3.2.1.tar.gz
tar xf hadoop-3.2.1.tar.gz
export HADOOP_CLIENT_OPTS="-Xmx2G"
# set this to the Hadoop 3.2.1 directory
export HADOOP_HOME=`pwd`/hadoop-3.2.1
./run.sh

3） Missing... At compile time jar Package problem solving （ An error is as follows ）

Solution ：

from windows Environment Download https://simulation.tudelft.nl/maven/dsol/dsol-xml/1.6.9/

Manually install the missing jar Package to local maven Warehouse

mvn install:install-file -Dfile=dsol-xml-1.6.9.jar -DgroupId=dsol -DartifactId=dsol-xml -Dversion=1.6.9 -Dpackaging=jar

4） Run again , Complete build

sh run.sh

The generated data file is stored in ${outputDir}/social_network.

Parameter setting

（ The following parameter descriptions omit the prefix “ldbc.snb.datagen.”, That is, the complete format of the parameter is “ldbc.snb.datagen.xxx”）

1） Conventional parameters

2） Scale factor

LDBC SNB Support the generation of graph datasets of different sizes ,generator.scaleFactor The number of points and edges corresponding to each parameter value is shown in the following table ：

( The screenshot comes from the official documents http://ldbcouncil.org/ldbc_snb_docs/ldbc-snb-specification.pdf)

3） Serialization mode

Datagen There are mainly 4 Kind of Csv Serialization mode of file , The generated data formats vary .

CsvBasic

Basic serialization mode , Each node 、 Nodes and relationships between nodes are independent csv file , As shown in Figure 1 ：

Figure 1 Each node 、 Nodes and relationships between nodes are independent csv file , among person_xx.csv Are all person Attribute data of the node .

If an attribute has multiple values , for example person Of email Property has multiple values , Will person Of email Records generate a separate csv file , And many more email Display in multiple lines , As shown in Figure 2 ：

Figure 2 person Of email Attributes are stored separately , And in multiple email Display in multiple records

CsvComposite（ Data generated by this schema , And GES Supported by Csv The format similarity is the highest ）

stay CsvBasic On the basis of , Combine attributes with multiple values and other attributes into one record , As shown in Figure 3 ; And merge multiple values ( With list The format of , Semicolons separate ), As shown in figure 4 ;

Figure 3 person The attribute records of nodes are merged into person_0_0.csv

Figure 4 language and email Two list Attributes are merged on one line

CsvMergeForeign

stay CsvBasic On the basis of , If the relationship between nodes is 1 To many , Then the relationship is merged into the attribute file of the node as a foreign key , As shown in figure 5

Figure 5 take comment-hasCreator->person、comment-isLocatedIn->place、comment-replyOf->post、comment-replyOf->comment Relationship with comment Properties file merge

CsvCompositeMergeForeign

yes CsvComposite and CsvMergeForeign The combination of , Both merged list attribute , The one to many relation is compressed , As shown in figure 6

Figure 6 place Column means person-isLocatedIn->place The foreign key representation of the relationship , meanwhile language and email With list Form show

The parameter values corresponding to each serialization mode are as follows

CsvBasic

ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.CsvBasicDynamicActivitySerializer
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.CsvBasicDynamicPersonSerializer
#ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.CsvBasicStaticSerializer

CsvComposite

ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.CsvCompositeDynamicActivitySerializer
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.CsvCompositeDynamicPersonSerializer
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.CsvCompositeStaticSerializer

CsvMergeForeign

ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.CsvMergeForeignDynamicActivitySerializer
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.CsvMergeForeignDynamicPersonSerializer
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.CsvMergeForeignStaticSerializer

CsvCompositeMergeForeign

ldbc.snb.datagen.serializer.dynamicActivitySerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.activity.CsvCompositeMergeForeignDynamicActivitySerializer
ldbc.snb.datagen.serializer.dynamicPersonSerializer:ldbc.snb.datagen.serializer.snb.csv.dynamicserializer.person.CsvCompositeMergeForeignDynamicPersonSerializer
ldbc.snb.datagen.serializer.staticSerializer:ldbc.snb.datagen.serializer.snb.csv.staticserializer.CsvCompositeMergeForeignStaticSerializer

Two 、LDBC SNB stay GES Application in

Datagen The resulting dataset is consistent with GES The format is as follows 3 Make a difference

Different label The point of id There may be id Repetition ;
knows The relationship is two-way ;
No, label Column .

Use DatagenToGES Data conversion script ( be based on CsvComposite Serialization mode ) Can be LDBC Count , Need to be in python3.6 Operation in environment .

DatagenTOGES The script has the following functions ：

take 8 Node types are mapped to 1-8 A number prefix , The original id Convert to start with a numeric prefix 、 The length is 20bytes The new id, Solve the difference label Between the points of id Repetitive questions ;
increase knows Reverse edge data for edge files ;
increase label Column .

File format before conversion (CsvComposite Serialization mode )：