当前位置:网站首页>Open source SPL eliminates tens of thousands of database intermediate tables
Open source SPL eliminates tens of thousands of database intermediate tables
2022-07-05 20:45:00 【It bond】
The intermediate table is a data table in the database that stores intermediate calculation results , It is often a summary table established in the database for faster or more convenient front-end query and statistics , Because it is an intermediate result processed from raw data , Therefore, it is called intermediate table . In some large institutions , The number of intermediate watches accumulated over the years is as high as tens of thousands , It causes a lot of trouble to the system and use .
Intermediate tables will occupy a lot of database storage space, resulting in insufficient database capacity , Facing expansion pressure . Database space is often expensive , The cost of expansion is very high , And database expansion is often limited , It is not a good way to store intermediate tables at high cost . meanwhile , Too many intermediate tables will also cause database performance problems , Intermediate tables do not exist in isolation , From the original data to the intermediate table, it needs a series of operations, which consumes database computing resources , And the frequency of processing intermediate tables is sometimes very high , A lot of resources of the database are consumed in the generation of intermediate tables , Serious cases will cause slow database queries 、 Slow trading and other issues .
Why are there so many intermediate tables ? The main reasons are as follows .
1. One step cannot be calculated
The original data table in the database needs complex calculation , Can be shown on the report . One SQL It is difficult to achieve such complex calculations . There should be multiple consecutive SQL Realization , The former generates an intermediate table for the latter SQL Use .
2. The waiting time for real-time calculation is too long
Because of the large amount of data or complex calculation , Report users wait too long . So run batch tasks every night , Calculate the data and store it in the intermediate table . Report users will query much faster based on the intermediate table .
3. Diverse data sources participate in the calculation
From file 、NOSQL、Web service And other external data , It doesn't have much computing power , Need to use the computing power of the database , Especially when you want to perform mixed calculation with the data in the database , The traditional method can only import the database to form an intermediate table .
4. The middle table is difficult to delete
Because the database usually adopts a flat structure that lacks hierarchy , Once created, the intermediate table may be used by multiple queries , Deleting may affect other queries . It's even hard to figure out which programs use an intermediate table , Not to mention deleting , It's not that I don't want to delete , But dare not delete . Accumulate over a long period , It's not surprising that there are tens of thousands of middle watches .
that , Why save intermediate data into the database to form an intermediate table ? Careful observation of the direct cause of the middle table can be seen , The main purpose of saving to the database is to continue to rely on the computing power of the database . Intermediate data will be further calculated when used , Sometimes the calculation is complicated , At present, there are only databases (SQL) Have more convenient computing ability . Although data storage forms such as files also have advantages ( Such as IO High performance 、 Compressible 、 Easy to parallel ), But the document has no computing power , If the calculation is hard coded in the application based on the file , Far from it SQL convenient . In order to make further use of the computing power of the database is the fundamental reason for the generation of intermediate tables .
Intermediate data is necessary in a sense , But just to obtain further computing power, it will occupy a lot of database resources , Obviously, it is not an ideal solution . If the file has the same ability as the database , Storing the intermediate table in the file system outside the database can solve various problems of the intermediate table in the database , The database can also be freed ( alleviate excessive burden ).
Open source SPL It can be achieved .
SPL Is an open source structured data computing engine , Data processing can be directly based on files , Make the file also have computing power .SPL Database independent , It provides professional structured data objects and rich operation class libraries on them , Have complete computing power , At the same time, it supports process control , It is also convenient to realize complex calculation , It can completely replace the database to complete the intermediate table generation and subsequent data processing tasks .

Document calculation
SPL Can be based on Csv、Excel Wait for documents to calculate , You can also calculate JSON/XML And so on , Easy to read and use . such , You can store intermediate table data into such files , Reuse SPL Processing . Here are some general operations :
| A | B | |
| 1 | =T("/data/scores.txt") | |
| 2 | =A1.select(CLASS==10) | Filter |
| 3 | =A1.groups(CLASS;min(English),max(Chinese),sum(Math)) | Group summary |
| 4 | =A1.sort(CLASS:-1) | Sort |
| 5 | =T("/data/students.txt").keys(SID) | |
| 6 | =A1.join(STUID,A5,SNAME) | relation |
| 7 | =A6.derive(English+ Chinese+ Math:TOTLE) | Append column |
In addition to the original SPL grammar ,SPL It also provides quite SQL92 The standard SQL Support , For familiar with using SQL People can use it directly SQL Query file .
$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')
More complicated with All support :
$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t
left join ClientTable ct on t.Client=ct.Client
SPL Processing JSON/XML And so on ( file ) It also has advantages , Such as : According to the employee order information (json) Complete the calculation .
| A | ||
| 1 | =json(file("/data/EO.json").read()) | |
| 2 | =A1.conj(Orders) | |
| 3 | =A2.select(Amount>1000 && Amount<=3000 && [email protected](Client,"*s*")) | filter |
| 4 | =A2.groups(year(OrderDate);sum(Amount)) | Group summary |
| 5 | =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) | Associated calculation |
You can see , Relative to others JSON library ( Such as JsonPath)SPL The implementation of is simpler .
Again , Use SQL You can also check JSON data :
$select * from {json(file("/data/EO.json").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
SPL Agile syntax and process computing are also very suitable for complex computing , For example, based on stock records (txt) Calculate the longest consecutive days of a stock It can be written like this :
| A | |
| 1 | =T("/data/stock.txt") |
| 2 | [email protected](price<price[-1]).max(~.len())-1 |
Another example , According to the user login record (csv) List the last login interval of each user :
| A | ||
| 1 | =T(“/data/ulogin.csv”) | |
| 2 | =A1.groups(uid;top(2,-logtime)) | Last 2 Login records |
| 3 | =A2.new(uid,#2(1).logtime-#2(2).logtime:interval) | Calculation interval |
Such calculations are even based on database usage SQL It's also hard to write ,SPL It is very convenient to realize .
With SPL Out of Library computing support , Originally, various problems caused by the intermediate table of the database can be effectively solved . File storage no longer takes up database storage space , The pressure of database expansion decreases , The database is more convenient to manage ; Out of Library computing no longer occupies database computing resources , Database load reduction can better serve other businesses .
High performance file format
Although text is a very common form of data storage , It has the advantages of versatility and readability , however , The performance of text is very poor ! It is difficult to achieve high performance based on text .
Text characters cannot be calculated directly , Need to convert to an integer 、 The set of real Numbers 、 date 、 String and other memory data types can be further processed , Text parsing is a very complex task ,CPU Time consuming . In general , The main time of external memory data access is the reading of the hard disk itself , However, the performance bottleneck of text files often occurs in CPU link . Because of the complexity of parsing ,CPU It is likely to take more time than the hard disk ( Especially when using high-performance solid-state drives ). Text is usually not used when high-performance processing of large amounts of data is required .
SPL Provides two high-performance data storage formats , Set files and group tables . The set file is SPL Binary data format provided , Compression technology is adopted ( Smaller footprint and faster reading ), Stored data type ( There is no need to parse the data type, and reading is faster ), It also supports the multiplication and segmentation mechanism of appendable data , It is easy to realize parallel computing by using segmentation strategy , Further improve computing performance .
Group table is SPL Provide inventory 、 File storage format of indexing mechanism , The number of columns involved in the calculation ( Field ) When there is less inventory, it will have great advantages . The group table supports column storage , Realized minmax Index outside , It also supports the multiplication and segmentation mechanism , In this way, we can not only enjoy the advantages of inventory , It is also easier to improve parallel computing performance .
SPL Storage is very convenient , Basically consistent with the use of text , For example, read the set file and calculate :
| A | B | |
| 1 | =T("/data/scores.btx") | Read in set file |
| 2 | =A1.select(CLASS==10) | Filter |
| 3 | =A1.groups(CLASS;min(English),max(Chinese),sum(Math)) | Group summary |
If the amount of data is large , It also supports cursor batch reading and multiple CPU Parallel computing :
=file("/data/scores.btx")[email protected]()
When using files as data storage , No matter what format the original data is , Finally, they must at least be converted into binary ( Such as set file ) Format , In this way, it will have more advantages in terms of space occupation and computing performance .
Manageability
After the intermediate table is transferred outside the library and stored by file , In addition to reducing the burden on the database , The intermediate table outside the library itself also has strong manageability . Files can be stored through the tree directory of the system , Easy to use and manage . Different systems 、 The intermediate tables used by different modules are stored in different directories very clearly , There will be no cross references , In this way, there will be no tight coupling problem before each system or module caused by the previous confusion of the use of intermediate tables in the database . If the corresponding function module is offline, you can safely delete the corresponding intermediate table data without worrying about the impact on other programs .
Multi data source support
In addition to file data sources ,SPL It also supports dozens of other data sources , You can not only connect and access , It can also complete mixed calculation .

After the intermediate table is stored in files, cross source calculation is involved in the full query with the real-time data in the database , Use SPL Complete such T+0 Inquiry is very convenient .
| A | ||
| 1 | =cold=file(“/data/orders.ctx”).open().cursor(area,customer,amount) | / Cold data from file system (SPL High performance storage ) To take , Yesterday's and previous data |
| 2 | =hot=db.cursor(“select area,customer,amount from orders where odate>=?”,date(now())) | / Heat data is taken from the production Library , Today's data |
| 3 | =[cold,hot].conjx() | |
| 4 | =A3.groups(area,customer;sum(amout):amout) | / Hybrid computing implementation T+0 |
Integration
SPL Provides standards JDBC and ODBC Call for interface supply . Specially , about Java Applications can put SPL Integrated into the application as an embedded engine , Make the application itself have intermediate ( data ) Table processing capacity .
JDBC call SPL Code example :
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
SPL It's the interpretation of execution , Natural support for hot switching . be based on SPL Data calculation logic writing 、 There is no need to restart for modification and operation and maintenance , In real time , Development, operation and maintenance are also more convenient .
With the ability to calculate outside the Library SPL, Move the intermediate table to the file system , It can help the database eliminate tens of thousands of intermediate tables , While reducing the burden on the database , Get more flexibility 、 Faster performance and stronger scalability .
SPL Information
Welcome to SPL Interested assistant (VX Number :SPL-helper), Into the SPL Technology exchange group
边栏推荐
- Make Jar, Not War
- Duchefa丨MS培养基含维生素说明书
- Abnova fluorescent dye 620-m streptavidin scheme
- [Yugong series] go teaching course in July 2022 004 go code Notes
- 资源道具化
- Propping of resources
- Duchefa p1001 plant agar Chinese and English instructions
- National Eye Care Education Conference, 2022 the Fourth Beijing International Youth eye health industry exhibition
- 2022北京眼睛健康用品展,护眼产品展,中国眼博会11月举办
- Rainbow 5.7.1 supports docking with multiple public clouds and clusters for abnormal alarms
猜你喜欢

Classic implementation of the basic method of intelligent home of Internet of things

Abnova丨荧光染料 620-M 链霉亲和素方案

Duchefa MS medium contains vitamin instructions

Introduction to dead letter queue (two consumers, one producer)

Abnova fluorescent dye 620-m streptavidin scheme
![Informatics Orsay all in one 1339: [example 3-4] find the post order traversal | Valley p1827 [usaco3.4] American Heritage](/img/f0/0f985425bd61d9852af0b5fd7307ee.png)
Informatics Orsay all in one 1339: [example 3-4] find the post order traversal | Valley p1827 [usaco3.4] American Heritage

王老吉药业“关爱烈日下最可爱的人”公益活动在南京启动
![[quick start of Digital IC Verification] 2. Through an example of SOC project, understand the architecture of SOC and explore the design process of digital system](/img/1d/22bf47bfa30b9bdc2e8fd348180f49.png)
[quick start of Digital IC Verification] 2. Through an example of SOC project, understand the architecture of SOC and explore the design process of digital system

Leetcode (695) - the largest area of an island

Duchefa丨MS培养基含维生素说明书
随机推荐
Graph embedding learning notes
Duchefa s0188 Chinese and English instructions of spectinomycin hydrochloride pentahydrate
Which securities is better for securities account opening? Is online account opening safe?
【愚公系列】2022年7月 Go教学课程 004-Go代码注释
Relationship between mongodb documents
线程池的使用
Applet page navigation
中国管理科学研究院凝聚行业专家,傅强荣获智库专家“十佳青年”称号
Abnova fluorescent dye 620-m streptavidin scheme
Usaco3.4 "broken Gong rock" band raucous rockers - DP
CCPC 2021 Weihai - G. shinyruo and KFC (combination number, tips)
Typhoon is coming! How to prevent typhoons on construction sites!
Use of form text box (II) input filtering (synthetic event)
Popular science | does poor English affect the NPDP exam?
[Yugong series] go teaching course in July 2022 004 go code Notes
Composition of applet code
Document method
物联网智能家居基本方法实现之经典
Abnova DNA marker high quality control test program
Abnova total RNA Purification Kit for cultured cells Chinese and English instructions