当前位置:网站首页>Open source SPL eliminates tens of thousands of database intermediate tables
Open source SPL eliminates tens of thousands of database intermediate tables
2022-07-05 20:45:00 【It bond】
The intermediate table is a data table in the database that stores intermediate calculation results , It is often a summary table established in the database for faster or more convenient front-end query and statistics , Because it is an intermediate result processed from raw data , Therefore, it is called intermediate table . In some large institutions , The number of intermediate watches accumulated over the years is as high as tens of thousands , It causes a lot of trouble to the system and use .
Intermediate tables will occupy a lot of database storage space, resulting in insufficient database capacity , Facing expansion pressure . Database space is often expensive , The cost of expansion is very high , And database expansion is often limited , It is not a good way to store intermediate tables at high cost . meanwhile , Too many intermediate tables will also cause database performance problems , Intermediate tables do not exist in isolation , From the original data to the intermediate table, it needs a series of operations, which consumes database computing resources , And the frequency of processing intermediate tables is sometimes very high , A lot of resources of the database are consumed in the generation of intermediate tables , Serious cases will cause slow database queries 、 Slow trading and other issues .
Why are there so many intermediate tables ? The main reasons are as follows .
1. One step cannot be calculated
The original data table in the database needs complex calculation , Can be shown on the report . One SQL It is difficult to achieve such complex calculations . There should be multiple consecutive SQL Realization , The former generates an intermediate table for the latter SQL Use .
2. The waiting time for real-time calculation is too long
Because of the large amount of data or complex calculation , Report users wait too long . So run batch tasks every night , Calculate the data and store it in the intermediate table . Report users will query much faster based on the intermediate table .
3. Diverse data sources participate in the calculation
From file 、NOSQL、Web service And other external data , It doesn't have much computing power , Need to use the computing power of the database , Especially when you want to perform mixed calculation with the data in the database , The traditional method can only import the database to form an intermediate table .
4. The middle table is difficult to delete
Because the database usually adopts a flat structure that lacks hierarchy , Once created, the intermediate table may be used by multiple queries , Deleting may affect other queries . It's even hard to figure out which programs use an intermediate table , Not to mention deleting , It's not that I don't want to delete , But dare not delete . Accumulate over a long period , It's not surprising that there are tens of thousands of middle watches .
that , Why save intermediate data into the database to form an intermediate table ? Careful observation of the direct cause of the middle table can be seen , The main purpose of saving to the database is to continue to rely on the computing power of the database . Intermediate data will be further calculated when used , Sometimes the calculation is complicated , At present, there are only databases (SQL) Have more convenient computing ability . Although data storage forms such as files also have advantages ( Such as IO High performance 、 Compressible 、 Easy to parallel ), But the document has no computing power , If the calculation is hard coded in the application based on the file , Far from it SQL convenient . In order to make further use of the computing power of the database is the fundamental reason for the generation of intermediate tables .
Intermediate data is necessary in a sense , But just to obtain further computing power, it will occupy a lot of database resources , Obviously, it is not an ideal solution . If the file has the same ability as the database , Storing the intermediate table in the file system outside the database can solve various problems of the intermediate table in the database , The database can also be freed ( alleviate excessive burden ).
Open source SPL It can be achieved .
SPL Is an open source structured data computing engine , Data processing can be directly based on files , Make the file also have computing power .SPL Database independent , It provides professional structured data objects and rich operation class libraries on them , Have complete computing power , At the same time, it supports process control , It is also convenient to realize complex calculation , It can completely replace the database to complete the intermediate table generation and subsequent data processing tasks .
Document calculation
SPL Can be based on Csv、Excel Wait for documents to calculate , You can also calculate JSON/XML And so on , Easy to read and use . such , You can store intermediate table data into such files , Reuse SPL Processing . Here are some general operations :
A | B | |
1 | =T("/data/scores.txt") | |
2 | =A1.select(CLASS==10) | Filter |
3 | =A1.groups(CLASS;min(English),max(Chinese),sum(Math)) | Group summary |
4 | =A1.sort(CLASS:-1) | Sort |
5 | =T("/data/students.txt").keys(SID) | |
6 | =A1.join(STUID,A5,SNAME) | relation |
7 | =A6.derive(English+ Chinese+ Math:TOTLE) | Append column |
In addition to the original SPL grammar ,SPL It also provides quite SQL92 The standard SQL Support , For familiar with using SQL People can use it directly SQL Query file .
$select * from d:/Orders.csv where Client in ('TAS','KBRO','PNS')
More complicated with All support :
$select t.Client, t.s, ct.Name, ct.address from
(select Client ,sum(amount) s from d:/Orders.csv group by Client) t
left join ClientTable ct on t.Client=ct.Client
SPL Processing JSON/XML And so on ( file ) It also has advantages , Such as : According to the employee order information (json) Complete the calculation .
A | ||
1 | =json(file("/data/EO.json").read()) | |
2 | =A1.conj(Orders) | |
3 | =A2.select(Amount>1000 && Amount<=3000 && [email protected](Client,"*s*")) | filter |
4 | =A2.groups(year(OrderDate);sum(Amount)) | Group summary |
5 | =A1.new(Name,Gender,Dept,Orders.OrderID,Orders.Client,Orders.Client,Orders.SellerId,Orders.Amount,Orders.OrderDate) | Associated calculation |
You can see , Relative to others JSON library ( Such as JsonPath)SPL The implementation of is simpler .
Again , Use SQL You can also check JSON data :
$select * from {json(file("/data/EO.json").read())}
where Amount>=100 and Client like 'bro' or OrderDate is null
SPL Agile syntax and process computing are also very suitable for complex computing , For example, based on stock records (txt) Calculate the longest consecutive days of a stock It can be written like this :
A | |
1 | =T("/data/stock.txt") |
2 | [email protected](price<price[-1]).max(~.len())-1 |
Another example , According to the user login record (csv) List the last login interval of each user :
A | ||
1 | =T(“/data/ulogin.csv”) | |
2 | =A1.groups(uid;top(2,-logtime)) | Last 2 Login records |
3 | =A2.new(uid,#2(1).logtime-#2(2).logtime:interval) | Calculation interval |
Such calculations are even based on database usage SQL It's also hard to write ,SPL It is very convenient to realize .
With SPL Out of Library computing support , Originally, various problems caused by the intermediate table of the database can be effectively solved . File storage no longer takes up database storage space , The pressure of database expansion decreases , The database is more convenient to manage ; Out of Library computing no longer occupies database computing resources , Database load reduction can better serve other businesses .
High performance file format
Although text is a very common form of data storage , It has the advantages of versatility and readability , however , The performance of text is very poor ! It is difficult to achieve high performance based on text .
Text characters cannot be calculated directly , Need to convert to an integer 、 The set of real Numbers 、 date 、 String and other memory data types can be further processed , Text parsing is a very complex task ,CPU Time consuming . In general , The main time of external memory data access is the reading of the hard disk itself , However, the performance bottleneck of text files often occurs in CPU link . Because of the complexity of parsing ,CPU It is likely to take more time than the hard disk ( Especially when using high-performance solid-state drives ). Text is usually not used when high-performance processing of large amounts of data is required .
SPL Provides two high-performance data storage formats , Set files and group tables . The set file is SPL Binary data format provided , Compression technology is adopted ( Smaller footprint and faster reading ), Stored data type ( There is no need to parse the data type, and reading is faster ), It also supports the multiplication and segmentation mechanism of appendable data , It is easy to realize parallel computing by using segmentation strategy , Further improve computing performance .
Group table is SPL Provide inventory 、 File storage format of indexing mechanism , The number of columns involved in the calculation ( Field ) When there is less inventory, it will have great advantages . The group table supports column storage , Realized minmax Index outside , It also supports the multiplication and segmentation mechanism , In this way, we can not only enjoy the advantages of inventory , It is also easier to improve parallel computing performance .
SPL Storage is very convenient , Basically consistent with the use of text , For example, read the set file and calculate :
A | B | |
1 | =T("/data/scores.btx") | Read in set file |
2 | =A1.select(CLASS==10) | Filter |
3 | =A1.groups(CLASS;min(English),max(Chinese),sum(Math)) | Group summary |
If the amount of data is large , It also supports cursor batch reading and multiple CPU Parallel computing :
=file("/data/scores.btx")[email protected]()
When using files as data storage , No matter what format the original data is , Finally, they must at least be converted into binary ( Such as set file ) Format , In this way, it will have more advantages in terms of space occupation and computing performance .
Manageability
After the intermediate table is transferred outside the library and stored by file , In addition to reducing the burden on the database , The intermediate table outside the library itself also has strong manageability . Files can be stored through the tree directory of the system , Easy to use and manage . Different systems 、 The intermediate tables used by different modules are stored in different directories very clearly , There will be no cross references , In this way, there will be no tight coupling problem before each system or module caused by the previous confusion of the use of intermediate tables in the database . If the corresponding function module is offline, you can safely delete the corresponding intermediate table data without worrying about the impact on other programs .
Multi data source support
In addition to file data sources ,SPL It also supports dozens of other data sources , You can not only connect and access , It can also complete mixed calculation .
After the intermediate table is stored in files, cross source calculation is involved in the full query with the real-time data in the database , Use SPL Complete such T+0 Inquiry is very convenient .
A | ||
1 | =cold=file(“/data/orders.ctx”).open().cursor(area,customer,amount) | / Cold data from file system (SPL High performance storage ) To take , Yesterday's and previous data |
2 | =hot=db.cursor(“select area,customer,amount from orders where odate>=?”,date(now())) | / Heat data is taken from the production Library , Today's data |
3 | =[cold,hot].conjx() | |
4 | =A3.groups(area,customer;sum(amout):amout) | / Hybrid computing implementation T+0 |
Integration
SPL Provides standards JDBC and ODBC Call for interface supply . Specially , about Java Applications can put SPL Integrated into the application as an embedded engine , Make the application itself have intermediate ( data ) Table processing capacity .
JDBC call SPL Code example :
…
Class.forName("com.esproc.jdbc.InternalDriver");
Connection conn =DriverManager.getConnection("jdbc:esproc:local://");
Statement st = connection.();
CallableStatement st = conn.prepareCall("{call splscript(?, ?)}");
st.setObject(1, 3000);
st.setObject(2, 5000);
ResultSet result=st.execute();
…
SPL It's the interpretation of execution , Natural support for hot switching . be based on SPL Data calculation logic writing 、 There is no need to restart for modification and operation and maintenance , In real time , Development, operation and maintenance are also more convenient .
With the ability to calculate outside the Library SPL, Move the intermediate table to the file system , It can help the database eliminate tens of thousands of intermediate tables , While reducing the burden on the database , Get more flexibility 、 Faster performance and stronger scalability .
SPL Information
Welcome to SPL Interested assistant (VX Number :SPL-helper), Into the SPL Technology exchange group
边栏推荐
- Welcome to the game and win rich bonuses: Code Golf Challenge officially launched
- Norgen AAV extractant box instructions (including features)
- Mysql频繁操作出现锁表问题
- 常用的视图容器类组件
- 小程序页面导航
- 小程序项目结构
- Informatics Olympiad 1340: [example 3-5] extended binary tree
- 解读协作型机器人的日常应用功能
- 重上吹麻滩——段芝堂创始人翟立冬游记
- Where is a good stock account? Is online account manager safe to open an account
猜你喜欢
[record of question brushing] 1 Sum of two numbers
鸿蒙系统控制LED的实现方法之经典
Abnova丨DNA 标记高质量控制测试方案
Analyze the knowledge transfer and sharing spirit of maker Education
Pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
CVPR 2022 | 常见3D损坏和数据增强
National Eye Care Education Conference, 2022 the Fourth Beijing International Youth eye health industry exhibition
全国爱眼教育大会,2022第四届北京国际青少年眼健康产业展会
如何让化工企业的ERP库存账目更准确
Graph embedding learning notes
随机推荐
2.8、项目管理过程基础知识
Duchefa丨MS培养基含维生素说明书
小程序页面导航
【UE4】UnrealInsight获取真机性能测试报告
Return to blowing marshland -- travel notes of zhailidong, founder of duanzhitang
Applet page navigation
Make Jar, Not War
AI 从代码中自动生成注释文档
How to renew NPDP? Here comes the operation guide!
插值查找的简单理解
Duchefa s0188 Chinese and English instructions of spectinomycin hydrochloride pentahydrate
Abnova丨E (DIII) (WNV) 重组蛋白 中英文说明书
Schema and model
Which securities is better for securities account opening? Is online account opening safe?
Duchefa cytokinin dihydrozeatin (DHZ) instructions
E. Singhal and numbers (prime factor decomposition)
Model method
IC popular science article: those things about Eco
手机开户股票开户安全吗?我家比较偏远,有更好的开户途径么?
Practical demonstration: how can the production research team efficiently build the requirements workflow?