当前位置:网站首页>Kingbasees plug-in ftutilx of Jincang database
Kingbasees plug-in ftutilx of Jincang database
2022-06-25 11:07:00 【Thousands of sails pass by the side of the sunken boat_】
Catalog
1. The plugin is introduced
ftutilx It's a KingbaseES An extension of , It is mainly used to format files from storage streams blob Extract text content from the type field . among blob Type field contents can include pdf、doc、docx、wps、xls、xlsx、ppt and pptx Format file .ftutilx The plug-in does not support encrypted file format .
2. Add plug-ins
In the use of ftutilx Before , You need to add it to kingbase.conf Of documents shared_preload_libraries in , And restart KingbaseES database .
shared_preload_libraries = 'ftutilx' # (change requires restart)
CREATE EXTENSION ftutilx;
3. Parameter configuration
ftutilx.max_string_length
Maximum length of extraction result , The default value is :128M, This parameter takes effect immediately after it is set .
ftutilx.jvm_option_string
JVM Initialize parameters , The default value is :"-Xmx1024m,-Xms1024m,-Xmn256m,-XX:MetaspaceSize=64m,-XX:MaxMetaspaceSize=128m,-XX:CompressedClassSpaceSize=256m", This parameter is only called for the first time in the session process extracttext Function creation JVM Effective when , Setting this parameter again is no longer valid .
Under the database default extended loading mechanism , After creating an extension in a session , The extended dynamic library is not loaded immediately after a new session starts , Instead, the extension dynamic library will not be loaded until the interface in the extension is called for the first time , As a result, it is invalid to set the extension parameters in the new session . The solution is : Modify... In the database configuration file shared_preload_libraries perhaps session_preload_libraries One of the two parameters , Make the parameter value include ftutilx, It can be loaded immediately after the new session starts ftutilx Extend dynamic library , And set the extension parameters .
4. Use ftutilx
ftutilx The plug-in provides extracttext Function is used to extract data stored in blob File contents in the type field .extracttext The() function accepts a that represents the contents of a file blob Type parameter , Returns the extracted text Type text content .
CREATE TABLE tab (title text, body blob);
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title, length(extracttext(body)) FROM tab;
4.1. Use ftutilx The joint use scheme of full-text retrieval
Because the extraction speed of electronic document content is slow , To improve the performance of full-text retrieval , You can add storage columns to a table , It is used to store content extraction results or word position lists .
Scheme 1 :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);
ALTER TABLE tab ADD COLUMN content text GENERATED ALWAYS AS (extracttext(body)) STORED;
CREATE INDEX tab_idx ON tab USING GIN (to_tsvector('zhparsercfg', content));
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title FROM tab WHERE to_tsvector('zhparsercfg', content) @@ to_tsquery(' journal ');
Option two :
CREATE EXTENSION zhparser;
CREATE TEXT SEARCH CONFIGURATION zhparsercfg (PARSER = zhparser);
ALTER TEXT SEARCH CONFIGURATION zhparsercfg ADD MAPPING FOR n,v,a,i,e,l WITH simple;
CREATE EXTENSION ftutilx;
CREATE TABLE tab (title text, body blob);
ALTER TABLE tab ADD COLUMN tab_idx_col tsvector GENERATED ALWAYS AS (to_tsvector('zhparsercfg', extracttext(body))) STORED;
CREATE INDEX tab_idx ON tab USING GIN (tab_idx_col);
INSERT INTO tab VALUES ('test.doc', blob_import('/home/test/data.doc'));
SELECT title FROM tab WHERE tab_idx_col @@ to_tsquery(' journal ');
4.2. matters needing attention
1) ftutilx Need to rely on jre-1.8.0 Runtime environment , Settings required after deployment LD_LIBRARY_PATH The system environment variable contains jre-1.8.0 Of libjvm.so route .
2) ftutilx.max_string_length Parameter is used to configure the maximum length of the extraction result , But because of tsvector At present, the biggest support (1M-1), therefore extracttext combination to_tsvector When using , The size of the word segmentation result cannot exceed (1M-1).
3) ftutilx Need to create JVM,JVM It will occupy more memory . Although adjusted ftutilx.jvm_option_string Of -Xmx Can restrict JVM Memory footprint , But too small -Xmx Value will cause large file parsing JVM An out of memory exception occurred .
4) Based on the previous full-text retrieval joint use scheme , In an environment with less system memory , You need to limit the number of session processes that insert data in parallel , In case the system memory is exhausted .
5. Uninstall plugins
drop extension ftutilx;
边栏推荐
- 金仓数据库 KingbaseES 插件identity_pwdexp
- 撸一个随机数生成器
- Apache ShenYu 入門
- Advanced single chip microcomputer -- development of PCB (2)
- Compilation of learning from Wang Shuang (1)
- Learn to learn self-study [learning to learn itself is more important than learning anything]
- What are the functions of arm64 assembly that need attention?
- [observation] objectscale: redefining the next generation of object storage, reconstruction and innovation of Dell Technology
- Technical practice and development trend of video conference all in one machine
- XSS attack
猜你喜欢

报名开启|飞桨黑客马拉松第三期如约而至,久等啦

Explanation and use of kotlin syntax for Android

【文件包含漏洞-03】文件包含漏洞的六种利用方式

Nuxtjs actual combat case

scrapy+scrapyd+gerapy 爬虫调度框架

Technical practice and development trend of video conference all in one machine

Writing wechat applet with uni app

软件测试 避免“试用期被辞退“指南,看这一篇就够了

今天16:00 | 中科院计算所研究员孙晓明老师带大家走进量子的世界

Jincang KFS data cascade scenario deployment
随机推荐
今天16:00 | 中科院计算所研究员孙晓明老师带大家走进量子的世界
ARM64汇编的函数有那些需要注意?
Think about it
Is it safe to open a securities account in changtou school by mobile phone?
Software testing to avoid being dismissed during the probation period
【上云精品】节能提效!加速纺织业“智造”转型
Android: generic mapping analysis of gson and JSON in kotlin
Kingbasees plug-in DBMS of Jincang database_ OUTPUT
Daily 3 questions (3) - check whether integers and their multiples exist
[maintain cluster case set] gaussdb query user space usage
Multiple environment variables
c盘使用100%清理方法
Crawler scheduling framework of scratch+scratch+grammar
Nuxtjs actual combat case
报名开启|飞桨黑客马拉松第三期如约而至,久等啦
Five types of questions about network planning
金仓数据库 KingbaseES 插件dbms_session
戴尔科技演绎“快”字诀,玩转CI/CD
Handler、Message、Looper、MessageQueue
A difficult mathematical problem baffles two mathematicians