当前位置:网站首页>4. Sensitive word filtering (prefix tree)
4. Sensitive word filtering (prefix tree)
2022-07-31 02:40:00 【A snicker】
The prefix tree root node is empty,Other nodes contain only one character.
from the root node to a node,Connect each path,is the string of the current node.
例如:Sensitive words form a prefix tree
3个指针:1指针指向根节点,2The pointer keeps moving forward,3Pointer round trip
1、定义前缀树
The data structure of the prefix tree,util包下的SensitiveFilter.java.
主要有两个成员变量,Keyword end marker and child nodes,Created in the node's data structurehashmapUsed to store structures.
private class TrieNode {
// Keyword end marker
private boolean isKeywordEnd = false;
// 子节点(key是下级字符,value是下级节点)
private Map<Character, TrieNode> subNodes = new HashMap<>(16);
public boolean isKeywordEnd() {
return isKeywordEnd;
}
public void setKeywordEnd(boolean keywordEnd) {
isKeywordEnd = keywordEnd;
}
/** * 添加子节点 * * @param c 字符 * @param node 前缀树节点 */
public void addSubNode(Character c, TrieNode node) {
subNodes.put(c, node);
}
/** * 获取子节点 * * @param c 字符 * @return 子节点的引用 */
public TrieNode getSubNode(Character c) {
return subNodes.get(c);
}
}
2、根据敏感词初始化前缀树
Read all sensitive words in the sensitive word file,Construct a prefix tree composed of sensitive words.Every time a sensitive word is read,added to the prefix tree.
The operation of adding a sensitive word to the prefix tree:Create a new child node first,Then go through this nodehashmap的put操作,Add child nodes to this nodemap中.例如,添加abc就需要新建a节点,根节点put(a,a节点),新建b节点…,新建c节点…;添加完abcafter sensitive words,再添加abd时,Just keep looping until b节点后,新建c节点,map对应关系即可.
@Component
public class SensitiveFilter {
private static final Logger LOGGER = LoggerFactory.getLogger(SensitiveFilter.class);
/** * 替换符 */
private static final String REPLACEMENT = "***";
/** * 根节点 */
private TrieNode rootNode = new TrieNode();
/** * 程序启动时,Or initialized when it is called for the first time,只需要初始化一次, * After the container is instantiated,调用构造器,这个方法就会被调用, * 这个beanInitialized when the service starts,So this method is when the service starts * 就会被调用. */
@PostConstruct
public void init() {
long start = System.currentTimeMillis();
try (
InputStream is = this.getClass().getClassLoader()
.getResourceAsStream("sensitive-words.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is))
) {
String keyword;
while ((keyword = reader.readLine()) != null) {
// 添加到前缀树
this.addKeyword(keyword);
}
} catch (IOException e) {
LOGGER.error("加载敏感词文件失败:" + e.getMessage());
e.printStackTrace();
}
LOGGER.info("Build prefix tree time:" + String.valueOf(System.currentTimeMillis() - start) + "ms");
}
/** * 将一个敏感词添加到前缀树中 * * @param keyword */
private void addKeyword(String keyword) {
TrieNode tempNode = rootNode;
for (int i = 0; i < keyword.length(); i++) {
char c = keyword.charAt(i);
TrieNode subNode = tempNode.getSubNode(c);
if (subNode == null) {
subNode = new TrieNode();
tempNode.addSubNode(c, subNode);
}
// 指向子节点,进入下一轮循环
tempNode = subNode;
// 设置结束标志
if (i == keyword.length() - 1) {
tempNode.setKeywordEnd(true);
}
}
}
private boolean isSymbol(char c) {
// 0x2E80 到 0x9FFFfor East Asian script
return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
}
}
3、过滤敏感词
Filter sensitive words of a string,如果有,则转化为***.Create a new string here,Not sensitive words,直接append,is a sensitive word,append().
指针1为根节点,指针2一直往前走,指针3往返,新建一个字符串.
如果遇到符号,Symbol againbegin位置,则指针2、3都加一,否则指针3加一.
begin不动,position不断增加.
If a complete sensitive word is encountered(That is, the flag bit is true),将position-begin替换为*,added to the new string,然后begin = position+1.
If there are no sensitive words,Add that character to the new string,begin = position + 1.
/** * 过滤敏感词 * <p> * For sensitive words with symbols, the symbols are removed first * * @param text 待过滤的文本 * @return 过滤后的文本 */
public String filter(String text) {
if (StringUtils.isBlank(text)) {
return null;
}
// 指针1
TrieNode tempNode = rootNode;
// 指针2
int begin = 0;
// 指针3
int position = 0;
// 结果
StringBuilder sb = new StringBuilder();
while (begin < text.length()) {
if (position < text.length()) {
char c = text.charAt(position);
// 跳过符号
if (isSymbol(c)) {
// 若指针1处于根节点,将此符号计入结果,让指针2向下走一步
if (tempNode == rootNode) {
sb.append(c);
begin++;
}
// Whether the symbol is at the beginning or in the middle,指针3都向下走一步
position++;
continue;
}
// 检查下级节点
tempNode = tempNode.getSubNode(c);
if (tempNode == null) {
// 以begin开头的字符串不是敏感词
sb.append(text.charAt(begin));
// 进入下一个位置
position = ++begin;
// 重新指向根节点
tempNode = rootNode;
} else if (tempNode.isKeywordEnd()) {
// 发现敏感词,将begin- position字符串替换掉
sb.append(REPLACEMENT);
// 进入下一个位置
begin = ++position;
// 重新指向根节点
tempNode = rootNode;
} else {
// 检查下一个字符
position++;
}
}
// position遍历越界仍未匹配到敏感词
else {
sb.append(text.charAt(begin));
position = ++begin;
tempNode = rootNode;
}
}
return sb.toString();
}
4、难点
Encounter sensitive words forfabcd和abc,The last segment of the string to be detected isfabc
It turned out to be a pointer3position作为循环判断条件,此时指针2为f,pointer after the loop3为c,c不是fabcdThe sensitive word flag bit node of ,So jump right out,没有检测fcharacters after begin的情况.
So take pointers2begin作为循环条件.
边栏推荐
猜你喜欢
Introduction and use of Drools WorkBench
Discourse 自定义头部链接(Custom Header Links)
Drools Rule Properties, Advanced Syntax
基于opencv实现人脸检测
8. Unified exception handling (controller notifies @ControllerAdvice global configuration class, @ExceptionHandler handles exceptions uniformly)
How to design the changing system requirements
AI在医疗影像设备全流程应用
mmdetection训练一个模型相关命令
软件积累 -- 截图软件ScreenToGif
8、统一处理异常(控制器通知@ControllerAdvice全局配置类、@ExceptionHandler统一处理异常)
随机推荐
How to do a startup CTO?
CentOS7下mysql5.7.37的安装【完美方案】
BAT can't sell "Medical Cloud": Hospitals flee, mountains stand, and there are rules
LeetCode 每日一题 2022/7/25-2022/7/31
Teach you how to configure Jenkins automated email notifications
YOLOV5学习笔记(三)——网络模块详解
StringJoiner详解
The application of AI in the whole process of medical imaging equipment
Face detection based on opencv
Draw Your Cards
f.grid_sample
静态路由+PAT+静态NAT(讲解+实验)
【CV项目调试】CUDNN_CONVOLUTION_FWD_SPECIFY_WORKSPACE_LIMIT问题
Mathematics to solve the problem - circular linked list
BAT卖不动「医疗云」:医院逃离、山头林立、行有行规
多线程下类对象的服务承诺探讨
Detailed explanation of STP election (step + case)
图解lower_bound&upper_bound
[1154] How to convert string to datetime
To write good test cases, you must first learn test design