当前位置:网站首页>4. Sensitive word filtering (prefix tree)
4. Sensitive word filtering (prefix tree)
2022-07-31 02:40:00 【A snicker】
The prefix tree root node is empty,Other nodes contain only one character.
from the root node to a node,Connect each path,is the string of the current node.
例如:Sensitive words form a prefix tree
3个指针:1指针指向根节点,2The pointer keeps moving forward,3Pointer round trip
1、定义前缀树
The data structure of the prefix tree,util包下的SensitiveFilter.java.
主要有两个成员变量,Keyword end marker and child nodes,Created in the node's data structurehashmapUsed to store structures.
private class TrieNode {
// Keyword end marker
private boolean isKeywordEnd = false;
// 子节点(key是下级字符,value是下级节点)
private Map<Character, TrieNode> subNodes = new HashMap<>(16);
public boolean isKeywordEnd() {
return isKeywordEnd;
}
public void setKeywordEnd(boolean keywordEnd) {
isKeywordEnd = keywordEnd;
}
/** * 添加子节点 * * @param c 字符 * @param node 前缀树节点 */
public void addSubNode(Character c, TrieNode node) {
subNodes.put(c, node);
}
/** * 获取子节点 * * @param c 字符 * @return 子节点的引用 */
public TrieNode getSubNode(Character c) {
return subNodes.get(c);
}
}
2、根据敏感词初始化前缀树
Read all sensitive words in the sensitive word file,Construct a prefix tree composed of sensitive words.Every time a sensitive word is read,added to the prefix tree.
The operation of adding a sensitive word to the prefix tree:Create a new child node first,Then go through this nodehashmap的put操作,Add child nodes to this nodemap中.例如,添加abc就需要新建a节点,根节点put(a,a节点),新建b节点…,新建c节点…;添加完abcafter sensitive words,再添加abd时,Just keep looping until b节点后,新建c节点,map对应关系即可.
@Component
public class SensitiveFilter {
private static final Logger LOGGER = LoggerFactory.getLogger(SensitiveFilter.class);
/** * 替换符 */
private static final String REPLACEMENT = "***";
/** * 根节点 */
private TrieNode rootNode = new TrieNode();
/** * 程序启动时,Or initialized when it is called for the first time,只需要初始化一次, * After the container is instantiated,调用构造器,这个方法就会被调用, * 这个beanInitialized when the service starts,So this method is when the service starts * 就会被调用. */
@PostConstruct
public void init() {
long start = System.currentTimeMillis();
try (
InputStream is = this.getClass().getClassLoader()
.getResourceAsStream("sensitive-words.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is))
) {
String keyword;
while ((keyword = reader.readLine()) != null) {
// 添加到前缀树
this.addKeyword(keyword);
}
} catch (IOException e) {
LOGGER.error("加载敏感词文件失败:" + e.getMessage());
e.printStackTrace();
}
LOGGER.info("Build prefix tree time:" + String.valueOf(System.currentTimeMillis() - start) + "ms");
}
/** * 将一个敏感词添加到前缀树中 * * @param keyword */
private void addKeyword(String keyword) {
TrieNode tempNode = rootNode;
for (int i = 0; i < keyword.length(); i++) {
char c = keyword.charAt(i);
TrieNode subNode = tempNode.getSubNode(c);
if (subNode == null) {
subNode = new TrieNode();
tempNode.addSubNode(c, subNode);
}
// 指向子节点,进入下一轮循环
tempNode = subNode;
// 设置结束标志
if (i == keyword.length() - 1) {
tempNode.setKeywordEnd(true);
}
}
}
private boolean isSymbol(char c) {
// 0x2E80 到 0x9FFFfor East Asian script
return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
}
}
3、过滤敏感词
Filter sensitive words of a string,如果有,则转化为***.Create a new string here,Not sensitive words,直接append,is a sensitive word,append().
指针1为根节点,指针2一直往前走,指针3往返,新建一个字符串.
如果遇到符号,Symbol againbegin位置,则指针2、3都加一,否则指针3加一.
begin不动,position不断增加.
If a complete sensitive word is encountered(That is, the flag bit is true),将position-begin替换为*,added to the new string,然后begin = position+1.
If there are no sensitive words,Add that character to the new string,begin = position + 1.
/** * 过滤敏感词 * <p> * For sensitive words with symbols, the symbols are removed first * * @param text 待过滤的文本 * @return 过滤后的文本 */
public String filter(String text) {
if (StringUtils.isBlank(text)) {
return null;
}
// 指针1
TrieNode tempNode = rootNode;
// 指针2
int begin = 0;
// 指针3
int position = 0;
// 结果
StringBuilder sb = new StringBuilder();
while (begin < text.length()) {
if (position < text.length()) {
char c = text.charAt(position);
// 跳过符号
if (isSymbol(c)) {
// 若指针1处于根节点,将此符号计入结果,让指针2向下走一步
if (tempNode == rootNode) {
sb.append(c);
begin++;
}
// Whether the symbol is at the beginning or in the middle,指针3都向下走一步
position++;
continue;
}
// 检查下级节点
tempNode = tempNode.getSubNode(c);
if (tempNode == null) {
// 以begin开头的字符串不是敏感词
sb.append(text.charAt(begin));
// 进入下一个位置
position = ++begin;
// 重新指向根节点
tempNode = rootNode;
} else if (tempNode.isKeywordEnd()) {
// 发现敏感词,将begin- position字符串替换掉
sb.append(REPLACEMENT);
// 进入下一个位置
begin = ++position;
// 重新指向根节点
tempNode = rootNode;
} else {
// 检查下一个字符
position++;
}
}
// position遍历越界仍未匹配到敏感词
else {
sb.append(text.charAt(begin));
position = ++begin;
tempNode = rootNode;
}
}
return sb.toString();
}
4、难点
Encounter sensitive words forfabcd和abc,The last segment of the string to be detected isfabc
It turned out to be a pointer3position作为循环判断条件,此时指针2为f,pointer after the loop3为c,c不是fabcdThe sensitive word flag bit node of ,So jump right out,没有检测fcharacters after begin的情况.
So take pointers2begin作为循环条件.
边栏推荐
- 完整复制虚拟机原理(云计算)
- 公司官网建站笔记(六):域名进行公安备案并将备案号显示在网页底部
- Software Testing Defect Reporting - Definition, Composition, Defect Lifecycle, Defect Tracking Post-Production Process, Defect Tracking Process, Purpose of Defect Tracking, Defect Management Tools
- coldfusion8 background scheduled tasks take shell
- 15、网站统计数据
- Brute Force/Adjacency List Breadth First Directed Weighted Graph Undirected Weighted Graph
- 系统需求多变如何设计
- 11、Redis实现关注、取消关注以及关注和粉丝列表
- 16、热帖排行
- 基于opencv实现人脸检测
猜你喜欢
Linux下redis7的安装,启动与停止
LeetCode 1161 The largest element in the layer and the LeetCode road of [BFS binary tree] HERODING
局域网电脑硬件信息收集工具
General introduction to the Unity interface
10 权限介绍
How to do a startup CTO?
Software testing basic interface testing - getting started with Jmeter, you should pay attention to these things
共模电感的仿真应用来了,满满的干货送给大家
STM32CUBEMX开发GD32F303(11)----ADC在DMA模式下扫描多个通道
Detailed explanation of STP election (step + case)
随机推荐
YOLOV5学习笔记(二)——环境安装+运行+训练
基于opencv实现人脸检测
4、敏感词过滤(前缀树)
Classic linked list OJ strong training problem - fast and slow double pointer efficient solution
【C语言基础】解决C语言error: expected ‘;‘, ‘,‘ or ‘)‘ before ‘&‘ token
Unity3D Button 鼠标悬浮进入与鼠标悬浮退出按钮事件
BAT卖不动「医疗云」:医院逃离、山头林立、行有行规
How to do a startup CTO?
Static route analysis (the longest mask matching principle + active and standby routes)
Go 项目实战-获取多级分类下的全部商品
YOLOV5学习笔记(三)——网络模块详解
How to build a private yum source
Refuse to work overtime, a productivity tool set developed by programmers
[1154] How to convert string to datetime
FPGA-based vending machine
The comprehensive result of the case statement, do you know it?[Verilog Advanced Tutorial]
JetPack组件Databinding
11、Redis实现关注、取消关注以及关注和粉丝列表
【Android】Room —— SQLite的替代品
自动化办公案例:如何自动生成期数据?