当前位置:网站首页>4. Sensitive word filtering (prefix tree)
4. Sensitive word filtering (prefix tree)
2022-07-31 02:40:00 【A snicker】

The prefix tree root node is empty,Other nodes contain only one character.
from the root node to a node,Connect each path,is the string of the current node.
例如:Sensitive words form a prefix tree
3个指针:1指针指向根节点,2The pointer keeps moving forward,3Pointer round trip
1、定义前缀树
The data structure of the prefix tree,util包下的SensitiveFilter.java.
主要有两个成员变量,Keyword end marker and child nodes,Created in the node's data structurehashmapUsed to store structures.
private class TrieNode {
// Keyword end marker
private boolean isKeywordEnd = false;
// 子节点(key是下级字符,value是下级节点)
private Map<Character, TrieNode> subNodes = new HashMap<>(16);
public boolean isKeywordEnd() {
return isKeywordEnd;
}
public void setKeywordEnd(boolean keywordEnd) {
isKeywordEnd = keywordEnd;
}
/** * 添加子节点 * * @param c 字符 * @param node 前缀树节点 */
public void addSubNode(Character c, TrieNode node) {
subNodes.put(c, node);
}
/** * 获取子节点 * * @param c 字符 * @return 子节点的引用 */
public TrieNode getSubNode(Character c) {
return subNodes.get(c);
}
}
2、根据敏感词初始化前缀树
Read all sensitive words in the sensitive word file,Construct a prefix tree composed of sensitive words.Every time a sensitive word is read,added to the prefix tree.
The operation of adding a sensitive word to the prefix tree:Create a new child node first,Then go through this nodehashmap的put操作,Add child nodes to this nodemap中.例如,添加abc就需要新建a节点,根节点put(a,a节点),新建b节点…,新建c节点…;添加完abcafter sensitive words,再添加abd时,Just keep looping until b节点后,新建c节点,map对应关系即可.
@Component
public class SensitiveFilter {
private static final Logger LOGGER = LoggerFactory.getLogger(SensitiveFilter.class);
/** * 替换符 */
private static final String REPLACEMENT = "***";
/** * 根节点 */
private TrieNode rootNode = new TrieNode();
/** * 程序启动时,Or initialized when it is called for the first time,只需要初始化一次, * After the container is instantiated,调用构造器,这个方法就会被调用, * 这个beanInitialized when the service starts,So this method is when the service starts * 就会被调用. */
@PostConstruct
public void init() {
long start = System.currentTimeMillis();
try (
InputStream is = this.getClass().getClassLoader()
.getResourceAsStream("sensitive-words.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is))
) {
String keyword;
while ((keyword = reader.readLine()) != null) {
// 添加到前缀树
this.addKeyword(keyword);
}
} catch (IOException e) {
LOGGER.error("加载敏感词文件失败:" + e.getMessage());
e.printStackTrace();
}
LOGGER.info("Build prefix tree time:" + String.valueOf(System.currentTimeMillis() - start) + "ms");
}
/** * 将一个敏感词添加到前缀树中 * * @param keyword */
private void addKeyword(String keyword) {
TrieNode tempNode = rootNode;
for (int i = 0; i < keyword.length(); i++) {
char c = keyword.charAt(i);
TrieNode subNode = tempNode.getSubNode(c);
if (subNode == null) {
subNode = new TrieNode();
tempNode.addSubNode(c, subNode);
}
// 指向子节点,进入下一轮循环
tempNode = subNode;
// 设置结束标志
if (i == keyword.length() - 1) {
tempNode.setKeywordEnd(true);
}
}
}
private boolean isSymbol(char c) {
// 0x2E80 到 0x9FFFfor East Asian script
return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
}
}
3、过滤敏感词
Filter sensitive words of a string,如果有,则转化为***.Create a new string here,Not sensitive words,直接append,is a sensitive word,append().
指针1为根节点,指针2一直往前走,指针3往返,新建一个字符串.
如果遇到符号,Symbol againbegin位置,则指针2、3都加一,否则指针3加一.
begin不动,position不断增加.
If a complete sensitive word is encountered(That is, the flag bit is true),将position-begin替换为*,added to the new string,然后begin = position+1.
If there are no sensitive words,Add that character to the new string,begin = position + 1.
/** * 过滤敏感词 * <p> * For sensitive words with symbols, the symbols are removed first * * @param text 待过滤的文本 * @return 过滤后的文本 */
public String filter(String text) {
if (StringUtils.isBlank(text)) {
return null;
}
// 指针1
TrieNode tempNode = rootNode;
// 指针2
int begin = 0;
// 指针3
int position = 0;
// 结果
StringBuilder sb = new StringBuilder();
while (begin < text.length()) {
if (position < text.length()) {
char c = text.charAt(position);
// 跳过符号
if (isSymbol(c)) {
// 若指针1处于根节点,将此符号计入结果,让指针2向下走一步
if (tempNode == rootNode) {
sb.append(c);
begin++;
}
// Whether the symbol is at the beginning or in the middle,指针3都向下走一步
position++;
continue;
}
// 检查下级节点
tempNode = tempNode.getSubNode(c);
if (tempNode == null) {
// 以begin开头的字符串不是敏感词
sb.append(text.charAt(begin));
// 进入下一个位置
position = ++begin;
// 重新指向根节点
tempNode = rootNode;
} else if (tempNode.isKeywordEnd()) {
// 发现敏感词,将begin- position字符串替换掉
sb.append(REPLACEMENT);
// 进入下一个位置
begin = ++position;
// 重新指向根节点
tempNode = rootNode;
} else {
// 检查下一个字符
position++;
}
}
// position遍历越界仍未匹配到敏感词
else {
sb.append(text.charAt(begin));
position = ++begin;
tempNode = rootNode;
}
}
return sb.toString();
}
4、难点
Encounter sensitive words forfabcd和abc,The last segment of the string to be detected isfabc
It turned out to be a pointer3position作为循环判断条件,此时指针2为f,pointer after the loop3为c,c不是fabcdThe sensitive word flag bit node of ,So jump right out,没有检测fcharacters after begin的情况.
So take pointers2begin作为循环条件.
边栏推荐
猜你喜欢

知识蒸馏7:知识蒸馏代码详解

自动化办公案例:如何自动生成期数据?

二层广播风暴(产生原因+判断+解决)

Tower of Hanoi problem

Unity界面总体介绍

Discourse 自定义头部链接(Custom Header Links)

7. List of private messages

Fiddler captures packets to simulate weak network environment testing

mycat的主从关系 垂直分库 水平分表 以及mycat分片联表查询的配置详解(mysql5.7系列)

The final exam first year course
随机推荐
【C语言基础】解决C语言error: expected ‘;‘, ‘,‘ or ‘)‘ before ‘&‘ token
The difference between link and @import
MPPT太阳能充放电控制器数据采集-通过网关采集电池电压容量电量SOC,wifi传输
全流程调度——MySQL与Sqoop
基于opencv实现人脸检测
10 权限介绍
tcp框架需要解决的问题
Brute Force/Adjacency Matrix Breadth First Directed Weighted Graph Undirected Weighted Graph
12 磁盘相关命令
共模电感的仿真应用来了,满满的干货送给大家
Software accumulation -- Screenshot software ScreenToGif
Inter-vlan routing + static routing + NAT (PAT + static NAT) comprehensive experiment
【Android】Room —— SQLite的替代品
SQL注入 Less46(order by后的注入+rand()布尔盲注)
工程(五)——小目标检测tph-yolov5
拒绝加班,程序员开发的效率工具集
Validate XML documents
Intranet Infiltration - Privilege Escalation
ShardingJDBC基本介绍
There is a problem with the multiplayer-hlap package and the solution cannot be upgraded