当前位置:网站首页>4. Sensitive word filtering (prefix tree)
4. Sensitive word filtering (prefix tree)
2022-07-31 02:40:00 【A snicker】

The prefix tree root node is empty,Other nodes contain only one character.
from the root node to a node,Connect each path,is the string of the current node.
例如:Sensitive words form a prefix tree
3个指针:1指针指向根节点,2The pointer keeps moving forward,3Pointer round trip
1、定义前缀树
The data structure of the prefix tree,util包下的SensitiveFilter.java.
主要有两个成员变量,Keyword end marker and child nodes,Created in the node's data structurehashmapUsed to store structures.
private class TrieNode {
// Keyword end marker
private boolean isKeywordEnd = false;
// 子节点(key是下级字符,value是下级节点)
private Map<Character, TrieNode> subNodes = new HashMap<>(16);
public boolean isKeywordEnd() {
return isKeywordEnd;
}
public void setKeywordEnd(boolean keywordEnd) {
isKeywordEnd = keywordEnd;
}
/** * 添加子节点 * * @param c 字符 * @param node 前缀树节点 */
public void addSubNode(Character c, TrieNode node) {
subNodes.put(c, node);
}
/** * 获取子节点 * * @param c 字符 * @return 子节点的引用 */
public TrieNode getSubNode(Character c) {
return subNodes.get(c);
}
}
2、根据敏感词初始化前缀树
Read all sensitive words in the sensitive word file,Construct a prefix tree composed of sensitive words.Every time a sensitive word is read,added to the prefix tree.
The operation of adding a sensitive word to the prefix tree:Create a new child node first,Then go through this nodehashmap的put操作,Add child nodes to this nodemap中.例如,添加abc就需要新建a节点,根节点put(a,a节点),新建b节点…,新建c节点…;添加完abcafter sensitive words,再添加abd时,Just keep looping until b节点后,新建c节点,map对应关系即可.
@Component
public class SensitiveFilter {
private static final Logger LOGGER = LoggerFactory.getLogger(SensitiveFilter.class);
/** * 替换符 */
private static final String REPLACEMENT = "***";
/** * 根节点 */
private TrieNode rootNode = new TrieNode();
/** * 程序启动时,Or initialized when it is called for the first time,只需要初始化一次, * After the container is instantiated,调用构造器,这个方法就会被调用, * 这个beanInitialized when the service starts,So this method is when the service starts * 就会被调用. */
@PostConstruct
public void init() {
long start = System.currentTimeMillis();
try (
InputStream is = this.getClass().getClassLoader()
.getResourceAsStream("sensitive-words.txt");
BufferedReader reader = new BufferedReader(new InputStreamReader(is))
) {
String keyword;
while ((keyword = reader.readLine()) != null) {
// 添加到前缀树
this.addKeyword(keyword);
}
} catch (IOException e) {
LOGGER.error("加载敏感词文件失败:" + e.getMessage());
e.printStackTrace();
}
LOGGER.info("Build prefix tree time:" + String.valueOf(System.currentTimeMillis() - start) + "ms");
}
/** * 将一个敏感词添加到前缀树中 * * @param keyword */
private void addKeyword(String keyword) {
TrieNode tempNode = rootNode;
for (int i = 0; i < keyword.length(); i++) {
char c = keyword.charAt(i);
TrieNode subNode = tempNode.getSubNode(c);
if (subNode == null) {
subNode = new TrieNode();
tempNode.addSubNode(c, subNode);
}
// 指向子节点,进入下一轮循环
tempNode = subNode;
// 设置结束标志
if (i == keyword.length() - 1) {
tempNode.setKeywordEnd(true);
}
}
}
private boolean isSymbol(char c) {
// 0x2E80 到 0x9FFFfor East Asian script
return !CharUtils.isAsciiAlphanumeric(c) && (c < 0x2E80 || c > 0x9FFF);
}
}
3、过滤敏感词
Filter sensitive words of a string,如果有,则转化为***.Create a new string here,Not sensitive words,直接append,is a sensitive word,append().
指针1为根节点,指针2一直往前走,指针3往返,新建一个字符串.
如果遇到符号,Symbol againbegin位置,则指针2、3都加一,否则指针3加一.
begin不动,position不断增加.
If a complete sensitive word is encountered(That is, the flag bit is true),将position-begin替换为*,added to the new string,然后begin = position+1.
If there are no sensitive words,Add that character to the new string,begin = position + 1.
/** * 过滤敏感词 * <p> * For sensitive words with symbols, the symbols are removed first * * @param text 待过滤的文本 * @return 过滤后的文本 */
public String filter(String text) {
if (StringUtils.isBlank(text)) {
return null;
}
// 指针1
TrieNode tempNode = rootNode;
// 指针2
int begin = 0;
// 指针3
int position = 0;
// 结果
StringBuilder sb = new StringBuilder();
while (begin < text.length()) {
if (position < text.length()) {
char c = text.charAt(position);
// 跳过符号
if (isSymbol(c)) {
// 若指针1处于根节点,将此符号计入结果,让指针2向下走一步
if (tempNode == rootNode) {
sb.append(c);
begin++;
}
// Whether the symbol is at the beginning or in the middle,指针3都向下走一步
position++;
continue;
}
// 检查下级节点
tempNode = tempNode.getSubNode(c);
if (tempNode == null) {
// 以begin开头的字符串不是敏感词
sb.append(text.charAt(begin));
// 进入下一个位置
position = ++begin;
// 重新指向根节点
tempNode = rootNode;
} else if (tempNode.isKeywordEnd()) {
// 发现敏感词,将begin- position字符串替换掉
sb.append(REPLACEMENT);
// 进入下一个位置
begin = ++position;
// 重新指向根节点
tempNode = rootNode;
} else {
// 检查下一个字符
position++;
}
}
// position遍历越界仍未匹配到敏感词
else {
sb.append(text.charAt(begin));
position = ++begin;
tempNode = rootNode;
}
}
return sb.toString();
}
4、难点
Encounter sensitive words forfabcd和abc,The last segment of the string to be detected isfabc
It turned out to be a pointer3position作为循环判断条件,此时指针2为f,pointer after the loop3为c,c不是fabcdThe sensitive word flag bit node of ,So jump right out,没有检测fcharacters after begin的情况.
So take pointers2begin作为循环条件.
边栏推荐
- 15. Website Statistics
- Maximum area of solar panel od js
- 【Bank Series Phase 1】People's Bank of China
- Live Preview | KDD2022 Doctoral Dissertation Award Champion and Runner-up Dialogue
- 开题报告之论文框架
- 8、统一处理异常(控制器通知@ControllerAdvice全局配置类、@ExceptionHandler统一处理异常)
- What level of software testing does it take to get a 9K job?
- 如何搭建私有yum源
- mmdetection trains a model related command
- AI software development process in medical imaging field
猜你喜欢

8. Unified exception handling (controller notifies @ControllerAdvice global configuration class, @ExceptionHandler handles exceptions uniformly)

16. Registration Center-consul

Classic linked list OJ strong training problem - fast and slow double pointer efficient solution

Mathematics to solve the problem - circular linked list

To write good test cases, you must first learn test design

The whole process scheduling, MySQL and Sqoop

221. Largest Square

Live Preview | KDD2022 Doctoral Dissertation Award Champion and Runner-up Dialogue

全流程调度——MySQL与Sqoop

跨专业考研难度大?“上岸”成功率低?这份实用攻略请收下!
随机推荐
Introduction and use of Drools WorkBench
12 磁盘相关命令
mysql index
Clustering index, and what is the difference between a clustering index
User interaction + formatted output
Difference between CMOS and TTL?
Tower of Hanoi problem
7. List of private messages
英特尔软硬优化,赋能东软加速智慧医疗时代到来
Project development software directory structure specification
Basic introduction to ShardingJDBC
SQL注入 Less47(报错注入) 和Less49(时间盲注)
字体压缩神器font-spider的使用
关于 mysql8.0数据库中主键位id,使用replace插入id为0时,实际id插入后自增导致数据重复插入 的解决方法
全流程调度——MySQL与Sqoop
Intranet Infiltration - Privilege Escalation
Why is String immutable?
try-catch中含return
The application of AI in the whole process of medical imaging equipment
Uninstallation of mysql5.7.37 under CentOS7 [perfect solution]