当前位置:网站首页>Compress words (kmp/ string hash, double hash)
Compress words (kmp/ string hash, double hash)
2022-07-02 16:00:00 【Selvaggia】
compress words
- Connect several words , Avoid repeating fields at the end of a word and the beginning of the next word
- kmp Find the longest prefix
- String double hash
- Ideas :
- Question 1
- Question two
- Question 3
- Question 4
- Question five
- Question 6
- Question seven
- Code
- Double hash (So easy, First write a single hash , Then one by one m o d − > m o d [ k ] , b a s − > b a s [ k ] , h [ i ] − > h [ k ] [ i ] mod->mod[k],bas->bas[k],h[i]->h[k][i] mod−>mod[k],bas−>bas[k],h[i]−>h[k][i], When calculating the hash value, use a k loop , It should be noted that when judging the equality of strings, the hash value under each hash function must meet the conditions
Topic link
This is good , Huo , As soon as you open this , The catalogue is amazing , There are so many questions in one question , obviously , Someone's IQ is worrying
Connect several words , Avoid repeating fields at the end of a word and the beginning of the next word
kmp Find the longest prefix
Deliver it to coedforce On ,next Array error , It should be the same name , To be on the safe side, change to nex
Or right kmp The understanding of the algorithm is not deep enough , When the match is successful j=le,
You don't succeed , One is inequality , After all, the main string ratio is over, but it can't match all
Just look at the last one j Indicates where to match
If the last character of the main string fails to match s[0] Quite successful ,j It will become -1, Then add one 0
Imagine the matching process
abcdefgh
fgheg
#include <iostream>
#include <math.h>
using namespace std;
#include <string>
const int N=1e6+10;
string t,s;
int len;// The length of the subsequent input string
int nex[N];
void getNext(){
int i=0;
int j=-1;
nex[0]=-1;
while(i<len){
if(j==-1||s[i]==s[j]){
i++;
j++;
nex[i]=j;
}
else j=nex[j];
}
}
int kmp(int st){
int i=st,j=0;
int le=t.size();
while(i<le){
while(j!=-1&&t[i]!=s[j])j=nex[j];
i++;j++;
}
// while(i<le){
// if(j==-1||t[i]==s[j]){
// i++;
// j++;
// }
// else j=next[j];
// }
return j;
// Or right kmp The understanding of the algorithm is not deep enough , When the match is successful j=le,
// You don't succeed , One is inequality , After all, the main string ratio is over, but it can't match all
// Just look at the last one j Indicates where to match
// If the last character of the main string fails to match s[0] Quite successful ,j It will become -1, Then add one 0
// Imagine the matching process
// abcdefgh
// fgheg
}
int main(){
int n;
cin>>n;
cin>>t;
for(int i=1;i<n;i++){
cin>>s;
len=s.size();
getNext();
int st;// The starting matching position of the formed string , There is no need to go back and forth from 0 Start
st=t.size()-len;
st=max(0,st);
int pos=kmp(st);
t+=s.substr(pos,len-pos+1);
}
cout<<t;
return 0;
}
String double hash
Ideas :
Take the first word as the desired string t
Then enter a word s Then calculate the string prefix hash of this word ,
from On heavy Stack Of Long degree yes m i n ( s t r l e n ( t ) , s t r l e n ( s ) ) , Because the length of overlap is min(strlen(t),strlen(s)), from On heavy Stack Of Long degree yes min(strlen(t),strlen(s)),
stay s The length of this section , Compare separately s Prefix hash value of and
t The hash value of the corresponding suffix segment , Be sure to s Add a paragraph of t, Further find t The hash value of the complete segment is repeated getH Get the hash value of the suffix segment
WC I didn't expect this problem to take me so long
A pile of problems :
Question 1
Continue to splice a string s s s To the original string t t t On , The total length data range is 1 e 6 1e6 1e6, In order to find the hash value of the existing length string , Repeated use s t r l e n ( t + 1 ) strlen(t+1) strlen(t+1)
n n n The range is 1 e 5 1e5 1e5,strlen The time complexity is O ( n ) O(n) O(n), The outer loop + Inside s t r l e n , complex miscellaneous degree O ( n 2 ) strlen, Complexity O(n2) strlen, complex miscellaneous degree O(n2), There are also... In the cycle strlen It's really overtime to despair
strlen
scanf("%s",t+1);
int st=1;
for(int j=1;j<n;j++){
scanf("%s",s+1);
int len=strlen(t+1);
int le=strlen(s+1);
for(int i=st;i<=len;i++){
int id=getID(t[i]);
h[i]=(h[i-1]*bas+id)%mod;
}
……
……
……
strlen It's just the work of a counter , It comes from somewhere in memory ( It can start with a string , Somewhere in the middle , Even some uncertain memory area ) Start scanning , Until you hit the first string Terminator ’\0’ until , Then return the counter value ( Length does not contain ’\0’).
We use it cin>>s and scanf(%s) When entering a string , After accepting the string, add a ’\0’, That is to say NULL As a string Terminator .
So use strlen Function to find the length of a string , The complexity is O(n). If you use... In a loop
for (int i = 0; i < strlen(s); ++i)// Complexity O(n2) That's trouble , It may time out , terms of settlement :
(1) The first strlen(s) First store it with variables , Then put it into the cycle ;
(2) Or use s[i] != ‘\0’;
Question two
1、
ll mod[2]={1e9+7,999999998}; There's no problem with the compiler , Deliver it to codeforce On ,compile error, The reason is that 1e9+7 Will count as double Floating point number of type , 1.000000000 ∗ 1 0 7 1.000000000*10^7 1.000000000∗107
ll mod[2]={1000000007,999999998};
2、 Similar to that ,cin>>s+1, The error report says that there is no overloaded operator operator
3、 And before next Duplicate name of
Question 3
Double hash or multiple hash , You must have the same hash value under each hash function to judge that the two strings are equal , It's a && The relationship between , therefore
st=len+1;
int pos=0;
ll H[2]={
0,0};
for(int i=1;i<=le&&i<=len;i++){
int id=getID(s[i]);
int ok=1;
for(int k=0;k<2;k++){
H[k]=(H[k]*bas[k]+id)%mod[k];//s character string s[1--i] This hash value
// ll temp=((h[k][len]-h[k][len-i]*p[k][i])%mod[k]+mod[k])%mod[k];
if(H[k]!=getH(len-i+1,len,k))ok=0;
}
if(ok)pos=i;//12345
}
As long as the hash value under a hash function is not equal , It cannot be counted as string equality
Question 4
H[2][N] Optimize to H[2] Well , It's a temporary variable , It is specially used to record the prefix hash value of the next word , Keep typing words , therefore H Remember to initialize the array every time , It can't be finished after the global variable is defined ( Initialize only once ), because ==H[k]=(H[k]*bas[k]+id)%mod[k];== We need to use the last H【k】 value , It's a bit like the optimization of rolling arrays
If you don't need to scroll array optimization , There is no need to initialize again ,
H[k][i]=(H[k][i-1]*bas[k]+id)%mod[k];
It takes space
for(int i=1;i<=le&&i<=len;i++){
int id=getID(s[i]);
int ok=1;
for(int k=0;k<2;k++){
H[k][i]=(H[k][i-1]*bas[k]+id)%mod[k];//s character string s[1--i] This hash value
// ll temp=((h[k][len]-h[k][len-i]*p[k][i])%mod[k]+mod[k])%mod[k];
if(H[k][i]!=getH(len-i+1,len,k))ok=0;
}
if(ok)pos=i;//12345
}
Question five
Use string hash Sure O ( n ) Handle ,O ( 1 ) Compare . Because each string will only be scanned once , It can be direct violence , Preprocessing O ( n ) , violence O ( n ) O(n) Find the longest identical part .
Because the string is as long as 1 0 6 10^6 106
Monomode hash I can't do it anymore , Use a safer pair hash, take 2 A larger modulus mod1 and mod2, And a prime number p (p Don't be too big , Single-mode p Not too big , Modulus must be large )
ll bas[2]={131,2333333};
ll mod[2]={1000000007,999999998};
1e5 It can be used P take 131,mod take 1<<64, adopt unsigned long long To get the mold Combine Single hash ,
Once it's here 1e6 Or honest and practical double hash or even three hash , In fact, it's not difficult , hold p Array ,mod Array ,H Array ,h Array , All become k Wei is good , One more cycle for(int j=0;j<k;k++)…… Just fine
It is recommended to take the above two hash functions P and mod Combine , strong , The latter group can even be used for single hash in this problem AC
Question 6
The upper and lower case letters in this question are treated as unequal , At first, I understood the opposite , It's useless for me to write a
int getID(char ch){//65 97
if(ch>=‘A’&&ch<=‘Z’)ch+=32;
return ch;
}
This way , But it is not necessary to ,P Base number P It's big enough ,48——65——97——122, It is completely possible to stagger and take different values , Direct use s[i] Just ok
int getID(char c){
if(c>=‘a’&&c<=‘z’) return c-‘a’;
if(c>=‘0’&&c<=‘9’) return c-‘0’+26;
return c-‘A’+36;
}
Question seven
Poor complexity analysis , At first, timeout was thought to be fast power sum initialization P The choice of power initialization of
Test certificate , This problem initializes P Power array p Come faster than fast power
Anyway initialization P Power array p Namely O(n) Complexity , Will not cause substantial timeout , Use this
Code
Single hash
2333333
999999998
Even 131 and unsigned long long The combination of will also be stuck
#include <iostream>
#include <string.h>
using namespace std;
typedef long long ll;
const int N=1e6+10;
char t[N];
char s[N];
ll bas=2333333;
ll mod=999999998;
ll h[N];
ll H[N];
ll p[N];
void init(){
p[0]=1;
for(int i=1;i<N;i++){
//N The highest power of bit is N-1
p[i]=(p[i-1]*bas)%mod;
}
}
ll getH(int l,int r){
return ((h[r]-h[l-1]*p[r-l+1])%mod+mod)%mod;
}
int main(){
init();
int n;
cin>>n;
scanf("%s",t+1);
int len=strlen(t+1);
int st=1;
for(int j=1;j<n;j++){
scanf("%s",s+1);
int le=strlen(s+1);
for(int i=st;i<=len;i++){
h[i]=(h[i-1]*bas+t[i])%mod;
}
st=len+1;
int pos=0;
for(int i=1;i<=le&&i<=len;i++){
H[i]=(H[i-1]*bas+s[i])%mod;//s character string s[1--i] This hash value
if(H[i]==getH(len-i+1,len))pos=i;//12345
}
for(int i=pos+1;i<=le;i++){
t[++len]=s[i];
}
}
printf("%s",t+1);
return 0;
}
Double hash (So easy, First write a single hash , Then one by one m o d − > m o d [ k ] , b a s − > b a s [ k ] , h [ i ] − > h [ k ] [ i ] mod->mod[k],bas->bas[k],h[i]->h[k][i] mod−>mod[k],bas−>bas[k],h[i]−>h[k][i], When calculating the hash value, use a k loop , It should be noted that when judging the equality of strings, the hash value under each hash function must meet the conditions
Using two-dimensional arrays is much better than writing two arrays , It can also be supplemented according to the errors reported by the compiler k
#include <iostream>
#include <string.h>
using namespace std;
typedef long long ll;
const int N=1e6+10;
char t[N];
char s[N];
ll bas[2]={
131,2333333};
ll mod[2]={
1000000007,999999998};
ll h[2][N];
ll H[2][N];
ll p[2][N];
void init(){
p[0][0]=p[1][0]=1;
for(int i=1;i<N;i++){
//N The highest power of bit is N-1
for(int k=0;k<2;k++)
p[k][i]=(p[k][i-1]*bas[k])%mod[k];
}
}
ll getH(int l,int r,int k){
return ((h[k][r]-h[k][l-1]*p[k][r-l+1])%mod[k]+mod[k])%mod[k];
}
int main(){
init();
int n;
cin>>n;
scanf("%s",t+1);
int len=strlen(t+1);
int st=1;
for(int j=1;j<n;j++){
scanf("%s",s+1);
int le=strlen(s+1);
for(int i=st;i<=len;i++){
for(int k=0;k<2;k++)
h[k][i]=(h[k][i-1]*bas[k]+t[i])%mod[k];
}
st=len+1;
int pos=0;
for(int i=1;i<=le&&i<=len;i++){
int ok=1;
for(int k=0;k<2;k++){
H[k][i]=(H[k][i-1]*bas[k]+s[i])%mod[k];//s character string s[1--i] This hash value
if(H[k][i]!=getH(len-i+1,len,k))ok=0;
}
if(ok)pos=i;//12345
}
for(int i=pos+1;i<=le;i++){
t[++len]=s[i];
}
}
printf("%s",t+1);
return 0;
}
Finally, you can use the idea of rolling array to optimize the space
H[2][N]–>H[2]
ll H[2]={
0,0};
for(int i=1;i<=le&&i<=len;i++){
int ok=1;
for(int k=0;k<2;k++){
H[k]=(H[k]*bas[k]+s[i])%mod[k];//s character string s[1--i] This hash value
if(H[k]!=getH(len-i+1,len,k))ok=0;
}
if(ok)pos=i;//12345
}
边栏推荐
- Flink real-time data warehouse (7): Flink realizes the full pull module to extract data in MySQL
- /Bin/ld: cannot find -lxslt
- mysql 计算经纬度范围内的数据
- 2020.4.12 byte written test questions B DP D monotone stack
- /bin/ld: 找不到 -lxml2
- [development environment] install Visual Studio Ultimate 2013 development environment (download software | install software | run software)
- Digital collection system development (program development) - Digital Collection 3D modeling economic model system development source code
- 又是一年毕业季
- (Wanzi essence knowledge summary) basic knowledge of shell script programming
- 图数据库|Nebula Graph v3.1.0 性能报告
猜你喜欢
《大学“电路分析基础”课程实验合集.实验五》丨线性有源二端网络等效电路的研究
数仓中的维度表与事实表
GraphX 图计算实践之模式匹配抽取特定子图
Huawei ECS installs mysqlb for mysqld service failed because the control process exited with error code. See “sys
Aike AI frontier promotion (7.2)
Armv8-a programming guide MMU (4)
智联招聘的基于 Nebula Graph 的推荐实践分享
The sea of stars hidden behind the nebula graph
动态规划入门二(5.647.62)
Traversal before, during and after binary tree
随机推荐
QVariant与Json的各种纠葛——Qt
Wise target detection 23 - pytoch builds SSD target detection platform
制作p12证书[通俗易懂]
/bin/ld: 找不到 -lxslt
Target detection - make your own deep learning target detection data set with labelimg
数据库系统概论第一章简答题-期末考得怎么样?
Analysis of the difference between array and linked list
ssh/scp 使不提示 All activities are monitored and reported.
beforeEach
Usage of group by
Add an empty column to spark dataframe - add an empty column to spark dataframe
Why does the system convert the temp environment variable to a short file name?
fastjson List转JSONArray以及JSONArray转List「建议收藏」
Teach you how to build virtual machines locally and deploy microservices
Fiddler实现手机抓包——入门
[Xiaobai chat cloud] suggestions on container transformation of small and medium-sized enterprises
Huawei ECS installs mysqlb for mysqld service failed because the control process exited with error code. See “sys
奥比中光 astra: Could not open “2bc5/[email protected]/6“: Failed to set USB interface
Boot 连接 Impala数据库
处理gzip: stdin: not in gzip formattar: Child returned status 1tar: Error is not recoverable: exitin