当前位置：网站首页>Compress words (kmp/ string hash, double hash)

Compress words (kmp/ string hash, double hash)

2022-07-02 16:00:00 【Selvaggia】

compress words

Insert picture description here
Topic link
This is good , Huo , As soon as you open this , The catalogue is amazing , There are so many questions in one question , obviously , Someone's IQ is worrying

Connect several words , Avoid repeating fields at the end of a word and the beginning of the next word

kmp Find the longest prefix

Deliver it to coedforce On ,next Array error , It should be the same name , To be on the safe side, change to nex

  Or right kmp The understanding of the algorithm is not deep enough , When the match is successful j=le,
	 You don't succeed , One is inequality , After all, the main string ratio is over, but it can't match all 
	 Just look at the last one j Indicates where to match  
	 If the last character of the main string fails to match s[0] Quite successful ,j It will become -1, Then add one 0 
	 Imagine the matching process  
	abcdefgh
		 fgheg

#include <iostream>
#include <math.h>
using namespace std;
#include <string>
const int N=1e6+10;
string t,s;
int len;// The length of the subsequent input string  
int nex[N];
void getNext(){
    
	int i=0;
	int j=-1;
	nex[0]=-1;
	while(i<len){
    
		if(j==-1||s[i]==s[j]){
    
			i++;
			j++;
			nex[i]=j;
		}
		else j=nex[j];
	}
}
int kmp(int st){
    
	int i=st,j=0;
	int le=t.size();
	while(i<le){
    
		while(j!=-1&&t[i]!=s[j])j=nex[j];
		i++;j++;	
	}
// while(i<le){
    
// if(j==-1||t[i]==s[j]){
    
// i++;
// j++;
// }
// else j=next[j];
// }
	return j;
// Or right kmp The understanding of the algorithm is not deep enough , When the match is successful j=le,
//  You don't succeed , One is inequality , After all, the main string ratio is over, but it can't match all 
//  Just look at the last one j Indicates where to match  
//  If the last character of the main string fails to match s[0] Quite successful ,j It will become -1, Then add one 0 
//  Imagine the matching process  
// abcdefgh
// fgheg 
}
int main(){
    
	int n;
	cin>>n;
	cin>>t;
	for(int i=1;i<n;i++){
    
		cin>>s;
		len=s.size();
		getNext();
		int st;// The starting matching position of the formed string  , There is no need to go back and forth from 0 Start 
		st=t.size()-len; 
		st=max(0,st);
		int pos=kmp(st); 
		t+=s.substr(pos,len-pos+1);
	}
	cout<<t;
	return 0;
}

String double hash

Ideas ：

Take the first word as the desired string t
Then enter a word s Then calculate the string prefix hash of this word ,
$from On heavy Stack Of Long degree yes m i n (s t r l e n (t), s t r l e n (s)),$
stay s The length of this section , Compare separately s Prefix hash value of and
t The hash value of the corresponding suffix segment , Be sure to s Add a paragraph of t, Further find t The hash value of the complete segment is repeated getH Get the hash value of the suffix segment

WC I didn't expect this problem to take me so long
A pile of problems ：

Question 1

Continue to splice a string $s$ To the original string $t$ On , The total length data range is $1 e 6$ , In order to find the hash value of the existing length string , Repeated use $s t r l e n （ t + 1 ）$
$n$ The range is $1 e 5$ ,strlen The time complexity is $O (n)$ , The outer loop + Inside $s t r l e n, complex miscellaneous degree O (n 2)$ , There are also... In the cycle ｓｔｒｌｅｎ It's really overtime to despair
strlen

	scanf("%s",t+1);
	int st=1;
	for(int j=1;j<n;j++){
    
		scanf("%s",s+1);
		int len=strlen(t+1);
		int le=strlen(s+1);
		for(int i=st;i<=len;i++){
    
			int id=getID(t[i]);
			h[i]=(h[i-1]*bas+id)%mod;
		}
		……
		……
		……

strlen It's just the work of a counter , It comes from somewhere in memory （ It can start with a string , Somewhere in the middle , Even some uncertain memory area ） Start scanning , Until you hit the first string Terminator ’\0’ until , Then return the counter value ( Length does not contain ’\0’).
We use it cin>>s and scanf(%s) When entering a string , After accepting the string, add a ’\0’, That is to say NULL As a string Terminator .
So use strlen Function to find the length of a string , The complexity is O(n). If you use... In a loop
for (int i = 0; i < strlen(s); ++i)// Complexity O(n2) That's trouble , It may time out , terms of settlement ：
（1） The first strlen(s) First store it with variables , Then put it into the cycle ;
（2） Or use s[i] != ‘\0’;

Question two

1、
ll mod[2]={1e9+7,999999998}; There's no problem with the compiler , Deliver it to codeforce On ,compile error, The reason is that 1e9+7 Will count as double Floating point number of type , $1.000000000*10^7$
ll mod[2]={1000000007,999999998};
2、 Similar to that ,cin>>s+1, The error report says that there is no overloaded operator operator
3、 And before next Duplicate name of

Question 3

Double hash or multiple hash , You must have the same hash value under each hash function to judge that the two strings are equal , It's a && The relationship between , therefore

st=len+1;
int pos=0;
ll H[2]={
    0,0};
for(int i=1;i<=le&&i<=len;i++){
    
			int id=getID(s[i]);
			int ok=1;
			for(int k=0;k<2;k++){
    
			H[k]=(H[k]*bas[k]+id)%mod[k];//s character string s[1--i] This hash value 
// ll temp=((h[k][len]-h[k][len-i]*p[k][i])%mod[k]+mod[k])%mod[k];
			if(H[k]!=getH(len-i+1,len,k))ok=0;
			}
			if(ok)pos=i;//12345
		}

As long as the hash value under a hash function is not equal , It cannot be counted as string equality

Question 4

H[2][N] Optimize to H[2] Well , It's a temporary variable , It is specially used to record the prefix hash value of the next word , Keep typing words , therefore H Remember to initialize the array every time , It can't be finished after the global variable is defined （ Initialize only once ）, because ==H[k]=(H[k]*bas[k]+id)%mod[k];== We need to use the last H【k】 value , It's a bit like the optimization of rolling arrays
If you don't need to scroll array optimization , There is no need to initialize again ,
H[k][i]=(H[k][i-1]*bas[k]+id)%mod[k];
It takes space

	for(int i=1;i<=le&&i<=len;i++){
    
			int id=getID(s[i]);
			int ok=1;
			for(int k=0;k<2;k++){
    
			H[k][i]=(H[k][i-1]*bas[k]+id)%mod[k];//s character string s[1--i] This hash value 
// ll temp=((h[k][len]-h[k][len-i]*p[k][i])%mod[k]+mod[k])%mod[k];
			if(H[k][i]!=getH(len-i+1,len,k))ok=0;
			}
			if(ok)pos=i;//12345
		}

Question five

Use string hash Sure O ( n ) Handle ,O ( 1 ) Compare . Because each string will only be scanned once , It can be direct violence , Preprocessing O ( n ) , violence O ( n ) O(n) Find the longest identical part .

Because the string is as long as $10^6$
Monomode hash I can't do it anymore , Use a safer pair hash, take 2 A larger modulus mod1 and mod2, And a prime number p (p Don't be too big , Single-mode p Not too big , Modulus must be large )

ll bas[2]={131,2333333};
ll mod[2]={1000000007,999999998};

1e5 It can be used P take 131,mod take 1<<64, adopt unsigned long long To get the mold Combine Single hash ,
Once it's here 1e6 Or honest and practical double hash or even three hash , In fact, it's not difficult , hold p Array ,mod Array ,H Array ,h Array , All become k Wei is good , One more cycle for（int j=0;j<k;k++)…… Just fine

It is recommended to take the above two hash functions P and mod Combine , strong , The latter group can even be used for single hash in this problem AC

Question 6

The upper and lower case letters in this question are treated as unequal , At first, I understood the opposite , It's useless for me to write a
int getID(char ch){//65 97
if(ch>=‘A’&&ch<=‘Z’)ch+=32;
return ch;
}
This way , But it is not necessary to ,P Base number P It's big enough ,48——65——97——122, It is completely possible to stagger and take different values , Direct use s[i] Just ok
int getID(char c){
if(c>=‘a’&&c<=‘z’) return c-‘a’;
if(c>=‘0’&&c<=‘9’) return c-‘0’+26;
return c-‘A’+36;
}

Question seven

Poor complexity analysis , At first, timeout was thought to be fast power sum initialization P The choice of power initialization of
Test certificate , This problem initializes P Power array p Come faster than fast power
Anyway initialization P Power array p Namely O(n) Complexity , Will not cause substantial timeout , Use this

Code

Single hash
2333333
999999998
Even 131 and unsigned long long The combination of will also be stuck

#include <iostream>
#include <string.h>
using namespace std;
typedef long long ll;
const int N=1e6+10;
char t[N];
char s[N];
ll bas=2333333;
ll mod=999999998;
ll h[N];
ll H[N];
ll p[N];
void init(){
    
	p[0]=1;
	for(int i=1;i<N;i++){
    //N The highest power of bit is N-1
		p[i]=(p[i-1]*bas)%mod;
	}
}
ll getH(int l,int r){
    
	return ((h[r]-h[l-1]*p[r-l+1])%mod+mod)%mod;
}
int main(){
    
	init();
	int n;
	cin>>n;
	scanf("%s",t+1);
	int len=strlen(t+1);
	int st=1;
	for(int j=1;j<n;j++){
    
		scanf("%s",s+1);
		int le=strlen(s+1);
		for(int i=st;i<=len;i++){
    
			h[i]=(h[i-1]*bas+t[i])%mod;
		}
		st=len+1;
		int pos=0;
		for(int i=1;i<=le&&i<=len;i++){
    
			H[i]=(H[i-1]*bas+s[i])%mod;//s character string s[1--i] This hash value 
			if(H[i]==getH(len-i+1,len))pos=i;//12345
		}
		for(int i=pos+1;i<=le;i++){
    
			t[++len]=s[i];
		}
	}
	printf("%s",t+1);
	return 0;
}

Double hash （So easy, First write a single hash , Then one by one $m o d - > m o d [k], b a s - > b a s [k], h [i] - > h [k] [i]$ , When calculating the hash value, use a k loop , It should be noted that when judging the equality of strings, the hash value under each hash function must meet the conditions

Using two-dimensional arrays is much better than writing two arrays , It can also be supplemented according to the errors reported by the compiler k

#include <iostream>
#include <string.h>
using namespace std;
typedef long long ll;
const int N=1e6+10;
char t[N];
char s[N];
ll bas[2]={
    131,2333333};
ll mod[2]={
    1000000007,999999998};
ll h[2][N];
ll H[2][N];
ll p[2][N];
void init(){
    
	p[0][0]=p[1][0]=1;
	for(int i=1;i<N;i++){
    //N The highest power of bit is N-1
		for(int k=0;k<2;k++)
		p[k][i]=(p[k][i-1]*bas[k])%mod[k];
	}
}
ll getH(int l,int r,int k){
    
	return ((h[k][r]-h[k][l-1]*p[k][r-l+1])%mod[k]+mod[k])%mod[k];
}
int main(){
    
	init();
	int n;
	cin>>n;
	scanf("%s",t+1);
	int len=strlen(t+1);
	int st=1;
	for(int j=1;j<n;j++){
    
		scanf("%s",s+1);
		int le=strlen(s+1);
		for(int i=st;i<=len;i++){
    
			for(int k=0;k<2;k++)
			h[k][i]=(h[k][i-1]*bas[k]+t[i])%mod[k];
		}
		st=len+1;
		int pos=0;
		for(int i=1;i<=le&&i<=len;i++){
    
			int ok=1;
			for(int k=0;k<2;k++){
    
			H[k][i]=(H[k][i-1]*bas[k]+s[i])%mod[k];//s character string s[1--i] This hash value 
			if(H[k][i]!=getH(len-i+1,len,k))ok=0;
			}
			if(ok)pos=i;//12345
		}
		for(int i=pos+1;i<=le;i++){
    
			t[++len]=s[i];
		}
	}
	printf("%s",t+1);
	return 0;
}

Finally, you can use the idea of rolling array to optimize the space
H[2][N]–>H[2]

	ll H[2]={
    0,0};
		for(int i=1;i<=le&&i<=len;i++){
    
			int ok=1;
			for(int k=0;k<2;k++){
    
			H[k]=(H[k]*bas[k]+s[i])%mod[k];//s character string s[1--i] This hash value 
			if(H[k]!=getH(len-i+1,len,k))ok=0;
			}
			if(ok)pos=i;//12345
		}