当前位置:网站首页>Xmppmini project details: step by step from the principle of practical XMPP technology development 4. String decoding secrets and message package

Xmppmini project details: step by step from the principle of practical XMPP technology development 4. String decoding secrets and message package

2020-11-06 20:10:00 itread01

    This section is longer , But if you do want to decode it manually xmpp Message , I suggest you take a closer look , And it's not really complicated .


    After logging in successfully , We can send and receive messages . The format of the message packet is roughly as follows :

1 <message id="JgXN5-33" to="[email protected]" from="[email protected]/Spark" type="chat">
2   <body>hi, How do you do .</body>
3 </message>


    In fact, there are compatibility problems with packets , Most of all, each client or server will add its own Expansion Kit nodes . In fact, starting from simplifying the agreement , These extensions are better placed in the message body itself , It can also be compatible with other communication protocols , We xmppmini That's what projects do . But it's all a afterword , Our immediate priority is how to decode the packet when we receive it .
    Let's give you a surprise and reassurance : I guarantee that only one function can complete this decoding . Really? ! Let's take a look at how traditional message decoding works . General message stream decoding , In particular xml Or the source code analysis of development language is more often used, and each node will be separated from the stream of bits first , Then a tree structure is formed according to their own rules . First, it is more complicated , From an open source perspective , It's too time-consuming for individual developers to implement , This is also why, as long as it is related to encoding and decoding, it is generally on the third-party library . There is also a very big drawback , Namely xml The ambiguity of decoding is too serious , Include json So is the decoding library , Including many famous libraries, until now, there are many special cases that can't be decoded correctly . Not long before I wrote this article, I saw a golang Development team's bug Report , It's about not being able to decode correctly under certain circumstances xml Situation of .
    Another common method is upper regular expression , I mean, I mean, very personal . Not to mention its decoding errors and ambiguity are also very serious , Every time I rewrite that string, I have to check the regular grammar string , It turns me off when I look at it , It can be said that the code maintenance implemented by regular expression is very poor . In fact, these seemingly complex strings , Just use a string to separate the function . Now let's introduce .
This function was written by me many years ago as an email client 《eEmail》 When the , The purpose is to decode smtp/pop3 Message for . In fact, it can also be used for xmpp/xml Bag . And the principle is easy to understand , It's very worthwhile to introduce it to you in detail .
    First , Let's consider how to get the content of a message in the following format :

1 key=value
2 Key=value
3 key=value;
4 Key=value;


Be careful “key” There is case in , Because this is a very noticeable phenomenon in network packets , The case of a flag is not consistent across implementations . And pay attention to “value” Sometimes there will be “;” Symbols . Yes web Readers of front-end development should be very familiar with this situation .

And we need to design this handler to achieve the following effect :

1 get_value('key=value', '=', ';');  // Should be  value
2 get_value('key=value', '', '=');   // Should be  key
3 get_value('key=value;', '=', ';'); // Should be  value
4 get_value('key=value;', '', '=');  // Should be  key


    It's actually a string between two separators , And the two separators are not the same . At the same time, consider the case that there is no first separator or no second separator . You don't have to think about the above results , Because it's still a little complicated at this time , Especially without a certain separator, it is more difficult to handle .
    At first, I designed this function and it worked very well , Basically with some regular string query 、 Cutting function can solve the decoding problem . But there's a problem with this function , It's just that the design is too elaborate , After many years golang Language appears , When I tried to port code, I found , I have forgotten the thought of dealing with it , When I write it again , The results are not exactly the same ! It's obviously not appropriate , Because of this xmpp The library of C#、java、 Pure C Wait for multiple versions , I can't do it myself , How to introduce methods to others ? So I have to think about how to divide it into several simple logical processing functions to accomplish the same task . After nearly two days of tossing , Kungfu is not responsible for the heart , I do find that its processing can be broken down into several simple functions . Even better, these simple functions can be combined with a simpler function .
    It's no surprise that this layer of window paper is punctured , Just one of the simplest string separator functions . For example, string “123abc456” Split into two strings “123”、”456” That's all right. . Not good. ! The sharp eyed reader must have found something . This is a string splitting function , No need to write , Almost all development languages have ! That's right ! But there are many problems with these partitioning functions , And there are compatibility problems between them !
    With my earliest realization of Delphi Version and final implementation of golang Version as an example .Delphi By default, the separator function will put spaces without your knowledge 、tab Characters are also used as separators . So when you use its default string separator function, there will be a lot of unexpected results, which make you hard to debug . and golang There's also a problem with the separation of . Some languages are implemented in regular expressions , The result is sometimes even more impractical . The reason is very simple , Because these separators themselves are used to separate multiple string segments , It also considers the case of common separators . And what we need is a clear separation of the specified separator , And a function that only divides a string into two segments .
    Consider the following string :

“1,2,3,4,5”


    After we write our own functions, it needs to be divided into “1”、”2,3,4,5” Two parts . And if it is golang The default implementation of is likely to be “1”、”2”. And the back is gone , Because it puts the second “,” It's also used as a separator .
In fact, the string separator function we want to implement is simpler , It only deals with the first separator . So manual implementation is very simple , Any programmer can do it .
    The concrete implementation is very simple , First query the position of the separator string in the source string , Then cut it and remove the separator itself . This can be accomplished by using the string query and separation functions that are available in development languages . It's very simple , The pseudo code is as follows . However, it should be noted that the expression of string position in different languages is not exactly the same , Most languages set the first starting position of a string as 0 , And some are 1; Also pay attention to when cutting strings , Some development languages do different things when their length exceeds or falls short , Some return the entire string , Some return to empty , Some return as much as they have . Specifically, we need to pay more attention to the implementation .


 1 // A string is divided into two according to the first position of the separator 
 2 void sp_str_two(string in_s, sp, string out s_left, string out  s_right)
 3 {
 4   // Where to start copying 
 5   Int find_pos;      // The location you have queried 
 6   Int left_last_pos; // The position of the last character in the left string 
 7 
 8   find_pos = pos(lowercase(sp), lowercase(in_s)); // Don't be case sensitive 
 9 
10   if (Length(sp)<1 ) find_pos = 0; // No separator is treated as not found 
11 
12   if find_pos <= 0        // No separator found , Return immediately , Now on the left is the original string , On the right is an empty string , It is similar to that after being divided into arrays  【 Indexes 1】  and  【 Indexes 2】  Content in 
13   {
14     s_left = in_s;
15     s_right = '';
16     return;
17 
18   };
19 
20   left_last_pos = find_pos - 1; // Because the end sign itself is not needed , So the last character we want is to move the position forward one bit 
21 
22   // Take the left 
23   s_left = copy(in_s, 1, left_last_pos); 

/* Because delphi The string position is from 1 Starting to calculate , So the position of the character is the length of the whole string containing it , There is no need to add 1 Or minus 1 Such a calculation
Other languages need to modify this part of the code according to the actual situation . Most development languages generally start with 0 Start calculating the position of a string . */ 24 25 //---- 26 // Take the right side 27 find_pos = find_pos + (length(sp)); // The starting position also skips the length of the separator 28 s_right = copy(in_s, find_pos, length(in_s)); // Remove the part before the initial separator ( The separator itself should not be ) 29 30 }


    There's a point to note : The separated string no longer contains the separator . But in real work , In fact, sometimes it's necessary to take a separator . I wanted to add a default argument to decide whether to include a separator in the result . But it is not convenient to find it in practice , First of all, there's an extra argument , When you see this function in your work, you will break your mind and think about the difference ( Although it's a very subtle pause ). This is a big taboo in the process of flowing water ( At least for me ). Again , Now emerging languages like java、golang In order to avoid ambiguity, default arguments are not supported . Of course, it can be split into multiple functions to solve the problem , But in this case, the problem of interrupting thinking still exists . So in the end I decided to keep it simple , When we separate, we must know what the separator is , Just add it back where you need it . Although this method seems a bit silly , But in the actual development, it can maintain the clarity and simplicity of thinking logic .
    With this function , You can easily implement a function that takes the left part of a string separator , And a function that takes the right part of the string separator . The pseudo code is as follows :

 1 // Split the string in two , Do not use the system's own separator strings as array functions , Because in that case, you can't deal with multiple separators in a string 
 2 // This function separates the string where it first appears , If other places appear again, don't pay attention to , That's how to deal with  xml  This marks the case of multiple nests 
 3 //b_get_left  Take the left or right side of the separated string 
 4 string sp_str(string in_s, sp, bool b_get_left)
 5 {
 6   String s_left;         // The string on the left 
 7   String s_right;        // The string on the right 
 8 
 9   sp_str_two(in_s, sp, s_left, s_right);
10 
11   //----
12   result = s_left;
13   if (False = b_get_left)  result = s_right;
14 
15   return result;
16 };
17 
18 // Take the left side of the separator string 
19 string sp_str_left(string in_s, sp)
20 {
21   return sp_str(in_s, sp, true);
22 
23 }
24 
25 // Take the right side of the separator string 
26 string sp_str_right(string in_s, sp)
27 {
28   return sp_str(in_s, sp, false);
29 
30 }


Okay , Finally, we want to achieve get_value() The function itself . Here is something to pay special attention to . With the previous basic function , To achieve get_value() It's also very simple . But be sure to use the predicted results of the above function operation as test cases after completion , A slight change in the call order in the following code may cause different results . The code is as follows :

 1 string get_value_sp(string in_s, b_sp, e_sp)
 2 {
 3   Result = in_s;
 4 
 5   if (Length(b_sp)<1) // If the left separator is empty, it means that only the 
 6   {
 7     Result = sp_str_left(Result, e_sp);
 8     Return result;
 9   };
10 
11   if (Length(e_sp)<1) // If the right separator is empty, it means that only after the left separator 
12   {
13     Result = sp_str_right(Result, b_sp);
14     Return result;
15   };
16 
17   // If you have both, take the space between the separators 
18   Result = sp_str_right(Result, b_sp);
19   Result = sp_str_left(Result, e_sp);
20   //Result = sp_str_left(Result, b_sp);
21 
22   return result;
23 }


    With these functions , Let's see how we can easily decode the packet from the beginning of the article .
    First, we need to make sure that the complete message package is included in the string . This is directly using the functions in the previous chapters FindStr() Query whether there are substrings “/message>” That's all right. .
    The second step , Make sure that the contents of the buffer contain the full packet , You can call directly get_value() Got the message package .

1 s = get_value(gRecvBuf, '<message', '</message >');
2 
3 msg = get_value(s, '<body>', '</body>');


At this time s The content of is

“
id="JgXN5-33" to="[email protected]" from="[email protected]/Spark" type="chat">
  <body>hi, How do you do .</body>
”


and msg The content of

“
hi, How do you do .
”


Note that the starting separator for the first call location is “<message”, instead of “<message>” , This is because message There are also attribute nodes attached to the package . And when these ground nodes don't exist , Use separator “<message ” You can also get the required string . These nodes include the sender's address , Use get_value() Functions are also easy to obtain :

1 from = get_value(s, ' from="',  '"');


    Let's take a closer look at this line of code , The first separator must be preceded by a space . Because if you don't add it, you may get “afrom” perhaps “bfrom” The contents of these nodes .
    As you can see, we can easily decode this xmpp Message node for . Because xmpp The message is more regular and tidy, so it can be processed in this way . If it's used to decode handwriting xml You can add some preprocessing to the file : For example, remove continuous spaces ; Will tab、 Go back 、 Line breaks to spaces and so on , Of course, we have to think about “message” There are many levels of situations . It's not hard, in fact , But xmpp There is no such situation in , We'll just press the no table .
    There is a problem with this decoding method : It's decoding efficiency . It is mainly character cutting and redistribution memory that affects some processing speed . Here we mainly talk about the principle , Second, most of the readers are sure to develop the client side , There's no need to optimize execution speed . If it's a Server Developer , Then the direction of optimization is to directly achieve get_value, But if it's my own optimization, I won't use this method , Because I think code maintainability is more important . If it is C Language , You can change all of the above functions to versions that do not require memory redistribution , All of them are realized by indicators . Similar to golang The slicing operation in is based on the principle of the same block of memory .
    When it comes to optimization , I can't help but have some interesting things to share with you . When we first started learning programming and computers in the early years , When it comes to optimization, most of them refer to the optimization of compiled code , At that time, most of the optimizations would say what to change, which assembly instruction or function modification would speed up the code execution, and so on . Especially when you see the compilations , All of a sudden, I feel this kind of work is far away from myself . This is mainly due to the lack of relevant information , I have to be English , The development of native non English language to change the assembly optimization code , The possibility is really small . After working for many years, I found that , It's not like this , In fact, a change of algorithm or processing method may make the code run faster than a thousand times ! Really? , And I'm conservative . Take the simplest example , At home ( In fact, foreign countries are also ) A lot of development is not made by computer software or related majors , A very common problem is that they don't know what a binary query is ( I didn't even hear of ), This makes them fail to understand the importance of indexing and sorting when designing database and container data structures . In the design time often ignores , And give the system ( Especially for server type systems ) Adding a simple binary search can exponentially improve performance . Most of these algorithms are fixed , For example, there are netizens to analyze nginx The original code is said to be the red black tree algorithm ( I don't really remember , In short, it's a binary tree ) As like as two peas in classic tutorial . This type of module is just like assembly , It's unlikely to modify it – Do you think you can write a faster algorithm than quicksort ? It's not to say it's totally impossible , Instead, we should not focus on code optimization in our daily development . But it's not entirely traditional , Again nginx Example , Its list container is not a traditional concatenation , It's a big chunk of memory , Keep indicators in it . This is in the Delphi It's the same in China , I saw that in the same year Delphi This part of the code implementation is amazing , Because I've never seen or heard of such a list . This kind of list is small in quantity (1 Less than 10000 ) When , The speed is amazing , Because the whole operation of this memory is to operate on the whole list – Multiple operations require only one copy of memory code . But years later, I was in charge of restoring one Delphi The number of server versions found reached 2 The aging energy of this level will drop sharply , It's very slow to delete an element in it – Because this memory is too big .Nginx The solution is simple , It goes back to the traditional algorithm – If there is too much , It allocates another block of memory , Connect with a string of links , So it gets the benefits of both . But I didn't use it in the end nginx How to do it , First, it's a little complicated , More importantly, I only needed to optimize the deletion situation . My practice is to exchange the position of the last element with the deleted one , Because the total has decreased 1, The element that has been moved to the end will never be accessed . I want to tell you that , Optimization is not that difficult , Do it boldly . At the same time, we should learn more professional knowledge ; At the same time, understand what you can't do ; And understand , There's a lot I can't do yet , But within the scope of what I can do, it can also increase the efficiency by tens of millions of times .
    Let's go back to string optimization , Why? “ Experts ” When we manipulate a string, we say we operate on the same block of memory , Don't subtract by adding more than one memory ? Most developers know about this optimization . But what's the principle , Most people don't know , And more people don't know , In fact, the system has many optimizations for memory allocation , So I don't have to worry too much about it . I have learned about operating system , Or have a certain understanding of the operating system implementation should know , Allocating memory is an important basic operation of the operating system . What you don't know is , Even in this era , The speed with which the operating system allocates memory is really slow . In the development language ( At least C/C++、delphi Must be ) It's all about taking a chunk of memory first , And then when the program needs to be allocated . It is equivalent to replacing the memory allocation function provided by the operating system with its own memory allocation algorithm . There are even several memory allocated C/C++ Open source project , The purpose is to improve malloc/new It's just the speed of the operation , We can see the importance of increasing the speed of allocating memory . This, of course, also causes the speed of different systems to vary greatly , Since it is so difficult , So I don't allocate memory is the fastest – That's right ! This is what string operations use so-called no memory reallocation stringbuf Instead of string The theoretical basis of . stay java And the new version of golang There are even special ones like this “ String buffer class ”. I know that , We can also know that , Not everything needs to be replaced , There's no need for places that don't generate frequent memory operations . And modern string implementations actually have buffers .
    Do you understand , In fact, what I want to say is , Generally, memory allocation has been optimized in our development environment , And the string also has a certain buffer , So we use... Directly in our code string In fact, the problem is not too big .
    When it comes to memory allocation management , I can't help sharing another story . It was years ago , I work for a futures software supply company which claims to be one of the best in China – Their claims could be credible , Because I've heard about their software in another company that claims to be number one in the industry . One day they need to add the function of allocating memory first to the program , The reason is that their customers will execute a lot of clients , At this time, when multiple clients switch, there may be insufficient memory . If it happens to be our client's turn, the customer will say , How can your client pop up a dialog box saying there's not enough memory … … Okay , In order to avoid this situation, our boss asked the program to get all the memory needed as soon as it entered . This seemingly nonsense function is actually possible , After studying it, I found that it was not difficult , Just rewrite the memory allocator . It's not hard , It was finished in a few days . But there's a big problem : Too slow , It's really slow at least 10 To 100 times , Especially after the memory usage is large . The final solution is to study the original memory allocator in detail , In fact, according to the size of memory usage statistics in which several intervals , Then allocate several memory blocks of a fixed size to the larger interval . The connection between blocks is also the simplest two-way connection string . In fact, the most time spent is the size of the memory interval , For example, the first gear should be 100k Or 1m This is it . Can't imagine , You have to configure it with statistics . Finally, the allocator speed of this pre allocated memory is the same as that of the original ( Of course, I want to make it faster and more vain , But it is true that the original speed is also very good ).
Interestingly , By chance, I came back to this company . Found that they didn't understand the code , I have given up . In fact, when I write code, I am used to writing all ideas clearly – The main thing is that I write too much code , I'm afraid I can't understand – My notes should still be useful , At least they are “ Easy ” Is replaced back to the default memory allocation mechanism .
    So, is it optimized or not , It depends on the actual situation . We should also combine our own ability to make decisions and choices .
    There is another point : Test cases are really important . If there is no test case above , I found no mistakes in the subtleties when I rewrote them into other languages , It will produce serious bug !golang Of mime There are many error prone test cases in the source code of decoding module , It is necessary to modify the modules for such complex functions , Otherwise, you make an improvement that you think is very important, and the result is bug It will leave a serious aftereffect .

&n

版权声明
本文为[itread01]所创,转载请带上原文链接,感谢