当前位置:网站首页>[rust notes] 15 string and text (Part 1)

[rust notes] 15 string and text (Part 1)

2022-07-05 06:05:00 phial03

15 - Strings and text

15.1-Unicode

15.1.1-ASCII、Latin-1 And Unicode

  • Unicode And ASCII All of the ASCII The code points are the same (0 ~ 0x7f).

  • Unicode take (0 ~ 0x7f) The range of code points is called Latin-1 Code block (ISO/IEC 8859-1).

  • namely Unicode yes Latin-1 Superset :

    • Latin-1 Convert to Unicode:

      fn latin1_to_char(latin1: u8) -> char {
              
        latin1 as char
      }
      
    • Unicode Convert to Latin-1:

      fn char_to_latin1(c: char) -> Option<u8> {
              
        if c as u32 <= 0xff {
              
          Some(c as u8)
        } else {
              
          None
        }
      }
      

15.1.2-UTF-8

  • Rust Of String and str Type used UTF-8 The encoding format represents the text .UTF-8 Encode characters as 1 To 4 A sequence of bytes .
  • UTF-8 Format restrictions on sequences :
    • For a given code point , Only the shortest encoding is considered well formed , I can't use 4 To encode bytes, just 3 A code point of bytes .
    • Well formed UTF-8 Not right 0xd800 ~ 0xdfff, And greater than 0x10ffff Numerical code of .
  • UTF-8 Important attributes of :
    • UTF-8 Code matching point 0 To 0x7f The encoding of is bytes 0 To 0x7f, preservation ASCII Bytes of text are the most efficient UTF-8.ASCII And UTF-8 It's reversible , and Latin-1 And UTF-8 Not reversible .
    • By observing the first few bits of any byte , You can know that it is some characters UTF-8 The first byte of the encoding , Or the middle byte .
    • By encoding the first few bits of the first byte, you can know the total length of the encoding .
    • The maximum encoding length is 4 Bytes ,UTF-8 There is no need for infinite loops , It can be used to process untrusted data .
    • Well formed UTF-8 Kind of , You can quickly point out the start and end positions of character encoding .UTF-8 The first byte of is obviously different from the following bytes .

15.1.5 - Text directionality

  • Some words are written from left to right : It belongs to the normal way of writing or reading , It's also Unicode The order in which characters are stored .
  • Some words are written from right to left : The first byte of a string stores the encoding of the character to be written on the far right .

15.2 - character (char)

  • char The type is save Unicode Code point 32 A value .
  • The scope is :0 To 0xd7ff, perhaps 0xe000 To 0x10ffff.
  • char Type implements Copy and Clone, And comparison 、 hash 、 All common features of the format .

15.2.1 - Character classification —— Methods of detecting character categories

  • ch.is_numeric(): Numeric character , Include Unicode General category Number; digit and Number; letter, But does not include Number; other.
  • ch.is_alphabetic(): Alphabetic character , Include Unicode Of “Alphabetic” Derived properties .
  • ch.is_alphanumeric(): Numeric or alphabetic characters , Including the above two categories .
  • ch.is_whitespace(): Blank character , Include Unicode Character properties “WSpace=Y”.
  • ch.is_control: Control characters , Include Unicode Of Other, control General category .

15.2.2 - Deal with numbers

  • ch.to_digit(radix): decision ch Whether the cardinality is radix Of ASCII Numbers . If so, go back Some(num), among num yes u32. otherwise , return None.radix The range is 2~36. If radix > 10, that ASCII The letter will be taken as the value 10~35 The number of .
  • std::char::from_digit(num, radix): hold u32 The number num Convert to char. If radix > 10,ch It's lowercase .
  • ch.is_digit(radix): stay ch Is based on radix Under the ASCII Digital hour , return true. Equivalent to ch.to_digit(radix) != None.

15.2.3 - Character case conversion

  • ch.is_lowercase(): Judge ch Lowercase or not .
  • ch.is_uppercase(): Judge ch Is it a capital letter .
  • ch.to_lowercase(): take ch Convert to lowercase .
  • ch.to_uppercase(): take ch Convert to uppercase .

15.2.4 - Convert with integer

  • as Operators can put char Convert to any integer type , The high bit will be shielded .
  • as Operators can put any u8 Value to char.char Types also implement From<u8>. Recommended std::char::from_u32, return Option<char>.

15.3-String And str

  • Rust Of String and str Type only saves well formed UTF-8.

  • String Type can create scalable buffers , To hold strings . Essential for Vec<u8> Type of packaging .

  • str Type is to manipulate string text in place .

  • String The dereference of is &str.str All methods defined on , Can be in String Call directly on .

  • The text processing method indexes the text according to the byte offset , Length is also measured in bytes , Not by character .

  • Rust According to the name of the variable , Guess its type , Such as :

    Variable name Guess the type
    stringString
    slice&str Or dereference as &str The type of , Such as String or Rc<String>
    chchar
    nusize, length
    i, jusize, Byte offset
    rangeusize Byte offset range , It may be fully qualified i..j, Partially Limited i.. or ..j, Or infinite ..
    pattern Any mode type :char, String, &str, &[char], FnMut(char) -> bool

15.3.1 - establish String value

  • String::new(): Return a new empty string . There is no buffer allocated on the heap , Subsequently, it will be allocated as needed .

  • String::with_capacity(n): Return a new empty string , At the same time, allocate at least... On the heap n Byte buffer .

  • slice.to_string(): It is often used to create by string literals String. Assign a brand new String, The content is slice Copy of .

  • iter.collect(): By splicing all items of the iterator (char&str or String value ) To build String. The following is an example of deleting spaces in a string :

    let spacey = "man hat tan";
    let spaceless: String = spacey.chars().filter(|c| !c.is_whitespcae()).collect();
    assert_eq!(spaceless, "manhattan");
    
  • slice.to_owned(): take slice Copy of as a new distribution String return .&str Type cannot implement Clone, This method can achieve the effect of cloning .

15.3.2 - Simple check —— Get basic information from string slices

  • slice.len(): Returns in bytes slice The length of .

  • slice.is_empyt(): stay slice.len() == 0 When to return to true.

  • slice[range]: Return to borrowing slice Slice the specified part of .

  • Can not be like slice[i] In this format, get a string slice of location index . Instead, you need to generate a based on slices chars iterator , Let the iterator parse the corresponding string UTF-8:

    let par = "rust he";
    assert_eq!(par[6..].chars().next(), Some('e'));
    
  • slice.split_at(i): Return from slice Borrowed tuples of two shared slices ,slice[..i] and slice[i..].

  • slice.is_char_boundary(i): stay i Returns true.

  • Slices can be compared equally 、 Order and hash .

15.3.3 - towards String Append and insert text

  • string.push(ch): Alphabet character ch Append to the end of the string .

  • string.push_str(slice): Additional slice The whole content of .

  • string.extend(iter): Put the iterator iter All items generated are appended to the string . Iterators can generate charstr or String value .

  • string.insert(i, ch): In byte offset value i The location of , Insert the character... Into the string ch.i All subsequent characters are moved back one bit .

  • string.insert_str(i, slice): In byte offset value i The location of , Insert... Into the string slice The whole content of .

  • String Realized std::fmt::Write, So you can use write! and writeln! macro , to String Append formatted text . Their return value type is Result. Need to add at the end ? Operator to handle errors .

    use std::fmt::Write;
    
    let mut letter = String::new();
    writeln!(letter, "Whose {} these are I think I know", "rustabagas")?;
    
  • + The operator : When the operand is a string , It can be used for string splicing .

15.3.4 - Delete text

  • string.shrink_to_fit(): After deleting the string contents , Can be used to free memory .
  • string.clear(): Reset the string to empty characters .
  • string.truncate(n): Discard byte offset values n All the characters after .
  • string.pop(): Remove the last character from the string , And Option<char> As return value .
  • string.remove(i): Delete byte offset value from string i Where the character is , And return the character , The following characters will move forward .
  • string.drain(range): According to the return of Godin byte index , Return iterator , And delete the corresponding characters when the iterator is cleared .

15.3.5 - The Convention of search and iteration

Rust Standard library functions related to searching and iterating text , Follow the following naming convention :

  • Most operations can process text from left to right ;
    • The name to r The first operation is handled from right to left , Such as rsplit and split The opposite operation of .
    • Change the processing direction , It will not only affect the order of generating values , It also affects the value itself .
  • If the name of the iterator begins with n ending , It means that you will limit the number of matches .
  • If the name of the iterator begins with _indices ending , Represents the byte offset that will produce them in the slice , And usually iteratable values .

15.3.6 - Mode of searching text

  • Pattern (pattern):

    • When the standard library function needs to search (search)、 matching (match)、 Division (split) Or trim (trim) When text , Will receive different types of parameters , To indicate what to look for . These types are called patterns .
    • Patterns can be implemented std::str::Pattern Any type of special type .
  • The standard library supports 4 There are two main models :

    • char Used as a pattern to match characters ;

    • String&str or &&str As a model , Used to match substrings equal to patterns .

    • FnMut(char) -> bool Closures as patterns , Used to match closure returns true A character of .

    • &[char] As a model , Express char Value slice , Used to match any character that appears in the list .

      let code = "\t funcation noodle() { ";
      assert_eq!(code.trim_left_matchs(&[' ', 't'] as &[char]),
          "function noodle() { ");
      
      • as The operator , You can convert character array literals to &[char];
      • &[char; n] Indicates a fixed size n Array type of , Not a pattern type .
      • &[' ', 't'] as &[char] Can also write &\[' ', '\t'][..].

15.3.7 - Search and replace

  • slice.contains(pattern): stay slice Include and pattern When the content matches true.

  • slice.starts_with(pattern) and slice.ends_with(pattern): stay slice The initial or final text of and pattern Return... When matching true.

    assert!("2017".starts_with(char::is_numeric));
    
  • slice.find(pattern) and slice.rfind(pattern): stay slice Include match pattern When , return Some(i).i Is the byte offset of the match .

  • slice.replace(pattern, replacement): Return to replacement Replace all pattern New after the content of String.

  • slice.replacen(pattern, replacement, n): The function is the same as above , But at most before replacement n Matches .

15.3.8 - Iterative text

  • slice.chars(): be based on slice The character of returns an iterator .

  • slice.char_indices(): be based on slice The characters of and their byte offsets return an iterator .

    assert_eq!("elan".char_indices().collect::<Vec<_>>(),
        vec![(0, 'e'), (2, 'l'), (3, 'a'), (4, 'n')]);
    
  • slice.bytes(): be based on slice Individual bytes in the return an iterator , expose UTF-8 code .

    assert_eq!("elan".bytes().collect::<Vec<_>>(), vec![b'e', b'l', b'a', b'n']);
    
  • slice.lines(): be based on slice Text lines in , Returns an iterator . The terminator of each line is \n or \r\n. The value generated by this iterator is from slice Borrowed &str. also , The resulting value does not contain a terminator .

  • slice.split(pattern): Based on pattern Division slice The resulting part returns an iterator . Two adjacent matches or with slice start 、 Any match at the end will return an empty string .

  • slice.rsplit(pattern): The function is the same as above , But it will scan and match from back to front slice.

  • slice.split_terminator(pattern) and slice.rsplit_terminator(pattern): The function is the same as the above two methods , however pattern Be regarded as terminator , Instead of the separator . If pattern It just matches slice On both sides of the road , Then the iterator will not generate an empty slice representing an empty string between the two ends of the match and slice .

  • slice.splitn(n, pattern) and slice.rsplitn(n, pattern): And split and rsplit similar , But at most, the string is divided into n A slice , from pattern Of the 1 Match times to n-1 Secondary match .

  • slice.split_whitespace(): Based on blank slice The separated part returns an iterator . Consecutive white space characters are used as a separator . The blank space at the end will be ignored . The blank space here is similar to char::is_whitespace Consistent with the description in .

  • slice.matches(pattern) and slice.rmatches(pattern): be based on pattern stay slice The match found in returns an iterator .

  • slice.match_indices(pattern) and slice.rmatch_indices(pattern): Same as above . But the resulting value is (offset, match) Yes , among offset Is the byte offset that matches the start position ,match Is the matching slice .

15.3.9 - trim

  • trim (trim) character string :
    • Remove the contents from the beginning and end of the string ( Usually blank ).
    • It is often used to clean up indented text read in files , Or an unexpected white space at the end of a line , In order to make the results clearer
  • slice.trim(): return slice Sub slice of , Do not include whitespace at the beginning and end of the slice .
  • slice.trim_left(): Only white space at the beginning of the slice is ignored .
  • slice.trim_right(): Only white space at the end of the slice is ignored .
  • slice.trim_matches(pattern): return slice Sub slice of , Does not include slice start and end matches pattern The content of .
  • slice.trim_left_match(pattern): Only match the contents at the beginning of the slice .
  • slice.trim_right_match(pattern): Only match the contents at the end of the slice .

15.3.10 - String case conversion

  • slice.to_uppercase(): Return the newly matched string , It saves the after conversion to uppercase slice Text . The length of the result is not necessarily the same as slice identical .
  • slice.to_lowercase(): Similar to the above , But the conversion is after lowercase slice Text .

15.3.11 - Resolve other types from characters

  • All common types implement std::str::FromStr Special type , Has a standard method of parsing values from string slices .

    pub trait FromStr: Sized {
          
        type Err;
        fn from_str(s: &str) -> Result<Self, self::Err>;
    }
    
  • Used to store IPv4 or IPv6 Enumeration of Internet addresses (enum) type std::net::IpAddr It has also been realized. FromStr.

    use std::net::IpAddr;
    let address = IpAddr::from_str("fe80::0000:3ea9:f4ff:fe34:7a50")?;
    assert_eq!(address, IpAddr::from([0xfe80, 0, 0, 0, 0x3ea9, 0xf4ff, 0xfe34, 0x7a50]));
    
  • String sliced parse Method , Slices can be resolved to any type . In the call , You need to write the given type .

    let address = "fe80::0000:3ea9:f4ff:fe34:7a50".parse::<IpAddr>()?;
    

15.3.12 - Convert other types to strings

  • Realized std::fmt::Display Special print type , Can be in format! Used in macros {} Format specifier .

    • For smart pointer types , If T Realized Display, be Box<T>Rc<T> and Arc<T> It's going to happen : The form they print out is the form they reference the target .
    • Vec and HashMap Wait until the container is not implemented Display.
  • If a type implements Display, Then the standard library will automatically implement std::str::ToString Special type :

    • The only way to this special type to_string.
    • For custom types, it is recommended to implement Display, instead of ToString.
  • The common types of the standard library are implemented std::fmt::Debug Special type :

    • You can receive a value and format it as a string , For program debugging .

    • Debug The generated string , Can use format! broad {:?} Format specifier print .

    • Custom types can also be implemented Debug, It is recommended to use derived features :

      #[derive(Copy, Clone, Debug)]
      struct Complex {
              
          r: f64,
          i: f64
      }
      

15.3.13 - Borrow as other text types —— Borrowing of slices

  • Slicing and String Realized AsRef<str>AsRef<[u8]>AsRef<Path> and AsRef<OsStr>: Use these features as bindings for your own parameter types , You can pass slices or strings directly to them , In time, these functions need other types .
  • Slicing and String It has also been realized. std::borrow::Borrow<Str> Special type :HashMap and BTreeMap Use Borrow Give Way String It can be used as a key in the table .

15.3.14 - visit UTF-8 Formatted text ( Text represented by bytes )

  • slice.as_bytes(): To borrow slice Bytes of as &[u8]. The bytes obtained must be well formed UTF-8.
  • string.into_bytes(): obtain String And return bytes of this string by value Vec<u8>. The bytes obtained may not be well formed UTF-8.

15.3.15 - from UTF-8 Data produces text

  • str::from_utf8(byte_slice): Receive one &[u8] Byte slice , Return to one Result: If byte_slice Include well formed UTF-8, Then return to Ok(&str), Otherwise, an error is returned .

  • String::from_utf8(vec): Based on incoming Vec<u8> Value to construct a string .

    • If vec Well formed UTF-8,from_utf8 Just go back to Ok(string), among string It's about getting vec ownership , And use it as a buffered string .

    • If bytes are not well formed UTF-8, Then return to Err(e), among e It's a FromUtf8Error Wrong value . If you call e.into_bytes() Then you will get the original vector vec, The conversion fails without losing the original value .

      let good_utf8: Vec<u8> = vec![0xe9, 0x8c, 0x86];
      
      let bad_utf8: Vec<u8> = vec![0x9f, 0xf0, 0xa6, 0x80];
      let result = String::from_utf8(bad_utf8);  //  Failure 
      assert!(result.is_err());
      assert_eq!(result.unwrap_err().into_bytes(),
          vec![0x9f, 0xf0, 0xa6, 0x80]);
      
  • String::from_utf8_lossy(byte_slice): Byte based shared slices &[u8] Construct a String or &str.

  • String::from_utf8_unchecked: take Vec<>u8 Package as a String And back to it , Requirements must be well formed UTF-8. Only in unsafe Block the use of .

  • str::from_utf8_unchecked: Receive one &[u8], And return it as a &str, Also, it will not check whether the format of bytes is well formed UTF-8. The same can only be done in unsafe Block the use of .

15.3.16 - Block allocation

fn get_name() -> String {
    
    std::env::var("USER").unwrap_or("whoever you are".to_string())
}
println!("Greetings, {}!", get_name());
  • The above example realizes the program of greeting users , stay Unix Can be realized on , But in Windows The user name on is USERNAME Field , Unable to get the user name of the system .

  • std::env::var The function returns String. and get_name All types may be returned String, It could be &'static str'.

  • therefore , have access to std::borrow::Cow(Clone-on-write Clone on write ) Type implementation , All types of data can be saved , You can also save borrowed data .

    use std::borrow::Cow;
    
    fn get_name() -> Cow<'static, str> {
          
        std::env::var("USER")
            .map(|v| Cow::Owned(v))
            .unwrap_or(Cow::Borrowed("whoever you are"))
    }
    println!("Greetings, {}!", get_name());
    
    • If the read is successful USER environment variable , be map Take the obtained string as Cow::Owned return .
    • If you fail ,unwrap_or Make it static &str As Cow::Borrowed return .
    • as long as T Realized std::fmt::Display Special type , that Cow<'a, T> Will get and display T The same result .
  • std::borrow::Cow Often used in situations where , Or you may not need to modify a borrowed text .

    • When there is no need to modify , You can continue to borrow it ;

    • Cow Of to_mut Method , Make sure Cow yes Cow::Owned, Values will be applied when necessary ToOwned Realization , Then return a modifiable reference to this value .

      fn get_title() -> Option<&'static str> {
               ... }
      
      let mut name = get_name();
      if let Some(title) = get_title() {
              
          name.to_mut().push_str(", ");
          name.to_mut().push_str(title);
      }
      println!("Greetrings, {}!", name);
      
    • At the same time, memory can be allocated only when necessary .

  • The standard library is Cow<'a, str> Provides special support for strings . If provided from String and &str Of From and Into transformation , So the above get_name I could just write it as :

    fn get_name() -> Cow<'static, str> {
          
        std::env::var("USER")
            .map(|v| v.into())
            .unwrap_or("whoever you are".into())
    }
    
  • Cow<'a, str> It has also been realized. std::ops::Add and std::ops::AddAssign String overload , therefore get_title() Judgment can be abbreviated as :

    if let Some(title) = get_title() {
          
        name += ", ";
        name += title;
    }
    
  • because String It can be used as write! Macro's goal , Therefore, the above code is also equivalent to :

    use std::fmt::Write;
    
    if let Some(title) = get_title() {
          
        write!(name.to_mut(), ", {}", title).unwrap();
    }
    
  • Not all Cow<..., str> It has to be 'static Life span , Before copying , Can be used all the time Cow Borrow the text calculated before .

15.3.17 - Strings as generic collections

  • String Realized std::default::Default and std::iter::Extend::default

    • default Returns an empty string .
    • extend You can append characters to the end of a string 、 String slice or string .
  • &str Types also implement Default

    • Returns an empty slice .
    • Often used in some boundary situations . For example, derive from a structure containing string slices Default.

See 《Rust Programming 》( Jim - Brandy 、 Jason, - By orendov , Translated by lisongfeng ) Chapter 17
Original address

原网站

版权声明
本文为[phial03]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/186/202207050549561475.html