当前位置：网站首页>Basic use of text preprocessing library Spacy (quick start)

Basic use of text preprocessing library Spacy (quick start)

2022-06-29 15:00:00 【iioSnail】

List of articles

spaCy brief introduction
spaCy install
spaCy Basic use of
spaCy Several important classes in
spaCy Processing of （Processing Pipeline）
actual combat ： Chinese word segmentation and Word Embedding

spaCy brief introduction

spaCy（ Official website ,github link ） It's a NLP Text preprocessing in domain Python library , Include participle （Tokenization）、 Part of speech tagging （Part-of-speech Tagging, POS Tagging）、 dependency analysis （Dependency Parsing）、 Morphological reduction （Lemmatization）、 Sentence boundary detection （Sentence Boundary Detection,SBD）、 Named entity recognition （Named Entity Recognition, NER） And so on . Specific support function reference links .

spaCy Characteristics ：

Support A variety of languages
Use Deep learning The model performs tasks such as word segmentation
Each language basically provides Models of different sizes To choose from , Users can choose according to their actual needs . User access The website links or github link The query model .
Simple and easy to use

spaCy install

pip install spacy -i https://pypi.tuna.tsinghua.edu.cn/simple

If you want to install GPU edition , May refer to Official documents

spaCy Basic use of

spacy For all tasks, it is basically 4 Walking ：

Download model
Load model
Process sentences
To get the results

give an example , Use spacy Divide words in English ：

1. First, download the model through the command ：

python -m spacy download en_core_web_sm

en_core_web_sm It's the name of the model , You can go to This link Search model .

Because at home , There may be a problem of slow download , You can go to github Search model , And then use pip install some_model.whl Manual installation

2. load 、 Use models and get results

import spacy #  Guide pack 
#  Load model 
nlp = spacy.load("en_core_web_sm") 
#  Using the model , Just pass in the sentence 
doc = nlp("Apple is looking at buying U.K. startup for $1 billion")
#  Get word segmentation results 
print([token.text for token in doc])

The final output is ：

['Apple', 'is', 'looking', 'at', 'buying', 'U.K.', 'startup', 'for', '$', '1', 'billion']

spaCy Several important classes in

In the last section , There are several key objects

nlp： The object of spacy.Language class （ Official document link ）.spacy.load Method will return this kind of object .nlp("...") The essence is to adjust Language.__call__ Method
doc: The object of spacy.tokens.Doc( Official document link ), It contains participles 、 Part of speech tagging 、 Word form reduction and other results （ For details, please refer to link ）.doc Is an iterative object .
token: The object of spacy.tokens.token.Token( Official document link ), You can get the specific attributes of each word through this object （ word 、 Part of speech, etc ）, For details, please refer to link .

spaCy Processing of （Processing Pipeline）

Insert picture description here
call nlp(...) It will be executed in the order shown in the above figure （ First participle , Then part of speech tagging and so on ）. For components that are not needed , You can choose to exclude... When loading the model ：

nlp = spacy.load("en_core_web_sm", exclude=["ner"])

Or disable it ：

nlp = spacy.load("en_core_web_sm", disable=["tagger", "parser"])

For disable , You can disable it when you want to use it later ：

nlp.enable_pipe("tagger")

All built-in components can be referred to link

actual combat ： Chinese word segmentation and Word Embedding

1. First of all to Chinese module of official documents Find the right model . Choose the smallest one here .

Insert picture description here
2. Download model

python -m spacy download zh_core_web_sm

3. Write code

import spacy #  Guide pack 
#  Load model , And eliminate unnecessary components
nlp = spacy.load("zh_core_web_sm", exclude=("tagger", "parser", "senter", "attribute_ruler", "ner"))
#  Process sentences 
doc = nlp(" Natural language processing is an important direction in the field of computer science and artificial intelligence . It studies various theories and methods that can realize effective communication between human and computer with natural language .")
# for Loop to get each token The vector corresponding to it 
for token in doc:
	#  Here for the convenience of display , Intercept only 5 position , But in fact, the model encodes Chinese words into 96 Dimension vector 
    print(token.text, token.tensor[:5])

In the official model , have tok2vec This component , It shows that the model can embedding, Very convenient . The final output is ：

 natural  [-0.16925007 -0.8783153  -1.4360809   0.14205566 -0.76843846]
 Language  [ 0.4438781  -0.82981354 -0.8556605  -0.84820974 -1.0326502 ]
 Handle  [-0.16880168 -0.24469137  0.05714838 -0.8260342  -0.50666815]
 yes  [ 0.07762825  0.8785285   2.1840482   1.688557   -0.68410844]
... //  A little 
 and  [ 0.6057179  1.4358768  2.142096  -2.1428592 -1.5056412]
 Method  [ 0.5175674  -0.57559186 -0.13569726 -0.5193214   2.6756258 ]
. [-0.40098143 -0.11951387 -0.12609476 -1.9219975   0.7838618 ]

原网站

版权声明
本文为[iioSnail]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/180/202206291433309906.html