当前位置：网站首页>Machine learning -- handwritten English alphabet 1 -- classification process

Machine learning -- handwritten English alphabet 1 -- classification process

2022-07-28 10:35:00 【Cute me】

Catalog

1. Import data

2. Data processing

3. Feature calculation

4. feature extraction

5. Build models and forecasts

6. Evaluate a model

7.review

1. Import data

Handwritten letters are stored as separate text files . Each file is separated by commas , It contains four columns ： Time stamp 、 The horizontal position of the pen 、 The vertical position of the pen and the pressure of the pen . The timestamp is the number of milliseconds elapsed since the beginning of data collection . Other variables are in normalized units （0 To 1） Express . For pen position ,0 Indicates the lower and left edges of the writing surface ,1 Represents the top and right edges .

letter = readtable("J.txt");
plot(letter.X,letter.Y)
axis equal

letter = readtable("M.txt");
plot(letter.X,letter.Y)
axis equal

2. Data processing

Pen position of handwritten data in standardized units （0 To 1） measurement . however , Tablets used to record data are not square . It means 1 The vertical distance of corresponds to 10 Inch , The same horizontal distance corresponds to 15 Inch . To correct this problem , The horizontal unit should be adjusted to the range [0 1.5] instead of [0 1].

letter = readtable("M.txt")

letter.X = 1.5*letter.X;

plot(letter.X,letter.Y)
axis equal

Time value has no physical meaning . They represent the number of milliseconds elapsed since the beginning of the data collection session . This makes it difficult for us to interpret handwriting patterns through time . A more useful time variable is the duration from each letter （ In seconds ）.

letter.Time = letter.Time - letter.Time(1)
letter.Time = letter.Time/1000

plot(letter.Time,letter.X)
plot(letter.Time,letter.Y)

3. Feature calculation

What aspects of these letters can be used to distinguish J and M or V？ Our goal is not to use raw signals , Instead, the calculation extracts the whole signal into simple 、 Useful information unit （ It's called a feature ） Value .
For letters J and M, A simple feature may be aspect ratio （ The height of a letter relative to its width ）.J It may be tall and narrow , and M May be more square .
And J and M comparison ,V Fast writing speed , Therefore, the duration of the signal may also be a distinguishing feature .

letter = readtable("M.txt");
letter.X = letter.X*1.5;
letter.Time = (letter.Time - letter.Time(1))/1000
plot(letter.X,letter.Y)
axis equal
# The above is the previous repeated code 
dur = letter.Time(end)
aratio = range(letter.Y)/range(letter.X)

4. feature extraction

MAT file featuredata.MAT Include a include 470 An alphabetic table for extracting features , These letters are written by different people . Table properties have three variables ：AspectRatio and Duration（ The two characteristics calculated in the previous section ） and Character（ Known letters ）.

load featuredata.mat
features

scatter(features.AspectRatio,features.Duration)

It is not clear whether these features are sufficient to distinguish the three letters in the data set （J、M and V）.gscatter Function to generate a grouped scatter graph , That is, a scatter graph that colors points according to grouped variables .

gscatter(features.AspectRatio,features.Duration,features.Character)

5. Build models and forecasts

load featuredata.mat
features
testdata

knnmodel = fitcknn(features,"Character")

After building the model according to the data , It can be used to classify new observations . It only needs to calculate the characteristics of the new observations , And determine where they are in the prediction space .

predictions = predict(knnmodel,testdata)

By default ,fitcknn fit k=1 Of kNN Model . in other words , The model uses only the closest known example to classify a given observation . This makes the model sensitive to any outliers in the training data （ As the outliers highlighted in the above figure ） sensitive . New observations near outliers may be misclassified . A simple way to solve this problem is to add k Value （ That is, use the most common class in several neighbors ）.

knnmodel = fitcknn(features,"Character","NumNeighbors",5)
predictions = predict(knnmodel,testdata)

6. Evaluate a model

kNN How good is the model ？testdata The table contains known classes for test observations . You can associate known classes with kNN The prediction of the model is compared , To understand the performance of the model on new data .

load featuredata.mat
testdata
knnmodel = fitcknn(features,"Character","NumNeighbors",5);
predictions = predict(knnmodel,testdata)

iscorrect = predictions == testdata.Character

Calculate the proportion of correct predictions by dividing the correct predictions by the total number of predictions . Store the results in a file named accurity Variables in . have access to sum Function to determine the number of correct predictions , Use numel Function to determine the total number of predictions .

accuracy = sum(iscorrect)/numel(predictions)

Error rate calculation

iswrong = predictions ~= testdata.Character
misclassrate = sum(iswrong)/numel(predictions)

Accuracy and error classification rate provide a single value for the overall performance of the model , But you can see a more detailed breakdown of the classes that are confused by the model . The confusion matrix shows the number of observations for each combination of real and predicted classes .

Confusion matrices are usually visualized by coloring elements according to their values . Usually diagonal elements （ Correct classification ） Color with one color , Other elements （ Wrong classification ） Color with another color . You can use confusionchart Function visualization confusion matrix .

7.review

Now? , You have a simple two feature model , It can handle three specific letters very well （J、M and V）. Does this pattern also apply to the entire alphabet ？ In this interaction , You will create and test the same kNN Model , however 13 Letters （ Half of the English alphabet ）.

load featuredata13letters.mat
features
testdata
gscatter(features.AspectRatio,features.Duration,features.Character)
xlim([0 10])
knnmodel = fitcknn(features,"Character","NumNeighbors",5);
predictions = predict(knnmodel,testdata);
misclass = sum(predictions ~= testdata.Character)/numel(predictions)
confusionchart(testdata.Character,predictions);