当前位置:网站首页>Thesis study - 7 Very Deep Convolutional Networks for Large-Scale Image Recognition (3/3)
Thesis study - 7 Very Deep Convolutional Networks for Large-Scale Image Recognition (3/3)
2022-07-03 19:06:00 【Aki Unwzii】
List of articles
In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with 25.3% error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a few modifications. Our method is described in Sect. A.1 and evaluated in Sect. A.2.
In the main part of the article , We compared ILSVRC Classification challenges , Through the performance of convolution network in different depths , Made a detailed evaluation . In this chapter , We aimed at the positioning task , And in 2014 The competition in , We use 25.3% The error rate of wins the game . Location task can be regarded as a special kind of object recognition , by top-5 Each category of the category predicts , And it has nothing to do with the actual quantity . therefore , We took Sermanet et al. (2014) Methods , They are ILSVRC-2013 Locate the champion of the game , We made a small change in their method . The details are A.1 section , The assessment is in A.2 section .
To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4).
To perform object recognition , We use a very deep convolution network , At the last full connection layer of the network , It is responsible for forecasting bounding box It's not , Classification confidence .bounding box The details of are provided by a 4D Vector representation , Respectively represent the coordinates of the center point (x,y), wide (w)、 high (h). We can decide bounding box Whether it works on all categories (single-class clustering ,SCR(Sermanet et al., 2014))) Or a specific type (per-class clustering ,PCR). In the former case , The last layer is a 4-D, After that is 4000-D( for example , There are 1000 Category ). Except for the last layer for prediction bounding box, We use and structure D( surface 1) Same structure , It contains 16 Weight layers , And found that it performed well in classification tasks ( chapter 4).
Training. Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale: S = 256 and S = 384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to 1 0 − 3 10^{−3} 10−3 . We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch.
Training . Convolution positioning network and convolution classification network ( chapter 3.1) The training process is similar . The main difference between them is , We replace logistic regression with Euclidean distance loss , It is used to punish bounding box Deviation from the true value . We trained two positioning models , Each has a different scale :S=256 and S= 384( Due to time constraints , We didn't ILSVRC-201 Scale jitter is used in submitted papers ). Use the corresponding classification model ( Training based on the same scale ) Initialize training , And set the learning rate to 1 0 − 3 10^{−3} 10−3. We studied such as (Sermanet et al., 2014) Described , Fine tune all layers and only two fully connected layers . The last full connection layer is initialized randomly , And train from scratch .
Testing. We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image.
test . We consider two test protocols . The first one is used to compare the performance of different networks on the verification set , And only consider bounding box In the face of ground truth Type of prediction ( Used to find classification errors ). The bounding box is obtained by applying the network only to the center clipping of the image .
The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coordinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.
The second is mature 、 Testing process , It is an image-based convolution positioning network intensive application , Similar to classification tasks ( The first 3.2 section ). The difference is , The output of the last full connection layer is not a class fraction graph , It's a set of bounding box predictions . In order to get the final prediction , We used Sermanet wait forsomeone (2014) Greedy merge program . It first incorporates spatially close predictions ( Use average coordinates ), Then they are graded according to the class scores obtained from the convolution classification network . When using multiple convolution positioning Networks , We first take their bounding box Union of test sets , Then run the merge process on the Union . We didn't use Sermanet wait forsomeone (2014) Multiple pooling migration technology . This can increase bounding box Predicted spatial resolution , And can further improve the results .
In this section we first determine the best-performing localisation setting (using the first test protocol), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5.
In this section , We first determine the best positioning setting ( Use the first method ), Then evaluate its performance in the complete state ( The second method ). The positioning error can be determined according to ILSVRC standard (Russakovsky et al., 2014) Measure to get . for example , Think bounding box The intersection and union ratio with the real bounding box is greater than 0.5, Think bounding box The prediction is correct .
Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the localisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to S = 384; the results with S = 256 exhibit the same behaviour and are not shown for brevity.
Set comparison . From the table 8 It can be seen that , Quasi regression (PCR) Better than class unknowable single class regression (SCR), This is related to Sermanet wait forsomeone (2014) The findings are different . PCR Better than SCR. We also note that , Fine tune all layers for positioning tasks than only fine tune all connected layers ( Such as (Sermanet et al., 2014) As we did in ) Produce significantly better results . In these experiments , The minimum image side is set to S = 384;S = 256 The results show the same behavior , Not shown for brevity .
Table 8: Localisation error for different modifications with the simplified testing protocol: the bounding box is predicted from a single central image crop, and the ground-truth class is used. All ConvNet layers (except for the last one) have the configuration D (Table 1), while the last layer performs either single-class regression (SCR) or per-class regression (PCR).
surface 8: Different modified positioning errors Use simplified test protocols : The bounding box is predicted from the clipping of a single central image , And used ground-truth class . all ConvNet layer ( Except for the last floor ) All have configuration D( surface 1), And the last layer performs a single class regression (SCR) Or every kind of regression (PCR).
Fully-fledged evaluation. Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted using our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Table 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance.
A fully fledged assessment . Determine the best positioning setting (PCR, Fine tuning of all layers ) after , We now apply it to fully mature scenarios , The top 5 Category tags are using our The best classification system ( The first 4.5 section ), And use Sermanet wait forsomeone (2014) The method combines multiple intensive computational bounding box predictions . From the table 9 It can be seen that , Compared with using center clipping ( surface 8), Will position ConvNet Applying it to the whole image significantly improves the results , Although before use 5 A prediction class label rather than a basic fact . And classification tasks ( The first 4 section ) similar , The performance is further improved by testing on multiple scales and combining the prediction of multiple networks .
Comparison with the state of the art. We compare our best localisation result with the state of the art in Table 10. With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation performance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation.
Comparison with the most advanced level . We are on the table 10 Our best positioning results are compared with the most advanced level . With 25.3% Test error rate , our “VGG” The team won ILSVRC-2014 Positioning challenge (Russakovskyet al,2014). It is worth noting that , Our result ratio ILSVRC-2013 Winner Overfeat(Sermanet et al,2014 year ) The results are much better , Although we used fewer scales and did not use their resolution enhancement technology . We imagine that if we integrate this technology into our method , Better positioning performance can be achieved . This shows that we are very deep ConvNets Improved performance —— We use a simpler location method to get better results , But it means more powerful .
In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature Table 10: Comparison with the state of the art in ILSVRC localisation. Our method is denoted as “VGG”
In the previous section , We discussed in ILSVRC Training and evaluation on data sets are very deep ConvNet. In this section , We evaluated the situation in ILSVRC Pre trained ConvNets, As a feature table 10: And ILSVRC Comparison of the latest localized technologies . Our method is expressed as “VGG”
extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC (Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available).
Extractors on other smaller datasets , Because of over fitting , It is not feasible to train large models from scratch . lately , People are interested in this kind of use case (Zeiler & Fergus,2013;Donahue et al,2013;Razavian et al,2014;Chatfield et al,2014), Because it turns out , Depth image stay ILSVRC The representation learned on can be well extended to other data sets , On these datasets, their performance is much better than that of hand-made representations . After this work , We investigate whether our model has better performance than the shallower model used in the most advanced methods . In this assessment , We considered two in ILSVRC( The first 4 section ) A model with the best classification performance on —— To configure “Net-D” and “Net-E”( We publicly offer ).
To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor is L2-normalised and combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed).
In order to make use of ILSVRC Pre trained ConvNets Image classification on other data sets , We removed the last full connection layer ( perform 1000 road ILSVRC classification ), And use the next to last layer of 4096-D Activation as an image feature , They are aggregated across multiple locations and scales . The generated image descriptor is L2 The normalized , And with linear SVM Combination of classifiers , Train on the target data set . For the sake of simplicity , In the process of the training ConvNet The weight remains fixed ( Do not perform fine tuning ).
Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals Q, and then the network is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a horizontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales Q. The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors.
The aggregation of features with our ILSVRC Assessment procedure ( The first 3.2 section ) In a similar way . namely , First, rescale the image , Make the smallest side equal to Q, Then the network is intensively applied to the image plane ( When all layers of ownership are regarded as convolution , It's possible ). then , We perform global average pooling on the generated feature graph , Generate 4096-D Image descriptor . Then average the descriptor with the descriptor of the horizontally flipped image . Such as Sect Shown . 4.2, Multiscale assessment is beneficial , So we extract multiple scales Q Characteristics of . The obtained multiscale features can be stacked or aggregated across scales . Stacking Allow subsequent classifiers to learn how to optimize the combined image statistics within a certain range ; However , This is at the cost of increasing the descriptor dimension . We return to the discussion of this design choice in the following experiment . We also evaluated the late fusion of features using two network computing , This is done by stacking their respective image descriptors .
Table 11: Comparison with the state of the art in image classification on VOC-2007, VOC-2012, Caltech-101, and Caltech-256. Our models are denoted as “VGG”. Results marked with * were achieved using ConvNets pre-trained on the extended ILSVRC dataset (2000 classes).
surface 11:** stay VOC-2007、VOC-2012、Caltech-101 and Caltech-256 It is compared with the image classification of the prior art .** Our model is expressed as “VGG”. Be marked with * The result is used in extended ILSVRC Data sets (2000 Classes ) Pre trained ConvNets Realized .
Image Classification on VOC-2007 and VOC-2012. We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels, corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes.
VOC-2007 and VOC-2012 Image classification on . We evaluate PASCAL VOC-2007 and VOC-2012 The benchmark image classification task begins (Everingham et al., 2015). These datasets contain 10K and 22.5K Images , Each image is labeled with one or more labels , Corresponding 20 Object categories . VOC The organizer provides pre-defined training 、 Verification and test data splitting (VOC-2012 The test data are not public ; contrary , The official evaluation server is provided ). Recognition performance is the average accuracy across classes (mAP) To measure .
Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific semantics which a classifier could exploit. Since averaging has a benefit of not inflating the descriptor dimensionality, we were able to aggregated image descriptors over a wide range of scales: Q ∈ {256, 384, 512, 640, 768}. It is worth noting though that the improvement over a smaller range of {256, 384, 512} was rather marginal (0.3%).
It is worth noting that , clear through VOC-2007 and VOC-2012 Verify the performance of the set , We found aggregate image descriptors , Calculate on multiple scales , Performing by averaging is similar to aggregating by stacking . We assume that this is because in VOC Data set , Objects appear on various scales , Therefore, there is no specific scale semantics that classifiers can take advantage of . Because averaging has the advantage of not expanding descriptor dimensions , We can aggregate image descriptors in a wide range :Q ∈ {256, 384, 512, 640, 768}. It is worth noting that , stay {256, 384, 512} The improvement on a smaller scale is quite insignificant (0.3%).
The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results. Our methods set the new state of the art across image representations, pretrained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%. It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline.
In the table 11 The performance of the test set is reported in and compared with other methods . Our network “Net-D” and “Net-E” stay VOC The same performance is shown on the dataset , Their combination slightly improves the results . Our method sets the latest technology for cross image representation , stay ILSVRC Pre training on the dataset , be better than Chatfield wait forsomeone (2014 ) The previous best result exceeded 6%. It should be noted that (Wei et al. 2014) Methods . stay VOC-2012 It has been realized. 1% better mAP, In the expansion of 2000 class ILSVRC Pre training on the dataset , The dataset includes additional 1000 Categories , Semantically close to VOC Categories in the dataset . It also benefits from the integration with the object detection assisted classification pipeline .
Image Classification on Caltech-101 and Caltech-256. In this section we evaluate very deep features on Caltech-101 (Fei-Fei et al., 2004) and Caltech-256 (Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard evaluation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall (which compensates for a different number of test images per class). Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20% of training images were used as a validation set for hyper-parameter selection.
Caltech-101 and Caltech-256 Image classification . In this section , We evaluate Caltech-101(Fei-Fei et al.,2004) and Caltech-256(Griffin et al.,2007) Very deep feature classification benchmark of image . Caltech-101 contain 9K Images , It is divided into 102 Categories (101 Object categories and a background category ), and Caltech-256 Bigger , Yes 31K Images and 257 Categories . The standard evaluation protocol for these datasets is to generate several randomly divided training and test data , And report the average recognition performance of segmentation , This is measured by the average class recall rate ( It compensates for a different number of test images for each class ). Following Chatfield et al (2014) after , Zeller and Fergus (2013),He wait forsomeone (2014), stay Caltech-101 On , We have created 3 Randomly split into training and test data , So each split and each category contains 30 Training images , Each category can contain up to 50 A test image . stay Caltech-256 On , We also generated 3 Split up , Each split contains each class 60 Training images ( The rest is used to test ). In each split ,20% The training image is used as the verification set of superparameter selection .
We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multiple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are semantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales Q ∈ {256, 384, 512}. Our models are compared to each other and the state of the art in Table 11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256, our features outperform the state of the art (Chatfield et al., 2014) by a large margin (8.6%).
We found that , And VOC Different , On the Caltech dataset , Descriptor stacks computed on multiple scales perform better than average or maximum pools . This can be explained by the following facts : In the Caltech image , Objects usually occupy the entire image , Therefore, the multi-scale image features are different in semantics ( Capture the whole object and part of the object ), And stacking allows the classifier to take advantage of this scale - Specific expression . We used three scales Q ∈ {256, 384, 512}. Our model is in table 11 And the most advanced models . It can be seen that , Deeper 19 layer Net-E Better performance than 16 Layer of Net-D, Their combination further improves performance . stay Caltech-101 On , Our expression and He The method of et al. Is competitive . (2014), However , It's in VOC-2007 The performance on is much worse than our network . stay Caltech-256 On , Our function is much better than the existing technology (Chatfield wait forsomeone ,2014 year )(8.6%).
Action Classification on VOC-2012. We also evaluated our best-performing image representation (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training images, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet features on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation. The results are compared to other approaches in Table 12.
Yes VOC-2012 Classification of actions . We also PASCAL VOC-2012 Action classification task (Everingham et al., 2015) We evaluated our best image representation (Net-D and Net-E Stacking of features ), This includes predictions from Single image , Given the bounding box of the person performing the action . The dataset contains 4.6K Training images , Marked as 11 Classes . And VOC-2012 Object classification tasks are similar , Performance is using mAP Measured . We considered two training settings :(i) Calculate over the whole image ConvNet Feature and ignore the provided bounding box ; (ii) Calculate the entire image and the features provided on the bounding box , And stack them to get the final representation . Results and table 12 Other methods in are compared .
Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features.
Even if you don't use the bounding box provided , Our expression is VOC The action classification task has also reached the most advanced level , And when you use images and bounding boxes at the same time , The results have been further improved . Different from other methods , We don't combine any task specific heuristics , It depends on the representation ability of very deep convolution features .
Other Recognition Tasks. Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperforming more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been observed in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014).
** Other identification tasks .** Since our model was published , They have been actively used in a wide range of image recognition tasks by the research community , Always better than a shallower expression . for example ,Girshick wait forsomeone (2014) Use our 16 Layer model , By replacing Krizhevsky wait forsomeone (2012) Of ConvNet To achieve the state of target detection results .Krizhevsky wait forsomeone (2012) Similar benefits have been achieved on a shallower architecture . Semantic segmentation (Long et al., 2014)、 Image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014)、 Texture and material identification (Cimpoi et al., 2014; Bell et al., 2014).
Here we present the list of major paper revisions, outlining the substantial changes for the convenience of the reader.
v1 Initial version. Presents the experiments carried out before the ILSVRC submission.
v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance.
v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classification datasets. The models used for these experiments are publicly available.
v4 The paper is converted to ICLR-2015 submission format. Also adds experiments with multiple crops for classification.
v6 Camera-ready ICLR-2015 conference paper. Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.
- “google is not defined” when using Google Maps V3 in Firefox remotely
- Work Measurement - 1
- 知其然,而知其所以然,JS 对象创建与继承【汇总梳理】
- 235. Ancêtre public le plus proche de l'arbre de recherche binaire [modèle LCA + même chemin de recherche]
- Zhengda futures news: soaring oil prices may continue to push up global inflation
- Caddy server agent
- Nous avons fait une plateforme intelligente de règlement de détail
- Php based campus lost and found platform (automatic matching push)
- A green plug-in that allows you to stay focused, live and work hard
Su embedded training - Day10
In addition to the prickles that pierce your skin, there are poems and distant places that originally haunt you in plain life
Valentine's Day - make an exclusive digital collection for your lover
Pan for in-depth understanding of the attention mechanism in CV
Think of new ways
Compose LazyColumn 顶部添加控件
FBI 警告:有人利用 AI 换脸冒充他人身份进行远程面试
leetcode:11. 盛最多水的容器【双指针 + 贪心 + 去除最短板】
Using the visualization results, click to appear the corresponding sentence
Nous avons fait une plateforme intelligente de règlement de détail
High concurrency Architecture - separate databases and tables
Dynamic planning -- expansion topics
[new year job hopping season] test the technical summary of interviewers' favorite questions (with video tutorials and interview questions)
[optics] vortex generation based on MATLAB [including Matlab source code 1927]
“google is not defined” when using Google Maps V3 in Firefox remotely
Find the median of two positive arrays
EGO Planner代码解析bspline_optimizer部分(3)
The installation path cannot be selected when installing MySQL 8.0.23
[mathematical modeling] ship three degree of freedom MMG model based on MATLAB [including Matlab source code 1925]
Sqlalchemy - subquery in a where clause - Sqlalchemy - subquery in a where clause
Differential constrained SPFA
Ping problem between virtual machine and development board
Flask generates swagger documents
We have built an intelligent retail settlement platform
These problems should be paid attention to in the production of enterprise promotional videos