It is more than hundreds of years of evolution of our education system. Thanks to that, today, the growth of the research is astonishing. Now we are making machines learn. And new robust and optimized models are trained day after day: from Neural Network to CNN to ViT. So, if we consider the DL models as the students of the machine education system,one could ask: Is ViT a Ph.D. student? This talk presents an analogy between the human education system and the deep learning system. Furthermore, different techniques dedicated to training transformers on mid-small databases alongside a novel hybrid model of ViT and CNN are presented.
Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between train- ing and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experi- ments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construc- tion showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the num- ber of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.
Image-to-image translation is a computer vision task aiming to learn the mapping between an input image from one domain to an output image from another domain following the style or characteristics. It can be applied to a wide range of applications, such as collection style transfer, object transfiguration, season transfer, and photo enhancement. Hue-Net is a deep learning framework for intensity-based Image-to-Image translation. It introduces a differentiable representation of (1D) cyclic and (2D) joint histograms and uses them for defining loss functions based on cyclic Earth Mover's Distance (EMD) and Mutual Information (MI). The of Hue-Net strength has been demonstrated on color transfer problems, where the aim is to paint a source image with the colors of a different target image.
the Graph Attention Network (GAT) is a Graph Neural Network (GNN) using attention to represent relative importance of neighbooring nodes in a graph. GNNs are a special kind of neural network operating on graphs. At first glance, they seem to be not very suitable for image analysis, but after an in-depth analysis of the Graph Attention Network, we will see an example of application of such model in superpixel image classification.
Dynamic neural network is an emerging research topic in deep learning. Compared to static models which have fixed computational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc. In this survey, we comprehensively review this rapidly developing area by dividing dynamic networks into three main categories: 1) instance-wise dynamic models that process each instance with data-dependent architectures or parameters; 2) spatial-wise dynamic networks that conduct adaptive computation with respect to different spatial locations of image data and 3) temporal-wise dynamic models that perform adaptive inference along the temporal dimension for sequential data such as videos and texts. The important research problems of dynamic networks, e.g., architecture design, decision making scheme, optimization technique and applications, are reviewed systematically. Finally, we discuss the open problems in this field together with interesting future research directions
The ability of an observer to perform a specific task on images, produced by a given medical imaging systems, defines an objective measure of image quality. If the observer is “numerical”, can deep learning methods “do the job”? What we found in the literature? Some papers rise this issue and propose to approximate the Ideal Observer for performing tasks detection and localization.
“The synergy between the large datasets in the cloud and the numerous computers that power it has enabled remarkable advancements in machine learning, especially in DNNs. […] That changed in 2013 when a projection in which Google users searched by voice for three minutes per day using speech recognition DNNs would double Google datacenters’ computation demands.” This presentation will introduce the concepts behind the hardware architectures used to support current growth in machine learning, including GPUs and TPUs.
Traveling at the time of coronavirus is difficult with the restrictions set by governments all around the world and that’s why most international meetings and conferences are held online. On the other side, deep learning has grown significantly in the past few years and especially for vision applications. Different architectures and models from CNNs to Transformers have been proposed. In this talk, we will not present another model, but we will list different techniques, layers, loss functions, and optimizers that can improve the performance of your model. Also, an analogy between travel and deep learning is presented in the beginning.
CNN are now widely used so it is necessary to implement them efficiently. To do so, CNN are most commonly implemented on GPU processors, and also a bit on FPGA. In this talk, without entering into the details, we will list some problems arising when implementing the CNN inferences, especially on FPGA. We will also link these problems to the CNN models themselves and we will highlight a few general recommendations extracted from the following papers.
ResNet, Highway networks to DenseNet, adding more inter-layer connections besides the direct connection in adjacent layers, emerged as popular approaches to strengthen feature propagation among different layers. However, dense connections cause much redundancy especially in the case of DenseNet. Another aspect is that for many dense connections from previous layers, the role played by the mainstream module is unclear. To address these issues, authors introduce a gating mechanism, inspired by SENeT to model the layer relationships in densely connected blocks.
Nowadays, it is well established that ConvNets are able to achieve incredible performance on complex vision task such as classification, object recognition or semantic segmentation. A common thought is that humans and ConvNets are able to solve these tasks by learning increasingly complex representations of object shapes. However, recent studies show that humans and ConvNets have indeed very different strategies by not being biased towards the same information in images. To this end, authors propose a stylized version of ImageNet , allowing ConvNets to learn images representation used by humans easier.
Image-to-image translation is a realm aiming at transposing images from one representation to another, like generating an aerial map of a region based on a photograph. Results in this field were greatly improved since the arrival of GAN models in 2014. GANs (Generative Adversarial Nets) are neural networks, specialized in sample generation. When applied to an image, those models are able to generate convincing samples that are similar to images from a reference dataset while remaining completely original.
BERT, which stands for Bidirectional Encoder Representations from Transformer, has been published by a Google AI team in 2018. It has been presented as a new cutting-edge model for Natural Language Processing (NLP). Based on Transformer achitecture, it is design to learn bidirectional representations by considering both the left and right contexts in all its layers. While being initially introduced for NLP tasks, it has recently been used to model other tasks such as action recognition.
While the Transformer architecture has become the de-facto standard for natural language processing tasks, its applications to computer vision remain limited. In vision, attention is either applied in conjunction with convolutional networks, or used to replace certain components of convolutional networks while keeping their overall structure in place. It has been shown that this reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well on multiple image tasks.