dc.description |
A fundamental problem in perception-based systems is to define
and learn representations of the scene that are more robust and adaptive to several nuisance
factors. Over the recent past, for a variety of tasks involving images, learned representations have been empirically shown to outperform handcrafted ones.
However, their inability to generalize across varying data distributions poses the following
question: Do representations learned using deep networks just fit a given data distribution or do
they sufficiently model the underlying structure of the problem ? This question could be
understood using a simple example: If a learning algorithm is shown a number of images of a simple handwritten digit, then the representation learned should be generic enough to identify the same digit in
a different form. With regards to deep networks, although the learned representation has been shown
to be robust to various forms of synthetic distortions such as random noise, they fail in the presence of
more implicit forms of naturally occurring distortions. In this dissertation, we
propose approaches to mitigate the effect of such distortions and in the process, study some
vulnerabilities of deep networks to small imperceptible changes that occur in the given input.
The research problems that comprise this dissertation lie in the cross section of two open topics: (1)
Studying and developing methods that enable neural networks learn robust representations (2) Improving
generalization of neural nets across domains. The first part of the dissertation approaches the problem of robustness from two broad viewpoints: Robustness to external nuisance factors that occur in the data and robustness
(or a lack thereof) to perturbations of the learned feature space. In the second part, we focus on learning representations that are invariant to external covariate
shift, which is more commonly termed as domain shift.
Towards learning representations robust
to external nuisance factors, we propose an approach that couples a deep convolutional neural
network with a low-dimensional discriminative embedding learned using triplet probability
constraints to solve the unconstrained face analysis problem. While previous approaches in this area have proposed scalable yet ad-hoc solutions to this problem, we propose a principled and parameter free formulation which is based on maximum likelihood estimation. In addition, we employ the principle of transfer learning to realize a deep network architecture that can train faster and on lesser data yet significantly outperforms existing approaches on the unconstrained face verification task. We demonstrate the robustness
of the approach to challenges including age, pose, blur and clutter by performing clustering
experiments on challenging benchmarks.
Recent seminal works have shown that deep neural networks are susceptible to visually imperceptible perturbations of the input. In this dissertation, we build on their ideas in two unique ways: (a) We show that neural networks that perform pixel-wise semantic segmentation tasks also suffer from this vulnerability, despite being trained with more extra information compares to simple classification tasks. In addition, we present a novel self correcting mechanism in segmentation networks and provide an efficient way to generate such perturbations (b) We present a novel approach to regularize deep neural networks by perturbing intermediate layer activations in an efficient manner, thereby exploring the trade-off between conventional regularization and adversarial robustness within the context of very deep networks. Both of these works provide interesting directions towards understanding the secure nature of deep learning algorithms.
While humans find it extremely simple to generalize their knowledge across domains, machine learning algorithms including deep neural networks suffer from the problem of domain shift across what are commonly termed as 'source' (S)
and 'target' (T) distributions. Let the data that a learning algorithm
is trained on be sampled from S. If the real data used to evaluate the model is
then sampled from T, then the learnt model under-performs on the target data. This inability to generalize is characterized as domain shift.
Our attempt to address this problem involves learning a common
feature subspace, where distance between source and target distributions are minimized. Estimating the distance between different domains is highly non-trivial and is an open research
problem in itself. In our approach we parameterize the distance measure by using a Generative
Adversarial Network (GAN). A GAN involves a two player game between two mappings com-
monly termed as generator and discriminator. These mappings are learned simultaneously by
employing an adversarial game, i.e. by letting the generator fool the discriminator and enabling
the discriminator to outperform the generator. This adversarial game can be formulated as a
minimax problem. In our approach, we learn three mappings simultaneously: the generator,
discriminator and a feature mapping that contains information about both the content and the
domain of the input. We deploy a two-level minimax game, where the first level is a competition
between the generator and a discriminator similar to a GAN; the second level game is where the
feature mapping attempts to fool the discriminator thereby introducing domain invariance in
the learned feature representation. We have extensively evaluated this approach for different
tasks such as object classification and semantic segmentation, where we achieve state of the
art results across several real datasets. In addition to the conceptual novelty, our approach
presents a more efficient and scalable solution compared to other approaches that attempt to
solve the same problem.
In the final part of this dissertation, we describe some ongoing efforts and future directions of research. Inspired from the study of perturbations described above, we propose a novel metric on how to effectively choose pixels to label given an image, for a pixel-wise segmentation task. This has the potential to significantly reduce the labeling effort and our preliminary results for the task of semantic segmentation are encouraging. While the domain adaptation approach proposed above considered static images, we propose an extension to video data aided by the use of recurrent neural networks. Use of full temporal information, when available, provides the perceptual system additional context to disambiguate among smaller object classes that commonly occur in real scenes. |
|