A Survey of Recent Meta-Learning Perspectives, Approaches, and Applications

WARNING: This article was written by the author during high-school, in a non-professional capacity.

Meta-learning, or learning to learn, is a paradigm of machine learning algorithms that can generalize itself with meta-knowledge of a certain form such that it can apply to various settings. While it is originally a hallmark of human intelligence, numerous meta-learning perspectives and approaches are springing up in recent years. This paper provides an overview of recent meta-learning approaches, especially for Model-Agnostic Meta-Learning (and its derivatives), Meta-Reinforcement Leaning, and Few-shot (or Zero/One-shot), three emerging methods in the past five years. Model-Agnostic Meta-Learning is a meta-learning algorithm compatible with any model trained using the gradient descent; Meta-Reinforcement refers to doing meta-learning in the field of meta-learning as to develop RL agents that can solve novel tasks fast and efficiently; and Few-shot learning is a few-sample specialization of meta-learning. Their mechanisms, applications, and performances are thoroughly analyzed.
Keywords: Meta-Learning, MAML, Meta-Reinforcement Learning, Few-Shot Learning

If they can, computers would certainly envy that Homo sapiens are all gifted learners. The ability of learning comes to us so naturally that we are apt to forget what a miracle it is. Kids learn by imitating their parents, high schoolers get to know historical events across space and time from their teacher’s preaching, and every human being unconsciously learns from experiences be it sad or happy. In fact, not only do humans acquire concrete knowledge through education or behavioristic processes of trials and errors, but more importantly, we learn how to learn — yet, this is what computers are struggling with so far. “Learning to learn” was firstly termed “meta-learning” by Donald B. Maudsley in educational psychology, which John Biggs described as “being aware of and taking control of one’s own learning” (Biggs, 1985). As computers may marvel at this hallmark of human intelligence, the concept meta-learning emerged in the field of machine learning (ML), a discipline concentrated on computer’s learning.

In the context of machine learning, meta-learning, according to Brazdil, refers to the “study of principled methods that exploit meta-knowledge to obtain efficient models and solutions by adapting machine learning and data mining processes” (Brazdil et al., 2009). Though multiple other definitions exist, the central, underlying idea is that the meta learning algorithm should be able to generalize itself with experience such that it can apply to various scenarios. It may challenge people’s common stereotype that an algorithm is oriented towards a fixed, procedural problem, or at least it should subject to the training data if it comes after “machine learning”; however, in the last 20 years, meta-learning did succeed in that.

In the past decade, the focus of machine learning research had shifted from deriving specific solutions for a single setting to general solutions for various settings, and future approaches of machine learning would implement even greater algorithmic generality. Deep convolutional neural networks like the famous AlexNet could well handle the training rate and overfitting problems while reaching a considerable success rate (Krizhevsky et al., 2017); however, that could only be trained toward a single task. To tackle this limitation, studies have proposed reinforcement learning to leverage more available training data. This approach proved itself sound when incorporated into AlphaGo, whose ability to learn in a behavioral manner stunned the public (Silver et al., 2016). Nonetheless, while deep reinforcement learning is succeeds in some structures like in the chessboard, its ability to generalize still hinges on predefined task guidelines.

Facing the “arbitrary, intrinsically-complex, outside world” that ML algorithms need to tackle with, Rich Sutton observed that, instead of trying to build simple ways to model the content of minds, “we should build in only the meta-methods that can find and capture this arbitrary complexity” (2019). As it turns out, the concept of meta-learning seems a panacea. Although its definition not yet unified, meta-learning is widely recognized as one of the most promising machine learning paradigms in future artificial intelligence advancement. For example, it has been witnessed that a robot hand could solve the Rubik’s cube without a hardcoded algorithm; what supports this astonishing feat was a technique called Automatic Domain Randomization (ADR) — and that was dictated by a central meta-learning algorithm (OpenAI et al., 2019). Innovations on meta-learning are still undergoing an explosion. To name a few, there are metric-based, model-based, and optimization-based approaches, as well as combinations of other existing methods like meta-reinforcement learning. Each of them carries a special perspective to this curious branch of machine learning, and each of them makes the next revolution of artificial intelligence closer to sight.

Given the increasing complexity in meta-learning research, this paper is intended to serve as a survey of recent meta-learning approaches. After a brief explanation of machine learning concepts and terminologies, literature of three state-of-the-art meta-learning areas, Model-Agnostic Meta-Learning (MAML), Meta-Reinforcement Learning (MRL), and Few-shot (or Zero/One-shot) learning and their respective perspectives and applications will be particularly analyzed. Considering meta-learning’s missions, the analysis focuses on three specific dimensions: training difficulty, generalizability, and accuracy.

Preliminaries

Although meta-learning is a broad concept that spans over a variety of machine learning methods and tasks, understanding the principles of those base-learners (i.e. the lower-level machine learning algorithms where meta-learning is applied on, usually with a fixed a priori) remains important. A brief overview of some machine learning terminologies and common tasks with respective methods is as follows.

Machine learning, the general term, refers to a set of methods that can detect patterns in data automatically, and then use the uncovered patterns to predict future data or to perform other kinds of decision making under uncertainty (Murphy, 2012, p. 4). To train a machine learning model such that it works as intended, a dataset in a certain form is always necessary. Every record in the dataset is an instance or a sample, and its properties are known as attributes, features or covariates, whose value called attribute values.

In a predictive or supervised learning approach, the learning algorithm takes in a labeled dataset of input-output pairs, a training set, and gives out response variables of similar forms. When the response variable is categorical, the problem is known as classification or pattern recognition. Common supervised classification methods are linear classifiers, support vector machines (Boser et al., 1992), decision trees (Quinlan, 1986), naive Bayes classifiers (Jensen, 1996), and neural networks. On the other hand, if the response variable is quantitative, the problem is known as regression. Neural networks, random forest (Breiman, 2001), as well as polynomial regression, can apply to it.

Whereas in a descriptive or unsupervised learning setting, the dataset is not labeled (i.e. unpaired). It runs automatically to discover interesting patterns. Its common applications are cluster analysis, association rule, and dimensionality reduction, where methods like K-Nearest Neighbors (KNN), K-means, and hierarchical clustering can apply (Mucherino et al., 2009). They have better performances in terms of generalizability but usually lack accuracy.

Another type of learning is Reinforcement Learning (RL). It differs from supervised learning in not needing a complete paired dataset, and it learns much like a kid, through behavior, trial-and-error processes. Reinforcement learning is closer to how a human learns; it is also closely related to meta-learning. Examples of it are Q-learning and SARSA, both of which incorporate a state-action-reward-state process.

Evaluating a trained model requires a validation set, and processes like holding-out, cross-validation, and bootstrapping are usually adopted. The tests on validation sets give metrics like classification accuracy, confusion matrix (more detailed), logarithmic loss, Area under Curve (AUC), and F-measures for classification models, as well as Mean Absolute Error (MEA) and Root Mean Squared Error (RMSE) for regression models. Underfitting and overfitting are two major obstacles for successful learning, the former unable to sufficiently detect patterns in the training set while the latter learning too well on the training set but failing at the validating set.

A Literature Survey

This section provides a review of recent meta-learning approaches, especially for Model-Agnostic Meta-Learning (and its derivatives), Meta-Reinforcement Leaning, and Few-shot (or Zero/One-shot), three rising methods in the past five years. The reviews also cover the perspectives they rely on as well as possible applications they have enabled.

Granted, there are plentiful historical meta-learning approaches whose performance excel in certain areas – for example, bagging (Breiman, 1996) and boosting (Freund & Schapire, 1997) attempted at exploiting variations in data and are considered meta-learning methods by Brazdil et al. (2009) – and many other algorithms implement the idea of meta-learning while not taking this name. In fact, the arguably first meta-learning system was studied by Rice (1976) as “the algorithm selection problem”, which has evolved into the meta-learning area of “algorithm recommendation”. Besides, a few more explored areas are dynamic bias selection, inductive transfer, and holistic meta-learning system (Lemke et al., 2015).

However, most of the earlier methods of these areas are fully studied in previous literature reviews, as by Bhatt et al. (2012), Brazdil et al. (2009), Lemke et al. (2015), Vanschoren, (2018), and Vilalta & Drissi (2002). Hence, this review instead concentrates on providing an updated survey on the selected new state-of-the-arts – starting with MAML.

Model-Agnostic Meta-Learning (MAML)

First proposed by Finn et al. (2017), the Model-Agnostic Meta-Learning approach was an algorithm compatible with any model trained using the gradient descent method and applicable to problems including but not limited to classification, regression, and reinforcement learning. This algorithm aims to achieve rapid adaptation and can train a model’s parameters with few gradient updates as to shorten the learning time on a new task. Based on the intuition that some internal representations of the data are more transferrable than others, the method defies the conventional wisdom – not ingesting entire datasets and feature embeddings.

As for the specific implementation, it considers a model represented by a parametrized function, whose model parameters alter when encountering new tasks. Then, the meta-optimization performs on these parameters using the update calculated via stochastic gradient descent (SGD). It is noted that this MAML meta-gradient update involves a gradient through a gradient.

As is formally proved by the authors, the algorithm applies to supervised regression and classification, as well as to reinforcement learning (where it enables quick acquisition of policies for new tasks through limited experience). For its application in regression, experimental evaluations have shown that MAML optimizes the model parameters into a region amenable to rapid adaptions and sensitive to loss functions, without overfitting to the parameters; this enables improvements over one step. Also, for classification tasks, evaluations around image recognition on the Omniglot and MiniImagenet datasets indicates that the learned convolutional model compares well to the state-of-the-art results.

However, this method uses fewer overall parameters than the meta-learner Long Short-Term Memory (LSTM), and it outperforms LSTM on 5-way classifications for both datasets. Limitations of it, nevertheless, are that it represents a narrower scope of application than prior memory-augmented neural networks and that it has a critical computational expense at deriving the second derivatives when backpropagating the meta-gradient in the meta-objective. Besides, its application in reinforcement learning is proven to be sound. Results in the 2D navigation and Locomotion challenges show that MAML can learn a model with a quick single gradient update (better initialization) while continuing to improve with future updates.

The fundamental notion of MAML is to adapt to a large number of tasks and thus gain better generalizability. The core optimizing mechanism, meta-gradient update, was also applied by Andrychowicz et al. (2016), where the algorithm is “learning to learn by gradient descent by gradient”, and similarly, it utilized LSTMs as part of the optimization. Therefore, Andrychowicz’s algorithm is worth understanding as well.

To construct learning algorithms that perform well on a cluster of optimization problems, Andrychowicz et al. proposed such an algorithm to cast the challenge of transfer learning as one of generalization. By directly parameterizing the optimizer and applying the gradient descent method, the meta-optimizer can minimize the loss of a given function. Applying recurrent neural networks requires the algorithm to optimize a vast number of parameters, thus Andychowicz incorporates a coordinatewise network architecture (which involves an optimizer that operates on the parameters of the objective function) to simplify the process to a single coordinate. The optimizer is implemented with a two-layer Long-Short Term Memory (LSTM) network.

Resulting learning curves show that optimizers learned from this procedure substantially outperforms the baselines across many settings. It exhibits a sloppier learning curve handling quadratic convex optimization and small neural networks on MNIST; can be generalized to different architectures (like MLPs with 40 hidden units, networks with two hidden layers, and networks using ReLU activations); boosts classification performance on the CIFAR-2, -5, and -10 datasets; and enables only 1 style training Neural Art (artistic style transfer using convolutional networks). The evidence all confirms that the learned algorithms, implemented by LSTMs, can outperform their generic, hand-crafted competitors in various tasks.

Returning to MAML, despite its monumental achievement, there are a few spaces of improvement on Finn’s original version. According to Antoniou et al. (2019), though the MAML algorithm had sped up the process of learning and improved generalization performance, it has many issues that render it problematic to put into practice. These include its training instability, second-order derivative cost, absence of batch normalization statistic accumulation, shared (across step) batch normalization bias, shared inner loop learning rate and fixed outer loop learning rate.

To address these problems, Antoniou et al. (2019) proposed a stable, automated, and improved MAML, which they named MAML++. For each issue mentioned above, they attempted a way to solve it. A Multi-step Loss Optimization (MSL) solves the alleviates the gradient instability by improving the gradient propagation. The Derivative-Order Annealing (DA) technique lowers the second-order derivative cost. Per-Step Batch Normalization Running Statistics (BNRS) solves the absence of batch normalization statistic accumulation and potentially improves its generalization performance. Per-Step Batch Normalization Weights and Biases (BNWB) cancels the shared batch normalization bias whilst increasing the convergence speed, stability, and generalization performance. Finally, Learning Per-Layer Per-Step Learning Rates and Gradient Directions (LSLR) and Cosine Annealing of Meta-Optimizer Learning Rate (CA) solves the previous two learning rate problems, respectively.

Empirical results indicate that this MAML++ marks a new state of the art across all few-shot tasks. Across Omniglot and Mini-Imagenet, MAML++ reduces the inner loop hyperparameter sensitivity, improves the generalization error, and stabilizes and speeds up MAML. While each proposed methodology outperforms MAML individually, BNRS and BNWB contribute to the performance boost the most.

Besides the implementation improvements given by MAML++, there’s another adaptive version of MAML from Behl et al. (2019). To ease the use of MAML with or without significantly parameter tuning to reduce the need for grid search, also to make the algorithm converges in fewer iterations, they proposed an elegant optimization to the original method (named Alpha MAML) by incorporating adaptive tuning of both the learning rate and the meta-learning rate. Based on the hypergradient descent (HD) algorithm, this solution updates learning rates by performing gradient descent alongside the original steps of optimization. It is noted that no extra gradient needs to be computed since the previous gradients can be accessed from extra memory storage.

The performance of Alpha MAML is tested upon few-shot image recognition tasks on the Omniglot dataset. Comparing the behavior of Alpha MAML vs MAML, results show that even the badly picked initial learning rate and meta-learning rate values can be tuned automatically by the online learning rate adaptation scheme of Alpha MAML. It appears that the algorithm does the necessary adjustment to both learning rates in each iteration to optimize the loss. Also, results show that Alpha MAML is less sensitive to the hyperparameter choice one dictates at first, and therefore requires much less fine-tuning.

Meta-Reinforcement Learning

Reinforcement learning algorithms learn through experimental trials and feedbacks, and it enables a behavioralist process that resembles that of human. However, training an RL model is sometimes difficult. Meta-Reinforcement Learning refers to doing reinforcement-learning in the field of meta-learning as to develop RL agents that can solve novel tasks fast and efficiently (Weng, 2019). Though ideas similar this emerged decades ago, for example, in the early meta-learning approach “Learning to Learn Using Gradient Descent” (Hochreiter et al., 2001), recent methods embodying this idea are plentiful. Below are some selected approaches that achieved state-of-the-art performances in different areas like robotics, image classification, and physics simulation.

Deep networks are generalizable if provided with massive labeled data, but for robots in a controlled laboratory setting, only limited supervision is available. Meanwhile, robots can collect ample data from the unstructured real world. Finn et al. (2017) formalized this problem as semi-supervised reinforcement learning (SSRL). They consider the problem of semi-supervised reinforcement learning, where a reward function can be evaluated in some small set of labeled MDPs, but the resulting policy must be successful on a larger set of unlabeled MDPs for which the reward function is not known. That it uses experiences from the unlabeled set without access to the reward function makes it distinct.

Finn et al. proposed and evaluated an algorithm for performing SSRL, called semi-supervised skill generalization (S3G). In this method, they trained an RL policy for settings where a reward function is available and then ran an algorithm to simultaneously learn rewards and general policies at wider, unstructured scenarios.

Considering the nature of this algorithm, the experimental evaluations are centered on domains where generalization is vital for success. In tasks like obstacle navigation, 2-link reacher, 2-link reacher with vision, half-cheetah jump, S3G all can improve the generalization of the learned neural network policy. Compared to utilizing supervised regression to reward labels, the inverse-RL objective can achieve higher performance. All these results suggest this method is efficient enough for learning on physical systems such as robots.

In addition to semi-supervised methods, Hsu et al. (2019) devised an unsupervised meta-learning method that aims to learn a learning procedure applicable to a wide range of novel tasks. Although it somehow digresses from Meta-RL, the underlying principle is enlightening - it allows deriving an unsupervised learning objective that incorporates the usage of representations. Their research shows that learning algorithms acquired through unsupervised meta-learning achieve better downstream performance than the original version without any additional assumptions or supervision. This implies that it is possible to leverage unsupervised embeddings to propose tasks for meta-learning algorithms and create unsupervised methods that are effective for human-specified downstream tasks.

Intending to leverage unlabeled data, Hsu et al. limit accessible data to an untagged set, and approach the problem by framing it as an acquisition process. Specifically, they intended to construct classification tasks from the data and learn how shall the algorithm effectively learn these tasks. The task generation process is critical here; they should be aptly diverse and structured to be fast learned and extended to their human-specified counterparts. Hence, they employ an adjusted k-means clustering method for grouping data points by salient features and then sample partitions, clusters, embeddings, and permutations of one-shot labels. They call it clustering to automatically construct tasks for unsupervised meta-learning (CACTUs).

Hsu et al. empirically demonstrated that such a method improves upon the utility of unsupervised representations in learning downstream, human-specified tasks, holding across benchmark datasets and tasks. However, one particular concern is that the experimental evaluation of MNIST, Omniglot, and miniImageNet exhibits particular structures, perhaps because they are designed for supervised benchmarks.

Besides extending the semi-supervised nature of meta-RL, there are other optimizations of Meta-RL. For example, Lan et al. (2019) improved the conventional meta-RL methodology with task-embedding and shared policy.

Meta-learning in reinforcement learning serves as a guide to take actions such that the cumulative reward is maximized in an environment. However, an obstacle in most meta-RL methods is that they fail to adequately and explicitly model the individuality and commonness of tasks. To further investigate the application of meta-learning in RL domains, Lan et al. proposed to capture shared information across different tasks and specific information on a single task simultaneously. They introduced a new component into the existing meta-RL methodology, named task encoder, and developed a new approach with better performance on training and acting upon novel tasks. Involving a Task Encoder adaption and Shared Policy, this method is named TESP. TESP explicitly models the individuality and commonness of tasks, and, to be specific, it learns a shared policy that characterizes the task commonness to enable a meta-learner to quickly abstract the individuality of varying tasks.

Lan et al. evaluated the proposed method on four tasks with the MuJoCo simulator. As it revealed, TESP significantly outperforms all baselines on all four tasks, which indicates a better learning capacity. Moreover, TESP, unlike the baseline models, keeps a good performance on out-of-distribution tasks and avoids overfitting the training distributions. These results combined illustrate that TESP learns not only faster, but also in a way more generalizable.

Other optimizations of meta-learning over RL, for example, guiding the policy search, are proposed. Though promising at leveraging past experiences to solve new tasks, Meta-RL algorithms have another general defect that they require a large amount of on-policy experiences during the meta-training process, hence they are subjected to much simpler domains. Noting that meta-reinforcement learning does not entail reinforcement learning during the meta-training process, Mendonca et al. (2019) proposed to learn a reinforcement learning procedure through imitation of expert policies that solve previously seen tasks. This embodies a meta-RL method that learns fast reinforcement learning via supervised imitation. By employing demonstrations during meta-training, this method can meta-learn adaptation skills with 10x fewer episodes of interactions, and that enables more viable real-world applications.

To reduce total training experience, Mendonca et al. separated meta-training into two phases — a phase that solves the meta-training task individually and a second phase that utilizes them for meta-learning. In the first phase, the algorithm learns policies for each of the meta-learning tasks, whereas, in the second phase, it learns reinforcingly using these policies as supervision. They named this process Guide Meta-Policy Learn (GMPS).

Not only theoretically proven capable of near-optimal cumulative reward when supplied with near-optimal experts, but it has also been experimentally tested that such method allows exploitation of extra supervision more easily. Handling sparse reward tasks, GMPS can adapt to validation tasks better than using a policy pre-trained with MultiTask imitation. For vision-based tasks, GMPS appears more stable and achieves higher rewards than using MAML, just like in the sparse reward case. These results imply its optimistic application in robotics.

Apart from mere optimizations, Du et al. (2019) proposed an interesting application of model-based reinforcement learning by learning task-agnostic dynamics priors as to learn task-agnostic dynamics priors from video inputs. It consists of two main steps. A frame prediction model (in this method, SpatialNet) is firstly pretrained. This model is then used to initialize a dynamic model for and RL agent. In the first step, it is required to predict the physical behavior that the model can perform both isolations of entities and accurate modeling of localized spaces as well as interactions. Du et al. chose a conceptually simple SpatialNet Architecture that is proven not biased and does not account for ego-dynamics. As for the incorporation of a dynamics model into a reinforcement learning setup, the authors combined Mnih’s previous model and the Proximal Policy Optimization (PPO). They call it an Intuitive Physics Agent (IPA).

Experiments on physics video datasets suggest that Spatialnet encourages better dynamic generalization and can maintain background details, unlike RCNet. Also, trials on PhysWorld (a collection of 2D physics games) indicate IPA+SpatialNet performs better than JISP, PPO, IPA with RCNet or ConvLSTM. Crucially, IPA encourages the policy to take into account the future physics of entities.

Few-Shot Learning

In real-world applications, a huge training dataset is not always feasible; that is, machine learning algorithms at times need to be trained upon limited input. For instance, sometimes we need to classify images with only two or three examples per class. However, whereas human can handle these “few-shot” (i.e., learning from a few labeled examples) tasks well through cognitive concepts and prototypes, machines learning methods are still far from human performance.

To accomplish few-shot learning, recent approaches generally utilize additional information from a large base-dataset (Bennequin, 2019). Hence, it is presumable that many few-shot learning methods are indeed meta-learning methods that specialize in few-shot training. As cited by Bennequin, memory-augmented networks, metric learning, gradient-based meta-learners (like MAML), and data generation are examples of popular solutions.

Although Finn only tested it on supervised regression and classification, and on reinforcement learning tasks, MAML can be extended to many other scenarios necessitating fast adaptation of a Deep Neural Network, like few-shot image recognition. Theoretically applicable, nevertheless, experiments of applying the Model-Agnostic Meta-Learning algorithm to the YOLO detector failed, as doomed by the low prediction of the objectness confidence (Bennequin, 2019).

Identifying that MAML only learns an initial model since its updating rule is fixed to a classic gradient descent method, Jamal and Qi (2019) argued that such a biased initial model may not be well generalizable to an unseen task deviating from meta-training tasks. Therefore, they proposed a Task-Agnostic Meta-Learning (TAML) algorithm for few-shot learning. To solve the central issue of MAML, the discrepancy of future tasks and training task, they imposed an unbiased task-agnostic prior on the initial model by avoiding it from over-performing at some tasks; this allowed the meta-learning to achieve a more competitive update rule.

There are two novel paradigms of TAML, one entropy-based and one inequality-minimization measures based. While the entropy-based means is only amenable to discrete labels and thus not as sound for regressing and reinforcement learning, the inequality-minimization-measures-based means, using measures like Theil Index, Atkinson Index, and Gini-Coefficient, is applicable to wider settings. However, experimentation exhibits that the entropy-based approach performs better than the inequality-based approach. Compared with MAML, both an entropy-based TAML and a general inequality-minimization TAML applicable to more ubiquitous scenarios are presented. This implies that TMAL could consistently outperform existing meta-learning algorithms on few-shot learning.

Though recent deep learning algorithms, relying on the ability to apply gradient-based optimization, have made considerable breakthroughs, there’s a persistent challenge on one-shot learning, a special case of few-shot learning. In the conventional approach of deep neural networks, a model must “forget” its trained parameter as to adequately incorporate the new information. However, Santoro et al. (2016) claim that neural net workswith a memory capacity provide a promising approach to meta-learning in deep networks.

They demonstrated that a highly capable memory-augmented neural network’s (MANN) capability of meta-learning in tasks that carry significant shot- and long-term memory demands. Also, they showed that the ability to slowly learn an abstract method for obtaining useful representations of raw data (via gradient-descent) and the ability to rapidly bind new information after a single presentation (via an external memory module) are combined in their approach. This again implies that such one-shot learning with memory-augmented neural networks can be effective for one-shot learning.

Experimental results on Omniglot classification and regression tasks confirmed that this model compares well to human baseline and other models like feedforward and LSTM. Such superior performance present even when only sparse training data was available, as a result of gradual, incremental learning in conjunction with a more flexible memory resource.

Another extreme counterpart of few-shot learning, Zero-Shot Learning (ZSL) approaches can tackle challenging circumstances in which new categories appear after the learning stage. It is inspired by the human ability to identify new objects by merely reading a description of it and relating them with previously learned concepts, and it is inherently a two-stage process. Romera-Paredes et al. (2017) studied a framework that is capable of integrating both stages and, based on that, proposed a ZSL approach that can perform efficiently both on training and inference stages, as well as extremely simple to implement – one line of code for each stage. It models the relationships between features, attributes, and classes as a two linear layers network, where the weights of the top layer are not learned but are given by the environment. This method is aptly named Embarrassing Simple ZSL (ESZSL). Also, they provide an analysis of the bound of the generalization error of this approach.

Both synthetic experiments and real data experiments show that ESZSL significantly outperforms state-of-the-art algorithms like DAP and ZSRwUA. A modified version, ESZSL All Signatures (ESZSL-AS), performs even better in some cases. This comports with the author’s claim that this method should be robust to attributes having different discriminative capabilities for characterizing the classes.

Conclusion

This survey shows three exemplary ways at which meta-learning embodies itself. From Model Agnostic Meta-Learning to Meta-Reinforcement Learning and Few-Shot Learning, each category contains varied implementations. The survey presented the iteration and evolution of algorithm series like MAML—Alpha-MAML—MAML++, the attempts of different meta-learning applications, like RL in robotics, image recognition, and simulation, as well as the specialization of algorithms, like few-shot learning as a few-sample adaption of supervised meta-learning. Along with other recent meta-learning works, they uncovered that exploiting meta-knowledge can be reached in many directions; in other words, there is a wide spectrum of approaches yet to be explored.

The No Free Lunch principle dictated that an optimization algorithm cannot outperform even the base-line model for all possible problems (Wolpert & Macready, 1997); however, meta-learning optimizations do not apply to its settings. The fundamental structure of meta-learning is distinctive from learning at the base level, considering the presence of meta-knowledge. It more resembles the human learning structure, where a large amount of meta-information can be stored. As Giraud-Carrier and Provost claimed, meta-learning offers a natural alternative to constructing a general-purpose learning algorithm, and though NFL may apply to the meta-level, the cost of our free lunch would be theoretically reasonable (Giraud-Carrier & Provost, 2005). Keeping an open view to novel means towards the exploitation of meta-knowledge can perhaps allow better insights. We have encountered quirky methods like learning gradient descent by gradient descent and learning with zero-shot in one line of code, so, if not expecting overly simple hacks, we should anticipate an epiphany – both in machines’ and our minds.

References

Andrychowicz, M., Denil, M., Gomez, S., Hoffman, M. W., Pfau, D., Schaul, T. , Shillingford, B., & de Freitas, N. (2016). Learning to learn by gradient descent by gradient descent. ArXiv:1606.04474 [Cs]. http://arxiv.org/abs/1606.04474
Antoniou, A., Edwards, H., & Storkey, A. (2019). How to train your MAML. ArXiv:1810.09502 [Cs, Stat]. http://arxiv.org/abs/1810.09502
Behl, H. S., Baydin, A. G., & Torr, P. H. S. (2019). Alpha MAML: Adaptive
Model-Agnostic Meta-Learning. ArXiv:1905.07435 [Cs, Stat]. http://arxiv.
org/abs/1905.07435
Bennequin, E. (2019). Meta-learning algorithms for Few-Shot Computer Vision. ArXiv:1909.13579 [Cs]. http://arxiv.org/abs/1909.13579
Bhatt, N., Thakkar, A., & Ganatra, A. (2012). A Survey & Current Research Challenges in Meta Learning Approaches based on Dataset Characteristics. 2(1), 9.
Biggs, J. B. (1985). The Role of Metalearning in Study Processes. British Journal of Educational Psychology, 55(3), 185–212. https://doi.org/10.1111/j.2044-8279.1985.tb02625.x
Boser, B. E., Guyon, I. M., & Vapnik, V. N. (1992). A training algorithm for optimal margin classifiers. Proceedings of the Fifth Annual Workshop on Computational Learning Theory - COLT ’92, 144–152. https://doi.org/10.1145/130385.130401
Brazdil, P., Giraud-Carrier, C., Soares, C., & Vilalta, R. (2009). Metalearning: Applications to Data Mining. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-540-73263-1
Breiman, L. (1996). Bagging Predictors. Machine Learning, 24(2), 123–140. https://doi.org/10.1023/A:1018054314350
Breiman, L. (2001). Random Forests. Kluwer Academic Publishers. https://doi.org/10.1023/A:1010933404324
Du, Y., & Narasimhan, K. (2019). Task-Agnostic Dynamics Priors for Deep Reinforcement Learning. ArXiv:1905.04819 [Cs, Stat]. http://arxiv.org/abs/1905.04819
Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ArXiv:1703.03400 [Cs]. http://arxiv.org/abs/1703.03400
Finn, C., Yu, T., Fu, J., Abbeel, P., & Levine, S. (2017). Generalizing Skills with Semi-Supervised Reinforcement Learning. ArXiv:1612.00429 [Cs]. http://arxiv.org/abs/1612.00429
Freund, Y., & Schapire, R. E. (1997). A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. Journal of Computer and System Sciences, 55(1), 119–139. https://doi.org/10.1006/jcss.1997.1504
Giraud-Carrier, C., & Provost, F. (2005). Toward a justification of meta-learning: Is the no free lunch theorem a show-stopper? Proceedings of the ICML-2005 Workshop on Meta-Learning.
Hochreiter, S., Younger, A. S., & Conwell, P. R. (2001). Learning to Learn Using Gradient Descent. In G. Dorffner, H. Bischof, & K. Hornik (Eds.), Artificial Neural Networks—ICANN 2001 (Vol. 2130, pp. 87–94). Springer Berlin Heidelberg. https://doi.org/10.1007/3-540-44668-0_13
Hsu, K., Levine, S., & Finn, C. (2019). Unsupervised Learning via Meta-Learning. ArXiv:1810.02334 [Cs, Stat]. http://arxiv.org/abs/1810.02334
Jamal, M. A., & Qi, G.-J. (2019). Task Agnostic Meta-Learning for Few-Shot Learning. 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11711–11719. https://doi.org/10.1109/CVPR.2019.01199
Jensen, F. V. (1996). Introduction to Bayesian Networks (1st ed.). Springer-Verlag.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2017). ImageNet classification with deep convolutional neural networks. Communications of the ACM, 60(6), 84–90. https://doi.org/10.1145/3065386
Lan, L., Li, Z., Guan, X., & Wang, P. (2019). Meta Reinforcement Learning with Task Embedding and Shared Policy. ArXiv:1905.06527 [Cs, Stat]. http://arxiv.org/abs/1905.06527
Lemke, C., Budka, M., & Gabrys, B. (2015). Metalearning: A survey of trends and technologies. Artificial Intelligence Review, 44(1), 117–130. https://doi.org/10.1007/s10462-013-9406-y
Mendonca, R., Gupta, A., Kralev, R., Abbeel, P., Levine, S., & Finn, C. (2019). Guided Meta-Policy Search. ArXiv:1904.00956 [Cs, Stat]. http://arxiv.org/abs/1904.00956
Mucherino, A., Papajorgji, P. J., & Pardalos, P. M. (2009). K-Nearest Neighbor Classification. In A. Mucherino, P. J. Papajorgji, & P. M. Pardalos (Eds.), Data Mining in Agriculture (pp. 83–106). Springer. https://doi.org/10.1007/978-0-387-88615-2_4
Murphy, K. P. (2012). Machine learning: A probabilistic perspective. MIT Press.
OpenAI, Akkaya, I., Andrychowicz, M., Chociej, M., Litwin, M., McGrew, B., Petron, A., Paino, A., Plappert, M., Powell, G., Ribas, R., Schneider, J., Tezak, N., Tworek, J., Welinder, P., Weng, L., Yuan, Q., Zaremba, W., & Zhang, L. (2019). Solving Rubik’s Cube with a Robot Hand. ArXiv:1910.07113 [Cs, Stat]. http://arxiv.org/abs/1910.07113
Quinlan, J. R. (1986). Induction of decision trees. Machine Learning, 1 (1), 81–106. https://doi.org/10.1007/BF00116251
Rice, J. R. (1976). The Algorithm Selection Problem. In Advances in Computers (Vol. 15, pp. 65–118). Elsevier. https://doi.org/10.1016/S0065-2458(08)60520-3
Romera-Paredes, B., & Torr, P. H. S. (2017). An Embarrassingly Simple Approach to Zero-Shot Learning. In R. S. Feris, C. Lampert, & D. Parikh (Eds.), Visual Attributes (pp. 11–30). Springer International Publishing. https://doi.org/10.1007/978-3-319-50077-5_2
Santoro, A., Bartunov, S., Botvinick, M., Wierstra, D., & Lillicrap, T. (2016). One-shot Learning with Memory-Augmented Neural Networks. ArXiv:1605.06065 [Cs]. http://arxiv.org/abs/1605.06065
Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. van den, Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484–503.
Sutton, R. (2019, March 13). The Bitter Lesson. http://incompleteideas. net/IncIdeas/BitterLesson.html
Vanschoren, J. (2018). Meta-Learning: A Survey. ArXiv:1810.03548 [Cs, Stat] . http://arxiv.org/abs/1810.03548
Vilalta, R., & Drissi, Y. (2002). A Perspective View and Survey of Meta-Learning. 20.
Weng, L. (2019, June 23). Meta Reinforcement Learning. Lil’Log. https://lilianweng.github.io/2019/06/23/meta-reinforcement-learning.html
Wolpert, D. H., & Macready, W. G. (1997). No free lunch theorems for optimization. IEEE Transactions on Evolutionary Computation, 1(1), 67–82. https://doi.org/10.1109/4235.585893