An active learner has a collection of data points, each with a label that is initially hidden but can be
obtained at some cost. Without spending too much, it wishes to find a classifier that will accurately map
points to labels. There are two common intuitions about how this learning process should be organized:
(i) by choosing query points that shrink the space of candidate classifiers as rapidly as possible; and (ii) by
exploiting natural clusters in the (unlabeled) data set. Recent research has yielded learning algorithms for
both paradigms that are efficient, work with generic hypothesis classes, and have rigorously characterized
labeling requirements. Here we survey these advances by focusing on two representative algorithms and
discussing their mathematical properties and empirical performance.
1 Introduction
As digital storage gets cheaper, and sensing devices proliferate, and the web grows ever larger, it gets easier
to amass vast quantities of unlabeled data – raw speech, images, text documents, and so on. But to build
classifiers from these data, labels are needed, and obtaining them can be costly and time consuming. When
building a speech recognizer for instance, the speech signal comes cheap but thereafter a human must examine
the waveform and label the beginning and end of each phoneme within it. This is tedious and painstaking,
and requires expertise.
We will consider situations in which we are given a large set of unlabeled points from some domain X,
each of which has a hidden label, from a finite set Y, that can be queried. The idea is to find a good classifier,
a mapping h : X ! Y from a pre-specified set H, without making too many queries. For instance, each
x 2 X might be the description of a molecule, with its label y 2 {+1,−1} denoting whether or not it binds
to a particular target of interest. If the x’s are vectors, a possible choice of H is the class of linear separators.
In this setting, a supervised learner would query a random subset of the unlabeled data and ignore the
rest. A semisupervised learner would do the same, but would keep around the unlabeled points and use them
to constrain the choice of classifier. Most ambitious of all, an active learner would try to get the most out
of a limited budget by choosing its query points in an intelligent and adaptive manner (Figure 1).
1.1 A model for analyzing sampling strategies
The practical necessity of active learning has resulted in a glut of different querying strategies over the past
decade. These differ in detail, but very often conform to the following basic paradigm:
Start with a pool of unlabeled data S X Pick a few points from S at random and get their labels
Repeat:
Fit a classifier h 2 H to the labels seen so far
Query the unlabeled point in S closest to the boundary of h
(or most uncertain, or most likely to decrease overall uncertainty,...)
This high level scheme (Figure 2) has a ready and intuitive appeal. But how can it be analyzed?
Two faces of active learning
Assuming sampling bias is correctly handled, how exactly is active learning helpful? The recent literature
offers two distinct narratives for explaining this. The first has to do with efficient search through the
hypothesis space. Each time a new label is seen, the current version space – the set of classifiers that are still
“in the running” given the labels seen so far – shrinks somewhat. It is reasonable to try to explicitly select
points whose labels will shrink this version space as fast as possible (Figure 4, left). Most theoretical work
in active learning attempts to formalize this intuition. There are now several learning algorithms which, on
a variety of canonical examples, provably yield significantly lower label complexity than supervised learning
[9, 13, 10, 2, 3, 7, 15, 16, 12]. We will use one particular such scheme [12] as an illustration of the general
theory.
The second argument for active learning has to do with exploiting cluster structure in data. Suppose, for
instance, that the unlabeled points form five nice clusters (Figure 4, right); with luck, these clusters will be
pure in their class labels and only five labels will be necessary! Of course, this is hopelessly optimistic. In
general, there may be no nice clusters, or there may be viable clusterings at many different resolutions. The
clusters themselves may only be mostly-pure, or they may not be aligned with labels at all. An ideal scheme
would do something sensible for any data distribution, but would especially be able to detect and exploit
any cluster structure (loosely) aligned with class labels, at whatever resolution. This type of active learning
is a bit of an unexplored wilderness as far as theory goes. We will describe one preliminary piece of work [11]
that solves some of the problems, clarifies some of the difficulties, and opens some doors to further research.
ساحة النقاش