## Multi-Class Probabilistic Active Learning

### Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, Myra Spiliopoulou

Knowledge Management and Discovery Lab
Otto-von-Guericke-University Magdeburg, Germany

daniel.kottke@ovgu.de

www.daniel.kottke.eu/talks/2016_ECAI

## Active learning in times of Big data?

• "Big data" lead to a huge amount of data (mostly unlabeled)
• Results from unlabeled data mostly too general
• But annotations are expensive, exhausting, time-consuming
• Hence minimizing the annotation effort is preferred

$$\Rightarrow$$ Active Learning

## Random

Selects label candidates at random.

 The probability for each label candidate to be selected is equal.

## Uncertainty Sampling

Prefers label candidates near the decision boundary.

 Candidates in the darker green area (near the decision boundary) are preferred.

## Expected Error Reduction

Simulates the selection of each candidate and each class and estimates the expected error reduction on the data.

 Candidates in the darker green area (complex calculation) are preferred.

## Our method: Multi-Class Probabilistic Active Learning

Basic idea and main tools

## Probabilistic Active Learning [1]

• estimate the performance gain of acquiring a label

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

## Probabilistic Active Learning [1]

• estimate the performance gain of acquiring a label by using local statistics around each candidate

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

## Probabilistic Active Learning [1]

• estimate the performance gain of acquiring a label by using local statistics around each candidate
• frequency estimates (count the number of nearby labels)
$$\rightarrow$$ posterior, reliability

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

## Probabilistic Active Learning [1]

• estimate the performance gain of acquiring a label by using local statistics around each candidate
• frequency estimates (count the number of nearby labels)
$$\rightarrow$$ posterior, reliability
• density estimates (share of instances in the neighborhood)
$$\rightarrow$$ impact

[1] "Optimised probabilistic active learning", by G. Krempl, D. Kottke, V. Lemaire.
Machine Learning, 2014.

## Posterior and its reliability

• Label frequencies summarize information about the posterior and the reliability of this posterior
• Provides local performance estimate by:
• modeling the true posterior $$\vec{p}$$ as a random experiment
• modeling all possible label realizations $$\vec{l}$$ as a random experiment $\operatorname{expPerf}(\vec{k},m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]]$ ($$m$$ is the number, how many additional hypothetical labels are considered)
• The difference of the current performance and the performance with additional labels builds the performance gain $\operatorname{perfGain}(\vec{k},m) = \frac{1}{m} \big(\operatorname{expPerf}(\vec{k},m) - \operatorname{expPerf}(\vec{k},0) \big)$

## The true posterior

Given the true posterior, the observed labels are Multinomial distributed.

The normalized likelihood function is:

$P(\vec{p} \mid \vec{k}) = \frac{ \operatorname{\Gamma}{\sum(k_i + 1)} }{ \prod\left(\operatorname{\Gamma}{k_i+1}\right) } \cdot \prod\left( p_i^{k_i} \right)$

Example: In the binary case, the normalized likelihood is a Beta-distribution (see figure).

• no labels $$\rightarrow$$ uniform
• more labels $$\rightarrow$$ less variance
• peak at the observed posterior

## The labeling vector

Describe all possible combinations of labels that could appear in that region given that $$m$$ labels are acquired.

Three-Class Example ($$m=2$$): $\vec{l} \in \{(2,0,0), (0,2,0), (0,0,2), (1,1,0), (1,0,1), (0,1,1)\}$

This labeling vector is Multinomial distributed.$P(\vec{l} \mid \vec{p}) = \textrm{Multinomial}_{\vec{p}}(\vec{l}) = \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) } \cdot \prod\left( p_i^{l_i} \right)$

## Impact of a candidate

• Idea: prefer candidates that would have a higher impact on the overall classification performance
• Here: density estimate

## Closed-form solution

The main contribution is to have a fast closed-form solution for the calculation of a candidate's usefulness.

$\operatorname{expPerf}(\vec{k}, m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]] \\ = \sum_{\vec{l}} \left(\prod_{j = \sum(k_i + 1)}^{\big(\sum(k_i + l_i + d_i + 1) \big) - 1} \frac{1}{j} \right) \cdot \prod \left( \prod_{j=k_i+1}^{k_i + l_i + d_i} j \right) \cdot \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) }$

(details in the paper)

## Experimental Setup

• Algorithms:
• McPAL
• Conf, Entr, Random
• BvsSB, MaxECost, VOI
• Classifiers: Parzen Window, pKNN (real-valued frequencies)
• 100 random partitions into training and test set (trials)
• Mean + variance curves and tables
• Trial-wise comparisons: statistical tests on superiority (Wilcoxon signed rank test ($$p=.05$$) + Hommel procedure)

## Results summary

• Especially beneficial at the beginning (all results converge to the same level)
• After 20 acquisitions: Best mean performance in 11/12 cases (significantly better than all other methods in 9/12 cases)
• No clear 2nd winner
• Execution time: Much faster than EER, slightly slower than US.
• Analysis of the influence factors.

## Conclusions

McPAL combines well-known advantages by using local statistics:

• decision theoretic (expected error reduction)
• fast (uncertainty sampling)
• exploring (random sampling).

McPAL is superior in experimental evaluation.

Code is available: https://kmd.cs.ovgu.de/res/mcpal/
(ready to use in practice - cooperations welcome)

## Thank you for your attention!

Slides, Paper, Bibtex:
www.daniel.kottke.eu/talks/2016_ECAI

Supplemental material:
kmd.cs.ovgu.de/res/mcpal/

Workshop at iKNOW (Oct 18, 2016):
Active Learning: Applications, Foundations and Emerging Trends
http://i-know.tugraz.at/

Multi-Class Probabilistic Active Learning
Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, Myra Spiliopoulou
European Conference on Artificial Intelligence (ECAI)
The Hague, Netherlands, 2016.