Probabilistic Active Learning

Dominik Lang, Johannes Teschner, Myra Spiliopoulou

Knowledge Management and Discovery Lab

Otto-von-Guericke-University Magdeburg, Germany

daniel.kottke@ovgu.de

**www.daniel.kottke.eu/talks/2016_ECAI**

- "Big data" lead to a huge amount of data (mostly unlabeled)
- Results from unlabeled data mostly too general
- More information has to be added (labels, annotations)
- But annotations are expensive, exhausting, time-consuming
- Hence minimizing the annotation effort is preferred

**\(\Rightarrow\) Active Learning**

Basic techniques in Active learning

Selects label candidates at random.

The probability for each label candidate to be selected is equal. |

Prefers label candidates near the decision boundary.

Candidates in the darker green area (near the decision boundary) are preferred. |

Simulates the selection of each candidate and each class and estimates the expected error reduction on the data.

Candidates in the darker green area (complex calculation) are preferred. |

Multi-Class Probabilistic

Active Learning

Basic idea and main tools

- estimate the
**performance gain**of acquiring a label

[1] "Optimised probabilistic active learning",
by G. Krempl, D. Kottke, V. Lemaire.

Machine Learning, 2014.

- estimate the
**performance gain**of acquiring a label by using**local statistics**around each candidate

[1] "Optimised probabilistic active learning",
by G. Krempl, D. Kottke, V. Lemaire.

Machine Learning, 2014.

- estimate the
**performance gain**of acquiring a label by using**local statistics**around each candidate - frequency estimates (count the number of nearby labels)

\(\rightarrow\)**posterior, reliability**

[1] "Optimised probabilistic active learning",
by G. Krempl, D. Kottke, V. Lemaire.

Machine Learning, 2014.

- estimate the
**performance gain**of acquiring a label by using**local statistics**around each candidate - frequency estimates (count the number of nearby labels)

\(\rightarrow\)**posterior, reliability** - density estimates (share of instances in the neighborhood)

\(\rightarrow\)**impact**

Machine Learning, 2014.

- Label frequencies summarize information about the posterior and the reliability of this posterior
- Provides local performance estimate by:
- modeling the
**true posterior**\(\vec{p}\) as a random experiment - modeling all
**possible label realizations**\(\vec{l}\) as a random experiment \[\operatorname{expPerf}(\vec{k},m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]] \] (\(m\) is the number, how many additional hypothetical labels are considered) - The difference of the current performance and the performance with additional labels builds the performance gain \[\operatorname{perfGain}(\vec{k},m) = \frac{1}{m} \big(\operatorname{expPerf}(\vec{k},m) - \operatorname{expPerf}(\vec{k},0) \big)\]

Given the true posterior, the observed labels are Multinomial distributed.

The normalized likelihood function is:

\[P(\vec{p} \mid \vec{k}) = \frac{ \operatorname{\Gamma}{\sum(k_i + 1)} }{ \prod\left(\operatorname{\Gamma}{k_i+1}\right) } \cdot \prod\left( p_i^{k_i} \right)\]

Example: In the binary case, the normalized likelihood is a Beta-distribution (see figure).

- no labels \(\rightarrow\) uniform
- more labels \(\rightarrow\) less variance
- peak at the observed posterior

Describe all possible combinations of labels that could appear in that region given that \(m\) labels are acquired.

Three-Class Example (\(m=2\)): \[\vec{l} \in \{(2,0,0), (0,2,0), (0,0,2), (1,1,0), (1,0,1), (0,1,1)\} \]

This labeling vector is Multinomial distributed.\[P(\vec{l} \mid \vec{p}) = \textrm{Multinomial}_{\vec{p}}(\vec{l}) = \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) } \cdot \prod\left( p_i^{l_i} \right)\]

- Idea: prefer candidates that would have a higher impact on the overall classification performance
- Here: density estimate

The main contribution is to have a fast closed-form solution for the calculation of a candidate's usefulness.

\[\operatorname{expPerf}(\vec{k}, m) = \mathbb{E}_{\vec{p}}[ \mathbb{E}_{\vec{l}}[ \operatorname{acc}(\vec{k}+\vec{l} \mid \vec{p}) ]] \\ = \sum_{\vec{l}} \left(\prod_{j = \sum(k_i + 1)}^{\big(\sum(k_i + l_i + d_i + 1) \big) - 1} \frac{1}{j} \right) \cdot \prod \left( \prod_{j=k_i+1}^{k_i + l_i + d_i} j \right) \cdot \frac{ \operatorname{\Gamma}((\sum l_i) + 1) }{ \prod\left(\operatorname{\Gamma}(l_i+1)\right) } \]

(details in the paper)

- Algorithms:
- McPAL
- Conf, Entr, Random
- BvsSB, MaxECost, VOI
- Classifiers: Parzen Window, pKNN (real-valued frequencies)
- 100 random partitions into training and test set (trials)
- Mean + variance curves and tables
- Trial-wise comparisons: statistical tests on superiority (Wilcoxon signed rank test (\(p=.05\)) + Hommel procedure)

- Especially beneficial at the beginning (all results converge to the same level)
- After 20 acquisitions: Best mean performance in 11/12 cases (significantly better than
*all*other methods in 9/12 cases) - No clear 2nd winner
- Execution time: Much faster than EER, slightly slower than US.
- Analysis of the influence factors.

McPAL combines well-known advantages by using local statistics:

**decision theoretic**(expected error reduction)**fast**(uncertainty sampling)**exploring**(random sampling).

McPAL is superior in experimental evaluation.

Code is available: `https://kmd.cs.ovgu.de/res/mcpal/`

(ready to use in practice - cooperations welcome)

**Slides, Paper, Bibtex:**

www.daniel.kottke.eu/talks/2016_ECAI

**Supplemental material:**

kmd.cs.ovgu.de/res/mcpal/

**Workshop at iKNOW (Oct 18, 2016):**

Active Learning: Applications, Foundations and Emerging Trends

http://i-know.tugraz.at/

*
Multi-Class Probabilistic Active Learning
Daniel Kottke, Georg Krempl, Dominik Lang, Johannes Teschner, Myra Spiliopoulou
European Conference on Artificial Intelligence (ECAI)
The Hague, Netherlands, 2016.
*