Probabilistic Active Learning in Datastreams

Daniel Kottke, Georg Krempl, Myra Spiliopoulou

Knowledge Management and Discovery Lab
Otto-von-Guericke-University Magdeburg, Germany

daniel.kottke@ovgu.de

www.daniel.kottke.eu/talks/2015_IDA

Pool vs. Stream Learning

  • Spatial components is extended with temporal information
    • Classification model might change (Drift)

  • Fast, endless instance generation
    • Efficient (on-line) algorithms required

  • Applications:
    • Data from twitter, sensors, bank transactions

Acquiring the next label?

Budget

  • How many labels to buy?

 

Spatial component

  • Where to buy (feature vectors)?

Temporal component

  • When to buy? (time)

Our method

Budgets in streams

  • Pools: absolute number (e.g. stop after 40 labels)
  • Streams: relative definition necessary (e.g. buy 10%)
     
  • Best budget fit with sequential acquisitions
  • Better: spend budget in expectation and control variance through tolerance window
  • Here: budget is constant over time for constant labeling capacities, permanently updated models, fair evaluation

Divide and Conquer

 

Divide and Conquer

 

Divide and Conquer

Measure spatial usefulness from Probabilistic Active Learning

Divide and Conquer

 

Divide and Conquer

Choose the most useful instances based on this value

Spatial Selection (Recap):
Probabilistic Active Learning


Evaluates each labeling candidate in its neighborhood using its label statistics:
  • Posterior probability (\(\hat{p}\))
  • Number of labels (\(n\))

Image: "Probabilistic Active Learning: Towards Combining Versatility, Optimality and Efficiency"
by G. Krempl, D. Kottke, M. Spiliopoulou, Proc. of the 17th Int. Conf. on Discovery Science, Bled, 2014. Springer.

Spatial Selection (Recap):
Probabilistic Active Learning

  • Probabilistic description of expected performance gain using the expectation over:
    • the possible label outcomes (\(y\)) of the candidate
    • the true posterior probability (\(p\))
\[ \mathrm{pgain}((n, \hat{p})) = \mathtt{E}_{p}\bigg[ \mathtt{E}_{y} \left[ \mathrm{gain}_{p}((n, \hat{p}),y) \right] \bigg] \\ = \int_0^1 { \mathrm{Beta}_{n \hat{p} + 1,n (1-\hat{p}) + 1}(p) \cdot \sum_{y \in \{0,1\}} \mathrm{Ber}_{p}(y) \cdot \mathrm{gain}_{p}((n, \hat{p}),y) } \, \mathrm{d} p \]

Temporal selection:
Problem Specification

  • Input: Single-value stream of spatial usefulness values
    • Arbitrary range
    • Arbitrary distribution
    • Changing distribution (dependent on previous decisions)
  • Task: Select the best b % of this value stream instantly
  • Subtask: Difference of current and target budget should not exceed a given tolerance window

Temporal selection:
Main Idea

  • Store the most recent usefulness values in a list
  • When a new value appears:
    • Determine its relative rank in that list
    • If it belongs to the better b % of that list,
      acquire the corresponding label

Temporal selection:
IQF - Properties

  • Selects the best b % of a single value stream in expectation
  • Only works for value streams coming from one single distribution
     
  • But: Spatial usefulness probably decrease as classification performance increases over time \(\rightarrow\) less labels were acquired
  • Idea: Threshold correction that ensures that the target budget is met w.r.t. a tolerance window

Temporal selection:
Balancing

  • Tolerance window (\(w_\textrm{tol}\)):
    maximal difference between current and the target budget
  • If there are label acquistions left (\(acq_\textrm{left}\) > 0)
    \(\rightarrow\) decrease threshold \(\theta\) (and vice versa)


\[ \theta_\textrm{bal} = \theta - f \cdot acq_\textrm{left} \]
How to define  \(f\)?

Temporal selection:
Balancing

Defining the normalization factor  \(f\):

  • Scales with the range of the data (\(\Delta\))
  • The higher the tolerance window (\(w_\textrm{tol}\)), the smaller the correction

\[ \theta_\textrm{bal} = \theta - \Delta \cdot \frac{acq_\textrm{left}}{w_\textrm{tol}} \]
  • \(\theta_\textrm{bal}\) - Balanced threshold
  • \(\theta\) - IQF acquisition threshold
  • \(\Delta\) - Data range of IQF window
  • \(w_\textrm{tol}\) - Tolerance window size

Evaluation

Experimental Settings

  • Datasets: 5 synthetic, 2 real
  • Algorithms:
    • PAL + BIQF
    • Random, Variable Uncertainty¹ (VarUncer), Split¹
    • Split + BIQF, VarUncer + BIQF
  • 100 random test and train streams
  • Different budgets  \(b \in \{0.02, 0,05, 0.1, 0.2, 0.3, 0.5, 0.7\}\)
  • BIQF parameter:  \(w = 100, w_{\mathrm{tol}} = 50\)
  • ¹ I. Zliobaite, A. Bifet, B. Pfahringer, G. Holmes: Active learning with drifting streaming data. IEEE Trans. Neural Netw. Learn. Syst. 25(1), 2014.

Evaluation of Active Learning

Evaluation of Active Learning

Evaluation of Active Learning

Conclusion

Active Learning algorithm for Datastreams that
 
  • is able to reach a defined budget within a given tolerance window
  • separates the spatial and temporal usefulness
  • uses Probabilistic Active Learning for spatial usefulness
  • uses the new Balanced Incremental Quantile Filter (BIQF) to determine the temporal usefulness of instances
  • is superior to state-of-the-art methods (esp. at the beginning)

Thank you for your attention!

Slides, Paper, Bibtex:
www.daniel.kottke.eu/talks/2015_IDA


Supplemental material:
kmd.cs.ovgu.de/res/pal/

Probabilistic Active Learning in Datastreams
Daniel Kottke, Georg Krempl, Myra Spiliopoulou
Fourteenth International Symposium on Intelligent Data Analysis (IDA)
Saint-Etienne, France, 2015. Springer.