Budget Optimization for Active Learning in Data Streams (Master’s Thesis) – Daniel Kottke

Active learning is a subfield of machine learning that aims to reduce the number of labeled information while receiving the same classification performance. As it is easy to capture unlabeled information from sensors but expensive to annotate this data, the influence of active learning increases fast. Personalized search engines, for instance, need user feedbacks to adapt their models. In a pool-based setting, active learning just acquires the most useful instances one by one, until budget is exceeded. In stream scenarios, it is much more difficult to find the best instances. This thesis addresses the drawbacks of recent stream active learning literature: It aims to find a usefulness measure that balance exploration and exploitation automatically and to find the best values among this usefulness stream.

Therefore, probabilistic active learning is applied to the stream-based setting as it balances exploration and exploitation. Furthermore, three budgeting algorithms are proposed. The first method is called the incremental percentile filter (IPF). It uses the most recent values to estimate an usefulness threshold based on a ranking strategy. Second, this algorithm was extended to correct the trend of the usefulness curve, as this probably decreases in active learning. The third approach combines the trend correction with the fast IPF method. An additional balancing framework ensures that the given budget is definitely yielded.

An intense evaluation of all budgeting approaches validates the theoretically discussed characteristics and shows that they perfectly find the highest usefulness values. Using 16 different data sets from real-world applications and the newly developed data stream generator, the superiority of probabilistic active learning with budgeting in data streams is shown. Nearly on all budget levels, it was the best algorithm compared to random and uncertainty-based methods.

Downloads: