prototype_selection

减少可解释分类器的数据量

Introduction

Selecting a small number of “representative” samples from a large data set may be of greater interpretative value than generating some “optimal” linear combination of all the elements of a data set.

In this paper, we motivate a particular method for selecting prototypes in the classification setting.

Formulation as an optimization problem

The intuition

A good set of prototypes:

  1. Every train-ing example should have a prototype of its same class in its neighborhood
  2. No point should have a prototype of a different class in its neighborhood
  3. There should be as few prototypes as possible

The notion of neighborhood:

For a given choice of Pl ⊆ X , we consider the set of ε-balls centered at each xj ∈ Pl .

  1. Covers as many training points of class l as possible
  2. Covers as few training points as possible of classes other than l
  3. Is sparse

set cover integer program

Set cover can be seen as a clustering problem in which we wish to find the smallest number of clusters such that every point is within ε of at least one cluster center.

From intuition to integer program.

Express the three properties(notion of neighborhood)as an integer program

Solving the problem

Present two algorithms for approximately solving our problem, both based on standard approximation algorithms for set cover.

LP relaxation with randomized rounding.

A greedy approach.

Examples on simulated and real data

Mixture of Gaussion simulation

image-20181215112754337

ZIP code digits data

image-20181215112823336

image-20181215112845036

参考资料

  • Bien J, Tibshirani R. Prototype selection for interpretable classification[J]. The Annals of Applied Statistics, 2011: 2403-2424.