# Also in the Article

Feature selection
This protocol is extracted from research article:
Distinguishing cell phenotype using cell epigenotype

Procedure

Our framework is a hybrid of metric learning and supervised learning techniques. Thus, our objective function consists of a loss term based on the accuracy of the classifier described in Eq. 6 and a regularization term based on Eq. 5. To make explicit the impact of the set of features S, we rewrite Eq. 6 as$ŵim(c,S)=KNN(zm;c,S)=∑n∈Dm||λ(c)(zℓm−xℓn)||ℓ∈{S}2δij∑n∈Dm||λ(c)(zℓm−xℓn)||ℓ∈{S}2$(9)and note that S is already explicit in Eq. 5. Letting $rSij=1(bSii(p)>bSij(p))$, our objective function is$F(S)=∑c∈{T}∑m∈{M}(ŵcm(c,S)−δcj)2+γ∑i,ji≠j∑p{0,0.5,1}rSij(p),$(10)where γ = 0.5 is a scalar regularization parameter that controls the strength of Eq. 5, giving it approximately half the importance of Eq. 6. Values of γ ≫ 1 will select features solely based on satisfaction of Eq. 5, while γ ≪ 1 will ignore this requirement in favor of Eq. 6.

With the objective function defined, we describe the forward feature selection algorithm. Recall that N is the number of features of the dataset. We first define {U1} = {{i}, i ∈ {1, . . , N}}. Our scheme for dimension reduction proceeds by finding S1 = arg min{SU1}F(S), then constructing {U2} = {{S1, i} i ∈ {1, . . , N}\S1}. Continuing iteratively, sets of features of arbitrary length S may be constructed. We continue until ℓ = 50, which is long after the addition of features has stopped improving the classification accuracy in the LOGO tests (Fig. 4).

Note: The content above has been extracted from a research article, so it may not display correctly.

Q&A