tSNE - Algorithm

measure distance from one point wrt every other point
map distance to the normal probability distribution
scale distances to 1 (similar to softmax)
steps 2 & 3 can be combined with the following equation:
- $P_{j ∣ i} = \frac{n or m ( x _{i} - x _{j} )}{\sum _{k \neq = i} n or m ( x _{i} - x _{k} )}$
- $P_{j ∣ i} = \frac{e x p ( - ∣∣ x _{i} - x _{j} ∣ ∣ ^{2} / ( 2 𝜎 _{i}^{2} ))}{\sum _{k \neq = i} e x p ( - ∣∣ x _{i} - x _{k} ∣ ∣ ^{2} / ( 2 𝜎 _{i}^{2} ))}$
convert conditional probabilities into joint probabilities
1. $P_{ij} = \frac{P _{j ∣ i} + P _{i ∣ j}}{2 n}$
(𝜎_𝑖)² controls what “closeness” means; these variances are chosen so that the entropy equals a user-specific value called the perplexity
1. $P er pl e x i t y = 2^{- \sum_{j} P_{j ∣ i} l o g_{2} P_{i ∣ j}}$

Randomly plot all points onto low dimensional space (initialize)
calculate probability distribution with t-distribution:
- $Q ij = \frac{( 1 + ∣∣ x _{i} - x _{j} ∣∣ ) ^{- 1}}{\sum _{k \neq = i} ( 1 + ∣∣ x _{i} - x _{k} ∣∣ ) ^{- 1}}$

KL divergence measures the “distance” between two probability distributions
- $K L (P ∣∣ Q) = \sum_{i, j} P_{ij} l o g (\frac{P _{ij}}{Q _{ij}})$
Use gradient descent to minimize the sum of the KL divergence over all the points
Take the partial derivative of the cost function wrt every point. This partial derivative tells us how to move the points within the reduced dimensional space.
- $\frac{𝛿 K L ( P ∣∣ Q )}{𝛿 x _{i}} = 4 \sum_{j} \frac{P _{ij} - Q _{ij}}{( 1 + ∣∣ x _{i} - x _{j} ∣∣ ) ^{- 1}} (x_{i} - x_{j})$

tSNE - Other