Graph construction complexity

We look for the last selected column on average, \langle c_\text{last} \rangle explored by the Proximity Table based k nearest neighbours algorithm. We look at the central point of a Gaussian distribution of nodes centred at the origin, and assume that the statistics allow us to consider it as a general case.

The probability to find a node at a distance r from the origin is given by a half-Gaussian

p(r) = \frac{\sqrt{2}}{\sigma\sqrt{\pi}} \exp\left( -\frac{r^2}{2\sigma^2} \right)

The number of cells in a radius r is c(r)=ar (easily verified with a square grid). Hence the number of nodes (or objects) at radius r is given by

o(r) = \mathcal{N}c(r)p(r)

with \mathcal{N} a renormalisation factor to ensure that N be found in [0,\infty) and therefore verifies

\int_0^\infty \mathcal{N}c(r)p(r) = N

which in turn gives

\mathcal{N} = \frac{N\sqrt{\pi}}{a\sqrt{2}\sigma} \quad \Rightarrow \quad o(r) = \frac{N}{\sigma^2}r\exp\left( -\frac{r^2}{2\sigma^2} \right)

We look for the radius \tilde{r} such that k neighbours have been found. That is, we solve for \tilde{r}

\int_0^{\tilde{r}} \frac{N}{\sigma^2}r\exp\left( -\frac{r^2}{2\sigma^2} \right) dr = k

Giving \tilde{r}^2 = 2\sigma^2 \ln\left(\frac{N}{N-k}\right). Finally, we count the number of cells that were explored before reaching radius \tilde{r}, i.e. \langle c_\text{last}\rangle

\langle c_\text{last} \rangle =\int_0^{\tilde{r}} ardr = a\sigma^2\ln\left(\frac{N}{N-k}\right)

Note however that for large N , this tends to zero, when we know that \langle c_\text{last} \rangle \geq k. This is because the previous integral was taken from r = 0 when in truth there exists a minimal radius r_\text{min} before one may find neighbours. To correct for this, we write

\langle c_\text{last} \rangle = a\sigma^2\ln\left(\frac{N}{N-k}\right) + k

One last remark that must be made is that because we have a half-Gaussian PDF, the parameter \sigma is not the standard deviation of the distances, which is related to the true SD s as

\sigma = s\sqrt{\frac{\pi}{\pi-2}}

Fitting the standard deviation on the data with a logarithm (the number of available cells as we increase the radius increases exponentially, see Figure 1), we find that the average last selected column should be

\langle c_\text{last}\rangle = A\ln^2(N)\ln\left( \frac{N}{N-k} \right) + k

meaning that at first order \langle c_\text{last}\rangle = \mathcal{O}\left(\frac{\log^2(N)}{N}\right), and therefore that the complexity of the PT-KNN algorithm is \mathcal{O}\left( \log^2(N) \right).

Figure 1. Standard deviation of the radial norm of nodes.