To understand the performance of the algorithms of TDA, it
is important to investigate the model cases, where the
samples are random.
A typical example is the result by
Niyogi-Smale-Weinberger,
a stochastic version of the
results on the homotopy equivalence of off-sets of samples
approximating a manifold. In essence, this result asserts
that for balls of small enough (compared to the feature size)
radius $r$, the sample of size
$$n=\Theta_{k,\tau}\left[r^{-k}\left(
\log{r}+|\log\delta|\right)\right]
$$
is enough to have the union of balls sampled from the
Lebesgue measure on a smooth submanifold $M$ of $\Real^D$
homotopy equivalent to $M$, with probability at least $1-\delta$.
More precise results concerning the short bars in random
samples from manifolds are nontrivial. One can show that
most of the short-living classes in dimension $k$ have the
birth-death times near
$$
r_{k,d}(n)\approx c_{k,d} n^{-1/d}
$$
with last spurious classes holding for a $\log{n}$ longer
time (Adler-Bobrowski,
Kahle ...).
Remarkably, it seems that those short-living homologies
are coming in waves, in increasing dimensions.
One can show that for large $d$, the range of $r$ for which homologies are present behaves like $d^{-1/2}$.
Another witness to this conjectured behavior is the Euler curve describing the Euler characteristics (alternating sum of ranks of homology groups) in random unions of balls:
Samples from a manifold are known to faithfully
reconstruct the topology of the underlying space, at least
when the samples lie exactly on the manifold: the
Čech complex at a carefully chosen scale has the
homotopy type of the manifold. Adding noise complicates
matters.
In some situations, these complications
can be overcome: for example if the noise is Gaussian.
If not, a new phenomenon arises, the crackle:
A special case is of $0$-dimensional manifolds, where the highly connected core emerges. On the surface of this core protuberances flare.
Altogether, one has the following
As $N\to \infty$, the $\mathbf{P}H_0$ (supported on $[0,\cdot]$) converged on any compact subset of $\Real_+$ to a point process with the density $$ {e^{-y}}{(1-e^{-y})^2} $$
Assume that
In this case (Adler-Bobrowski-Weinberger)
$$
\frac{\beta_k([x,y])}{{\mathrm{vol}(M)}
\log{\mu}^{D-m-1}s^m}\to b_k(x,y).
$$
Here $\beta_k[x,y]$ is the rank of persistent
$k$-homology, and $b_k$ is a function depending on
the dimension and codimension of $M$ only.
Crackle happens also for subexponential distributions: even more scales are involved.
Let \[ \bbr:[0,1]\to \Real \] be a sample trajectory of the Brownian bridge (Brownian motion conditioned on $\bm(0)=\bm(1)=0$, or, equivalently, Brownian motion on the unit circle).
Denote by $N_k^f(a,b)$ the number of bars in $k$-th persistent homology of $f$ overlapping the interval $[a,b]$, and by $B(x,y)=\mathbb{E} N_o^\bbr (x,y)$. Then
for $0 \leq x \leq y $ one has $$ B(x,y)=\sum_{n\geq 1} e^{-2(n(y-x)+x)^2} $$
When $x$ is close to $y$, the number of the bars
overlapping
$[x,y], x \gt 0$ is close to
$$
(y-x)^{-1}\sqrt{\pi/2} \mathtt{erfc}(2x)
$$
In other words, the density $\beta_0$ of
the persistence point process near the diagonal is exploding like
$$
\beta_0^{\bbr}(x,y)\sim (y-x)^{-3}
$$
Similarly, for the Brownian motion with drift, $$ \bmv_t=\sigma\bm_t+vt, $$ the number of bars overlapping any given interval is finite a.s. and is geometrically distributed with parameter $$ \exp(-v(y-x)/\sigma^2). $$
As a corollary, if $f$ is smooth, then for the small perturbations of $f$ by the Brownian motion one has $$ \lim_{\sigma\to 0} \frac{1}{\sigma^2} \lim_{\Delta\to 0} \Delta \beta_0^{f+\sigma \bm}(x, x+\Delta)=\sum_{s:f(s)=x} \frac{1}{f'(s)}=\frac{f_*(dx)}{dx}. $$
In particular, the point process of short bars of a small perturbation of a polynomial $f$ fix this polynomial up to shifts and reflections.
Somewhat more generally, one can define the persistence dimension of a function $f$ on a manifold $M$ as $$ \dimph(f):=\inf\{k: \langle (y-x)^k \cdot \phpp \rangle \lt\infty\}, $$ and the persistence dimension of a class of functions $\class$ as the $\sup_{f\in \class} \dimph(f)$.
There is a general bound on the total $k$-persistence
for $d$-dimensional simplicial complexes implying for
them
$$
\dimph(f)\leq d.
$$
The results for the Brownian excursion show that it has
the persistence dimension $2$...
Classifying textures based on material is a well known and important problem.
Three scales of each of ten materials in the KTH-TIPS data set
Exemplars from SIPI Rotated Texture data set
Four exemplars from each of the 25 classes in the UIUCTex data set
The 61 materials in the CUReT data set
The topology comes into play through the observation done during the study of patches of natural images
Examples from the van Hateren Natural Images data set
It turned out (Carlsson et al) that most of the $3\times3$ patches concentrate around a 1-dimensional singular set ("three circles") spanned by two-dimensional cells forming a two-dimensional complex that has the topology of a Klein bottle.
Animation of an immersoin of the Klein bottle into 3-space (by Konard Polthier)
A model of the Klein bottle.
A lattice of patches in the Klein bottle model.
Using this fact and an appropriate orthonormal basis for $L_2(K)$ one can obtain a vector of estimated of Fourier coefficients of the distribution density of patches from a particular texture (Perea-Carlsson).
First 44 estimated $K$-Fourier coefficients for the texture of Bricks
The $K$-Fourier coefficients can be made rotation invariant. Combining this with patches of different size one obtains a rotation invariant, multiscale descriptor amenable to classical machine learning analyses.
Even observing a single trajectory of
a dynamical system leaves room for applications of
topology: one can detect periodicity, or quasiperiodicity
(toric behavior), or capture a strange
attractor.
Topological data analysis helps dramatically when there is
no model for the dynamics, and the scale at which the
fluctuations are expected to manifest themselves is
unknown.
A typical area where such phenomena arise are modern power
grids, increasingly supporting large number of volatile
active elements.
Below is an illustration oft this concept: a simulation of a 30+ bus
power network, subject to a disruption is shown. The measurements
represent an aggregated output of a collection of Phasor Measurement
Units (PMUs)
Given a trace, how to detect emergence of oscillations,
how to classify the oscillations into stable and unstable,
how to detect bifurcations, of the scale of the noise is
not known in advance?
Persistence homology helps.
The first step, however, is to lift the observed traces
using the Takens' embedding:
If $\{x(t)\}_{t=\ldots, -1,0,1,\ldots}, x(\cdot)\in\Real^d$ is a time series, the window $w$ Takens embedding of $x$ is the time series in $\Real^{wd}$ given by
$$ x_T(t):=(x(t), x(t+1),\ldots,x(t+w-1)). $$
In this way we might want to detect bifurcations on an undefined scale.
emergence of oscillations
period doubling
An implementation of the persistence homology for the Takens embeddings of noisy time series shows that one need not rely on the eye-balling (which is possible only for carefully chosen projections into low-dimensional spaces), like in the plots below:
Here, the plot on the left clearly shows emergence of a limit cycle in the traces of a process after a period of transition; the right plot exhibits the period doubling bifurcation.
The structure of the Internet has been studies for some time now.
More specifically, it is typically the network of the Aunotomous System Nodes whose structure is considered. It is a large-scale network consisting of large players, like ISPs, Universities, large companies and various unknown entities.
There about $2.5\times 10^4$ (self-registered) nodes, interacting using BGP.
Over the past few years, several authors (Boguñá, Papadopoulos, Krioukov ...) suggested that the Internet can be modelled on a sample from a hyperbolic space - more specifically, from a large disk in the hyperbolic plane.
Topological data analysis suggests a natural test for this conjecture: one should be able to check the dimension of a manifold from a dense enough sample from this manifold. If the model is realistic, the dimension of the Internet were $2$.
There is a standard way to measure the dimension of a manifold via local homology.
Recall that the (reduced) homology group of a sphere $S^k$ are $$ \bar{H}_l(S^k,\Int)=\left\{\begin{array}{lll}\Int, &\mathrm{if}& k=l,\\ 0 &\mathrm{otherwise.}&\\ \end{array}\right. $$
This allows one to recover the dimension of a manifold $M^m$ via checking the homologies $$ \bar{H}_*(M, M-\{x\}): $$ it should be of rank $1$ precisely in dimension $m$ and nowhere else.
Note that the dimension is a function of a point - a separate theorem asserts that it is a (locally) constant function.
One can also recognize the boundary points (in a manifold with boundary or, more generally, with corners): if $x\in\partial X$, $$ \bar{H}_*(M,M-x)\cong 0 $$ in all dimensions.
To recover the dimension of a manifold $M$ at a point $x$, we need to have a dense enough sample in a small ball around $x$. A "stochastic version" of Hausmann-Latschev's theorem tells us that in this case, the local homology will be equivalent those of a sphere.
This tell us what to expect from what to expect from local dimensions of the Rips complexes of the samples from the hyperbolic spaces $H^d$ (or any other manifold):
Here is an impressionistic view of the ASN
And here are some plots
It seems quite clear that the ASN networks are not sampled from the hyperbolic space - at least not from a regular enough density.
It seems safe to say that the Internet does not resemble a (homological) manifold.
It does not mean that the hyperbolic spaces cannot be used to embed networks - but they definitely are bad as a template for the underlying geometry.
On the other hand, the local homology is a brand new characteristic of a node - we should investigate those.
Topological Data Analysis is a young discipline, with a lot of appeal and dynamic trajectory.
Does it have
indeed a future as one of the essential components of
modern data science?