Catarina Moreira

Project HEIDI (or how to index 1 billion vectors)

The traditional methods for multimedia data indexing are usually based on metric tree indexes. The basic idea is to derive metrics from item properties to build a tree that can then be used to prune branches when processing queries, this way speeding up the search. However, these indexes are only effective when the number of features is small.

For higher dimensions, these trees will suffer from the curse of dimensionality. When the dimensionality increases, the volume of the space increases so fast that the available data becomes so sparse that the tree indexing structures do not provide any advantages in the retrieval process. So this leads to the dilemma. Either the number of features is reduced, making the searching process less discriminative and unreliable, or approximate queries are preformed leading to a relative error during the retrieval process.

The promise of the recently introduced subspace-tree has a logarithmic retrieval complexity for extremely high dimensional features. The subspace-tree indicates that the conjecture of "the curse of dimensionality" is false. The search in such a structure starts at the subspace with the lowest dimension. In this subspace, the set of all possible similar objects is determined. In the next subspace, additional metric information corresponding to a higher dimension is used to reduce this set. This process is then repeated. The theoretical estimation of temporal complexity of the subspace tree is logarithmic for evenly distributed data. It means that depending on the distribution of our data, we have to choose an ideal projection into the subspaces leading to an ideal hierarchy. In the project we will perform experiments on different databases, high dimensional data up to several thousand, and of size till 1 billion.

Project HEIDI (or how to index 1 billion vectors)

European Digital Mathematics Library (EuDML) Projects