HEIDI (HEIGH Dimensional Indexing) or how to index one billion of vectors


Period: Februar 2012 to February 2015

Funded by: FCT: PTDC/EIA-CCO/119722/2010

Traditional indexing of multimedia data leads to dilemma. Either the number of features has to be reduced or the quality of the results in unsatisfactory, or approximate queries is preformed leading to a relative error during retrieval. The promise of the recently introduced subspace-tree [1-4] is the logarithmic retrieval complexity of extremely high dimensional features. The subspace-tree indicates that the conjecture "the curse of dimensionality" is false. The search in such a structure starts at the subspace with the lowest dimension. In this subspace, the set of all possible similar objects is determined. In the next subspace, additional metric information corresponding to a higher dimension is used to reduce this set. This process is then repeated. The theoretical estimation of temporal complexity of the subspace tree is logarithmic for evenly distributed data. It means that depending on the distribution of our data, we have to choose an ideal projection into the subspaces leading to an ideal hierarchy. In the project we will perform experiments on different databases, high dimensional data up to several thousand, and of size till 1 billion.

Referencs

[1] Wichert A.: Content-based image retrieval by hierarchical linear subspace method, Journal of Intelligent Information Systems, 31 (1): 85-107, 2008

[2] Santos P., Galhardas H., Wichert A. . Panoramix: A Content-based Health Record System, Internal Journal of Computer Assisted Radiology and Surgery: 4 (Supplement): 152, 2009 doi:0.1007/s11548-009-0317-y

[3] Andreas Wichert, Pedro Teixeira, Pedro Santos, Helena Galhardas: Subspace tree: High dimensional multimedia indexing with logarithmic temporal complexity, Journal of Intelligent Information Systems, 35 (3): 495-516, 2010 doi:10.1007/s10844-009-0104-9

[4] Andreas Wichert and Pedro Santos: Fast multimedia querying formedical applications, Eds. Claudia Plant and Christian Bohm, Chapter in Database Technology for Life Sciences and Medicine,World Scientific Publishing, pp. 203-218: 2010


Project Members

Andreas Wichert (PI), Angelo Cardoso, Joao Sacramento, Catarina Pinto Moreira


Publications (Preprints)

Book

[B3] Iintelligent Big Multimedia Databases, A. Wichert, World Scientific, 2015. info

Journals

[J1] Andreas Wichert and Andre Verissimo: CBIR with a Subspacee-tree: Principal Component Analysis versus Averaging, Multimedia Systems, 18(4): 283-293, 2012 doi:10.1007/s00530-011-0248-7

[J2] Angelo Cardoso and Andreas Wichert: Iterative random projections for high-dimensional data clustering, Pattern Recognition Letters, 33(13): 1749-1755, 2012 doi:10.1016/j.patrec.2012.06.007

[J3] Catarina Moreira, Andreas Wichert: Finding Academic Experts on a MultiSensor Approach using Shannon's Entropy, Expert Systems with Applications, 40(14): 5740-5754, 2013, doi:10.1016/j.eswa.2013.04.001

[J4] Andreas Wichert and Catarina Moreira, Projection Based Operators in lp space for Exact Similarity Search, Fundamenta Informaticae, Annales Societatis Mathematicae Polonae, 136(6): 461-474, 2015 doi:10.3233/FI-2015-1166

Conference Papers

[C1] Wichert, A., Product quantization for vector retrieval with no error, Proceeedings of International Conference on Enterprise Information Systems, Wroclaw, Poland (ICEIS), Volume 1: 87-92, 2012

Software

[A1] Project-Heidi (Windows, C/C++, SQL-Server-2012)

[A2] Lime (Java/JAI/Jama, octave or Matlab)


Related Links

Prof. Vladimir Pestov (Curse of Dimensionality)

Prof. Christos Faloutsos (GEMINI)

Prof. Potr Indyk (LSH)

Approximate Nearest Neighbor (Weizmann Institute Course)

Locality Sensitive Hashing (LSH) Home Page

Approximate nearest neighbor search

Evaluation of Approximate nearest neighbors: large datasets