IndexSpan: Discovery of Large Item-Indexable Sequential Patterns
Method for the efficient discovery of sequential patterns in item-indexable sequences.An item-indexable sequence does not allow item repetitions and is commonly dense.
The target tasks for databases composed of item-indexable sequences benefit from lengthy patterns and tolerate local mismatches.
A pattern-merging procedure is made available for the efficient discovery of large noise-tolerant sequential patterns.
@article{, author = {Rui Henriques and C. Antunes and Sara C. Madeira}, title = {Methods for the Efficient Discovery of Large Item-Indexable Sequential Patterns}, booktitle = {ECML/PKDD New Frontiers to Mine Complex Patterns}, series = {Lecture Notes in Artificial Intelligence}, year = {2014}, volume = {8399}, pages = {94-108}, publisher = {Springer-Verlag}, }Synthetic datasets (non-exhaustive set):
- 500x50 (5% noise): datasetA, datasetB, datasetC, hiddenBicsA, hiddenBicsB, hiddenBicsC
- 1000x75 (5% noise): datasetA, datasetB, datasetC, hiddenBicsA, hiddenBicsB, hiddenBicsC
- 2000x100 (5% noise): datasetA, datasetB, datasetC, hiddenBicsA, hiddenBicsB, hiddenBicsC
- Datasets with no noise: datasetA, datasetB, datasetC, hiddenBicsA, hiddenBicsB, hiddenBicsC
- dlblc.arff (180 conditions, 660 genes)
- yeast.arff (17 conditions, 1884 genes)
- coloncancer.arff (62 conditions, 2000 genes)
- leukemia.arff (38 conditions, 7129 genes)
Software: IndexSpan JAR v3.1.0 (06-11-2013, 3.40 MB)
Others: Raw Results and Statistical Sheets
Tasks:
- Case 1: analysis of gene expression data
Consider the case where we have microarrays, where a set of genes have a level of expression to different conditions.
For instance, consider 4 conditions, (c1,c2,c3,c4), and two genes with following levels of expression across conditions: respectively g1=(-1,-1,2,0) and g2=(0,0,2,-1). We can see that the levels of expression for both g1 and g2 remain unchanged for conditions 1 and 2 (co-occurrence) and decrease for condition 3 (precedence). This order-preserving regularities are critical since different genes have different default baselines of expression (an analysis of their absolute values is misleading) and, therefore, local patterns in micro-arrays are best described by order-preserving regularities.
Consider the mapping of this case into SPM task over item-indexable sequences:
g1=(c1c2)c4c3 and g2=c4(c1c2)c3 //by ordering conditions according to the observed levels of expression
We can, for instance, see that the (c1c2)c4 and c4c3 are sequential patterns (local regularities) supported by both genes. - Case 2: analysis of preferences
Consider the analysis of ratings of interest. Increasingly, the selection of videos, the booking of hotels, the shopping of items, the booking of restaurants, among many other activities, are accompanied by ratings on large-scale platforms (e.g. IMDB, booking.com, Amazon etc.). It is, therefore, of high-interest to understand local ordering preferences for specific populations of customers. Since, movies, shopping items, hotels and restaurants have all unique IDs, the resulting databases of ratings are item-indexable.
Consider that 4 movies have the same ordering of preferences among a population of customers, this information can be used to support recommendations, or for the analysis of profiles. For instance, consider the following ratings across four movies: user1=(5,5,3,4) and user2=(4,4,1,2). They both respect ordering regularities (inclusion of both co-occurrences and precedences). - Case 3: analysis of quality/performance/surveys
Similarly to the previous case, the analysis of: 1) the quality of services, 2) the performance of people (e.g. increasingly accomplished in the activities of the different public ministries), and of 3) surveys; follow the same characteristics. Commonly, in these settings, a scale is considered for different surveyed aspects. Frequent orderings can be found using different discretizations of these ratings (e.g. L1-L5, L1-L10) with impact on the number of precedences vs. co-occurrences. - Case 4: analysis of planning and task analysis data
Consider the following problems of analyzing local patterns in the:
- the planning/scheduling of a list of tasks with different identifiers for a particular temporal granularity. This planning can be done in the context of operational tasks made by people in the industry or public sector, or for the analysis of decisions made, for instance, by a population of robots or computational methods). Frequent co-occurrences (selected tasks for a particular day) and precedences (tasks for different days) are discovered under item-indexable constraints (no repeated items since all tasks have unique identifiers);
- the academic accomplishments of students based on the finished courses per semester. Understanding local co-occurrences and precedences allows to characterize different subsets of students and to support decisions regarding the most adequate orders for the academic course offerings. This and similar cases rely on item-indexable data since one item appears one time per instance, in this case it corresponds to a course accomplished. - Case 5: analysis of (biological) networks
Biological networks typically describe the weight of relations among different molecular units (mi). Thus, each molecular unit is defined by a set of values characterizing the weights of the relations with all of the molecular units. Consider that the local ordering m31<m2=m202<...<m88 is observed for a subset of molecular units {m3,m28,...,m43}. This means that the relative degree of influence from the m31<...<m88 molecular units is preserved across {m3,m28,...,m43} units.
In fact, not only biological networks can be mined using local orderings, but many other types of networks.