|
Abstract
Pattern discovery in long sequences is of great importance in many
applications including computational biology study, consumer behavior
analysis, system performance analysis, etc. In a noisy environment, an
observed sequence may not accurately reflect the underlying behavior. For
example, in a protein sequence, the amino acid N is likely to mutate to D
with little impact to the biological function of the protein. It would be
desirable if the occurrence of D in the observation can be related to a
possible mutation from N in an appropriate manner. Unfortunately, the
support measure (i.e., the number of occurrences) of a pattern does not
serve this purpose. In this paper, the authors introduce the concept of
compatibility matrix as the means to provide a probabilistic connection
from the observation to the underlying true value. A new metric match is
also proposed to capture the "real support" of a pattern which would be
expected if a noise-free environment is assumed. In addition, in the
context the authors address, a pattern could be very long. The standard
pruning technique developed for the market basket problem may not work
efficiently. As a result, a novel algorithm that combines statistical
sampling and a new technique (namely border collapsing) is devised to
discover long patterns in a minimal number of scans of the sequence
database with sufficiently high confidence. Empirical results demonstrate
the robustness of the match model (with respect to the noise) and the
efficiency of the probabilistic algorithm.
Read the Presentation
Slides...
Referred Papers
|