edu.ucsb.nmsl.tools
Class LCS

java.lang.Object
  extended by edu.ucsb.nmsl.tools.LCS

public class LCS
extends java.lang.Object

This class is responsible for finding the Longest Common Subsequence (LCS) of a set. A subsequence of a set is defined as sequence, B of obtained by removing elements of sequence A. For example, if A = { 'A', 'B', 'C', 'B', 'D', 'A', 'B' } then {'A', 'B', 'B', 'B'} is a subsequence of A if you remove items 3, 5, 6 from the sequence. Furthermore, the longest common subsequence is a subsequence of two sequences containing the greatest number of common elements between the two sequences. Two sequences may have multiple common subsequences that are of the lognest length. The implementation provide in this class is based on the classic Dynamic Programming approach and achieves O(n2) performance.

The LCS class is capable of calculating an LCS for strings as well as collections of objects. Similarly, the sets returned by the LCS class are either collections of strings or collections of objects. Also, this implementation of the LCS offers a method called "getShortestEmbedding()". An embedding is defined a sequence of indices for a longest common subsequence. For example, an embedding, with respect to the first sequence, for the sequences { 'B', 'A', 'B' }, and { 'A', 'B', 'A' } is { 1, 2 } since { 'B', 'A' } is an LCS for the two sequences. Another embedding is { 2, 3 } since { 'A', 'B' } is also an LCS. The shortest embedding, however, is defined as the embedding with the shortest difference between the first and last index of an embedding. Again, it is possible for two sequences to have more than one shortest embedding, just as they can have more than one LCS.

The LCS algorithm is important to the alignment phase of the AutoCap process. Using this algorithm AutoCap is able to filter out words that do not match between what is recognized and what is actually in the transcript. Once all utterances have been collected during the speech recognition phase, the LCS is calculated between all the utterances and the transcript. The result is all the common words between the two, with the order retained. The time-stamps for the recognized words are then used to determine the time- stamp of each caption withing the transcript.

Future implementations of this algorithm could be done using the either the Hirschberg or Hunt-Szymanski Paradigm. In the case of AutoCap, the Hunt- Szymanski paradigm may be preferred since the Hirschberg paradigm can degrade to worse than O(n2) performance if the number of matches is not small compared to the size of each subsequence. Since this is not likely when using AutoCap, the Hunt-Szymanski paradigmn will perform better for this application

Version:
1.0
Author:
Allan Knight

Field Summary
protected static int[][] b
          Print matrix used to reconstruct an LCS of two sequences
protected static int[][] c
          Distance matrix used to hold the lengths of the LCS
protected static int DIAG
          Indicates a move should be diagnal within the Print matrix
protected static int LEFT
          Indicates a move should be to the left within the Print matrix
protected static char[] SYMBOLS
          Symbols used to print the Print matrix for debuggin purposes
protected static int UP
          Indicates a move should be up within the Print matrix
 
Constructor Summary
LCS()
           
 
Method Summary
protected static java.util.Collection buildLCS(java.util.Collection x)
          This method is a helper function that recursively builds the actual LCS for two given collections.
protected static void buildLCS(java.lang.Object[] x, java.util.Collection col, int i, int j)
          This recursive method is called by buildLCS and does the actual work of building the LCS from the print matrix.
static java.util.Collection getLCS(java.util.Collection x, java.util.Collection y)
          This method computes the LCS of two collections and returns it as a collection.
static java.util.Collection getLCSIndices(java.util.Collection x, java.util.Collection y)
          This method returns an embedding with respect to x for the sequences x and y.
protected static void getLCSIndices(java.util.Collection col, int i, int j)
          This method is a helper function that computes the indices of an LCS for a given collection.
static java.util.Collection getShortestEmbedding(java.util.Collection x, java.util.Collection y)
          This method returns the shortest embedding with respect to x between subsequences x and y.
private static java.util.Collection getShortestEmbeddingFrom(java.util.Collection x, java.util.Collection y, int startI, int startJ)
          This method returns the shortest embedding with respect to x between subsequences x and y starting from startI and startJ the distance matrix.
static java.lang.String getWordLCS(java.lang.String x, java.lang.String y)
          This method computes the LCS of two stringss and returns it as a string.
static void main(java.lang.String[] args)
          This main method is for testing and debugging purposes.
protected static void populateMatrix(java.util.Collection x, java.util.Collection y)
          This method is called to popluate both the distance and print matrix.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

c

protected static int[][] c
Distance matrix used to hold the lengths of the LCS


b

protected static int[][] b
Print matrix used to reconstruct an LCS of two sequences


DIAG

protected static final int DIAG
Indicates a move should be diagnal within the Print matrix

See Also:
Constant Field Values

UP

protected static final int UP
Indicates a move should be up within the Print matrix

See Also:
Constant Field Values

LEFT

protected static final int LEFT
Indicates a move should be to the left within the Print matrix

See Also:
Constant Field Values

SYMBOLS

protected static final char[] SYMBOLS
Symbols used to print the Print matrix for debuggin purposes

Constructor Detail

LCS

public LCS()
Method Detail

getLCS

public static java.util.Collection getLCS(java.util.Collection x,
                                          java.util.Collection y)
This method computes the LCS of two collections and returns it as a collection.

Parameters:
x - A sequence that will be compared to y to find their LCS.
y - A sequence that will be compared to x to find their LCS.
Returns:
A collection of objects that represent the LCS of x and y.

getWordLCS

public static java.lang.String getWordLCS(java.lang.String x,
                                          java.lang.String y)
This method computes the LCS of two stringss and returns it as a string. The LCS is calculated with respect to the words within the string, not the characters.

Parameters:
x - A String that will be compared to y to find their LCS.
y - A String that will be compared to x to find their LCS.
Returns:
A String that represent the LCS of x and y.

getLCSIndices

public static java.util.Collection getLCSIndices(java.util.Collection x,
                                                 java.util.Collection y)
This method returns an embedding with respect to x for the sequences x and y.

Parameters:
x - A sequence that will be compared to y to find their LCS.
y - A sequence that will be compared to x to find their LCS.
Returns:
A collection of Integers that contain the indices of a LCS with respect to x between subsequences x and y.

getShortestEmbeddingFrom

private static java.util.Collection getShortestEmbeddingFrom(java.util.Collection x,
                                                             java.util.Collection y,
                                                             int startI,
                                                             int startJ)
This method returns the shortest embedding with respect to x between subsequences x and y starting from startI and startJ the distance matrix. getShortestEmbeddingFrom is a helper method for getShortestEmbedding and should not be called anywhere else in your code.

Parameters:
x - A sequence that will be compared to y to find the shortest embedding.
y - A sequence that will be compared to x to find the shortest embedding.
startI - The starting row in the distance matrix for this call.
startJ - The starting column in the distance matrix for this call.
Returns:
A collection of indices for the shortest embedding for the indicated subsequences of x and y.

getShortestEmbedding

public static java.util.Collection getShortestEmbedding(java.util.Collection x,
                                                        java.util.Collection y)
This method returns the shortest embedding with respect to x between subsequences x and y.

Parameters:
x - A sequence that will be compared to y to find the shortest embedding.
y - A sequence that will be compared to x to find the shortest embedding.
Returns:
A collection of indices for the shortest embedding for the subsequences of x and y.

getLCSIndices

protected static void getLCSIndices(java.util.Collection col,
                                    int i,
                                    int j)
This method is a helper function that computes the indices of an LCS for a given collection.

Parameters:
col - A collection that holds the indices of an embedding.

buildLCS

protected static java.util.Collection buildLCS(java.util.Collection x)
This method is a helper function that recursively builds the actual LCS for two given collections.

Parameters:
x - One of the collections for which the LCS is being calculated.
Returns:
The collection that holds the LCS after it is built.

buildLCS

protected static void buildLCS(java.lang.Object[] x,
                               java.util.Collection col,
                               int i,
                               int j)
This recursive method is called by buildLCS and does the actual work of building the LCS from the print matrix.

Parameters:
x - An array of Objects that is one of the collections for which the LCS is being calculated.
col - The collection that holds the LCS after it is built.
i - The index of the row in the print matrix that is currently being inspected.
j - The index of the column in the print matrix that is currently being used inspected.

populateMatrix

protected static void populateMatrix(java.util.Collection x,
                                     java.util.Collection y)
This method is called to popluate both the distance and print matrix. These matrices are used in the construction of an LCS between two given subsequences.

Parameters:
x - A sequence that will be compared to y to find their LCS.
y - A sequence that will be compared to x to find their LCS.

main

public static void main(java.lang.String[] args)
This main method is for testing and debugging purposes. This method is not called during the normal operation of LCS nor AutoCap.