HOW A MATRIX IS USED FOR ALIGNMENT

MatrixGen is intended to provide an application to allow researchers to generate protein scoring matrices based on their own aligned data sets. These matrices can then be used by an alignment program to generate a more accurate alignment.

There are many algorithms for determining protein similarity and aligning proteins. There are also many programs for doing this. MatrixGen is not an alignment program. It does, however, interoperate with alignment programs to provide a more accurate alignment.

Many methods for aligning proteins rely upon a weighted matrix to score the likelihood of transitions from one amino acid to the next. The alignment with the highest score is then judged the most correct alignment. These weighted matrices are generated by analyzing previously aligned sets of proteins to determine the probability of mutation of a particular Amino Acid in an evolutionary context. These matrices are most generally presented as a log of odds (lod) Matrix. Each element of a lod Matrix is the logarithm of a ratio of probabilities. Specifically, the probabilities considered are the observed probability of a pair of amino acids over the expected probability of a pair of amino acids.

Matrixgen calculates these probabilities as described by Henikoff and Henikoff. MatrixGen does not do block clustering as described by this same paper. The idea behind matrixgen is to create a matrix specifically for the protein that you are aligning using a dataset of very similar proteins. Matrixgen will also generate distance matrices and give you other useful statistics about your dataset.

THE MEANING OF THE SCORES

By looking at any element of the matrix you are able to tell if:

1 - Expected alignment and observed alignment are just as likely (score of zero).

2 - Expected alignment is more probable than observed alignment (negative score).

3 - Observed alignment is more probable than expected alignment (positive score).

SUPPORTED INPUT FORMATS

There are many different file formats for aligned protein data sets. MatrixGen only supports NEXUS currently. Input must be a NEXUS file with a valid data or characters block. If your data sets are in another format then I suggest that you use a utility such as seqVerter to convert you data sets into the nexus format. If you have a compelling reason to support a different file format, then submit a enhancement request on my source forge project page and I may consider it. Currently I am using a slightly modified version of the Nexus Class Library written by Paul O. Lewis for the parsing of Nexus files.

BUGS

If you want to report a bug or suggestion please do it from my source forge project page, and please include the data that causes the problem. If you do not include the dataset I will most likely not fix the problem. You can also email me directly with your suggestions and bugs at: