Abstract:
Compression of large DNA sequences has been a subject of great interest since the
availability of genomic databases. Although only two bits are sufficient to encode four bases
of DNA (namely A, G, T and C), the massive size DNA sequences forces the need for efficient
compression. In this article we are going to propose an improved version of an existing
algorithm known as “GtEncseq” which describes the procedure of storing multiple
biological sequences of variable Character size, with customizable character
transformations, “wildcard” and “separator” support, and a diverse group of internal
representations optimized for different arrangements of wildcards and sequence lengths.
Our main target is extensive compression of data with an attempt of eliminating the
wildcard entries from the sequence but make it available for the reuse. An efficient time
requirement for encoding the desired sequence is also a note to consider.
Description:
Supervised by
Prof. Dr. M. A. Mottalib,
Co-Supervisor,
Tareque Mohmud Chowdhury,
Assistant Professor,
Computer Science and Engineering (CSE),
Islamic University of Technology (IUT),
Board Bazar, Gazipur-1704. Bangladesh.