An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Hasan, Md. Zahidul; Shimul, Anik Islam

An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Hasan, Md. Zahidul; Shimul, Anik Islam

URI: http://hdl.handle.net/123456789/1189

Date: 2012-11-15

Abstract:

Compression of large DNA sequences has been a subject of great interest since the availability of genomic databases. Although only two bits are sufficient to encode four bases of DNA (namely A, G, T and C), the massive size DNA sequences forces the need for efficient compression. In this article we are going to propose an improved version of an existing algorithm known as “GtEncseq” which describes the procedure of storing multiple biological sequences of variable Character size, with customizable character transformations, “wildcard” and “separator” support, and a diverse group of internal representations optimized for different arrangements of wildcards and sequence lengths. Our main target is extensive compression of data with an attempt of eliminating the wildcard entries from the sequence but make it available for the reuse. An efficient time requirement for encoding the desired sequence is also a note to consider.

Description:

Supervised by Prof. Dr. M. A. Mottalib, Co-Supervisor, Tareque Mohmud Chowdhury, Assistant Professor, Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704. Bangladesh.

Show full item record