An Improved Data Structure for Efficient Storage of Multiple BIOsequences

Hasan, Md. Zahidul; Shimul, Anik Islam

dc.contributor.author	Hasan, Md. Zahidul
dc.contributor.author	Shimul, Anik Islam
dc.date.accessioned	2021-10-12T06:39:56Z
dc.date.available	2021-10-12T06:39:56Z
dc.date.issued	2012-11-15
dc.identifier.citation	[1] Sascha Steinbiss and Stefan Kurtz, “A New Efficient Data Structure for Storage And Retrieval of Multiple BIOsequences”. [2] Shanika Kuruppu, Bryan Beresford-Smith, Thomas Conway, and Justin Zobel, ”Iterative Dictionary Construction for Compression of Large DNA Data Sets”. [3] Hieu Dinh and Sanguthevar Rajasekaran, “A memory-efficient data structure representing exactmatch overlap graphs with application for next-generation DNA assembly”. [4] Sheng Bao, Shi Chen, Zhi-Qiang Jing and Ran Ren, ” A DNA Sequence Compression Algorithm Based on LUT and LZ77”. [5] D.A. Benson, I. Karsch-Mizrachi, D.J. Lipman, J. Ostell, and E.W.Sayers, “GenBank,” Nucleic Acids Research, vol. 38, (Database Issue), pp. D46-D51, 2010. [6] A. Morgulis, G. Coulouris, Y. Raytselis, T.L. Madden, R. Agarwala, and A.A. Schaffer, “Database Indexing for Production MegaBLAST Searches,” Bioinformatics, vol. 24, no. 16, pp. 1757-1764, 2008. [7] Srinivasa K. G , Jagadish M , Venugopal K R ,LMPatnaik, “Efficient Compression of non-repetitive DNA sequences using Dynamic Programming”. [8] E. Rivals, J-P. Delahaye, M. Dauchet, and 0. Delgrange. A guaranteed compression scheme for repetitive dna sequences.” LIFL Lille I Univerisity technical report, page 285, 1995. [9] Raffaele Giancarlo∗, Davide Scaturro and Filippo Utro ,“Textual data compression in computational biology: a synopsis” Dipartimento di Matematica ed Applicazioni, Università di Palermo, Palermo, Italy. [10] Marty C. Brandon, Douglas C. Wallace and Pierre Baldi, “Data structures and compression algorithms for genomic sequence data”. [11] Gergely Korodi and Ioan Tabus, “Compression of Annotated Nucleotide Sequences”. [12] “The NCBI C Toolkit,” ftp://ftp.ncbi.nih.gov/toolbox/ncbi_tools, 2011. [13] W.J. Kent, “BLAT-the BLAST-Like Alignment Tool,” Genome Research, vol. 12, no. 4, pp. 656-664, 2002 [14] A. Do ¨ ring, D. Weese, T. Rausch, and K. Reinert, “SeqAn an Efficient, Generic C++ Library for Sequence Analysis,” BMC Bioinformatics, vol. 9, article 11, 2008. 42	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/1189
dc.description	Supervised by Prof. Dr. M. A. Mottalib, Co-Supervisor, Tareque Mohmud Chowdhury, Assistant Professor, Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704. Bangladesh.	en_US
dc.description.abstract	Compression of large DNA sequences has been a subject of great interest since the availability of genomic databases. Although only two bits are sufficient to encode four bases of DNA (namely A, G, T and C), the massive size DNA sequences forces the need for efficient compression. In this article we are going to propose an improved version of an existing algorithm known as “GtEncseq” which describes the procedure of storing multiple biological sequences of variable Character size, with customizable character transformations, “wildcard” and “separator” support, and a diverse group of internal representations optimized for different arrangements of wildcards and sequence lengths. Our main target is extensive compression of data with an attempt of eliminating the wildcard entries from the sequence but make it available for the reuse. An efficient time requirement for encoding the desired sequence is also a note to consider.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	An Improved Data Structure for Efficient Storage of Multiple BIOsequences	en_US
dc.type	Thesis	en_US