Development of an Efficient Algorithm for DNA Sequence Alignment Based on Cosine Similarity

Ahsan, T.M. Ariq; Faize, MD Saleh

dc.contributor.author	Ahsan, T.M. Ariq
dc.contributor.author	Faize, MD Saleh
dc.date.accessioned	2021-10-12T04:54:01Z
dc.date.available	2021-10-12T04:54:01Z
dc.date.issued	2012-11-15
dc.identifier.citation	1. Andoni, Alexandr and PiotrIndyk. 2008. Near- optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122. 2. Arasu, Arvind, VenkateshGanti, and RaghavKaushik. 2006. Efficient exact set-similarity joins. In VLDB ‟06: Proceedings of the 32nd International Confer- ence on Very Large Data Bases, pages 918–929. 3. Behm, Alexander, ShengyueJi, Chen Li, and Jiaheng Lu. 2009. Space-constrained gram-based indexing for efficient approximate string search. In ICDE ‟09: Proceedings of the 2009 IEEE International Conference on Data Engineering, pages 604–615. 4. Bergsma, Shane and GrzegorzKondrak. 2007. Alignment-based discriminative string similarity. In ACL ‟07: Proceedings of the 45th Annual Meet- ing of the Association of Computational Linguistics, pages 656–663. 5. Bocek, Thomas, Ela Hunt, and Burkhard Stiller. 2007. Fast similarity search in large dictionaries. Technical Report ifi-2007.02, Department of Informatics (IFI), University of Zurich. 858Chandel, Amit, P. C. Nagesh, and SunitaSarawagi. 2006. Efficient batch top-k search for dictionary based entity recognition. In ICDE ‟06: Proceed- ings of the 22nd International Conference on Data Engineering. 6. Charikar, Moses S. 2002. Similarity estimation tech- niques from rounding algorithms. In STOC ‟02: Proceedings of the thiry-fourth annual ACM sym- posium on Theory of computing, pages 380–388. 7. Chaudhuri, Surajit, VenkateshGanti, and RaghavKaushik. 2006. A primitive operator for similarity joins in data cleaning. In ICDE ‟06: Proceedings of the 22nd International Conference on Data Engineering. 8. Cohen, William W., PradeepRavikumar, and Stephen E. Fienberg. 2003. A comparison of string distance metrics for name-matching tasks. In Proceedings of the IJCAI-2003 Workshop on Information Integration on the Web (IIWeb-03), pages 73–78. 35 9. Davis, Jason V., Brian Kulis, Prateek Jain, SuvritSra,and Inderjit S. Dhillon. 2007. Information-theoretic metric learning. In ICML ‟07: Proceedings of the 24th International Conference on Machine Learning, pages 209–216. 10. Gravano, Luis, Panagiotis G. Ipeirotis, H. V. Jagadish,Nick Koudas, S.Muthukrishnan, and DiveshSrivastava. 2001. Approximate string joins in a database (almost) for free. In VLDB ‟01: Proceedings of the27th International Conference on Very Large DataBases, pages 491–500. 11. Henzinger, Monika. 2006. Finding near-duplicate web pages: a large-scale evaluation of algorithms. In SIGIR ‟06: Proceedings of the 29th Annual Inter- national ACM SIGIR Conference on Research and Development in Information Retrieval, pages 284– 291. 12. Huynh, Trinh N. D., Wing-Kai Hon, Tak-Wah Lam,and Wing-Kin Sung. 2006. Approximate string matching using compressed suffix arrays. Theoreti-cal Computer Science, 352(1-3):240–249. Kim, Min-Soo, Kyu-Young Whang, Jae-Gil Lee, and Min-Jae Lee. 2005. n-Gram/2L: a space and time efficient two-level n-gram inverted index structure. In VLDB ‟05: Proceedings of the 31st International Conference on Very Large Data Bases, pages 325– 336. 13. Lee, Hongrae, Raymond T. Ng, and Kyuseok Shim.2007. Extending q-grams to estimate selectivity of string matching with low edit distance. In VLDB ‟07: Proceedings of the 33rd International Conference on Very Large Data Bases, pages 195–206. 14. Li, Chen, Bin Wang, and Xiaochun Yang. 2007. Vgram: improving performance of approximate queries on string collections using variable-length grams. In VLDB ‟07: Proceedings of the 33rd International Conference on Very Large Data Bases, pages 303–314. 15. Li, Chen, Jiaheng Lu, and Yiming Lu. 2008. Effi-cient merging and filtering algorithms for approximate string searches. In ICDE ‟08: Proceedings of the 2008 IEEE 24thInternational Conference onData Engineering, pages 257–266. 16. Liu, Xuhui, Guoliang Li, JianhuaFeng, and LizhuZhou. 2008. Effective indices for efficient approximate string search and similarity join. InWAIM‟08: 36 Proceedings of the 2008 The Ninth International Conference on Web-Age Information Management, pages 127–134. 17. Manku, Gurmeet Singh, Arvind Jain, and AnishDas Sarma. 2007. Detecting near-duplicates for web crawling. In WWW ‟07: Proceedings of the 16th International Conference on World Wide Web, pages 141–150. 18. Navarro, Gonzalo and Ricardo Baeza-Yates. 1998. Apractical q-gram index for text retrieval allowing errors. CLEI Electronic Journal, 1(2).Ravichandran, Deepak, Patrick Pantel, and Eduard Hovy. 2005. Randomized algorithms and nlp: using locality sensitive hash function for high speed noun clustering. In ACL ‟05: Proceedings of the 43rd Annual Meeting on Association for Computa- tional Linguistics, pages 622–629. 19. Sarawagi, Sunitaand AlokKirpal. 2004. Efficient set joins on similarity predicates. In SIGMOD ‟04: Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages743–754. 20. Wang, Wei, Chuan Xiao, Xuemin Lin, and ChengqiZhang. 2009. Efficient approximate entity extraction with edit distance constraints. In SIGMOD ‟09: Proceedings of the 35th SIGMOD International Conference on Management of Data, pages 759–770. 21. Winkler, William E. 1999. The state of record link-age and current research problems.Technical Report R99/04, Statistics of Income Division, Internal Revenue Service Publication. 22. Xiao, Chuan, Wei Wang, and Xuemin Lin. 2008. Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. In VLDB ‟08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 933–944.	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/1171
dc.description	Supervised by Prof. Dr. M. A. Mottalib, Co-Supervisor, Abid Hasan, Lecturer, Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704. Bangladesh.	en_US
dc.description.abstract	In our thesis we wanted to work with approximate gene matching with the help of the cosine similarity factor. Though several other gene matching algorithms has been invented since the post Sanger method period but quite a little advancement has been done in this field. We have chalked out a new formula for gene sequence matching and implemented gap algorithm in it and then evaluated it with some of the well established algorithm (The Dot-Matrix method, The Dynamic Programming and The Word Method.). We sacrificed efficiency for accuracy but we think our acumen of time was not bad either. We have our sight set upon further developing it and more assessment of it in near future.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering (CSE), Islamic University of Technology (IUT), Board Bazar, Gazipur-1704, Bangladesh	en_US
dc.title	Development of an Efficient Algorithm for DNA Sequence Alignment Based on Cosine Similarity	en_US
dc.type	Thesis	en_US