Bag-of-Words Modeling for Hidden Web Crawler

Mohamed, Njingamndap Ousmanou

dc.contributor.author	Mohamed, Njingamndap Ousmanou
dc.date.accessioned	2021-01-27T09:22:42Z
dc.date.available	2021-01-27T09:22:42Z
dc.date.issued	2015-11-15
dc.identifier.citation	[1] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,” Commun. ACM, vol. 50, no. 5, pp. 94–101, May 2007. [Online] Available: http://doi.acm.org/10.1145/1230819.1241670 [2] O. A. McBryan, “Genvl and wwww: Tools for taming the web,” In Proceedings of the First International World Wide Web Conference, 1994, pp. 79–90 [3] S. Brin and L. Page, “The anatomy of a large-scale hypertextual web search engine,” in Proceedings of the seventh international conference on World Wide Web 7, ser. WWW7. Amsterdam, The Netherlands, The Netherlands: Elsevier Science Publishers B. V., 1998, pp. 107–117. [Online]. Available: http://dl.acm.org/citation.cfm?id=297805.297827 [4] L. Barbosa and J. Freire, “Siphoning hidden-web data through keywordbased interfaces,” in In SBBD, 2004, pp. 309–321 [5] S. Raghavan and H. Garcia-Molina, “Crawling the hidden web,” in Proceedings of the 27th International Conference on Very Large Data Bases, ser. VLDB ’01. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 129–138. [Online]. Available:http://dl.acm.org/citation.cfm?id=645927.672025 [6] M.Soulemane, M. Rafiuzzaman and H. Mahmud “Crawling the Hidden Web: An Approach to Dynamic Web Indexing” International Journal of Computer Applications (0975 – 8887) Volume 55– No.1, October 2012 [7] YU Hai-tao1,2, GUO Jian-yi, YU Zheng-tao, XIAN Yan-tuan, YAN Xin A Novel Method for Extracting Entity Data from Deep Web Precisely [8] A. Ntoulas, “Downloading textual hidden web content through keyword queries,” in In JCDL, 2005, pp. 100–109. [9] S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau1, “Extracting Data behind Web Forms,” Lecture Notes in Computer Science, vol. 2784, pp. 402–413, Jan. 2003. [10] Zheng He, Hangzai Luo, Jianping Fan, Xiao Liu EXTRACTING THE SEMANTIC CONTENT OF WEB PAGES VIA REPEATED STRUCTURES [11] S. Lawrence and C. L. Giles, “Searching the world wide web,” SCIENCE, vol. 280, no. 5360, pp. 98–100, 1998. [12] M. K. Bergman, “The deep web: Surfacing hidden value,” September 2001. [Online]. Available: http://www.brightplanet.com/pdf/deepwebwhitepaper.pdf [13] Ziv Bar-Yossef and Sridhar Rajagopala, “Template detection via data mining and its applications,” in International Conference on World Wide Web, 2002. [14] Shian-Hua Lin and Jan-Ming Ho, “Discovering informative content blocks from web documents,” in ACM SIGKDD international conference on Knowledge discovery and data mining, 2002. 38 [15] Lan Yi, Bing Liu, and Xiaoli Li, “Eliminating noisy information in web pages for data mining,” in ACM SIGKDD international conference on Knowledge discovery and data mining, 2003. [16] Sandip Debnath, Prasenjit Mitra, and C. Lee Giles, “Automatic extraction of informative blocks from webpages,” in ACM SIGAPP SAC, 2005. [17] Thomas Gottron, “Content code blurring: A new approach to content extraction,” in InternationalWorkshop on Text Information Retrieval, 2008, pp. 29–33. [18] T Weninger, WH Hsu, and J Han, “Cetr: Content extraction via tag ratios,” in International Conference on World Wide Web, 2010. [19] S. Lawrence and C. L. Giles, “Searching the world wide web,” SCIENCE, vol. 280, no. 5360, pp. 98–100, 1998. [20] M. K. Bergman, “The deep web: Surfacing hidden value,” September 2001. [Online]. Available: http://www.brightplanet.com/pdf/deepwebwhitepaper.pdf [21] B. He, M. Patel, Z. Zhang, and K. C.-C. Chang, “Accessing the deep web,” Commun. ACM, vol. 50, no. 5, pp. 94–101, May 2007. [Online]. Available: http://doi.acm.org/10.1145/1230819.1241670 [22] Seyed M. Mirtaheri, Mustafa Emre Dinc¸t¨urk, Salman Hooshmand, Gregor V. Bochmann, Guy-Vincent Jourdan, ”A Brief History of Web Crawlers”, arXiv:1405.0749v1 [cs.IR] 5 May 2014 [23] J. Bau, E. Bursztein, D. Gupta, and J. Mitchell, “State of the art: Automated black-box web application vulnerability testing,” in Security and Privacy (SP), 2010 IEEE Symposium on. IEEE, 2010, pp. 332–345. [24] A. Doup´e, M. Cova, and G. Vigna, “Why johnny cant pentest: An analysis of black-box web vulnerability scanners,” in Detection of Intrusions and Malware, and Vulnerability Assessment. Springer, 2010, pp. 111–131. [25] J. Marini, Document Object Model, 1st ed. New York, NY, USA: McGraw-Hill, Inc., 2002. [26] Panagiotis Liakos, Alexandros Ntoulas,Alexandros Labrinidis, Alex Delis,” Focused crawling for the hidden web” 16 April 2015 Springer Science+Business Media New York [27] Schonhofen, P.: Identifying document topics using the wikipedia category network. In: Proceedings of the 2006 IEEE/WIC/ACMInternational Conference onWeb Intelligence (WI), pp. 456–462, Hong Kong (2006) [28] http://stackoverflow.com	en_US
dc.identifier.uri	http://hdl.handle.net/123456789/796
dc.description	Supervised by Md. Kamrul Hasan, PHD, Associate Professor, Department of Computer Science and Engineering (CSE), Islamic University of Technology (IUT), The Organization of the Islamic Cooperation (OIC), Gazipur-1704, Dhaka, Bangladesh	en_US
dc.description.abstract	The internet is an integral part of our daily life, many of our activities we do it online like gaming, education and so on. Before getting any information online we have to search for it, thus web search engine like Google which are web applications which search over the internet for websites, become very popular. Search engine was there at the early age of internet and become the tool for searching. Search engine has a component which is charged of collecting information that we see on the interface when searching, it is called “crawler”. The crawler collects the information by retrieving the content of the pages it has the link, then look inside the content and take all the links. Hence it can resume the process of retrieving the content of these new links and collect all the content. The process continues until there is no more links to retrieve the content. However, on the internet information are not only on the pages we see, there are a tremendous amount of information that we are not seeing inside databases this is called hidden web. To access this information a normal user has to fill up form to access it. Like in the ecommerce websites, we have to give some information on the research box to get information about a product we are looking. The crawler however has to be able to put the correct words on the form to crawl the content inside databases. Many researches have been going out to solve this issue, to determine the optimal list of words to fill up the form with. The list of words is also known as the bag-of-words (BOW). In this thesis, we propose a model of structuring words inside the bow to reduce access to a words and set of related words on the bow. We have successful implemented the structure and integrated it to a crawler we have also implemented and showed the 6 result of experiment. From that we are able to see the effect of the structure in a crawler.	en_US
dc.language.iso	en	en_US
dc.publisher	Department of Computer Science and Engineering, Islamic University of Technology, Gazipur, Bangladesh	en_US
dc.title	Bag-of-Words Modeling for Hidden Web Crawler	en_US
dc.type	Thesis	en_US