Web-page Crawler Writing
The Internet Archive uses Heritrix crawler to crawl web pages [38]. The Heritrix crawler is freely available to use. One of the limitations of the Heritrix crawler is it does not filter out irrelevant information before downloading the webpages. Hence, lots of unrelated webpages comes with few relevant webpages through a Heritrix crawl. We will develop a smart crawler capability that will correctly crawl national security associated news webpages. We will use focused crawling approach to develop our smart crawler. The focused crawler is selectively seek out pages that are relevant to a pre-defined set of topics. The topics are specified not using keywords, but using exemplary documents. Rather than collecting and indexing all accessible Web documents to be able to answer all possible ad-hoc queries, a focused crawler analyzes its crawl boundary to find the links that are likely to be most relevant for the crawl, and avoids irrelevant regions of the Web. This leads to significant savings in hardware and network resources, and helps keep the crawl more up-to-date. The focused crawler is guided by a classifier which learns to recognize relevance from examples embedded in a topic taxonomy, and a distiller which identifies topical vantage points on the Web [39]. Our smart focused crawler will combine and apply different focused crawling approaches in order to produce a collection with high precision and recall. First, our crawler will use an advanced ontology that models disease outbreak security events (see ontology subsection). Earlier research showed that event model improves the performance of the focused crawler [40].