Web Crawler Using ML DL
Web Crawler Using ML DL
Submitted by
KHUSHAL KRISHNA KUMAR
(06314803120) (06514803120)
I would also like to express my thanks to the faculty of the Department of Information Technology
for providing a conducive environment for learning and research. Their encouragement and
constructive feedback have significantly contributed to the successful completion of this project.
Special thanks to my family and friends for their unwavering support and encouragement during
this academic pursuit.
Finally, I would like to acknowledge the assistance provided by various resources and references
that enriched the content of this project.
KHUSHAL
Enrollment No: 06314803120
KRISHNA KUMAR
Enrollment No: 06514803120
CANDIDATE’S DECLARATION
I hereby declare that the work presented in this project report titled, “WEB CRAWLER
USING ML/DL ” submitted by me in the partial fulfillment of the requirement of the award of
the degree of Bachelor of Technology ([Link].) Submitted in the Department of
Information Technology, Maharaja Agrasen Institute of Technology is an authentic record
of my project work carried out under the guidance of (Supervisors name and affiliation).
The matter presented in this project report has not been submitted either in part or full to any
university or Institute for award of any degree.
Krishna Kumar
Place: Delhi Roll No.:
06514803120
SUPERVISOR’S CERTIFICATE
It is to certify that the Project entitled “WEB CRAWLER USING ML/DL” which is being
submitted by Mr. Krishna Kumar and Mr. Khushal to the Maharaja Agrasen Institute of
Technology, Rohini in the
fulfillment of the requirement for the award of the degree of Bachelor of Technology
([Link].), is a record of bonafide project work carried out by him/her under my/ our guidance
and supervision.
Table of contents
Chapter 1 .......................................................................................................................................... 3
1 Introduction ............................................................................................................................... 3
1.1 Architecture ........................................................................................................................ 3
1.2 How a Spider interact with Web Server ............................................................................. 5
1.3 Possible Stakeholders ......................................................................................................... 6
1.4 Summary ............................................................................................................................ 6
Chapter 2 .......................................................................................................................................... 7
2 Parsing Web Page ...................................................................................................................... 7
2.1 HTML Tag Tree ................................................................................................................. 7
2.2 Words filtering ................................................................................................................... 9
Chapter 3 ........................................................................................................................................ 10
3 Stemming ................................................................................................................................ 10
3.1 Introduction ...................................................................................................................... 10
3.2 Algorithms ........................................................................................................................ 11
3.3 Language Challenges ....................................................................................................... 14
3.4 Error metrics .................................................................................................................... 15
3.5 Applications ..................................................................................................................... 15
Chapter 04 ...................................................................................................................................... 16
4 Clustering ................................................................................................................................ 16
4.1 Types of clustering ........................................................................................................... 16
4.2 Distance measure ............................................................. Error! Bookmark not defined.
4.3 Hierarchical clustering ..................................................................................................... 17
4.4 Partitional clustering ........................................................................................................ 18
1.2 Applications ..................................................................................................................... 20
1.3 Comparisons between data clustering techniques ............................................................ 21
Chapter 05 ...................................................................................................................................... 23
2 Survey ..................................................................................................................................... 23
2.1 Trellian SiteSpider 1.00.004............................................................................................. 23
2.2 Icegiant Data Extractor .................................................................................................... 23
2.3 ETagFix by ISAPILabs .................................................................................................... 24
2.4 StreamBull by J.P.C. Development Services ................................................................... 25
2.5 Inspyder Site Map Creator ............................................................................................... 25
2.6 Conclusion ....................................................................................................................... 26
lOMoAR cPSD| 14639031
Chapter 06 ...................................................................................................................................... 27
3 Analysis & Design .................................................................................................................. 27
3.1 Methods/Techniques ........................................................................................................ 27
3.2 Use Case Model ............................................................................................................... 28
3.3 DFD .................................................................................................................................. 33
6.4 ERD .................................................................................................................................. 34
Chapter 07 ...................................................................................................................................... 35
4 Implementation ....................................................................................................................... 35
4.1 Techniques/ tools ......................................................................................................... 35
4.2 Class Diagram of Web Spider .......................................................................................... 36
4.3 Class Diagram of Web Portal ........................................................................................... 37
Chapter -8 ....................................................................................................................................... 38
5 Results & Conclusion.............................................................................................................. 38
5.1 Conclusion ....................................................................................................................... 38
APPENDIX – A ............................................................................................................................. 40
1 Purpose of Vision Document .................................................................................................. 40
5.2 Product Overview ............................................................................................................ 40
5.3 User Descriptions ............................................................................................................. 40
5.4 Problem Statement ........................................................................................................... 41
5.5 Product Position Statement .............................................................................................. 41
APPENDIX – B ............................................................................................................................. 42
6 Appendix-B [SRS] .................................................................................................................. 42
6.1 Purpose ............................................................................................................................. 42
6.2 Document Conventions .................................................................................................... 42
6.3 Intended Audience and Reading Suggestions .................................................................. 42
6.4 Project Scope.................................................................................................................... 43
6.5 Overall Description .......................................................................................................... 45
6.6 System Features ............................................................................................................... 48
6.8 Nonfunctional Requirements ........................................................................................... 49
APPENDIX-C ................................................................................................................................ 50
7 Appendix-C [Glossary] ........................................................................................................... 50
APPENDIX-D ................................................................................................................................ 51
Appendix-D [References] .......................................................................................................... 51
7.1 Some other Resources ...................................................................................................... 51
2
lOMoAR cPSD| 14639031
Chapter 1
[Introduction]
1 Introduction
A web crawler (also known as a web spider or web robot) is a program or automated script that
browses the World Wide Web in a methodical, automated manner. Other less frequently used
names for web crawlers are ants, automatic indexers, bots, and worms.
The process is called web crawling or spidering. Many sites, in particular search engines, use
spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy
of all the visited pages for later processing by a search engine that will index the downloaded pages
to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website,
such as checking links or validating HTML code. Also, crawlers can be used to gather specific
types of information from Web pages, such as harvesting e-mail addresses (usually for spam).
A web crawler is one type of bot or software agent. In general, it starts with a list of URLs to visit,
called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and
adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are
1.1 Architecture
A crawler must not only have a good crawling strategy, but it should also have a highly optimized
architecture. As Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly
easy to build a slow crawler that downloads a few pages per second for a short period of time,
building a high-performance system that can download hundreds of millions of pages over several
weeks presents a number of challenges in system design, I/O and network efficiency, and
Generally the architecture of common web crawlers can be described as Fig. 1.1.1. It shows that
the crawlers have two major processors i.e. Scheduler and Multi-threaded
3
lOMoAR cPSD| 14639031
Downloader. The job of downloader is to initiate and send HTTP request to server (or Internet) for
downloading content of Web pages. Then the URLs and Text/metadata is separated from the
database). And the list of URLs will be passed to Schedular for recursively download.
[Fig: 1.1.1]High Level Architecture of a Standard Web Spider/Crawler How Web Spiders Work
Web Spiders start by parsing a specified web page, noting any hypertext links on that page that
point to other web pages. They then parse those pages for new links, and so on, recursively. Web-
Spider software doesn't actually move around to different computers on the Internet, as viruses or
intelligent agents do. A Spider resides on a single machine. The Spider simply sends HTTP
requests for documents to other machines on the Internet, just as a web browser does when the
user clicks on links. All the Spider really does is to automate the process of following links.
4
lOMoAR cPSD| 14639031
Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves
some subsequent purpose. The most common use is to build an index for a web search engine, but
Spiders are also used for any other purpose. E.g. Muscle Fish uses a Spider to search the Web for
audio files. It turns out that searching for audio files is not very different from searching for any
HTTP request. Web site administrators typically examine their web server’s log and use the user
agent field to determine which crawlers have visited the web server and how often. The user agent
5
lOMoAR cPSD| 14639031
field may include a URL where the Web site administrator may find out more information about
the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying
information in the user agent field, or they may mask their identity as a browser or other well-
known crawler. It is important for web crawlers to identify themselves so Web site administrators
can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler
trap or they may be overloading a web server with requests, and the owner needs to stop the
crawler. Identification is also useful for administrators that are interested in knowing when they
1.3.2 E-Business
Research
1.4 Summary
Web Spider is the member of a software family that is responsible to crawling web pages and
populates databases for search engines, using fast and intelligent techniques. This product also
includes a web-based module that will use the database maintained by spider.
6
lOMoAR cPSD| 14639031
Chapter 2
[Background & Technical Issues – Parsing Web Page]
request for any specified web URL. The resultant received web page contains the data to be
analyzed. And the data is managed by HTML tags in resultant page. So our crawling application
should have ability to identify HTML tags, and to fetch textual information written in those tags.
As the fetched textual information is helpful to keep the record of keywords, and the associated
URLs.
which it resides. For this, a crawler may need to utilize the tag tree or DOM structure of the HTML.
Figure 2.1.1 shows a tag tree corresponding to an HTML source. The <html> tag forms the root of
the tree and various tags and texts form nodes of the tree. Unfortunately, many Web pages contain
badly written HTML. For example, a start tag may not have an end tag (it may not be required by
the HTML specification), or the tags may not be properly nested. In many cases, the <html> tag
or the <body> tag is all-together missing from the HTML page. Thus structure-based criteria often
require the prior step of converting a dirty HTML document into a wellformed one, a process that
is called tidying an HTML page. This includes both insertion of missing tags and the reordering of
tags in the page. Tidying an HTML page is necessary for mapping the content of a page onto a tree
structure with integrity, where each node has a single parent. Hence, it is an essential precursor to
analyzing an HTML page as a tag tree. Note that analyzing the DOM structure is only necessary
if the topical crawler intends to use the HTML document structure in a nontrivial manner. For
example, if the crawler only needs the links within a page, and the text or portions of the text in
7
lOMoAR cPSD| 14639031
the page, one can use simpler HTML parsers. Such parsers are also readily available in many
languages.
An Example:
<html>
<head>
<title>Projects</title>
</head> <body>
<h4>Current Projects</h4> <ul>
<li> <a href="[Link]">LAMP</a> Linkage analysis with
MP.</li>
<li> <a href="[Link]">NI</a> Network Infrastructure.</li>
<li> <a href="[Link]">ADNA</a> A DNA Algorithm.</li> <li> <a href="[Link]">DLT</a> A
distributed, first-order logic theorem.</li>
</ul>
</body>
</html>
8
lOMoAR cPSD| 14639031
our application has to filter the keywords by avoiding useless words (like A, Is, The, also, although,
In our case the useless words can be defined as “All words, those are not derived from a Noun”.
So we have built a list of such words, called StopList. This list is helpful for dropping
common/useless words.
Our crawler parse all the words in textual information, and perform search operation in the
StopList, if the word exists in StopList then Drop the word else save the word in the bucket of
Keywords.
Fetch a word
[Word]
Is Member
of StopList [Yes]
[No]
Is Last
word [Yes]
[No]
end
9
lOMoAR cPSD| 14639031
Chapter 3
[Background & Technical Issues - Stemming]
3 Stemming
3.1 Introduction
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base
or root form — generally a written word form. The stem need not be identical to the morphological
root of the word; it is usually sufficient that related words map to the same stem, even if this stem
is not in itself a valid root. The algorithm has been a long-standing problem in computer science;
the first paper on the subject was published in 1968. The process of stemming, often called
conflation, is useful in search engines for query expansion or indexing and other natural language
processing problems.
Examples
A stemmer for English, for example, should identify the string "cats" (and possibly
"catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming",
History
The first ever published stemmer was written by Julie Beth Lovins in 1968 . This paper was
[3.2]
remarkable for its early date and had great influence on later work in this area.
A later stemmer was written by Martin Porter and was published in the July 1980 issue of the
journal Program. This stemmer was very widely used and became the defacto standard algorithm
used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on
stemming and information retrieval. Many implementations of this algorithm were written and
10
lOMoAR cPSD| 14639031
freely distributed; however, many of these implementations contained subtle flaws. As a result,
these stemmers did not match their potential. To eliminate this source of error, Martin Porter
released an official free-software implementation of the algorithm around the year 2000. He
extended this work over the next few years by building Snowball, a framework for writing
3.2 Algorithms
There are several types of stemming algorithms which differ in respect to performance and
The term brute force is borrowed from a concept in artificial intelligence research and problem
solving in mathematics denoted as brute force search. Brute force stemmers employ a lookup table
which contains relations between root forms and inflected forms. To stem a word, the table is
queried to find a matching inflection. If a matching inflection is found, the associated root form is
returned.
Brute force algorithms are initially very difficult to design given the immense amount of relations
that must be initially stored to produce an acceptable level of accuracy (the number can span well
into the millions). However, brute force algorithms are easy to improve in that decreasing the
stemming error is only a matter of adding more relations to the table. Someone with only a minor
11
lOMoAR cPSD| 14639031
experience in linguistics is capable of improving the algorithm, unlike the suffix stripping
The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner
morphological and inflexional endings from words in English. Its main use is as part of a term
normalization process that is usually done when setting up Information Retrieval systems. [3.4]
Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root
form relations. Instead, a typically smaller list of "rules" are stored which provide a path for the
algorithm, given an input word form, to find its root form. Some examples of the rules include:
and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes
regarded as crude given the poor performance when dealing with exceptional relations (like 'ran'
and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical
categories which have well known suffices with few exceptions. This, however, is a problem, as
not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve
upon this
challenge. [3.1]
12
lOMoAR cPSD| 14639031
A more complex approach to the problem of determining a stem of a word is lemmatisation. This
process involves first determining the part of speech of a word, and applying different
normalization rules for each part of speech. The part of speech is first detected prior to attempting
to find the root since for some languages, the stemming rules change depending on a word's part
of speech.
This approach is highly conditional upon obtaining the correct lexical category (part of speech).
While there is overlap between the normalization rules for certain categories, identifying the wrong
category or being unable to produce the right category limits the added benefit of this approach
over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information
about the word to be stemmed, then we are able to more accurately apply normalization rules
rules). [3.1]
Hybrid approaches use two or more of the approaches described above in unison. A simple example
is a suffix tree algorithm which first consults a lookup table using brute force. However, instead
of trying to store the entire set of relations between words in a given language, the lookup table is
kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If
the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.
[3.1]
In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffices,
several approaches also attempt to remove common prefixes. For example, given the word
13
lOMoAR cPSD| 14639031
indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same
approaches mentioned earlier apply, but go by the name affix stripping. [3.1]
In this algorithm first of all we have a set of documents that contain stem word that stems can be
not valid roots because staring common characters save as stem of course the length of stem of a
word must be at a specific range according of the main word. For example "be" and "beside"
between these two word "be" is not accepted as stem but "browse" and "browsing" stem will be
"brows".
significant use of the Porter Stemmer algorithm), many other languages have been investigated.
Hebrew and Arabic are still considered difficult research languages for stemming. English
stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person
singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but
stemmers become harder to design as the morphology, orthography, and character encoding of the
target language becomes more complex. For example, an Italian stemmer is more complex than
an English one (because of more possible verb inflections), a Russian one is more complex (more
possible noun declensions), a Hebrew one is even more complex (due to non-catenative
instead of rules for only a single language when interpreting a search query. Commercial systems
using multilingual stemming exist. But my project is not supporting Multilingual Stemming, as it
14
lOMoAR cPSD| 14639031
Overstemming is an error where two separate inflected words are stemmed to the same root, but
should not have been - a false positive. Understemming is an error where two separate inflected
words should be stemmed to the same root, but are not - a false negative. Stemming algorithms
attempt to minimize each type of error, although reducing one type can lead to increasing the other.
3.5 Applications
Stemmers are common elements in query systems such as Web search engines, since a user who
runs a query on "daffodils" would probably also be interested in documents that contain the word
"daffodil" (without the s). The effectiveness of stemming for English query systems were soon
found to be rather limited, however, and this has led early Information retrieval researchers to
deem stemming irrelevant in general. An alternative approach, based on searching for n-grams
rather than stems, may be used instead. Also, recent research has shown greater benefits for
English language words, and produces most accurate results to find root words. The stemmer is
applied only on keywords at web page, and then these stemmed words will be saved into database.
15
lOMoAR cPSD| 14639031
Chapter 04
[Background & Technical Issues - Clustering]
4 Clustering
Clustering is the classification of objects into different groups, or more precisely, the partitioning
of a data set into subsets (clusters), so that the data in each subset (ideally) share some common
trait - often proximity according to some defined distance measure. Data clustering is a common
technique for statistical data analysis, which is used in many fields, including machine learning,
Besides the term data clustering (or just clustering), there are a number of terms with similar
successive clusters using previously established clusters, whereas partitional algorithms determine
all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-
16
lOMoAR cPSD| 14639031
down"). Agglomerative algorithms begin with each element as a separate cluster and merge them
into successively larger clusters. Divisive algorithms begin with the whole set and proceed to
Two-way clustering, co-clustering or biclustering are clustering methods where not only the
objects are clustered but also the features of the objects, i.e., if the data is represented in a data
matrix, the rows and columns are clustered simultaneously. Another important distinction is
whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is
that distances are symmetric (the distance from object A to B is the same as the distance from B to
A).
traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements
at one end and a single cluster containing every element at the other. Agglomerative algorithms
begin at the top of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows
Cutting the tree at a given height will give a clustering at a selected precision.
In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.
Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with
17
lOMoAR cPSD| 14639031
The K-means algorithm assigns each point to the cluster whose center (also called centroid) is
nearest. The center is the average of all the points in the cluster — that is, its coordinates are the
arithmetic mean for each dimension separately over all the points in the cluster...
Example : The data set has three dimensions and the cluster has two points:
[[Link].1]
Y = (y1, y2, y3) . ------------------
[[Link].2]
[[Link].3] where:
• Randomly generate k clusters and determine the cluster centers, or directly generate k random
• Repeat the two previous steps until some convergence criterion is met (usually that the assignment
hasn't changed).
The main advantages of this algorithm are its simplicity and speed which allows it to run on large
datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting
18
lOMoAR cPSD| 14639031
clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does
In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than
belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the
cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient
giving the degree of being in the k th cluster uk( x) . Usually, the sum of those coefficients is
defined to be 1:
With fuzzy c -means, the centroid of a cluster is the mean of all points, weighted by their degree
The degree of belonging is related to the inverse of the distance to the cluster then the coefficients
are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So
For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1.
When m is close to 1, then cluster center closest to the point is given much more weight than the
The fuzzy c -means algorithm is very similar to the k -means algorithm: I. Choose a number of
clusters.
II. Assign randomly to each point coefficients for being in the clusters.
III. Repeat until the algorithm has converged (that is, the coefficients ' change between
threshold) :
o Compute the centroid for each cluster, using the formula above .
o For each point, compute its coefficients of being in the clusters, using the formula above.
The algorithm minimizes intra-cluster variance as well, but has the same problems as k -means,
the minimum is a local minimum, and the results depend on the initial choice of weights. The
19
lOMoAR cPSD| 14639031
some of these ideas: partial membership in classes. It has better convergence properties and is in
data, invented for gene clustering. It requires more computing power than k -means, but does not
require specifying the number of clusters a priori , and always returns the same result when run
II. Build a candidate cluster for each point by including the closest point, the next
closest, and so on, until the diameter of the cluster surpasses the threshold.
III. Save the candidate cluster with the most points as the first true cluster, and remove
The distance between a point and a group of points is computed using complete linkage, i.e. as the
maximum distance from the point to any member of the group (see the "Agglomerative
1.2 Applications
Some of major common applications of clustering are:
In the process of intelligent grouping of the files and websites, clustering may be used to create a
more relevant set of search results compared to normal search engines like Google. There are
currently a number of web based clustering tools, those are discussed in chapter 4 (i.e. Survey).
20
lOMoAR cPSD| 14639031
In the study of social networks, clustering may be used to recognize communities within large
groups of people.
Many data mining applications involve partitioning data items into related subsets; the marketing
applications discussed above represent some examples. Another common application is the
Clustering can be used to group all the shopping items available on the web into a set of unique
products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't
measure can be used to compare how well different data clustering algorithms perform on a set of
data. Many of these measures are derived from the matching matrix (also known as confusion
21
lOMoAR cPSD| 14639031
In our case, we need to make clusters on the basis of two facts. i.e. frequency of keyword and
frequency of web url. And If we say Frequency of Keywords on X-axis , and Frequency of URLs
on Y-axis. Then the K-means algorithm is most suitable for implementation. Why?
Because QT (quality threshold) requires more computing power then K-means and Fuzzy c-
[Link] it always produces same clustering results on each execution, which affects the
concerned it is almost similar to K-means, so we choose k-means because of our prior knowledge
22
lOMoAR cPSD| 14639031
Chapter 05
[Background & Technical Issues-Surveys]
2 Survey
A software survey is conducted before the development of this proposed system ‘Search Engine
with Intelligent Web Spider’. Many crawling software were in the domain of this survey. Some of
search, web crawling and site mapping functions. SiteSpider can be used simply as a web browser,
which works cooperatively with Internet Explorer to allow importing and utilization of our
favorite’s directory. Trellian SiteSpider can help to extract valuable data from any website[8].
multithreaded crawling engine allows the user to quickly download MP3 urls from a website. MP3
23
lOMoAR cPSD| 14639031
Extractor ties directly into Microsoft''s Internet Explorer as a toolbar for easy access while
browsing. After download is finished, the acquired urls are displayed on a page for easy
viewing[9].
by IIS so that the etag does not change every time out computer reboots. This can result in web
crawlers and user clients wanting to recache our web page when they don't need to[10].
24
lOMoAR cPSD| 14639031
the necessary internet addresses from a database residing on an internet server. The user selects his
favorite categories like teens, hardcore, asians… (32 x 2 in all) and gets images accordingly. All
images have been pre-screened to avoid multiples or illegal content, and to avoid the local
powerful enough to handle large sites. Inspyder Sitemap Creator can help us generate an accurate
25
lOMoAR cPSD| 14639031
2.6 Conclusion
All of these mentioned software tools are for public usage. Any one can buy and use them, but
their usage is limited to MP3 extraction, site map creation, image searching etc. If a person want
to scan any specific web page to get to know the keywords at some page, then these tools might
So our web spider is the tool which provides the functionality of scanning web page, to extract
keywords on it. Because these found keywords will not only be helpful to define the content type
on the webpage but also helpful to categorize the webpage on the basis of keywords.
26
lOMoAR cPSD| 14639031
Chapter 06
[Development Phases – Analysis & Design]
artifacts are produced in analysis frame and how different system requirements were analyzed in
this phase? The purpose of this phase is to collect and analyze the system requirements for this
tool.
3.1 Methods/Techniques
To start the development of the system we have made a vision to analyze the situation clearly, and
identified the:
After this analyzation we became able to define vision of the system. And we documented the
vision in APPENDIX-A.
Then we identified and modeled all the requirements into document form, named SRS (Software
And then we became able to model Use Cases, to identify the major Use Cases and objects/actors
acting on them. we not only identified use cases and actors, but also defined their roles to achieve
27
lOMoAR cPSD| 14639031
28
lOMoAR cPSD| 14639031
Case Use
Actor System (i.e. Web Spider)
Type Primary
Description This use case begins when the textual contents of URL entered by
the administrator are downloaded. All the predefined common
words are dropped from the list of contents, and only the keywords
remains for stemming.
Case Use
Actor System (i.e. Web Spider)
Type Primary
Description This use case begins after the extraction of keywords. And these
keywords are converted to their root word. (e.g. management or
managing to manag)
29
lOMoAR cPSD| 14639031
Case
Actor System (i.e. Web Spider)
Type Primary
Description The use case begins when the parent use case provides a list of
stemmed words and their URLs. These words and URLs are
Case
Actor System (i.e. Web Spider)
Type Primary
Description The use case begins when the parent use case provides a list of
clustered pairs of words and URLs. These pairs are stored into record, by making a connection to
the database. And if the pair is already stored into the record, it will only update the frequency of
30
lOMoAR cPSD| 14639031
31
lOMoAR cPSD| 14639031
Case
Actor System (i.e. Web Portal)
Type Primary
Description This use case begins after taking the input text form user. The
keywords are extracted from the text of user and these keywords are
converted to their root word. (e.g. management or managing to
manag)
Case
Actor System (i.e. Web Portal)
Type Primary
Description This use case begins after searching the URLs of keywords from
the database. These URLs will be displayed at user screen in a descending order on the basis of
frequency.
32
lOMoAR cPSD| 14639031
3.3 DFD
In the analyzation phase we come to know that our system should have two levels of
33
lOMoAR cPSD| 14639031
6.4 ERD
34
lOMoAR cPSD| 14639031
Chapter 07
[Development Phases - Implementation]
4 Implementation
In this chapter we introduced the implementation phase of Search Engine with Intelligent
Web Spider. What tools we have chosen for the development of this software system. And what
operating system we used for development environment is mentioned in this chapter; as well the
classes, their functionality and their attribute variables are also mentioned in the class diagrams.
I. Borland JBuilder 2005 o To write and execute the source code of Java language
a. To test the source code written in Borland JBuilder 2005, for checking the code
V. WAMP Server 1.4.1 o To install the web server and to install and configure the
mySQL databases.
JDK 1.4.0 o MySQL Connector with JDBC 3.1.14 o Java Runtime Environment 1.4
35
lOMoAR cPSD| 14639031
36
lOMoAR cPSD| 14639031
db index PorterStemmer
+ debug_called: var
+ main() : var - regex_consonant: var = '(?:[bcdfghjklm...
+ vardump_called: var
- regex_vowel: var = '(?:[aeiou]|(?<...
+ show_errors: var = true
+ Stem(var) : var
+ db() : var
- step1ab(var) : var
+ select(var) : var
- step1c(var) : var
+ escape(var) : var
- step2(var) : var
+ print_error(var) : var
- step3(var) : var
+ show_errors() : var
- step4(var) : var
+ hide_errors() : var
- step5(var) : var
+ flush() : var
- replace(var*, var, var, var) : var
+ query(var) : var
- m(var) : var
+ get_var(var, var, var) : var
- doubleConsonant(var) : var
+ get_row(var, var, var) : var
- cvc(var) : var
+ get_col(var, var) : var
+ StemIt(var) : var
+ get_results(var, var) : var
+ get_col_info(var, var) : var
+ vardump(var) : var
+ dumpvar(var) : var
+ debug() : var
37
lOMoAR cPSD| 14639031
Chapter -8
[Results & Conclusion]
d y
Card [Link] 39 0, touch [Link] 20 2, christma
5.1 Conclusion
A small list of produced results by the Spider is shown in Table-8.1. These results are achieved by
giving input [Link] as starting URL. And all the keywords in this domain
are fetched out by developed Spider. Not only the stemmed keywords are fetched but also the
keywords are categorized on the basis of their frequency of occurrence into different clusters. And
now these keywords are available for normal users in a database, so that the user can get fast and
By observing the results, we can say that we have achieved our objectives of:
I. Make fast and automated scanning of Web Pages, and search by Indexing.
(Java module is for Scanning, and PHP module is for search purpose)
II. Maintain the log efficiently (by using the Relational Database).
38
lOMoAR cPSD| 14639031
III. Only authorized persons can operate it (protected by User ID and Password). To achieve
these objectives, we have learnt many new tricks and technologies under the supervision of my
And we are thankful to our teachers, especially supervisor who has motivated us to complete this
task.
39
lOMoAR cPSD| 14639031
APPENDIX – A
[Vision Document]
Search Engine with Intelligent Web Crawler. It focuses on the capabilities needed by the
stakeholders and the target users, and why these needs exist. The details of how the specified
System fulfills these needs are mentioned in the use-case and supplementary specifications.
format. And save this data into manageable hierarchy, by applying clustering techniques. And this
saved data will be accessible to Users, so that the users could see a list of related web pages.
Users are familiar with the use of computer. There are two types of users; Administrator and Search
Performer. The key responsibility of Administrator is to provide the starting web portal to the
crawler and maintain a record using this software, and the responsibility of Search Performers is
IV. All the keywords should be stored into their respective category/cluster.
40
lOMoAR cPSD| 14639031
The Problem Of Scanning the textual contents of web pages and keeping its
record for future use.
The Impact Of Time consuming and incorrect log entries. And non-logical
Which Is hierarchy of information retrieved
Successful Solution • Make fast and automated scanning, and accurate search
Would by Indexing
• Maintain the log efficiently
• Only authorized persons can operate it
41
lOMoAR cPSD| 14639031
APPENDIX – B
[Software Requirements Specification]
6 Appendix-B [SRS]
6.1 Purpose
This document specifies the software requirements of a project, titled A Search Engine with
Intelligent Web Spider-(version 1). A complete description of the product is discussed in the
document, and this document is completely covering the areas of the product going to be made.
The system analyst creates and maintains the vision, the use-case model overview and
supplementary specifications, which serve as input to the SRS and are the communication medium
The use-case specifier creates and maintains the individual use cases of the use-case model and
42
lOMoAR cPSD| 14639031
6.3.3 Designers
Designers use the SRS Package as a reference when defining responsibilities, operations, and
6.3.4 Implementers
Implementers refer to the SRS Package for input when implementing classes, and the functionality
of those classes.
Project Manager consults the document to keep an eye view on project, by doing so it will be easy
for him to manage, and he also refers to the SRS Package for input when planning iterations.
This document is also beneficial for Marketing Staff, because by consulting it they will be able to
understand the features added in it, and they could be able to make a good publicity of product.
6.3.7 Testers
6.3.8 Users
By reading/consulting this document user will get to know about the features of the product and
importance of product, and if he/she want to add some thing more than he/she could be able to tell
43
lOMoAR cPSD| 14639031
The responsibility of first module will be to move like a spider from one site to other, and to find
the keywords of each web page. These keywords will be found by separating them from a list of
words like a, an, the, that, is, are, etc. And then the
keywords will be stemmed out, by using Stemming Algorithm. Means the Management will be
converted into Manage, Meetings into Meet and Cats into Cat etc. And these stemmed words will
be stored in the database. While searching from site to site, this software will consider the
privileges set for users on current site, means it will also examine the [Link] (i.e. A file usually
placed on restricted domains, and this file describes the privileges to use the content of that site),
this software will also do the cycle handling while moving from site to site and will also counts
number of times a certain web site is cited, means it will also do the statistical analysis. And on
the basis of these analyzed statistics, the system will manage found keywords into manageable
clusters/categories.
Second module will be web based and the responsibility of this module will be to search
from the database maintained by the first module. It will search by taking value/input from user.
44
lOMoAR cPSD| 14639031
And before search from database, this input will be stemmed by using the same Stemming
This software product is the member of a software family that is responsible to populate databases
for search engines, using fast and intelligent techniques. This product also includes a web-based
1. The first module of the product will Search the words in a web page.
4. It will make Clusters of keywords on the basis of their frequency in a web page.
5. It will store these keywords with the link of web page in a database.
6. The second module of product will take an input from a web user (i.e. based on one or more
keywords), and will search these keywords in the database maintained by first module.
7. The search operation will be performed on the basis of clusters of the keywords, and the result of
45
lOMoAR cPSD| 14639031
This program needs these under mentioned tools for its execution:
And these tools are required for its development and administration:
46
lOMoAR cPSD| 14639031
The software requires NIC or modem to connect or work on some Network, and all other hardware
RAM 128 MB
It includes the processing speed of hardware, and the storage capacity of database. So the
developers have to choose a server that provides maximum capacity for database.
[Link] Tools
This software will be made by using Java2 compiler (i.e. Borland JBuilder10), which is the latest
tool for compilation of J2SE applications, so limitation are reduced by using version 10.0.
This program has to crawl from site to site, and some websites don’t authorize crawlers to scan
their pages. So the system should consider these privileges and should not perform search
Development team provides a training session to the Administrators, and a user manual will also
be embedded with the software module. This help manual contains all the information about the
basic functionality, and also answers “How to perform this functionality?” and some FAQs.
47
lOMoAR cPSD| 14639031
The software module is responsible to visit/crawl web pages from a starting web portal (i.e.
provided by the user). It will not only crawl through one web page to another but also fetch the
It is the High priority or primary feature of the system, because without crawling the raw data can’t
1. Administrator will enter the starting web portal URL, and the depth for crawling.
3. System will make a list of URLs found at current web page, for crawling purpose.
4. System will found Logical Words in this page, and send them to Stemmer.
5. System will store all stemmed words into their respective cluster.
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base
or root form — generally a written word form. The stem need not be identical to the morphological
root of the word; it is usually sufficient that related words map to the same stem, even if this stem
48
lOMoAR cPSD| 14639031
Stemming the keywords is useful in search engines for query expansion or indexing and other
The web crawling module of the software system will require an HTTP internet connection, and
the maximum transfer rate of this connection will affect the performance of this module. So the
The web crawling module of the software system should only be accessible to authorized users,
and no any irrelevant individual should have access to the database maintained by web crawler.
And to ensure security, the system needs to implement a user recognition system. That will operate
[Link] Availability
The software will be easily available from our website, and user will able to purchase it by credit
card.
6.8.3.2Maintainability
User will be able to download its newer versions from our website, so that he/she can maintain it.
6.8.3.3Less expensive
49
lOMoAR cPSD| 14639031
APPENDIX-C
[Glossary]
7 Appendix-C [Glossary]
1 SRS: Software Requirement Specification
50
lOMoAR cPSD| 14639031
APPENDIX-D
[References]
Appendix-D [References]
[1]. Gautam Pant, Padmini Srinivasan, and Filippo Menczer; “Crawling The Web – Chapter 1:
Introduction”
[Link]
[2]. Julie Beth Lovins (1968). Development of a stemming algorithm. Mechanical Translation and
Computational Linguistics 11:22–31
[5]. Ricardo Baeza-Yates and Berthier Ribeiro-Neto (1999). Modern Information Retrieval. ACM
Press/Addison Wesley.
[6]. Basak S.C., Magnuson V.R., Niemi C.J., Regal R.R. "Determing Structural Similarity of
Chemicals Using Graph Theoretic Indices". Discr. Appl. Math., 19, 1988: 17-44.
[7]. E. B. Fowlkes & C. L. Mallows (September 1983). "A Method for Comparing Two
Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584.
[9].
[Link] e visited 28-01-
2010]
Specifications, Software Engineering Standards Committee of the IEEE Computer Society, New
York (1993)
51
lOMoAR cPSD| 14639031
Pearson Education
[Link]
Data
Clustering”[Link] [Link]
[Link]
[Link]
Snapshots:.
Web Portal
52
lOMoAR cPSD| 14639031
Web Spider
53
lOMoAR cPSD| 14639031
Project Schedule:
| Phase Duration
Specifying requirements and analyzing them 5 weeks
Making of the design of Product 3 weeks
Inspection and testing of design made 1 week
Implementation of Design 3 weeks
Testing of product by some user, who will not be the member of development 1 week
team
Modification of faults found in testing 2 weeks
Making of Help manual and PowerPoint presentations for the product 2 weeks
Making of Help manual and PowerPoint presentations for the product Last week
54