0% found this document useful (0 votes)
47 views58 pages

Web Crawler Using ML DL

The document discusses the development of a web crawler using machine learning and deep learning. It covers topics like parsing web pages, stemming, clustering, surveying existing crawlers, and designing the proposed crawler. Chapters discuss parsing HTML tags, filtering words, stemming algorithms, types of clustering, use case model, data flow diagram, entity relationship diagram, and implementation techniques.

Uploaded by

khushal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views58 pages

Web Crawler Using ML DL

The document discusses the development of a web crawler using machine learning and deep learning. It covers topics like parsing web pages, stemming, clustering, surveying existing crawlers, and designing the proposed crawler. Chapters discuss parsing HTML tags, filtering words, stemming algorithms, types of clustering, use case model, data flow diagram, entity relationship diagram, and implementation techniques.

Uploaded by

khushal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

A project report on

WEB CRAWLER USING ML/DL


submitted in partial fulfillment of the requirements for the award of the degree of

Bachelor of Technology (IT)


Under the supervision of

Er. Ajay K. Kaushik


Assistant professor

Submitted by
KHUSHAL KRISHNA KUMAR
(06314803120) (06514803120)

DEPARTMENT OF INFORMATION TECHNOLOGY MAHARAJA


AGRASEN INSTITUTE OF TECHNOLOGY SECTOR-22, ROHINI,
DELHI –110086
June 2024
ACKNOWLEDGEMENT
I extend my sincere gratitude to Er. Ajay K. Kaushik for his invaluable guidance and support
throughout the development of the project on "WEB CRAWLER USING ML/DL" His expertise
and insights have been instrumental in shaping this endeavor.

I would also like to express my thanks to the faculty of the Department of Information Technology
for providing a conducive environment for learning and research. Their encouragement and
constructive feedback have significantly contributed to the successful completion of this project.
Special thanks to my family and friends for their unwavering support and encouragement during
this academic pursuit.
Finally, I would like to acknowledge the assistance provided by various resources and references
that enriched the content of this project.

KHUSHAL
Enrollment No: 06314803120

KRISHNA KUMAR
Enrollment No: 06514803120
CANDIDATE’S DECLARATION

I hereby declare that the work presented in this project report titled, “WEB CRAWLER
USING ML/DL ” submitted by me in the partial fulfillment of the requirement of the award of
the degree of Bachelor of Technology ([Link].) Submitted in the Department of
Information Technology, Maharaja Agrasen Institute of Technology is an authentic record
of my project work carried out under the guidance of (Supervisors name and affiliation).

The matter presented in this project report has not been submitted either in part or full to any
university or Institute for award of any degree.

Date : 08-05-2024 Khushal


Roll No.:
06314803120

Krishna Kumar
Place: Delhi Roll No.:
06514803120
SUPERVISOR’S CERTIFICATE

It is to certify that the Project entitled “WEB CRAWLER USING ML/DL” which is being

submitted by Mr. Krishna Kumar and Mr. Khushal to the Maharaja Agrasen Institute of
Technology, Rohini in the

fulfillment of the requirement for the award of the degree of Bachelor of Technology

([Link].), is a record of bonafide project work carried out by him/her under my/ our guidance

and supervision.

Er. Ajay K. Kaushik Dr. Pooja Gupta


Assistant professor H.O.D.

Department of Information Technology Department of Information Technology


lOMoAR cPSD| 14639031

Table of contents
Chapter 1 .......................................................................................................................................... 3
1 Introduction ............................................................................................................................... 3
1.1 Architecture ........................................................................................................................ 3
1.2 How a Spider interact with Web Server ............................................................................. 5
1.3 Possible Stakeholders ......................................................................................................... 6
1.4 Summary ............................................................................................................................ 6
Chapter 2 .......................................................................................................................................... 7
2 Parsing Web Page ...................................................................................................................... 7
2.1 HTML Tag Tree ................................................................................................................. 7
2.2 Words filtering ................................................................................................................... 9
Chapter 3 ........................................................................................................................................ 10
3 Stemming ................................................................................................................................ 10
3.1 Introduction ...................................................................................................................... 10
3.2 Algorithms ........................................................................................................................ 11
3.3 Language Challenges ....................................................................................................... 14
3.4 Error metrics .................................................................................................................... 15
3.5 Applications ..................................................................................................................... 15
Chapter 04 ...................................................................................................................................... 16
4 Clustering ................................................................................................................................ 16
4.1 Types of clustering ........................................................................................................... 16
4.2 Distance measure ............................................................. Error! Bookmark not defined.
4.3 Hierarchical clustering ..................................................................................................... 17
4.4 Partitional clustering ........................................................................................................ 18
1.2 Applications ..................................................................................................................... 20
1.3 Comparisons between data clustering techniques ............................................................ 21
Chapter 05 ...................................................................................................................................... 23
2 Survey ..................................................................................................................................... 23
2.1 Trellian SiteSpider 1.00.004............................................................................................. 23
2.2 Icegiant Data Extractor .................................................................................................... 23
2.3 ETagFix by ISAPILabs .................................................................................................... 24
2.4 StreamBull by J.P.C. Development Services ................................................................... 25
2.5 Inspyder Site Map Creator ............................................................................................... 25
2.6 Conclusion ....................................................................................................................... 26
lOMoAR cPSD| 14639031

Chapter 06 ...................................................................................................................................... 27
3 Analysis & Design .................................................................................................................. 27
3.1 Methods/Techniques ........................................................................................................ 27
3.2 Use Case Model ............................................................................................................... 28
3.3 DFD .................................................................................................................................. 33
6.4 ERD .................................................................................................................................. 34
Chapter 07 ...................................................................................................................................... 35
4 Implementation ....................................................................................................................... 35
4.1 Techniques/ tools ......................................................................................................... 35
4.2 Class Diagram of Web Spider .......................................................................................... 36
4.3 Class Diagram of Web Portal ........................................................................................... 37
Chapter -8 ....................................................................................................................................... 38
5 Results & Conclusion.............................................................................................................. 38
5.1 Conclusion ....................................................................................................................... 38
APPENDIX – A ............................................................................................................................. 40
1 Purpose of Vision Document .................................................................................................. 40
5.2 Product Overview ............................................................................................................ 40
5.3 User Descriptions ............................................................................................................. 40
5.4 Problem Statement ........................................................................................................... 41
5.5 Product Position Statement .............................................................................................. 41
APPENDIX – B ............................................................................................................................. 42
6 Appendix-B [SRS] .................................................................................................................. 42
6.1 Purpose ............................................................................................................................. 42
6.2 Document Conventions .................................................................................................... 42
6.3 Intended Audience and Reading Suggestions .................................................................. 42
6.4 Project Scope.................................................................................................................... 43
6.5 Overall Description .......................................................................................................... 45
6.6 System Features ............................................................................................................... 48
6.8 Nonfunctional Requirements ........................................................................................... 49
APPENDIX-C ................................................................................................................................ 50
7 Appendix-C [Glossary] ........................................................................................................... 50
APPENDIX-D ................................................................................................................................ 51
Appendix-D [References] .......................................................................................................... 51
7.1 Some other Resources ...................................................................................................... 51

2
lOMoAR cPSD| 14639031

Chapter 1
[Introduction]

1 Introduction
A web crawler (also known as a web spider or web robot) is a program or automated script that

browses the World Wide Web in a methodical, automated manner. Other less frequently used

names for web crawlers are ants, automatic indexers, bots, and worms.

The process is called web crawling or spidering. Many sites, in particular search engines, use

spidering as a means of providing up-to-date data. Web crawlers are mainly used to create a copy

of all the visited pages for later processing by a search engine that will index the downloaded pages

to provide fast searches. Crawlers can also be used for automating maintenance tasks on a website,

such as checking links or validating HTML code. Also, crawlers can be used to gather specific

types of information from Web pages, such as harvesting e-mail addresses (usually for spam).

A web crawler is one type of bot or software agent. In general, it starts with a list of URLs to visit,

called the seeds. As the crawler visits these URLs, it identifies all the hyperlinks in the page and

adds them to the list of URLs to visit, called the crawl frontier. URLs from the frontier are

recursively visited according to a set of policies.

1.1 Architecture
A crawler must not only have a good crawling strategy, but it should also have a highly optimized

architecture. As Shkapenyuk and Suel (Shkapenyuk and Suel, 2002) noted that: "While it is fairly

easy to build a slow crawler that downloads a few pages per second for a short period of time,

building a high-performance system that can download hundreds of millions of pages over several

weeks presents a number of challenges in system design, I/O and network efficiency, and

robustness and manageability."

Generally the architecture of common web crawlers can be described as Fig. 1.1.1. It shows that

the crawlers have two major processors i.e. Scheduler and Multi-threaded

3
lOMoAR cPSD| 14639031

Downloader. The job of downloader is to initiate and send HTTP request to server (or Internet) for

downloading content of Web pages. Then the URLs and Text/metadata is separated from the

downlaoded content. This separated text/metadata is to be stored on Storage location (i.e.

database). And the list of URLs will be passed to Schedular for recursively download.

[Fig: 1.1.1]High Level Architecture of a Standard Web Spider/Crawler How Web Spiders Work

Web Spiders start by parsing a specified web page, noting any hypertext links on that page that

point to other web pages. They then parse those pages for new links, and so on, recursively. Web-

Spider software doesn't actually move around to different computers on the Internet, as viruses or

intelligent agents do. A Spider resides on a single machine. The Spider simply sends HTTP

requests for documents to other machines on the Internet, just as a web browser does when the

user clicks on links. All the Spider really does is to automate the process of following links.

4
lOMoAR cPSD| 14639031

[Fig: 1.2.1]Flow of a basic Sequential Crawler

Following links isn't greatly useful in itself, of course. The list of linked pages almost always serves

some subsequent purpose. The most common use is to build an index for a web search engine, but

Spiders are also used for any other purpose. E.g. Muscle Fish uses a Spider to search the Web for

audio files. It turns out that searching for audio files is not very different from searching for any

other kind of file[1].

1.2 How a Spider interact with Web Server


Web crawlers typically identify themselves to a web server by using the User-agent field of an

HTTP request. Web site administrators typically examine their web server’s log and use the user

agent field to determine which crawlers have visited the web server and how often. The user agent

5
lOMoAR cPSD| 14639031

field may include a URL where the Web site administrator may find out more information about

the crawler. Spambots and other malicious Web crawlers are unlikely to place identifying

information in the user agent field, or they may mask their identity as a browser or other well-

known crawler. It is important for web crawlers to identify themselves so Web site administrators

can contact the owner if needed. In some cases, crawlers may be accidentally trapped in a crawler

trap or they may be overloading a web server with requests, and the owner needs to stop the

crawler. Identification is also useful for administrators that are interested in knowing when they

may expect their Web pages to be indexed by a particular search engine.

1.3 Possible Stakeholders


The users of Web Spider with Search Engine are diverse. The major categories of users and

examples of their applications are given as follows:

1.3.1 General Public

Search textual information related to their need.

Guess the Content type, on a specific web page.

1.3.2 E-Business

Provides a Search Option in the web portal for users.

Maintains a hierarchy of their web portal.


1.3.1 Academia

Education/Learning (by searching different Articles, magazines etc.)

Research

1.4 Summary
Web Spider is the member of a software family that is responsible to crawling web pages and

populates databases for search engines, using fast and intelligent techniques. This product also

includes a web-based module that will use the database maintained by spider.

6
lOMoAR cPSD| 14639031

Chapter 2
[Background & Technical Issues – Parsing Web Page]

2 Parsing Web Page


Crawler behave like a client to the web servers (as mentioned in Chap # 01), and send an HTTP

request for any specified web URL. The resultant received web page contains the data to be

analyzed. And the data is managed by HTML tags in resultant page. So our crawling application

should have ability to identify HTML tags, and to fetch textual information written in those tags.

As the fetched textual information is helpful to keep the record of keywords, and the associated

URLs.

2.1 HTML Tag Tree


Crawlers assess the value of a URL or a content word by examining the HTML tag context in

which it resides. For this, a crawler may need to utilize the tag tree or DOM structure of the HTML.

Figure 2.1.1 shows a tag tree corresponding to an HTML source. The <html> tag forms the root of

the tree and various tags and texts form nodes of the tree. Unfortunately, many Web pages contain

badly written HTML. For example, a start tag may not have an end tag (it may not be required by

the HTML specification), or the tags may not be properly nested. In many cases, the <html> tag

or the <body> tag is all-together missing from the HTML page. Thus structure-based criteria often

require the prior step of converting a dirty HTML document into a wellformed one, a process that

is called tidying an HTML page. This includes both insertion of missing tags and the reordering of

tags in the page. Tidying an HTML page is necessary for mapping the content of a page onto a tree

structure with integrity, where each node has a single parent. Hence, it is an essential precursor to

analyzing an HTML page as a tag tree. Note that analyzing the DOM structure is only necessary

if the topical crawler intends to use the HTML document structure in a nontrivial manner. For

example, if the crawler only needs the links within a page, and the text or portions of the text in

7
lOMoAR cPSD| 14639031

the page, one can use simpler HTML parsers. Such parsers are also readily available in many

languages.

An Example:
<html>
<head>
<title>Projects</title>
</head> <body>
<h4>Current Projects</h4> <ul>
<li> <a href="[Link]">LAMP</a> Linkage analysis with
MP.</li>
<li> <a href="[Link]">NI</a> Network Infrastructure.</li>
<li> <a href="[Link]">ADNA</a> A DNA Algorithm.</li> <li> <a href="[Link]">DLT</a> A
distributed, first-order logic theorem.</li>
</ul>
</body>
</html>

[Fig. 2.1.1]Tag Tree of above mentioned HTML page.

8
lOMoAR cPSD| 14639031

2.2 Words filtering


The textual data fetched from the web page contains many keywords as well useless words. And

our application has to filter the keywords by avoiding useless words (like A, Is, The, also, although,

most, mostly etc.).

In our case the useless words can be defined as “All words, those are not derived from a Noun”.

So we have built a list of such words, called StopList. This list is helpful for dropping

common/useless words.

Our crawler parse all the words in textual information, and perform search operation in the

StopList, if the word exists in StopList then Drop the word else save the word in the bucket of

Keywords.

[ List of fetched words ]

Fetch a word

[Word]

Is Member
of StopList [Yes]

[No]

Save in Keywords Neglect the Word


Bucket

Is Last
word [Yes]

[No]
end

[Fig. 2.2.1]Flow of Filtering Words

9
lOMoAR cPSD| 14639031

Chapter 3
[Background & Technical Issues - Stemming]

3 Stemming
3.1 Introduction
Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base

or root form — generally a written word form. The stem need not be identical to the morphological

root of the word; it is usually sufficient that related words map to the same stem, even if this stem

is not in itself a valid root. The algorithm has been a long-standing problem in computer science;

the first paper on the subject was published in 1968. The process of stemming, often called

conflation, is useful in search engines for query expansion or indexing and other natural language

processing problems.

Stemming programs are commonly referred to as stemming algorithms or stemmers.

Examples

A stemmer for English, for example, should identify the string "cats" (and possibly

"catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming",

"stemmed" as based on "stem". A stemming algorithm reduces the words "fishing",

"fished", "fish", and "fisher" to the root word, "fish".

History

The first ever published stemmer was written by Julie Beth Lovins in 1968 . This paper was
[3.2]

remarkable for its early date and had great influence on later work in this area.

A later stemmer was written by Martin Porter and was published in the July 1980 issue of the

journal Program. This stemmer was very widely used and became the defacto standard algorithm

used for English stemming. Dr. Porter received the Tony Kent Strix award in 2000 for his work on

stemming and information retrieval. Many implementations of this algorithm were written and

10
lOMoAR cPSD| 14639031

freely distributed; however, many of these implementations contained subtle flaws. As a result,

these stemmers did not match their potential. To eliminate this source of error, Martin Porter

released an official free-software implementation of the algorithm around the year 2000. He

extended this work over the next few years by building Snowball, a framework for writing

stemming algorithms, and implemented an improved English stemmer

together with stemmers for several other languages.[3.1]

3.2 Algorithms
There are several types of stemming algorithms which differ in respect to performance and

accuracy, some of these are listed here:

I. Brute Force Algorithms


II. Porter’s Stemmer
III. Suffix Stripping Algorithms
IV. Lemmatisation Algorithms
V. Hybrid Approaches
VI. Affix Stemmers
VII. Matching Algorithms

3.2.1 Brute Force Algorithm

The term brute force is borrowed from a concept in artificial intelligence research and problem

solving in mathematics denoted as brute force search. Brute force stemmers employ a lookup table

which contains relations between root forms and inflected forms. To stem a word, the table is

queried to find a matching inflection. If a matching inflection is found, the associated root form is

returned.

Brute force algorithms are initially very difficult to design given the immense amount of relations

that must be initially stored to produce an acceptable level of accuracy (the number can span well

into the millions). However, brute force algorithms are easy to improve in that decreasing the

stemming error is only a matter of adding more relations to the table. Someone with only a minor

11
lOMoAR cPSD| 14639031

experience in linguistics is capable of improving the algorithm, unlike the suffix stripping

approaches which require a solid background in linguistics. [3.1]

3.2.2 Porter’s Stemmer

The Porter stemming algorithm (or ‘Porter stemmer’) is a process for removing the commoner

morphological and inflexional endings from words in English. Its main use is as part of a term

normalization process that is usually done when setting up Information Retrieval systems. [3.4]

3.2.3 Suffix Stripping Algorithm

Suffix stripping algorithms do not rely on a lookup table that consists of inflected forms and root

form relations. Instead, a typically smaller list of "rules" are stored which provide a path for the

algorithm, given an input word form, to find its root form. Some examples of the rules include:

I. if the word ends in 'ed', remove the 'ed'


II. if the word ends in 'ing', remove the 'ing'
III. if the word ends in 'ly', remove the 'ly'
Suffix stripping approaches enjoy the benefit of being much simpler to maintain than brute force

algorithms, assuming the maintainer is sufficiently knowledgeable in the challenges of linguistics

and morphology and encoding suffix stripping rules. Suffix stripping algorithms are sometimes

regarded as crude given the poor performance when dealing with exceptional relations (like 'ran'

and 'run'). The solutions produced by suffix stripping algorithms are limited to those lexical

categories which have well known suffices with few exceptions. This, however, is a problem, as

not all parts of speech have such a well formulated set of rules. Lemmatisation attempts to improve

upon this

challenge. [3.1]

12
lOMoAR cPSD| 14639031

3.2.4 Lemmatisation Algorithm

A more complex approach to the problem of determining a stem of a word is lemmatisation. This

process involves first determining the part of speech of a word, and applying different

normalization rules for each part of speech. The part of speech is first detected prior to attempting

to find the root since for some languages, the stemming rules change depending on a word's part

of speech.

This approach is highly conditional upon obtaining the correct lexical category (part of speech).

While there is overlap between the normalization rules for certain categories, identifying the wrong

category or being unable to produce the right category limits the added benefit of this approach

over suffix stripping algorithms. The basic idea is that, if we are able to grasp more information

about the word to be stemmed, then we are able to more accurately apply normalization rules

(which are, more or less, suffix stripping

rules). [3.1]

3.2.5 Hybrid Approaches

Hybrid approaches use two or more of the approaches described above in unison. A simple example

is a suffix tree algorithm which first consults a lookup table using brute force. However, instead

of trying to store the entire set of relations between words in a given language, the lookup table is

kept small and is only used to store a minute amount of "frequent exceptions" like "ran => run". If

the word is not in the exception list, apply suffix stripping or lemmatisation and output the result.
[3.1]

3.2.6 Affix Stemmer

In linguistics, the term affix refers to either a prefix or a suffix. In addition to dealing with suffices,

several approaches also attempt to remove common prefixes. For example, given the word

13
lOMoAR cPSD| 14639031

indefinitely, identify that the leading "in" is a prefix that can be removed. Many of the same

approaches mentioned earlier apply, but go by the name affix stripping. [3.1]

3.2.7 Matching Algorithm

In this algorithm first of all we have a set of documents that contain stem word that stems can be

not valid roots because staring common characters save as stem of course the length of stem of a

word must be at a specific range according of the main word. For example "be" and "beside"

between these two word "be" is not accepted as stem but "browse" and "browsing" stem will be

"brows".

3.3 Language Challenges


While much of the early academic work in this area was focused on the English language (with

significant use of the Porter Stemmer algorithm), many other languages have been investigated.

Hebrew and Arabic are still considered difficult research languages for stemming. English

stemmers are fairly trivial (with only occasional problems, such as "dries" being the third-person

singular present form of the verb "dry", "axes" being the plural of "axe" as well as "axis"); but

stemmers become harder to design as the morphology, orthography, and character encoding of the

target language becomes more complex. For example, an Italian stemmer is more complex than

an English one (because of more possible verb inflections), a Russian one is more complex (more

possible noun declensions), a Hebrew one is even more complex (due to non-catenative

morphology and a writing system without vowels), and so on.

3.3.1 Multilingual Stemming

Multilingual stemming applies morphological rules of two or more languages simultaneously

instead of rules for only a single language when interpreting a search query. Commercial systems

using multilingual stemming exist. But my project is not supporting Multilingual Stemming, as it

supports only English.

14
lOMoAR cPSD| 14639031

3.4 Error metrics


There are two error measurements in stemming algorithms, overstemming and understemming.

Overstemming is an error where two separate inflected words are stemmed to the same root, but

should not have been - a false positive. Understemming is an error where two separate inflected

words should be stemmed to the same root, but are not - a false negative. Stemming algorithms

attempt to minimize each type of error, although reducing one type can lead to increasing the other.

3.5 Applications

3.5.1 Information Retrieval

Stemmers are common elements in query systems such as Web search engines, since a user who

runs a query on "daffodils" would probably also be interested in documents that contain the word

"daffodil" (without the s). The effectiveness of stemming for English query systems were soon

found to be rather limited, however, and this has led early Information retrieval researchers to

deem stemming irrelevant in general. An alternative approach, based on searching for n-grams

rather than stems, may be used instead. Also, recent research has shown greater benefits for

retrieval in other languages.

Usage in our Web Spider


This web spider is using the techniques of Stemming, on the keywords found at web pages. The

stemming techniques/algorithm is Porter’s Stemming Algorithm. As it is the best stemmer for

English language words, and produces most accurate results to find root words. The stemmer is

applied only on keywords at web page, and then these stemmed words will be saved into database.

15
lOMoAR cPSD| 14639031

Chapter 04
[Background & Technical Issues - Clustering]

4 Clustering
Clustering is the classification of objects into different groups, or more precisely, the partitioning

of a data set into subsets (clusters), so that the data in each subset (ideally) share some common

trait - often proximity according to some defined distance measure. Data clustering is a common

technique for statistical data analysis, which is used in many fields, including machine learning,

data mining, pattern recognition, image analysis and bioinformatics.

[Fig: 4.2.1]Grouping of different points into 4 clusters

Besides the term data clustering (or just clustering), there are a number of terms with similar

meanings, including cluster analysis, automatic classification, numerical taxonomy, botryology

and typological analysis.

4.1 Types of clustering


Data clustering algorithms can be hierarchical or partitional. Hierarchical algorithms find

successive clusters using previously established clusters, whereas partitional algorithms determine

all clusters at once. Hierarchical algorithms can be agglomerative ("bottom-up") or divisive ("top-

16
lOMoAR cPSD| 14639031

down"). Agglomerative algorithms begin with each element as a separate cluster and merge them

into successively larger clusters. Divisive algorithms begin with the whole set and proceed to

divide it into successively smaller clusters.

Two-way clustering, co-clustering or biclustering are clustering methods where not only the

objects are clustered but also the features of the objects, i.e., if the data is represented in a data

matrix, the rows and columns are clustered simultaneously. Another important distinction is

whether the clustering uses symmetric or asymmetric distances. A property of Euclidean space is

that distances are symmetric (the distance from object A to B is the same as the distance from B to

A).

4.3 Hierarchical clustering

4.3.1 Creating clusters

Hierarchical clustering builds (agglomerative), or breaks up (divisive), a hierarchy of clusters. The

traditional representation of this hierarchy is a tree (called a dendrogram), with individual elements

at one end and a single cluster containing every element at the other. Agglomerative algorithms

begin at the top of the tree, whereas divisive algorithms begin at the root. (In the figure, the arrows

indicate an agglomerative clustering.)

Cutting the tree at a given height will give a clustering at a selected precision.

In the following example, cutting after the second row will yield clusters {a} {b c} {d e} {f}.

Cutting after the third row will yield clusters {a} {b c} {d e f}, which is a coarser clustering, with

a smaller number of larger clusters.

17
lOMoAR cPSD| 14639031

4.4 Partitional clustering


4.4.1 K-means and derivatives

[Link] K-means clustering

The K-means algorithm assigns each point to the cluster whose center (also called centroid) is

nearest. The center is the average of all the points in the cluster — that is, its coordinates are the

arithmetic mean for each dimension separately over all the points in the cluster...

Example : The data set has three dimensions and the cluster has two points:

X = (x1, x2, x3) -----------------

[[Link].1]
Y = (y1, y2, y3) . ------------------

[[Link].2]

Then the centroid Z becomes:

Z = (z1, z2, z3) , -----------------

[[Link].3] where:

z1 = (x1 + y1)/2 z2 = (x2 + y2)/2 and z3 = (x3 + y3)/2).

The algorithm steps are: [3.2]

• Choose the number of clusters, k.

• Randomly generate k clusters and determine the cluster centers, or directly generate k random

points as cluster centers.

• Assign each point to the nearest cluster center.

• Recompute the new cluster centers.

• Repeat the two previous steps until some convergence criterion is met (usually that the assignment

hasn't changed).

The main advantages of this algorithm are its simplicity and speed which allows it to run on large

datasets. Its disadvantage is that it does not yield the same result with each run, since the resulting

18
lOMoAR cPSD| 14639031

clusters depend on the initial random assignments. It minimizes intra-cluster variance, but does

not ensure that the result has a global minimum of variance.

[Link] Fuzzy c-means clustering

In fuzzy clustering, each point has a degree of belonging to clusters, as in fuzzy logic, rather than

belonging completely to just one cluster. Thus, points on the edge of a cluster, may be in the

cluster to a lesser degree than points in the center of cluster. For each point x we have a coefficient

giving the degree of being in the k th cluster uk( x) . Usually, the sum of those coefficients is

defined to be 1:

With fuzzy c -means, the centroid of a cluster is the mean of all points, weighted by their degree

of belonging to the cluster:

The degree of belonging is related to the inverse of the distance to the cluster then the coefficients

are normalized and fuzzyfied with a real parameter m > 1 so that their sum is 1. So

For m equal to 2, this is equivalent to normalising the coefficient linearly to make their sum 1.

When m is close to 1, then cluster center closest to the point is given much more weight than the

others, and the algorithm is similar to k- means.

The fuzzy c -means algorithm is very similar to the k -means algorithm: I. Choose a number of

clusters.

II. Assign randomly to each point coefficients for being in the clusters.

III. Repeat until the algorithm has converged (that is, the coefficients ' change between

two iterations is no more than ε , the given sensitivity

threshold) :

o Compute the centroid for each cluster, using the formula above .

o For each point, compute its coefficients of being in the clusters, using the formula above.

The algorithm minimizes intra-cluster variance as well, but has the same problems as k -means,

the minimum is a local minimum, and the results depend on the initial choice of weights. The

19
lOMoAR cPSD| 14639031

Expectation-maximization algorithm is a more statistically formalized method which includes

some of these ideas: partial membership in classes. It has better convergence properties and is in

general preferred to fuzzy-c-means.

[Link] QT clustering algorithm

QT (quality threshold) clustering (Heyer et al, 1999) [4.2]


is an alternative method of partitioning

data, invented for gene clustering. It requires more computing power than k -means, but does not

require specifying the number of clusters a priori , and always returns the same result when run

several times. The algorithm is: [4.2]

I. The user chooses a maximum diameter for clusters.

II. Build a candidate cluster for each point by including the closest point, the next

closest, and so on, until the diameter of the cluster surpasses the threshold.

III. Save the candidate cluster with the most points as the first true cluster, and remove

all points in the cluster from further consideration.

IV. Recurse with the reduced set of points.

The distance between a point and a group of points is computed using complete linkage, i.e. as the

maximum distance from the point to any member of the group (see the "Agglomerative

hierarchical clustering" section about distance between clusters).

1.2 Applications
Some of major common applications of clustering are:

1.2.1 Search Result Grouping

In the process of intelligent grouping of the files and websites, clustering may be used to create a

more relevant set of search results compared to normal search engines like Google. There are

currently a number of web based clustering tools, those are discussed in chapter 4 (i.e. Survey).

20
lOMoAR cPSD| 14639031

1.2.2 Social Network Analysis

In the study of social networks, clustering may be used to recognize communities within large

groups of people.

1.2.3 Data mining

Many data mining applications involve partitioning data items into related subsets; the marketing

applications discussed above represent some examples. Another common application is the

division of documents, such as World Wide Web pages, into genres.

1.2.4 Grouping of Shopping Items

Clustering can be used to group all the shopping items available on the web into a set of unique

products. For example, all the items on eBay can be grouped into unique products. (eBay doesn't

have the concept of a SKU)

1.2.5 Mathematical Chemistry


To find structural similarity, etc., for example, 3000 chemical compounds were clustered in the

space of 90 topological indices . [4.1]

1.3 Comparisons between data clustering techniques


There have been several suggestions for a measure of similarity between two clusterings. Such a

measure can be used to compare how well different data clustering algorithms perform on a set of

data. Many of these measures are derived from the matching matrix (also known as confusion

matrix), e.g., the Rand measure and the Fowlkes-Mallows Bk measures[6].

21
lOMoAR cPSD| 14639031

In our case, we need to make clusters on the basis of two facts. i.e. frequency of keyword and

frequency of web url. And If we say Frequency of Keywords on X-axis , and Frequency of URLs

on Y-axis. Then the K-means algorithm is most suitable for implementation. Why?

Because QT (quality threshold) requires more computing power then K-means and Fuzzy c-

[Link] it always produces same clustering results on each execution, which affects the

randomization of information. Hence we rejected QT algorithm. As for as the Fuzzy c-means is

concerned it is almost similar to K-means, so we choose k-means because of our prior knowledge

22
lOMoAR cPSD| 14639031

Chapter 05
[Background & Technical Issues-Surveys]

2 Survey
A software survey is conducted before the development of this proposed system ‘Search Engine

with Intelligent Web Spider’. Many crawling software were in the domain of this survey. Some of

them are listed here with their brief description.

2.1 Trellian SiteSpider 1.00.004


Trellian SiteSpider is a powerful, yet simple to use program which is loaded with sophisticated

search, web crawling and site mapping functions. SiteSpider can be used simply as a web browser,

which works cooperatively with Internet Explorer to allow importing and utilization of our

favorite’s directory. Trellian SiteSpider can help to extract valuable data from any website[8].

2.2 Icegiant Data Extractor


MP3 Extractor is a state-of-the-art internet MP3 download manager. Its easy to use design and

multithreaded crawling engine allows the user to quickly download MP3 urls from a website. MP3

23
lOMoAR cPSD| 14639031

Extractor ties directly into Microsoft''s Internet Explorer as a toolbar for easy access while

browsing. After download is finished, the acquired urls are displayed on a page for easy

viewing[9].

2.3 ETagFix by ISAPILabs


ETagFix is an ISAPI plugin for Internet Information Server (IIS) that changes the etag produced

by IIS so that the etag does not change every time out computer reboots. This can result in web

crawlers and user clients wanting to recache our web page when they don't need to[10].

24
lOMoAR cPSD| 14639031

2.4 StreamBull by J.P.C. Development Services


StreamBull is a fast and easy-to-use multi-threaded adult picture and video downloader. It reads

the necessary internet addresses from a database residing on an internet server. The user selects his

favorite categories like teens, hardcore, asians… (32 x 2 in all) and gets images accordingly. All

images have been pre-screened to avoid multiples or illegal content, and to avoid the local

downloader having to spend time on filtering or searching

2.5 Inspyder Site Map Creator


Inspyder Sitemap Creator is a web-crawling XML sitemap file generator. It's easy to use, yet

powerful enough to handle large sites. Inspyder Sitemap Creator can help us generate an accurate

sitemap and help us keep it up to date.

25
lOMoAR cPSD| 14639031

2.6 Conclusion
All of these mentioned software tools are for public usage. Any one can buy and use them, but

their usage is limited to MP3 extraction, site map creation, image searching etc. If a person want

to scan any specific web page to get to know the keywords at some page, then these tools might

not be suitable for him/her.

So our web spider is the tool which provides the functionality of scanning web page, to extract

keywords on it. Because these found keywords will not only be helpful to define the content type

on the webpage but also helpful to categorize the webpage on the basis of keywords.

26
lOMoAR cPSD| 14639031

Chapter 06
[Development Phases – Analysis & Design]

3 Analysis & Design


In this chapter we introduce the analysis phase of Search Engine with Intelligent Web Spider. What

artifacts are produced in analysis frame and how different system requirements were analyzed in

this phase? The purpose of this phase is to collect and analyze the system requirements for this

tool.

3.1 Methods/Techniques
To start the development of the system we have made a vision to analyze the situation clearly, and

identified the:

I. Possible users of the system

II. Key user needs

III. Problem statement IV. Product position statement

IV. Brief overview of the system.

After this analyzation we became able to define vision of the system. And we documented the

vision in APPENDIX-A.

Then we identified and modeled all the requirements into document form, named SRS (Software

Requirement Specification) in APPENDIX-B. This required us to think from in top to bottom

approach for each requirement and sub requirement.

And then we became able to model Use Cases, to identify the major Use Cases and objects/actors

acting on them. we not only identified use cases and actors, but also defined their roles to achieve

the purpose of our system.

27
lOMoAR cPSD| 14639031

3.2 Use Case Model


In the analyzation phase we come to know that our system should be divided into two major
modules, i.e.
I. Web Spider, to maintain the database
II. Web Portal, to provide search facility from the database

3.2.1 Web Spider Use Case Diagram

Use Case Diagram of Web Spider – [Fig: [Link]]

28
lOMoAR cPSD| 14639031

3.2.2 Web Spider Use Case Description

Use Case: Scan Web Page


Actor Administrator
Type Primary
Description The use case begins when administrator enters a URL of a web page.
And the system starts visiting the textual content of that web page
to Extract keywords for Stemming.

Use Case: Extract Key Words


Parent Scan Web Page

Case Use
Actor System (i.e. Web Spider)
Type Primary
Description This use case begins when the textual contents of URL entered by
the administrator are downloaded. All the predefined common
words are dropped from the list of contents, and only the keywords
remains for stemming.

Use Case: Make Stemmed Words


Parent Scan Web Page

Case Use
Actor System (i.e. Web Spider)
Type Primary
Description This use case begins after the extraction of keywords. And these
keywords are converted to their root word. (e.g. management or
managing to manag)

Use Case: Maintain Database


Actor Administrator
Type Primary
Description This use case begins when the administrator triggers an action to
Save the found Stemmed words from all the URLs. The frequency of occurrence of each word at

each URL will be analyzed to maintain the DB.

29
lOMoAR cPSD| 14639031

Use Case: Make Clusters


Parent Use Maintain Database

Case
Actor System (i.e. Web Spider)
Type Primary
Description The use case begins when the parent use case provides a list of

stemmed words and their URLs. These words and URLs are

clustered on the basis of their frequency of occurrence

Use Case: Save In DB


Parent Use Maintain Database

Case
Actor System (i.e. Web Spider)
Type Primary
Description The use case begins when the parent use case provides a list of
clustered pairs of words and URLs. These pairs are stored into record, by making a connection to

the database. And if the pair is already stored into the record, it will only update the frequency of

them into record.

30
lOMoAR cPSD| 14639031

3.2.3 Web Portal Use Case Diagram

[Fig: [Link]]Use Case Diagram of Web Portal

31
lOMoAR cPSD| 14639031

3.2.4 Web Portal Use Case Description

Use Case: Search


Actor User
Type Primary
Description The use case begins when a user enters a text to search in our
database. And it will only consider the important words in the text
entered by the user, to perform search.

Use Case: Make Stemmed Input


Parent Use Search

Case
Actor System (i.e. Web Portal)
Type Primary
Description This use case begins after taking the input text form user. The
keywords are extracted from the text of user and these keywords are
converted to their root word. (e.g. management or managing to
manag)

Use Case: Display Results


Parent Use Search

Case
Actor System (i.e. Web Portal)
Type Primary
Description This use case begins after searching the URLs of keywords from
the database. These URLs will be displayed at user screen in a descending order on the basis of

frequency.

32
lOMoAR cPSD| 14639031

3.3 DFD
In the analyzation phase we come to know that our system should have two levels of

[Link] are level 0 and level 1.

3.3.1 DFD Level 0

6.3.2 DFD Level 1

33
lOMoAR cPSD| 14639031

6.3.3 DFD Level 2

6.4 ERD

34
lOMoAR cPSD| 14639031

Chapter 07
[Development Phases - Implementation]

4 Implementation
In this chapter we introduced the implementation phase of Search Engine with Intelligent

Web Spider. What tools we have chosen for the development of this software system. And what

operating system we used for development environment is mentioned in this chapter; as well the

classes, their functionality and their attribute variables are also mentioned in the class diagrams.

4.1 Techniques/ tools


we used these under mentioned tools, for development of both modules:

I. Borland JBuilder 2005 o To write and execute the source code of Java language

II. Sun Java Enterprise studio 8.1

a. To test the source code written in Borland JBuilder 2005, for checking the code

compatibility with different IDEs.

III. Macromedia Dreamweaver MX

a. To write the source code of PHP language.

IV. Internet Explorer, Opera & Mozilla Firefox Web browsers

a. To test the source code of PHP language

V. WAMP Server 1.4.1 o To install the web server and to install and configure the

mySQL databases.

VI. Some other Supporting tools:

JDK 1.4.0 o MySQL Connector with JDBC 3.1.14 o Java Runtime Environment 1.4

The purpose to choose these tools is that:

35
lOMoAR cPSD| 14639031

4.2 Class Diagram of Web Spider

[Fig: 7.2.1]Class Diagram of Web Spider

36
lOMoAR cPSD| 14639031

4.3 Class Diagram of Web Portal


cd version 1

db index PorterStemmer
+ debug_called: var
+ main() : var - regex_consonant: var = '(?:[bcdfghjklm...
+ vardump_called: var
- regex_vowel: var = '(?:[aeiou]|(?<...
+ show_errors: var = true
+ Stem(var) : var
+ db() : var
- step1ab(var) : var
+ select(var) : var
- step1c(var) : var
+ escape(var) : var
- step2(var) : var
+ print_error(var) : var
- step3(var) : var
+ show_errors() : var
- step4(var) : var
+ hide_errors() : var
- step5(var) : var
+ flush() : var
- replace(var*, var, var, var) : var
+ query(var) : var
- m(var) : var
+ get_var(var, var, var) : var
- doubleConsonant(var) : var
+ get_row(var, var, var) : var
- cvc(var) : var
+ get_col(var, var) : var
+ StemIt(var) : var
+ get_results(var, var) : var
+ get_col_info(var, var) : var
+ vardump(var) : var
+ dumpvar(var) : var
+ debug() : var

[Fig: 7.3.1]Class Diagram of Web Portal

37
lOMoAR cPSD| 14639031

Chapter -8
[Results & Conclusion]

5 Results & Conclusion

KeyWor URL Frequenc Cluster Nr

d y
Card [Link] 39 0, touch [Link] 20 2, christma

[Link] 18 2, ecard [Link] 18 2, greet

[Link] 17 2, garag [Link] 16 2, light

[Link] 16 2, love [Link] 16 2, start

[Link] 16 2, ecard [Link] 15 2, love

[Link] 14 3, snow [Link] 14 3, kate

[Link] 14 3, love [Link] 12 3, decor


[Link] 12 3, door [Link] 12 3, music

[Link] 12 3, greet [Link] 11 3,

[Table 8.1 – Example of Results by Different Web Pages on a Single Domain]

5.1 Conclusion
A small list of produced results by the Spider is shown in Table-8.1. These results are achieved by

giving input [Link] as starting URL. And all the keywords in this domain

are fetched out by developed Spider. Not only the stemmed keywords are fetched but also the

keywords are categorized on the basis of their frequency of occurrence into different clusters. And

now these keywords are available for normal users in a database, so that the user can get fast and

accurate results for the required URL.

By observing the results, we can say that we have achieved our objectives of:

I. Make fast and automated scanning of Web Pages, and search by Indexing.

(Java module is for Scanning, and PHP module is for search purpose)

II. Maintain the log efficiently (by using the Relational Database).

38
lOMoAR cPSD| 14639031

III. Only authorized persons can operate it (protected by User ID and Password). To achieve

these objectives, we have learnt many new tricks and technologies under the supervision of my

kind supervisor. e.g:

I. Web Page Crawling and Parsing techniques

II. Word Stemming, and Clustering of objects

III. Good way of doing programming, i.e. fail safe programming

IV. Different tools e.g. Borland JBuilder, PHP, WAMP etc.

And we are thankful to our teachers, especially supervisor who has motivated us to complete this

task.

39
lOMoAR cPSD| 14639031

APPENDIX – A
[Vision Document]

Appendix-A [Vision Doc]

1 Purpose of Vision Document


The purpose of this document is to collect, analyze, and define high-level needs and features of A

Search Engine with Intelligent Web Crawler. It focuses on the capabilities needed by the

stakeholders and the target users, and why these needs exist. The details of how the specified

System fulfills these needs are mentioned in the use-case and supplementary specifications.

5.2 Product Overview


The proposed software system crawls through web sites, and analyze the data written in text

format. And save this data into manageable hierarchy, by applying clustering techniques. And this

saved data will be accessible to Users, so that the users could see a list of related web pages.

5.3 User Descriptions

5.3.1 User Profiles

Users are familiar with the use of computer. There are two types of users; Administrator and Search

Performer. The key responsibility of Administrator is to provide the starting web portal to the

crawler and maintain a record using this software, and the responsibility of Search Performers is

to find the addresses of related web pages.

5.3.2 Key User Needs

I. User wants to save the record of large number of keywords.

II. Analyze the frequency of keywords on all scanned web pages.


III. All the keywords should be stored after stemming them.

IV. All the keywords should be stored into their respective category/cluster.

40
lOMoAR cPSD| 14639031

5.4 Problem Statement

The Problem Of Scanning the textual contents of web pages and keeping its
record for future use.

Affects Record-Keeper and Record-Users

The Impact Of Time consuming and incorrect log entries. And non-logical
Which Is hierarchy of information retrieved

Successful Solution • Make fast and automated scanning, and accurate search
Would by Indexing
• Maintain the log efficiently
• Only authorized persons can operate it

5.5 Product Position Statement

FOR Record’s users and administrators

WHO Administrators want to scan the web pages for information


extraction, and the users want to perform search operation.

SEARCH ENGINE is an intelligent software system


WITH
INTELLIGENT
WEB CRAWLER
THAT Crawls through web sites, and analyze the data written in text
format. And saves this data into manageable format

UNLIKE Simple web crawlers

OUR PRODUCT • Applies stemming algorithm on extracted keywords.


• Count frequencies of keywords occurred in a web
page.
Applies Clustering algorithm to save keywords into
appropriate cluster

41
lOMoAR cPSD| 14639031

APPENDIX – B
[Software Requirements Specification]

6 Appendix-B [SRS]
6.1 Purpose
This document specifies the software requirements of a project, titled A Search Engine with

Intelligent Web Spider-(version 1). A complete description of the product is discussed in the

document, and this document is completely covering the areas of the product going to be made.

6.2 Document Conventions


To read this document, notice the mentioned conventions to understand it clearly.

I. Normal + Italic: Italic text is used to indicate the keywords.


II. Normal + Bold: Bold text is used to indicate the significant terms.
III. Normal + Bold + Italic: Such text is the 4th level heading style.

6.3 Intended Audience and Reading Suggestions


The SRS is to be used by these underlying kinds of stakeholders.

6.3.1 System Analyst

The system analyst creates and maintains the vision, the use-case model overview and

supplementary specifications, which serve as input to the SRS and are the communication medium

between the system analyst, the customer, and other developers.

6.3.2 Use-Case Specifier

The use-case specifier creates and maintains the individual use cases of the use-case model and

other components of the SRS package.

42
lOMoAR cPSD| 14639031

6.3.3 Designers

Designers use the SRS Package as a reference when defining responsibilities, operations, and

attributes on classes, and when adjusting classes to the implementation environment.

6.3.4 Implementers

Implementers refer to the SRS Package for input when implementing classes, and the functionality

of those classes.

6.3.5 Project Managers

Project Manager consults the document to keep an eye view on project, by doing so it will be easy

for him to manage, and he also refers to the SRS Package for input when planning iterations.

6.3.6 Marketing Staff

This document is also beneficial for Marketing Staff, because by consulting it they will be able to

understand the features added in it, and they could be able to make a good publicity of product.

6.3.7 Testers

The testers use the SRS Package to verify system compliance.

6.3.8 Users

By reading/consulting this document user will get to know about the features of the product and

importance of product, and if he/she want to add some thing more than he/she could be able to tell

his/her new requirement which will be added in product in next iteration.

6.4 Project Scope


This software project will cover the two major areas:

Database population ( with Spider/Crawler) Web, to search from database.

43
lOMoAR cPSD| 14639031

The responsibility of first module will be to move like a spider from one site to other, and to find

the keywords of each web page. These keywords will be found by separating them from a list of

words like a, an, the, that, is, are, etc. And then the

keywords will be stemmed out, by using Stemming Algorithm. Means the Management will be

converted into Manage, Meetings into Meet and Cats into Cat etc. And these stemmed words will

be stored in the database. While searching from site to site, this software will consider the

privileges set for users on current site, means it will also examine the [Link] (i.e. A file usually

placed on restricted domains, and this file describes the privileges to use the content of that site),

this software will also do the cycle handling while moving from site to site and will also counts

number of times a certain web site is cited, means it will also do the statistical analysis. And on

the basis of these analyzed statistics, the system will manage found keywords into manageable

clusters/categories.

A Search Engine with Intelligent Web Spider [Fig: A-2.4.1]

Second module will be web based and the responsibility of this module will be to search

from the database maintained by the first module. It will search by taking value/input from user.

44
lOMoAR cPSD| 14639031

And before search from database, this input will be stemmed by using the same Stemming

algorithm i.e. used by first module.

6.5 Overall Description

6.5.1 Product Perspective

This software product is the member of a software family that is responsible to populate databases

for search engines, using fast and intelligent techniques. This product also includes a web-based

module that will use the database maintained by spider.

6.5.2 Product Features

The significant functions of the product are:

1. The first module of the product will Search the words in a web page.

2. It will Filter the keywords, from the list of words found.

3. It will Stem the keywords by using any appropriate algorithm.

4. It will make Clusters of keywords on the basis of their frequency in a web page.

5. It will store these keywords with the link of web page in a database.

6. The second module of product will take an input from a web user (i.e. based on one or more

keywords), and will search these keywords in the database maintained by first module.

7. The search operation will be performed on the basis of clusters of the keywords, and the result of

keywords will be shown to web users.

45
lOMoAR cPSD| 14639031

6.5.3 User Classes and Characteristics

S/No User Classes Characteristics


Technical Experience Privilege
.
1. Web Module Users Not Expert No experience Access to search
required. from the database,
by using the web
interface.

2. Administrators Expert to handle Experienced. Have full access to


all technical issues the system, and
related to the can populate
product. database by using

Web Crawler. The users of


this software product can be divided on the basis of their privilege level and somehow on the basis
of their technical expertise.

6.5.4 Operating Environment

[Link] Software Requirements

This program needs these under mentioned tools for its execution:

1. Java Runtime Environment 1.4.2

2. Web Browser (Internet explorer, Mozilla Firefox or Opera etc).

And these tools are required for its development and administration:

1. Borland JBuilder 10.0 OR Sun Java Studio Enterprise 8.1

2. Wamp Server 5.0 (with support of mySQL and PHP5)

3. Java Development Kit 1.3

4. MySQL Connector with JDBC 3.1.14

46
lOMoAR cPSD| 14639031

[Link] Hardware Requirements

The software requires NIC or modem to connect or work on some Network, and all other hardware

components that a simple computer needs. e.g.

RAM 128 MB

Processor 1000 MHz

Mother Board P-III,P-IV


6.5.5 Design and Implementation Constraints

[Link] Hardware limitations

It includes the processing speed of hardware, and the storage capacity of database. So the

developers have to choose a server that provides maximum capacity for database.

[Link] Tools

This software will be made by using Java2 compiler (i.e. Borland JBuilder10), which is the latest

tool for compilation of J2SE applications, so limitation are reduced by using version 10.0.

[Link] Security considerations

This program has to crawl from site to site, and some websites don’t authorize crawlers to scan

their pages. So the system should consider these privileges and should not perform search

operation on such sites.

6.5.6 User Documentation

Development team provides a training session to the Administrators, and a user manual will also

be embedded with the software module. This help manual contains all the information about the

basic functionality, and also answers “How to perform this functionality?” and some FAQs.

47
lOMoAR cPSD| 14639031

6.6 System Features

6.6.1 Scan Web Pages

[Link] Description and Priority

The software module is responsible to visit/crawl web pages from a starting web portal (i.e.

provided by the user). It will not only crawl through one web page to another but also fetch the

data written in the page, to search logical words.

It is the High priority or primary feature of the system, because without crawling the raw data can’t

be collected for maintaining database.

[Link] Stimulus/Response Sequences

1. Administrator will enter the starting web portal URL, and the depth for crawling.

2. System will fetch the contents of the specified URL.

3. System will make a list of URLs found at current web page, for crawling purpose.

4. System will found Logical Words in this page, and send them to Stemmer.

5. System will store all stemmed words into their respective cluster.

6. Repeat step 2-4 for all URLs found at this level.

6.6.2 Stemming the Key Words

[Link] Description and Priority

Stemming is the process for reducing inflected (or sometimes derived) words to their stem, base

or root form — generally a written word form. The stem need not be identical to the morphological

root of the word; it is usually sufficient that related words map to the same stem, even if this stem

is not in itself a valid root.

48
lOMoAR cPSD| 14639031

Stemming the keywords is useful in search engines for query expansion or indexing and other

natural language processing problems. So it has High priority in our system.

[Link] Stimulus/Response Sequences

1. A list of logical words will be generated.

2. Word will be converted to its base/root word.

3. Step 2 will be performed for all logical words.

6.8 Nonfunctional Requirements

6.8.1 Performance Requirements

The web crawling module of the software system will require an HTTP internet connection, and

the maximum transfer rate of this connection will affect the performance of this module. So the

bandwidth of the connection should be as much as possible to increase the performance.

6.8.2 Security Requirements

The web crawling module of the software system should only be accessible to authorized users,

and no any irrelevant individual should have access to the database maintained by web crawler.

And to ensure security, the system needs to implement a user recognition system. That will operate

before the starting of crawling module.

6.8.3 Software Quality Attributes

[Link] Availability

The software will be easily available from our website, and user will able to purchase it by credit
card.

6.8.3.2Maintainability

User will be able to download its newer versions from our website, so that he/she can maintain it.

6.8.3.3Less expensive

It will be a cheaper software and easily available at software market.

49
lOMoAR cPSD| 14639031

APPENDIX-C
[Glossary]

7 Appendix-C [Glossary]
1 SRS: Software Requirement Specification

2 JRE: Java Runtime Environment

3 J2SE: Java 2 Standard Edition

4 RAM: Random Access Memory

5 HTTP: Hyper Text Transmission Protocol


Integrated Development
6 IDE:
Environment
7 QT: Quality Threshold

50
lOMoAR cPSD| 14639031

APPENDIX-D
[References]
Appendix-D [References]

[1]. Gautam Pant, Padmini Srinivasan, and Filippo Menczer; “Crawling The Web – Chapter 1:
Introduction”
[Link]

[2]. Julie Beth Lovins (1968). Development of a stemming algorithm. Mechanical Translation and
Computational Linguistics 11:22–31

[3]. [Link] visited 15-012010]

[4]. [Link] visited 21-01-2010]

[5]. Ricardo Baeza-Yates and Berthier Ribeiro-Neto (1999). Modern Information Retrieval. ACM
Press/Addison Wesley.

[6]. Basak S.C., Magnuson V.R., Niemi C.J., Regal R.R. "Determing Structural Similarity of
Chemicals Using Graph Theoretic Indices". Discr. Appl. Math., 19, 1988: 17-44.

[7]. E. B. Fowlkes & C. L. Mallows (September 1983). "A Method for Comparing Two
Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584.

[8]. [Link] [Date visited 28-01-2010]

[9].
[Link] e visited 28-01-
2010]

[10]. [Link] visited 28-01-2010]


7.1 Some other Resources
[1]. “Programming Spiders, Bots and Aggregators in Java”, by Sybex.

[2]. IEEE Std 830-1993, Recommended Practice for Software Requirements

Specifications, Software Engineering Standards Committee of the IEEE Computer Society, New

York (1993)

51
lOMoAR cPSD| 14639031

[3]. “Data Mining – Introductory and Advanced Topics” by Margrate H. Dunham,

Pearson Education

[4]. Search Engines and Legal Issues, available at:

[Link]

[5]. Osmar R. Zaïane: “Principles of Knowledge Discovery in Databases - Chapter 8:

Data

Clustering”[Link] [Link]

[6].Tariq Rashid: “Clustering”,

[Link]

[Link]

Snapshots:.

Web Portal

52
lOMoAR cPSD| 14639031

Web Spider

53
lOMoAR cPSD| 14639031

Project Schedule:

| Phase Duration
Specifying requirements and analyzing them 5 weeks
Making of the design of Product 3 weeks
Inspection and testing of design made 1 week
Implementation of Design 3 weeks
Testing of product by some user, who will not be the member of development 1 week
team
Modification of faults found in testing 2 weeks
Making of Help manual and PowerPoint presentations for the product 2 weeks
Making of Help manual and PowerPoint presentations for the product Last week

Total estimated time: 18 weeks

54

You might also like