INTRODUCTION
1.1 Background:
WebPage URL Extraction technology enables Organizations to quickly and easily get the results they want. Users can rapidly find, capture and store any information from any website. It can extract data nearly from any website. Including commerce sites, member list directories, search engines and others. This Powerful solution has been utilized to successfully collect content from sites. It collects both static and dynamic content from pages, and then converts it into a text file, Excel spread sheet, or any database format. And its fully customizable, to ensure fast, simple extraction from any target website.
1.2 Objectives:
The main objective of this project is to design a website which extracts images, tags and metatags for any given URL.
1.3 Purpose, Scope and Applicability:
1.3.1Purpose
The main aim of webpage URL extractor is to extract the URL address from the webpage. With this system user can store very large amount of websites into local drive or disk drive
1.3.2Scope
User can enter the website address and find the links of that particular website .Scope of this also has a facility to store the links in database file or upload it to an excel file. This Project will be useful to those users who want to know links in any particular website.
1.3.3 Applicability:
Its applicable for all the Companies who are facing with the problem of manual mistakes.
1
1.4. Achievements:
This project gives me an opportunity of improving my designing, coding, analyzing and testing skills
LITERATURE SURVEY
2.1 Existing system
2
The Extract URL method is the primitive system that it facilitates the Extraction of URL's from files on our local drives and stores it into a text file without duplication(all dissimilar URL's is present no one is identical) and facilitates the navigation of appropriate URL's but it stores only limited count of URL'S/pages to store per a session.
Overview
Web Link Extractor is a powerful link extractor utility. It extracts Link Text and Link URL from the web pages you specify. And put the result into a Text/CSV file that you can open with Notepad or Excel.
Features
1. You may use URL fuzzy matching 2. You may specify the min length of link text 3. It may craw the pages with Next Page link one by one 4. You may specify some keywords to exclude some links 5. The output data could be link URL, link text or both of them. 6. You may save your task parameters to a INI file and load it next time.
Limitations of Existing System
The limit is 32,000 URLs/pages per session. If you need to extract more pages per session, then try Win Web Crawler - it saves extracted data directly to disk file, so there is no limit.
3
2.2 Proposed system
To overcome the drawback of existing system the proposed system came into picture with additional advantage that it Provides the no of URL's/pages to store per a session by using win web crawler in this system it saves extracted data directly to disk file so there is no limit to store URL's/pages. Win Web Crawler 2.0 is a Web Search Tool product from winwebcrawler.com, This program is a Web crawler utility for extracting URL, meta-tags, plain text, page size, last modified date value from Web sites, Web directories, search results and lists of URLs from file. It saves data directly to a disk file and has numerous filters to restrict a session including a URL filter, text filter, data filter, domain filter and date modified filter.
URL Filter
Include:
Set this option if want to specify a list of keywords and tell program that a link/URL must contain any of those entered keywords before its files are downloaded. You can enter one or more keywords line by line. Every links or URLs will be checked before download.
Exclude:
Set this option if want to specify a list of keywords and tell program that a link/URL must NOT contain any of those entered keywords before its files are downloaded. You can enter one or more keywords line by line. Every links or URLs will be checked before download. For example: you do not want to process files from folder "http://www.xyz.com/movies"so/movies in the exclude box and program will not download anything from that folder.
Text Filter
Use text filter to extract data only from those web pages that contain text keyword you specify. You can specify one or more keyword in the box and set 'OR' / 'AND' logic. For example: if OR is set then WDE will evaluate true if the testing webpage contains any of your
keywords. On the other hand, if 'AND' is set then all of your keywords must exist in that web page before WDE start data extraction from that page.
Domain Filter
Check this option so that all extracted data is verified against domain list. By default it is checked always.
REQUIREMENT SPECIFICATION
Requirements
A requirement is a feature that must be included in the actual design and implementation, getting to know the system to be implemented is of prime importance.
Main emphasis is on:
The inputs to the system. The outputs expected from the system. The people involved in the working of the system. The volume of DATA (INPUTS) and the amount of INFORMATION (OUTPUTS) that will be involved.
System Environment:-
3.1 Hardware Requirement Specifications(HRS)
Processor
Intel core 2 duo
Clock speed
800 MHZ or above
Ram
128 MB or above
Hard disk
20 GB
3.2 Software Requirement Specifications(SRS)
Operating System
Windows XP, Windows 7
Technology
Java
Language
Java
Data Base
Oracle
Browser
Any web browser IE, Firefox etc.
ANALYSIS
4.1 Software Life Cycle Models
There is various software development approaches defined and designed which are used/employed during development process of software, these approaches are also referred as "Software Development Process Models". Each process model follows a particular life cycle in order to ensure success in process of software development.
One such approach/process used in Software Development is "The Waterfall Model".
4.1.1 Waterfall Model
Requirement gathering
And analysis
System Design
Implementation
Testing
Deployment of system
Maintenance
Figure: General Overview of waterfall model
DESIGN PHASE ANALYSIS
UML DIAGRAMS: Unified Modelling Language (UML) is a standard language for specifying, visualizing, constructing, and documenting the artifacts of software systems, as well as for business modelling and other non-software systems. The UML represents a collection of best engineering practices that have proven successful in the modelling of large and complex systems. The UML is a very important part of developing objects oriented software and the software development process. The
8
UML uses mostly graphical notations to express the design of software projects. Using the UML helps project teams communicate, explore potential designs, and validate the architectural design of the software.
5.1 CLASS DIAGRAM:
HOME URL's Links
Get URL's URL : varchar Store in db : varchar Retrieve existing : varchar URL()
Get Links Link : varchar Store in db : varchar Retrieve existing : varchar Links()
Fig 5.1: Class diagram for user activities 5.2 USE CASE DIAGRAM: Displays the relationship among actors and use cases. A use case is a set of scenarios that describing an interaction between a user and a system. A use case diagram displays the relationship among actors and use cases. The two main components of a use case diagram are use cases and actors.
USECASE DIAGRAM:
Enter URL Extract URL's
User
Extracts URL's Extract Links
SQL DB Store in database
Excel
Fig 5.2: Usecase diagram for user activities
5.3 STATE CHART DIAGRAM:
State chart diagrams model the dynamic behavior of individual classes or any other kind of object. They show the sequences of states that an object goes through, the events that cause a transaction from one state to anther and the actions that result from a state change.
State chart diagrams are closely related to activity diagrams. The main difference between the two diagrams is state chart diagrams are state centric, while activity diagrams are
10
activity centric. a state chart diagram is typically used to model the discrete stages of an objects life time ,where as an activity diagram is better suited to model the sequence of activities in a process. State Chart Diagram Sample Robot Transmission State Chart Diagram Sample.
You can use the following tools on the state chart diagram toolbox to mode state chart diagrams:
V I D S U E a t x e R n l v o a L t i a r : e d l b e a s c r i a c h k d s i t e e n s y k w e f y o w r d o r d
Decisions Synchronizations States Transactions Start states End states
Fig 5.3.1: State chart diagram for extracting URLs
11
V I D L l S U E a t x i e R n l v o n a L t i a k r e d l b e a s c r i a c h k d s i t e e n s y K w e f y o w r d o r d
Fig 5.3.2: State chart diagram for extracting Links
5.4 SEQUENCE DIAGRAM:
A sequence diagram is a graphical view of a scenario that shows object interaction in a time based sequence what happens first, what happens next. Sequence diagrams establish the roles of objects and help provide essential information to determine class responsibilities and interfaces. This type of diagram is best used during early analysis phases in design because they are simple and easy to comprehend .sequence diagrams are normally associated with use cases.
12
A sequence diagram has two dimensions: typically, vertical placement represents time and horizontal placement represents different objects. The following tools located on the sequence diagram toolbox enable you to model sequence diagrams: Object Message Icons Focus of control Message to Self Note Note Anchor
13
4 3 2 1 D E U : a x s E t e D S G n a r e t b a o t e a c a r s t b s e o a s U u r s R r U e i L l R n L U s p D s d a a t t a e d B a s e
Fig 5.4.1: Sequence diagram for extracting URLs
14
4 3 2 1 D E U : a x t s D S G E a r e t e n b a r o t a c a r e s t b L r s e o a s i r s n u L e k r i s l n U k p D s d a a t t a e d B a s e
Fig 5.4.2: Sequence diagram for extracting Links
15
5.5 COLLABORATION DIAGRAM:
1: Enter URL User Extractor
2: 3: Get URL's
5: Database Updated Extract URL's5 4: Stores in DataBase Database
Fig 5.5.1: Collaboration diagram for extracting URLs
1: Enter URL User Extractor
2: 3: Get Links
5: Database Updated Extract Links Database 4: Stores in DataBase
Fig 5.5.2: Collaboration diagram for extracting Links
16
IMPLEMENTATION 6.1 Technologies Used
HTML:
HTML, an initialism of Hypertext Markup Language, is the predominant markup language for web pages. It provides a means to describe the structure of text-based information in a document by denoting certain text as headings, paragraphs, lists, and so on and to supplement that text with interactive forms, embedded images, and other objects. HTML is written in the form of labels (known as tags), surrounded by angle brackets. HTML can also describe, to some degree, the appearance and semantics of a document, and can include embedded scripting language code which can affect the behavior of web browsers and other HTML processors.
Hypertext Markup Language (HTML), the languages of the World Wide Web (WWW), allows users to produces Web pages that include text, graphics and pointer to other Web pages (Hyperlinks). HTML is not a programming language but it is an application of ISO Standard 8879, SGML (Standard Generalized Markup Language), but specialized to hypertext and adapted to the Web. The idea behind Hypertext is that instead of reading text in rigid linear structure, we can easily jump from one point to another point. We can navigate through the information based on our interest and preference. A markup language is simply a series of elements, each delimited with special characters that define how text or other items enclosed within the elements should be displayed. Hyperlinks are underlined or emphasized works that load to other documents or some portions of the same document. HTML can be used to display any type of document on the host computer, which can be geographically at a different location. It is a versatile language and can be used on any platform or desktop. HTML provides tags (special codes) to make the document look attractive. HTML tags are not case-sensitive. Using graphics, fonts, different sizes, color, etc., can enhance the presentation of the document. Anything that is not a tag is part of the document itself.
Attributes:
17
The attributes of an element are name-value pairs, separated by "=", and written within the start label of an element, after the element's name. The value should be enclosed in single or double quotes, although values consisting of certain characters can be left unquoted in HTML (but not XHTML).Leaving attribute values unquoted is considered unsafe. Most elements take any of several common attributes: id, class, style and title. Most also take language-related attributes: lang and dir.
Advantages
A HTML document is small and hence easy to send over the net. It is small because it does not include formatted information. HTML is platform independent. HTML tags are not case-sensitive
Introduction to JAVA:
History of JAVA:
Java language was developed by James Gosling and his team at sun micro systems and released formally in 1995. Its former name is oak. Java Development Kit 1.0 was released in 1996. to popularize java and is freely available on Internet.
Overview of JAVA:
Java is loosely based on C++ syntax, and is meant to be Object-Oriented Structure of java is midway between an interpreted and a compiled language. Java programs are compiled by the java compiler into Byte Codes which are secure and portable across different platforms. These byte codes are essentially instructions encapsulated in single type, to what is known as a java virtual machine (JVM) which resides in standard browser.
18
JVM verifies these byte codes when downloaded by the browser for integrity. JVM is available for almost all OS. JVM converts these byte codes into machine specific instructions at runtime.
Java Features:
The inventors of java wanted to design a language, which could offer solutions to some of the problems encountered in modern programming. They wanted the language to be not only reliable portable and distributed but also simple compact and interactive. Sun Microsystems officially describes java with the following attributes. Simple Compile and interpreted Platform-Independent and Portable Object-Oriented Robust and Secure Distributed Familiar simple and small Multithreaded and Interactive High Performance Dynamic and Extensible
Importance of Java to the Internet:
Java has had a profound effect on the Internet. This is because, java expands the Universe of objects that can move about freely in Cyberspace. In a network, two categories of objects are transmitted between the server and the personal computer. They are passive information and Dynamic active programs. in the areas of Security and probability. But Java addresses these concerns and by doing so, has opened the door to an exciting new form of program called the Applet.
19
JAVA and World Wide Web:
World Wide Web is an open ended information retrieval system designed to be used in the distributed environment. This system contains web pages that provide both information and controls. We can navigate to a new web page in any direction. This is made possible worth HTML java was meant to be used in distributed environment such as internet. So java could be easily incorporated into the web system and is capable of supporting animation graphics, games and other special effect. The web has become more dynamic and interactive with support of java. We can run a java program on remote machine over internet with the support of web.
JAVA Environment:
Java environment includes a large no. of tools which are part of the system known as java development kit (JDK) and hundreds of classes, methods, and interfaces grouped into packages forms part of java standard library (JSL).
JAVA Architecture:
Java architecture provides a portable, robust, high performing environment for development. Java provides portability by compiling the byte codes for the Java Virtual Machine, which is then interpreted on each platform by the run-time environment. Java is a dynamic system, able to load code when needed from a machine in the same room or across the planet.
JAVA Virtual Machine:
When we compile the code, java compiler creates machine code (byte code) for a hypothetical machine called java virtual machine (JVM). The JVM will execute the byte code and overcomes the issue of portability. The code is written and compile for one machine and interpreted all other machines. This machine is called java virtual machine.
20
Paradigm of JAVA:
Dynamic down loading applets(small application programs); Elimination of flatware phenomenon that is providing those features of a product that user needs at a time. The remaining features of a product can remain in the server. Changing economic model of the software Up-to-date software availability Supports network entire computing Supports CORBA & DCOM
Compilation of code
When you compile the code, the Java compiler creates machine code (called byte code) for a hypothetical machine called Java Virtual Machine (JVM). The JVM is supposed t executed the byte code. The JVM is created for the overcoming the issue of probability. The code is written and compiled for one machine and interpreted on all machines .This machine is called Java Virtual Machine.
21
SPARC Macintosh Java Java Java Pc Compiler interpreter interprete compiler Byte rmacintos (SPARC) code r Source h Platfor code m indepen dent
)))
Fig 3.3.1 Compiling and interpreting java source code During run-time the Java interpreter tricks the byte code file into thinking that it is running on a Java Virtual Machine. In reality this could be an Intel Pentium windows 95 or sun SPARCstation running Solaris or Apple Macintosh running system and all could receive code from any computer through internet and run the Applets
SERVLETS/JSP: Introduction to Servlets:
22
Servlets provide a Java based solution used to address the problems currently associated with doing server-side programming, including inextensible scripting solutions, platform-specific APIs, and incomplete interfaces. Servlets are objects that conform to a specific interface that can be plugged into a Java-based server. Servlets are to the server-side what applets are to the client-side -- object byte codes that can be dynamically loaded off the net. They differ from applets in that they are faceless objects (without graphics or a GUI component). They serve as platform-independent, dynamically-loadable, pluggable helper byte code objects on the server side that can be used to dynamically extend server-side functionality.
What is a Servlet?
Servlets are modules that extend request/response-oriented servers, such as Java-enabled web servers. For example, a servlet might be responsible for taking data in an HTML order-entry form and applying the business logic used to update a company's orderdatabase.
Servlets are to servers what applets are to browsers. Unlike applets, however, servlets have no graphical user interface. Servlets can be embedded in many different servers because the servlets API, which you use to write servlets, assumes nothing about the server's environment or protocol. Servlets have become most widely used within HTTP servers; many web servers support the Servlets API.
6.2 Module Explanation
23
Login Module Search Module
6.3 Screen Shots Screen-1:
24
Screen-2:
25
Screen-3:
26
TESTING
7.1 Testing Apporach
27
Software Testing Techniques:
Software testing is a critical element of software quality assurance and represents the ultimate review of specification, designing and coding.
Test Case Design:
Any engineering product can be tested in one of two ways:
White Box Testing:
This testing is also called as glass box testing. In this testing, by knowing the specified function that a product has been designed to perform test can be conducted that demonstrates each function is fully operation at the same time searching for errors in each function. It is a test case design method that uses the control structure of the procedural design to derive test cases. Basis path testing is a white box testing.
Black Box Testing:
In this testing by knowing the internal operation of a product, tests can be conducted to ensure that all gears mesh, that is the internal operation performs according to specification and all internal components have been adequately exercised. It fundamentally focuses on the functional requirements of the software. The steps involved in black box test case design are: 1. Graph based testing methods 2. Equivalence partitioning 3. Boundary value analysis 4. Comparison testing
Software Testing Strategies:
28
A software testing strategy provides a road map for the software developer. Testing is a set of activities that can be planned in advance and conducted systematically. For this reason a template for software testing a set of steps into which we can place specific test case design methods should be defined for software engineering process. Any software testing strategy should have the following characteristics: 1. Testing begins at the module level and works outward toward the integration of the entire computer based system. 2. Different testing techniques are appropriate at different points in time. 3. The developer of the software and an independent test group conducts testing. 4. Testing and Debugging are different activities but debugging must be accommodated in any testing strategy.
Unit Testing:
Unit testing focuses verification efforts in smallest unit of software design (module). 1. Unit test considerations 2. Unit test procedures
Integration Testing:
Integration testing is a systematic technique for constructing the program structure while conducting tests to uncover errors associated with interfacing. There are two types of integration testing:
1.
Top-Down Integration: Top down integration is an incremental approach to construction of program structures.
Modules are integrated by moving down wards throw the control hierarchy beginning with the main control module.
29
2.
Bottom-Up Integration: Bottom up integration as its name implies, begins construction and testing with automatic modules.
3.
Regression Testing: In this contest of an integration test strategy, regression testing is the re execution of some
subset of test that have already been conducted to ensure that changes have not propagate unintended side effects.
VALIDATION TESTING:
At the culmination of integration testing, software is completely assembled as a package; interfacing errors have been uncovered and corrected, and a final series of software tests validation testing may begin. Validation can be fined in many ways, but a simple definition is that validation succeeds when software functions in a manner that can be reasonably expected by the customer. Reasonable expectation is defined in the software requirement specification a document that describes all user-visible attributes of the software. The specification contains a section titled Validation Criteria. Information contained in that section forms the basis for a validation testing approach.
MANUAL TESTING:
Manual testing is the oldest and most rigorous type of software testing. Manual testing requires a tester to perform manual test operations on the test software without the help of Test automation. Manual testing is a laborious activity that requires the tester to possess a certain set of qualities; to be patient, observant, speculative, creative, innovative, open-minded, resourceful, unopinionated, and skillful. Manual testing helps discover and record any software bugs or discrepencies related to the functionality of the product. A manual tester would typically perform the following steps for manual testing:
30
1. Understand the functionality of program. 2. Prepare a test environment 3. Execute test case(s) manually 4. Verify the actual result
Usage:
It involves testing of all the functions performed by the people while preparing the data and using these data from automated
CONCLUSION & FUTURE SCOPE OF PROJECT
CONCLUSION:
The project titled as WEBPAGE URL EXTRACTION was deeply studied and analyzed to design the code and implement with various testing methods was done under the guidance of the
31
experienced project guide. The solution developed is free from all the bugs and executable with all different modules to the utmost satisfaction of the client. All the current requirements and possibilities have been taken care during the project time. We feel that the solution provided now will suit to all the needs of various clients in the same industry but also we dont rule the possibilities of further upgrading of this solution with the new and advance technologies and further additional requirements of the clients. The documentation and the project report is finally prepared to be referred as user manual for further effective results of this software solution.
FUTURE SCOPE OF PROJECT:
In future the project webpage URL extraction can be not only used for extracting URLs, Links but it can be extended as a search engine.
BIBLIOGRAPHY
Book References:
1. Pressman, Roger S (2005). Software Engineering: A Practitioner's Approach (6th ed.).
Boston, Mass: McGraw-Hill 2. Loney, Kevin (17 December 2008). Oracle Database 11g The Complete Reference (1st ed.). McGraw-Hill. p.
32