0% found this document useful (0 votes)
31 views15 pages

Java - Report Final

The document presents a project report on an Automated Web Data Extraction Framework using Java, aimed at simplifying the process of web scraping. It outlines the project's objectives, design flow, and implementation details, emphasizing the use of Java libraries like Jsoup and Selenium for effective data extraction. The report also discusses the testing methodology and results, demonstrating the scraper's functionality and performance across various types of websites.

Uploaded by

sidak450
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views15 pages

Java - Report Final

The document presents a project report on an Automated Web Data Extraction Framework using Java, aimed at simplifying the process of web scraping. It outlines the project's objectives, design flow, and implementation details, emphasizing the use of Java libraries like Jsoup and Selenium for effective data extraction. The report also discusses the testing methodology and results, demonstrating the scraper's functionality and performance across various types of websites.

Uploaded by

sidak450
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

PROJECT NAME

Automated web data extraction framework


using java
A PROJECT REPORT

Submitted by

Kanika(23BCS12910)
Sidak Singh (23BCS13207)
Aryan Athwal (223BCS13302)
Harshit Jaryal(23BCS13109)

in partial fulfillment for the award of the degree of

BACHELOR OF ENGINEERING

IN

COMPUTER SCIENCE & ENGINEERING

Chandigarh University

Month, 2025
TABLE OF CONTENTS

List of Figures .........................................................................................................

CHAPTER 1. INTRODUCTION .........................................................................


1.1. Introduction to Project ................................................................................................... 5

1.2. Identification of Problem ............................................................................................... 6

CHAPTER 2. BACKGROUND STUDY ........................................................... 7


2.1. Existing solutions .......................................................................................................... 7

2.2. Problem Definition ........................................................................................................ 8

2.3. Goals/Objectives ........................................................................................................... 8

CHAPTER 3. DESIGN FLOW/PROCESS ....................................................... 9


3.1. Evaluation & Selection of Specifications/Features ......................................................... 9

3.2. Analysis of Features and finalization subject to constraints ............................................ 9

3.3. Design Flow ................................................................................................................ 12


CHAPTER 4. RESULTS ANALYSIS AND VALIDATION .......................... 13
4.1. Implementation of solution .......................................................................................... 13

CHAPTER 5. CONCLUSION AND FUTURE WORK ................................. 15


5.1. Conclusion .................................................................................................................. 15

5.2. Future work.....................................................................................................................


R

CHAPTER 1
INTRODUCTION

1.1 Introduction to Project


In today’s data-driven world, the ability to automatically extract useful information from web sources
has become critical. Web scraping is a technique for programmatically accessing and parsing
information from websites. Unlike web APIs, which provide structured access to data, scraping
directly reads the content as rendered in HTML.
Java, known for its robustness and vast ecosystem, offers several tools and libraries for scraping.
This project focuses on using Java to build a scraper capable of navigating to websites, extracting
relevant information (e.g., news headlines, product prices, stock data), and storing it in a structured
format like CSV or databases.
This project also deals with challenges such as handling JavaScript-rendered content, managing
delays, avoiding detection, and ensuring data accuracy. Java libraries like Jsoup (for parsing HTML),
Selenium WebDriver (for automation), and HttpClient (for making requests) are integrated in this
implementation.

Everything from product specifications to news headlines can be found on the internet, yet much of
it is dispersed over cluttered web sites. Our project, the Java Web Scraper, is a straightforward
desktop application that uses Java and the JSoup library to mechanically extract particular data
from webpages.

Our scraper essentially performs the following tasks:

Retrieves the HTML content from any website that is open to the public. Makes use of well-known
CSS selectors (similar to those found in a website's stylesheet) to highlight the elements that are
important to you, such headlines and links.Formats the information that has been extracted for ease
of use and reading .For developers, researchers, or students looking for a simple method of gathering
data without depending on complex browser automation or Python-based tools, this application is
perfect.

1.2 Identification of Problem


Manually gathering data from websites is a tedious and error-prone process, especially when large
volumes of data are required frequently. Traditional APIs are not always available or accessible for
every website. Some problems identified include:
 Lack of structured data access.
 Time-consuming nature of manual collection.
 Difficulty in extracting dynamic content rendered via JavaScript.
 Inconsistent HTML structures across different web sources.

Although there is a wealth of helpful information available online, it is not conveniently


available in a downloaded version. Python is used in many of the data scraping technologies
R

that are now available, which isn't necessarily the best choice in an environment that is focused
on Java. Additionally, browser-based automation solutions may be overly complicated and
slow. In order to overcome these obstacles, our project: • Creates a lightweight scraper using
only Java. Managing common problems like badly formed HTML or network faults.Offering a
user-friendly solution that is simple to use, offline, and easily customizable.

This project aims to overcome these problems by building a reusable Java-based web scraping
solution.

CHAPTER 2
BACKGROUND STUDY
2.1 Existing Solutions
Several programming languages and tools exist for web scraping. Among the most commonly used
are:
 Python: Known for libraries like BeautifulSoup, Scrapy, and Requests, Python is the most
popular language for web scraping due to its simplicity and vast community support.
 JavaScript (Node.js): Tools such as Puppeteer and Cheerio provide headless browsing and
DOM manipulation features.
 Java: Though less commonly used than Python for scraping, Java provides strong libraries
such as:
o Jsoup: For parsing HTML and traversing the DOM.
o Selenium WebDriver: For interacting with dynamic web pages rendered via
JavaScript.
o Apache HttpClient: For managing HTTP requests and sessions.
o HtmlUnit: A headless browser for Java-based automation.
Each tool has its own strengths and trade-offs. Java is particularly suitable for projects requiring
integration with enterprise-level systems, robust error handling, and multithreaded performance.

Although Selenium and other browser automation tools may simulate human surfing, they are
typically complex and resource-intensive.

• Commercial solutions are also available, although they frequently include charges or restrictions.

While there are benefits to each alternative, many of them require a steady internet connection, are
too complicated for novices, or just don't function well for offline or Java-based projects.

2.2 Problem Definition


The primary problem is that, despite the abundance of useful information available online, it is not
readily available. We require a tool that can: • Download HTML from a website by connecting to
it.
R

• Parse this HTML to extract only the relevant information, such as product names or headlines.

• Regardless of a website's peculiarities, accomplish this effectively and consistently Given the
diverse nature of websites and the need for real-time data access, the problem can be formally defined
as:
Develop a modular and extensible web scraping application using Java that is capable of navigating
to websites, handling both static and dynamic content, extracting relevant data, and storing the data
in structured formats.
The solution should be:
 Scalable to multiple websites.
 Adaptable to changes in HTML structure.
 Capable of handling JavaScript-rendered content.
 Efficient in terms of speed and resource usage.

2.3 Goals/Objectives
The objective of this project is to fill the identified gap by developing a website data extractor with
the following key goals:
1. Design and Architecture: Define a modular and scalable design for the web scraper that allows
easy integration and future enhancements.
2. Tool and Library Selection: Evaluate and utilize the most suitable open-source Java libraries
such as Jsoup, Selenium WebDriver, and Apache HttpClient for effective web scraping.
3. Content Handling: Implement mechanisms to handle both static HTML content and dynamic
JavaScript-rendered content.
4. Data Extraction: Extract structured data from web pages using techniques such as CSS
selectors and XPath.
5. Data Storage: Store the scraped data in structured formats including CSV and relational
databases.
6. Error Handling and Logging: Incorporate robust error detection, handling, and logging
mechanisms to ensure the stability of the scraper.
7. Performance Optimization: Introduce multi-threading and request optimization techniques
to enhance scraping speed and reduce resource usage.
R

CHAPTER 3
DESIGN FLOW/PROCESS

3.1. Evaluation & Selection of Specifications/Features


To ensure the robustness of the scraper, the following design considerations were made
Before beginning implementation, a thorough evaluation of potential tools, libraries, and design
strategies was conducted to determine the most suitable technologies for the scraper. The goal was
to select components that offered flexibility, ease of integration, reliability, and robust support for
both static and dynamic web content.
Selection Criteria:
 Compatibility with Java: All selected libraries had to be Java-based or provide seamless Java
integration.
 Support for HTML Parsing and DOM Traversal: HTML parsers needed to handle real-world
messy HTML structures.
 Support for Dynamic Content: Tools had to manage JavaScript-rendered pages and AJAX
requests.
 Ease of Data Export: Mechanisms for exporting scraped data to formats like CSV and
databases were a priority.
 Community and Documentation: Active support and proper documentation were essential for
future enhancements.
Selected Libraries and Features:
1. Jsoup
Chosen for parsing static HTML, it provides a fast and easy API for extracting and
manipulating data using CSS selectors.
2. SeleniumWebDriver
Used to automate browser actions and extract data from dynamically rendered websites by
simulating real user interactions.
3. ApacheHttpClient
Enabled handling of HTTP requests with session control, cookie management, and request
headers customization.
4. OpenCSV
Provided a convenient way to write structured data to CSV files.
5. MySQLwithJDBC
Integrated for persistent storage of scraped data in a relational format.
6. Log4j
Incorporated for detailed logging of execution flow, errors, and system behaviour
R

3.2. Analysis of Features and Finalization Subject to Constraints


The following features were finalized after considering real-world constraints, user interaction, and
operational efficiency:
1. Interactive Command-line Interface: The application begins with a welcome message and
prompts users to enter a target URL and select the type of content to scrape.
2. Support for Multiple Content Types: Users can choose to scrape:
o All paragraph text (<p> elements)
o Hyperlinks (<a> tags)
o Images (<img> tags)
o Entire HTML content of the page
3. Content Parsing using JSoup: Once the HTML is fetched, it is parsed using JSoup to extract
and format the selected content type.
4. Output Management: All scraped data is saved in an organized directory structure under
scraped_data/, with subfolders for different content types (e.g., images/). Unique filenames
are generated using timestamps and domain names.
5. URL Validation and Normalization: If users enter a URL without the HTTP prefix, the system
automatically appends "https://" to prevent errors.
6. User-Agent Spoofing and Timeout Control: The scraper sets a custom user-agent and applies
a timeout (10 seconds) to avoid being blocked by servers or delayed indefinitely.
7. Image Downloading Capability: When scraping images, the system downloads each file,
infers its type, and stores it in a structured directory while logging the results.
8. Error Handling and Logging: Exception handling covers invalid input, connection failures,
parsing errors, and file handling issues. All exceptions are logged and reported back to the
user in real time.
Based on these factors, the final scraper design included the following core features:
 Support for both static and dynamic web pages.
 Custom user-agent headers and proxy support for anti-bot evasion.
 Modular HTML parsers that could be adapted for different site structures.
 Retry mechanisms and error logging to handle unexpected failures.
 Configurable scraping parameters through external property files (e.g., delay time, output
format, URLs).
3.3. Design Flow
The flow of the application is divided into the following major steps:
1. Startup Initialization
 Display welcome message.
 Initialize necessary directories for output using Files.createDirectory().
2. Input Handling
R

 Prompt user to enter a URL or type "exit" to quit.


 Validate and normalize the URL by prepending https:// if missing.
3. Connection Setup
 Use JSoup's connect() method with a custom user-agent.
 Set a timeout of 10 seconds to avoid hanging.
4. HTML Document Fetching
5. Use .get() to retrieve the web page content.
6. Confirm the connection and display the page title.
7. User Menu for Scraping Options
 Display a menu with options to scrape:
o Paragraph text
o Hyperlinks
o Images
o Full HTML content
 Read the user’s choice and parse it as an integer.
8. Content Extraction Module
 Use JSoup's select() method with appropriate tags:
o <p> for text
o <a href> for links
o <img src> for images
 Clean and structure the output using StringBuilder.
9. Data Output and File Writing
 Create filenames based on URL, content type, and timestamp.
 Use BufferedWriter or FileOutputStream to write content.
 Save images using Java InputStream and OutputStream.
10. Loop and Exit
 Loop continues for new URLs until the user types “exit”.
R

 Close Scanner and clean up resources on exit.

HIGH LEVEL FLOW

DATA FLOW DIAGRAM


R

CHAPTER 4
RESULTS ANALYSIS AND VALIDATION

4.1 Implementation Overview


The Java Web Scraper was implemented as a console-based tool that allows users to enter a URL
and select the type of content they wish to extract. Based entirely on Java Standard Edition, the
program requires no graphical user interface or external servers. The simplicity of the code
architecture allows for fast execution and ease of customization.
The key features tested include:
 Real-time interaction via command-line prompts
 Scraping of different HTML elements
 Downloading and saving of external image files
 Consistent file generation and storage structure
 Error reporting and handling
4.2 Testing Methodology
The scraper was tested in a controlled environment using websites from various domains such as
news portals, blogs, e-commerce platforms, and static HTML pages. These test cases were chosen to
validate the system’s behavior across websites with differing structures, content densities, and
response formats.
Test Environment
 Operating System: Windows 10 / Ubuntu 20.04
 JDK Version: OpenJDK 11
 Internet Speed: 30 Mbps average
 IDE Used: IntelliJ IDEA / Eclipse
 Test Duration: 3 weeks
Types of Tests Conducted
1. Functional Testing:
Verified that each scraping option (text, links, images, full HTML) performs the intended
action and generates output files as expected.
2. Usability Testing:
Ensured that the menu-driven interface responds appropriately to valid and invalid user
inputs.
3. Boundary Testing:
Checked scraper behavior for edge cases such as empty URLs, unsupported domains, and
malformed HTML content.
4. Exception Handling Testing:
Evaluated how the application responds to scenarios like:
R

o Network disconnection
o Invalid menu selections
o Malformed URLs
o I/O exceptions during file writing
5. Performance Testing:
Measured execution times for each scraping task based on website complexity and size.

4.3 Test Results and Observations


The results of the tests are summarized in the following subsections.

A. Functional Accuracy

Scraping Type Expected Outcome Actual Outcome Status

Paragraph Text All <p> tags extracted and Successfully extracted and ✅
saved formatted Passed
Hyperlinks All <a> tags with href Links saved with text + URL ✅
captured Passed
Images All <img> tags downloaded Most images saved, minor ✅
and logged skips Passed
Full HTML Complete page HTML Verified 1:1 replica of original ✅
stored in .txt HTML Passed

B. Runtime Performance

Website Type Content Size Avg. Execution Time Scraping Mode


News Portal Medium ~2.1 sec Text & Links
E-commerce Site Large (images) ~6.3 sec Images
Blog Page Small ~1.7 sec Text
Static HTML File Very Small ~0.5 sec HTML export
R

4.4 Output Validation


To validate the accuracy and consistency of the output files, manual inspection and automated
verification scripts were used. Here are the highlights:
1. Text Content Validation:
o All text files preserved paragraph order.
o Non-empty lines were numbered and formatted for readability.
2. Link Validation:
o Absolute URLs were generated correctly using JSoup’s abs:href.
o Output included both hyperlink text and actual URL.
3. Image File Validation:
o Files were saved with accurate extensions and valid file sizes.
o Logs correctly marked download success or failure.
4. HTML Export Validation:
o Files opened in browsers rendered identical to the original live page.
R
R

CHAPTER 5
CONCLUSION AND FUTURE WORK
5.1. Conclusion
This project successfully demonstrated the development of a modular, user-driven web scraping tool
using Java. By leveraging the JSoup library and native Java functionalities, the program allows users
to interactively extract textual, visual, and structural content from live websites.
The system was designed with clarity, simplicity, and scalability in mind. Its command-line interface
makes it accessible to technical users, while its modular codebase allows for easy maintenance and
future upgrades.
Moreover, the scraper promotes ethical use by honoring site structures and avoiding excessive
requests, making it suitable for both educational and analytical applications.
The online Scraper Using Java project effectively illustrates how the JSoup library and basic Java
programming may be used to create a robust yet user-friendly tool for data extraction from online
pages.

Practical Value: The program provides academics and developers with real-world applications by
automating the extraction of valuable data from a range of websites.Educational Merit: Network
programming, exception management, and modular design concepts are reinforced by the project.
Effective teamwork and project management were demonstrated by the substantial contributions
made by each member, which ranged from architecture and coding to testing and documentation.

5.2. Future Work


While the current implementation meets the project goals, there are several areas for improvement
and expansion:
1. Support for JavaScript-rendered Websites: The current scraper handles static HTML
only. Future versions can integrate Selenium WebDriver or similar headless browsers to
extract content loaded dynamically through JavaScript and AJAX. This would significantly
extend the range of websites that can be effectively scraped.
2. Graphical User Interface (GUI): A GUI using JavaFX or Swing can be developed to
provide an intuitive interface for users. This would allow non-technical users to input
URLs, choose scraping options, and export data without using the command line..
3. CAPTCHA Handling Mechanism: Many websites use CAPTCHA to block automated
bots. Future development could include integration with third-party CAPTCHA-solving
R

services or OCR libraries to bypass such verification systems where ethically permissible.
4. Multi-threaded Scraping for Large-scale Tasks: Implementing concurrent scraping
through multi-threading will reduce execution time and increase efficiency. This is
particularly useful when scraping large numbers of pages or sites that serve high-volume
data.
5. Advanced Output Formats and Data Visualization: In addition to CSV and text files,
future versions can support data export to JSON, XML, or databases. Basic data
visualization tools can also be added to help users analyse the scraped data quickly and
effectively.

You might also like