WEB STRUCTURE MINING
Definition of Web Structure Mining :
1. Web Structure Mining is one of the three different types of techniques
in Web Mining.
2. Web Structure Mining is the technique of discovering structure
information from the web.
3. It uses graph theory to analyze nodes and connections in the structure
of a website.
Depending upon the type of Web Structural data, Web Structure Mining can
be categorised into two types:
1.Extracting patterns from the hyperlink in the Web
2. Mining the document structure
1.Extracting patterns from the hyperlink in the Web
1. A hyperlink is a structural component that connects the web
page to a different location.
2. The Web operates using hyperlinks and Hypertext Transfer
Protocol (HTTP).
3. Any page can create a hyperlink of any other page and that page can
also be linked to some other page.
4. This interconnected nature supports unique network analysis
algorithms.
5. The structure of web pages to analyze hyperlink patterns
among them.
2. Mining the document structure :
It is the analysis of tree like structure of web page to describe HTML or XML
usage or the tags usage.
Example of Web Structure Mining:
An example of a technique of web structure mining is the PageRank
algorithm used by Google Search to rank websites in their search
engine results.
A page's rank is decided by the number and quality of links pointing
to the target node.
Two main Approaches of Web Structure Mining or there are two basic
strategic models for successful websites :
1] Page Rank
2] Hubs and Authorities
1] PAGE RANK
PageRank (PR) is an algorithm used by Google Search to rank websites in
their search engine results.
PageRank was named after Larry Page, one of the founders of Google.
PageRank is a way of measuring the importance of website pages.
ACCORDING TO GOOGLE :
“PageRank works by counting the number and quality of links to a page to
determine a rough estimate of how important the website is.”
“The underlying assumption is that more important websites are likely to
receive more links from other websites.”
It is not the only algorithm used by Google to order search engine results,
but it is the first algorithm that was used by the company, and it is the
best-known.
The following is the simplified version of PageRank :-
Let u, v be Web pages.
Therefore, let Bu be the group of pages that point to u.
Moreover, let Nv be the multiple links from v.
Let c < 1 be a factor for normalization.
It can describe a simple ranking R, which is a simplified interpretation of
PageRank −
2] HUBS AND AUTHORITIES
Hubs: Pages with many useful links, serving as access points for
diverse information.
Authorities: Pages with accurate, comprehensive information,
attracting the most inbound links and user trust.
Applications of Web Structure Mining:
1. Used in Search engines to find relevant information.
2. To find out relevance of each web page.
3. Information retrieval in social networks.
4. Measuring the completeness of Websites.