0% found this document useful (0 votes)
18 views30 pages

Web Minnig

Uploaded by

Zahra Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views30 pages

Web Minnig

Uploaded by

Zahra Waheed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Deeper Dive into

Purpose-Built
Search: A Bullet
Point Journey
Core Concept

Tailored information retrieval systems designed for specific domains or


user needs, offering superior relevance and efficiency compared to
general-purpose search.
Key Benefits:

 Domain Expertise: Deep understanding of language, data


structures, and search intent within a specific domain.
 Targeted Functionalities: Specialized features and operators
catered to the domain (e.g., legal citation search, product filtering).
 Streamlined Efficiency: Faster and more accurate results, saving
time and effort.
Diverse Applications:

 E-commerce: Advanced product comparisons based on specific criteria.


 Legal Research: Efficient navigation of databases with specialized search
operators.
 Enterprise Search: Role-specific search for internal documents and
resources.
 Media & Entertainment: Granular search by genre, cast, release date, etc.
 Scientific Exploration: Domain-specific ranking algorithms for relevant
research papers.
 Healthcare: Search medical databases based on symptoms, diagnoses, and
medications.
 Education: Curated search experiences for students and educators across
disciplines.
Technical Underpinnings:

 Advanced Indexing & Processing: Algorithms optimize data for


specific domain searches.
 Specialized Query Understanding: Intent analysis tailored to the
domain vocabulary and patterns.
 Domain-Specific Ranking: Prioritizes results based on relevance
and search context within the domain.
Emerging Trends:

 AI-Powered Insights: Extracting deeper connections and patterns


from search results.
 Cross-Domain Integration: Seamlessly search across specialized
tools for broader exploration.
 Personalization & Adaptability: Intuitive interfaces learning from
user habits and preferences.
Future Implications:

 Democratization of information access across various domains.


 Increased productivity and efficiency in knowledge-driven tasks.
 Personalized learning experiences and deeper understanding of
complex topics.
Controlled Queries vs.
Uncontrolled Queries in
Web Mining:
Concept

 Controlled queries: Formulated by the researcher with specific


goals and requirements, often tailored to a particular domain or
dataset. They leverage structured query languages (e.g., SQL, XPath)
or web APIs to precisely retrieve relevant data.
 Uncontrolled queries: Submitted by users (e.g., search
keywords, reviews, forum posts) with varying levels of
clarity, structure, and intent. They represent spontaneous information
needs in diverse formats and require parsing, understanding, and
interpretation.
Key Differences:
Key Differences:
Relation to Web Mining:

 Controlled queries:
 Used to access well-organized data repositories (e.g., databases, websites with clean
APIs)
 Support targeted extraction of specific data points for analysis or modeling
 Examples: Crawling product prices from e-commerce sites, extracting scientific
literature through APIs
 Uncontrolled queries:
 Often require pre-processing, text analysis, and natural language processing (NLP)
techniques
 Present challenges due to noise, subjectivity, and ambiguity
 Used for broader exploration, sentiment analysis, topic modeling, and understanding
user behavior
 Examples: Analyzing customer reviews, mining social media trends, exploring
unstructured knowledge bases
Considerations:

 Choice between controlled and uncontrolled queries depends on


research objectives, data availability, and resource constraints.
 Both approaches can be valuable, and often they are combined for
comprehensive web mining.
 Uncontrolled queries offer broader insights but necessitate deeper
understanding and careful processing.
Web Mining Examples:

 Travel website data:


 Controlled queries could be used to extract hotel listings based on specific
criteria (location, price, amenities).
 Uncontrolled queries could analyze visitor reviews to understand sentiment
and identify areas for improvement.
 News analysis:
 Controlled queries could retrieve articles on specific topics from credible
sources.
 Uncontrolled queries could explore broader social media discussions to
uncover emerging trends and public opinion.
Future Directions:

 Integration of semantic web technologies and advanced NLP


techniques to better understand unstructured data.
 Development of adaptive mining methods that can dynamically
switch between controlled and uncontrolled queries based on context
and needs.
 Enhanced use of explainable AI (XAI) to make query interpretation
and analysis more transparent.
Understanding
Word Embedding
and Word2Vec for
Efficient Language
Processing
https://www.youtube.com/watch?
v=viZrOnJclY0
Understanding Word Embedding and
Word2Vec for Efficient Language Processing

 Word embeddings and the Word2Vec model can be used to


assign numerical representations to words based on their
context, allowing for more efficient processing of language
and understanding of word similarities.
Understanding Word Embedding and
Word2Vec for Efficient Language Processing

 Key insights
• Word embeddings allow similar words to have similar numbers, making it easier to analyze
and understand text data.
• Words with similar meanings and usage should be assigned similar numbers in word
embedding to help neural networks learn more efficiently.
• Backpropagation is used to optimize the random values of the weights in a neural network,
enabling the network to make accurate predictions.
• The word embedding model uses input words to predict the next word in a phrase, assigning
higher values to the desired output word.
• Optimizing the weights of word embeddings can potentially improve the performance of
natural language processing models by capturing semantic relationships between words.
• Using word embeddings can optimize the weights in a neural network, allowing it to learn
how similar words are used and improve language processing.
• Word2vec efficiently creates word embeddings by selectively optimizing weights for specific
outputs, allowing for the creation of multiple embeddings for each word in a large vocabulary.
Q&A

 What are word embeddings and Word2Vec?


 —Word embeddings and Word2Vec are methods used to convert
words into numerical representations based on their context, making
it easier to process language and understand word similarities in
machine learning.
 How does a neural network determine word associations?
 —A simple neural network can determine the association between
words and numbers based on their context in phrases, allowing for
the prediction of the next word in a phrase.
Q&A

 Why is training a neural network important for word


embeddings?
 —Training a neural network is important for correctly predicting the
next word in a phrase and adjusting word embeddings to make
similar words more similar to each other based on their context.
 What strategies does Word2Vec use to increase context in
word embeddings?
 —Word2Vec uses two strategies, continuous bag-of-words and skip-
gram, to increase context in word embeddings by predicting
surrounding words based on the middle word and vice versa.
Q&A

 How does Word2Vec optimize training for word embeddings?


 —Word2Vec speeds up training by using negative sampling to
optimize only for the words we want to predict, efficiently creating
word embeddings by selecting a few words to predict and optimizing
only a fraction of the total weights in the neural network.
Timestamped Summary


00:00 Word embeddings and word2vec convert words into numbers,
allowing similar words to have similar numerical representations for easier
use in machine learning algorithms.
 02:38 Similar words should have similar numbers to help a neural network
learn and apply knowledge, and a simple neural network can determine
word-number associations based on context.
 04:54 We create a neural network with inputs for each unique word,
connect them to activation functions, and optimize the weights through
backpropagation to associate numbers with each word.
 06:20 Using word embeddings and the Word2Vec model, we can predict the
next word in a phrase by training a neural network to assign values to input
words, connect them to activation functions with weights, and run the
outputs through the softmax function for classification.
Timestamped Summary

 08:18 Word embeddings are adjusted through backpropagation to


make words that appear in the same context more similar to each
other, and the neural network accurately predicts the next word
based on input.
 10:37 Training a neural network with Word2Vec can help process
language and understand how similar words are used by assigning
numbers to words based on their context.
 12:31 Word2Vec uses multiple activation functions and a large
vocabulary to efficiently create word embeddings by optimizing only
a fraction of the total weights in the neural network.
GOOGLE BERT

 https://jalammar.github.io/illustrated-bert/
How to download pre-trained models and corpora

 https://radimrehurek.com/gensim/auto_examples/howtos/
run_downloader_api.html
Pre trained corpus

 A pre-trained corpus is a massive collection of text data that has


already been used to train a language model. Think of it like a vast
library of books that a language model has already read and learned
from. This "reading" process lets the model understand the nuances
of language, like how words are used together, sentence structure,
and different writing styles.
What's in it?

 A pre-trained corpus can contain diverse sources like


books, articles, code, websites, and even social media conversations.
 The size can vary, with some corpora containing billions of words!
Why is it used?

 Training a language model from scratch requires immense computing


power and data.
 Pre-trained corpora save time and resources by providing a
foundation of knowledge.
 The model can then be fine-tuned on specific tasks like summarizing
text, translating languages, or writing different kinds of creative
content.
Benefits:

 Faster training of language models.


 Improved performance on various NLP tasks.
 Adaptability to diverse domains by fine-tuning.
Examples:

 Well-known pre-trained corpora include Wikipedia, BookCorpus, and


Common Crawl.
 Specialized corpora exist for legal documents, medical texts, or
scientific papers.

You might also like