APIS
sequential model api
- allows to create models layer by layer for most problems
-functional model api
- alternative way of creating models with more complexity
Auto SXS
- evaluation tool facilitates A/b testing for LLMs
- core is Autorater
model garden
- quick and easy way to find and apply right model
- foundation models (pretrained multitask models that can be finetuned using vertex
ai)
- task specific (pretrained to solve specific problems)
- finetunable models (open source that can be finetuned using custom notebook or
pipeline
FeatureStore (feature repository
- fully managed solution
- batch and streaming feature ingestion
- share and reuse ml features across use cases
- serve ml features at scale with low latency (offload operational overhead of
handling infra)
- It manages and scales the underlying infrastructure for you such as storage and
compute resources.
- alleviate training-serving skew
Data Catalog
- catalogs native metadata on data assets
Dataprep
- can handle unstructured/structured datasets
- built on top of dataflow
= autoscalable
- flows is sequence of recipes
- recipes are preprocessing steps from library called Wranglers
- combine the flow and their recipes to create your Dataflow pipeline
Dataplex
- enables organizations to centrally manage, monitor and govern their data across
data lakes,
data warehouses and data marts with consistent controls, thus providing access to
trusted data, empowering analytics at scale.
BigQuery
- recommended NoSQL store
- serverless that support sql syntax
- storage at scale and reduce latency
- efficient processing and seamless integration with other google cloud services
- ideal for analytics (visualisation?)
- needs to be at end of pipeline to storea and analyse results (?)
- zscale normalization easy
- minimise computational overhead
- import TF models
Good for analytics and dashboards
- no hyperparameter tuning
- no end to end, needs other tools like vertex ai platform (to serve models)
Vertex AI
- model creation
- provides flexibility, scalability
- training and development
- provides lower infrastructure overhead
- dont need refactor code too much
- train tensorflow estimator code
- distributed training (auto handles training jobs distribution across many
machines)
- automatic scaling of resources, save costs compared to VM
- helps solve
1.
Vertex AI custom containers
- use ML frameworks, non ml dependencies that are not supported on vertex
Vertex AI pipelines
- run modular containerised AI pipeline steps
Vertex AI model monitoring
- fully managed for monitoring at minimal maintenance
Vertex AI Tensorboard
- compact and complete overview of training metrics over time
Dataflow
- data transform and processing
- Unified stream and batch data processing that's serverless, fast, and cost-
effective
- evaluating model on large dataset
- uses apache beam
tabular workflow
- sequential attention
- integrated, managed, scalable pipelines
- end to end ML with tabular data for regression and classification
automl
- tables do not require code
- handles training, validation, test splits auto when specify time column
automl nlp
- have to custom built models for nlp
automl tables
- automates building of ML models from tabular data
cloud filestore
- faster than cloud storage for accessing
Kubeflow
- form end to end architecture
pipelines sdk
- best practice to orchestrate ai pipelines with modular steps
cloud composer
- not cost efficient for one pipeline because env always active
- fully managed workflow orchestration
- used to automate machine learning workflows
- lacks flexibility and scalability as Kubeflow pipelines
cloud vision api
- confidently detects large objects within image
natural language api
- do sentiment analysis
- nlp api gives you sentiment analysis out of the box
- automl vs nlp api
- automl nlp requires custom training
automl nlp
- works well with small datasets
- uses transfer learning
cloud data fusion
- fully managed
- cloud native data integration service
- codeless interface
cloud function
- not good for computationally expensive/heavy data workflows
dataprep
- data preparation like cleaning, new column creation
cloud storage
- managed service for storing unstructured data (binary large objects)
- secured with data encryption
preemptive VM
- purchased VM for a steep discount
Responsible AI
7 Principles and 4 Area to not pursue
- Be socially beneficial
- Avoid creating or reinforcing unfair bias
- Be built and tested for safety
- Be accountable to people
- Incorporate privacy design principles
- Uphold high standards of scientific excellence
- Be made available for uses that follow these principles
4 Area
- likely to cause harm
- main purpose is to cause injury
- tech that gather or use information for surveilliance
- purpose contradicts widely accepted laws and human rights
SHAP lib
- con is computational cost on large feature sets such as images
Learning and Intepretability tool (LIT)
- mainly NLP but preliminary support for image and tabular
TCAV
TCAV focuses on associating predictions with broader human-understandable concepts
XRAI is a technique designed to highlight the specific pixels or regions of an
image that are most influential in the model's decision
ACE is about extracting high-level concepts from the model's decision-making
process
Vertex AI Explainable AI used for model understanding
crossentropy
- sparse categorical crossentropy
- use when classes are mutually exclusive (each sample belongs exclusively to
one class)
- require labels to be integer encoded in a single vector
- categorical crossentropy
- use when one sample can have multiple classes or labels are soft
probabilities
- 1,0
precision
- increasing threshold ->
1. increased precision
2. reduces number of false positives (predict car but no car)
3. might reduce recall (ability to detect all cars)
recommendations
- use frequently bought together to increase revenue while following best
practices
- overfitting
indication: very high auc roc on training
1. dropout param and l2 regularisation helps
2. increase size of network (neurons) makes more complex, no help
situations
- model trained long before, accuracy of model decreased
why?
lack of model training as market changes
- stream files which may contain PII, using Cloud Data Loss Prevention API
- make 3 buckets quarantine, sensitive, nonsensitive. write all data to quarantine,
and do periodic scans using api and move data to either of bucket.
datasets
- [Link] for input data in memory
- tfrecord (most efficient format for tensorflow) for input data in (a file) / non-
memory storage
[Link] is hybrid of Apache Beam on Dataflow and TensorFlow
- preprocessing function is a logical description of a tra
regulated insurance company
- build model that accepts or reject insurance applications
factors for build?
1. traceability (maintaining records of data for regulatory)
2. reproducibility (vital for validating reliability)
3. explainability (model decisions can be easily explained)
TPU reduce bottleneck and speed up training
- interleave for reading data ( helps to parallelize data reading)
- set prefetch option equal to training batch (preload the data)
model skews 6 months later due to change in input data distribution, how to address
- create alerts to monitor for skew, retrain model
time series predictions
- always split by time
- randomly split will artificially increase accuracy ( cant borrow info from future
to predict future)
aggregated data sent at the end of each day
- batch prediction
cnn vs rnn
cnn
- used for com vision
rnn
- time series predictions
tips
sigmoid activation- binary classification
SoftMax activation- multi class classification
- larger batch size require smaller learning ratem
resource tagging/labelling - best way to manage ml resources for medium/big teams
use nested cross validation to avoid data leakage in time series data
feature crosses
- features need to be binned
[Link]
pipelines-and-cloud-build#cicd_architecture
gcp recommends to use cloud build when building Kubeflow pipelines
epochs
- less training, affect model accuracy
learning rate
- converge faster if higher learning rate, might cause exploding gradients
batch size
- reducing results in reduced amount of memory required for each iteration
shape (-1,2) any num of rows, 2 columns (2 elements)
detect car or no car
true positive: predict car got car
true negative: predict no car no car
false positive: predict car no car
false negative: predict no car got car
precision= tp/(tp+fp)
- proportion of positive identifications that are actually correct
recall = tp/(tp+fn)
- proportion of actual positives that correctly identified
accuracy = (tp+tn)/(tp+fp+fn+tn)
- proportion of all correct predictions
auc pr vs auc roc
auc pr
- useful for imbalanced datasets like fraud detection
- considers both precision and recall
auc roc
- less informative for imbalanced datasets because equal weighs true and false
positives
imbalanced dataset
- oversample minority, downsample majority
- upweights of minority
too large to fit in single machine - distributed
alot of dependencies not supported - custom containers
training data split into multi files, reduce execution of input pipeline - parallel
interleave
quickly test, build deploy - automl
train from scratch if model needs to adhere to PII regulations
use key based hashes to tokenize (PII)
post training quantization
- minimally decrease model performance
- reduce model latency when retraining impossible
adam optimiser
- good for large datasets
- alot of parameters to adjust