0% found this document useful (0 votes)
19 views6 pages

Notesv 1

The document outlines various tools and APIs for machine learning and data management, including the Sequential and Functional Model APIs, Auto SXS for A/B testing, and Vertex AI for model creation and monitoring. It also discusses the importance of Responsible AI principles, data preparation with Dataprep, and features of BigQuery and Dataflow for data processing. Additionally, it covers model evaluation metrics, handling imbalanced datasets, and techniques for ensuring model explainability and compliance with regulations.

Uploaded by

raneli9600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views6 pages

Notesv 1

The document outlines various tools and APIs for machine learning and data management, including the Sequential and Functional Model APIs, Auto SXS for A/B testing, and Vertex AI for model creation and monitoring. It also discusses the importance of Responsible AI principles, data preparation with Dataprep, and features of BigQuery and Dataflow for data processing. Additionally, it covers model evaluation metrics, handling imbalanced datasets, and techniques for ensuring model explainability and compliance with regulations.

Uploaded by

raneli9600
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as TXT, PDF, TXT or read online on Scribd

APIS

sequential model api


- allows to create models layer by layer for most problems

-functional model api


- alternative way of creating models with more complexity

Auto SXS
- evaluation tool facilitates A/b testing for LLMs
- core is Autorater

model garden
- quick and easy way to find and apply right model
- foundation models (pretrained multitask models that can be finetuned using vertex
ai)
- task specific (pretrained to solve specific problems)
- finetunable models (open source that can be finetuned using custom notebook or
pipeline

FeatureStore (feature repository


- fully managed solution
- batch and streaming feature ingestion
- share and reuse ml features across use cases
- serve ml features at scale with low latency (offload operational overhead of
handling infra)
- It manages and scales the underlying infrastructure for you such as storage and
compute resources.
- alleviate training-serving skew

Data Catalog
- catalogs native metadata on data assets

Dataprep
- can handle unstructured/structured datasets
- built on top of dataflow
= autoscalable
- flows is sequence of recipes
- recipes are preprocessing steps from library called Wranglers
- combine the flow and their recipes to create your Dataflow pipeline
Dataplex
- enables organizations to centrally manage, monitor and govern their data across
data lakes,
data warehouses and data marts with consistent controls, thus providing access to
trusted data, empowering analytics at scale.

BigQuery
- recommended NoSQL store
- serverless that support sql syntax
- storage at scale and reduce latency
- efficient processing and seamless integration with other google cloud services
- ideal for analytics (visualisation?)
- needs to be at end of pipeline to storea and analyse results (?)
- zscale normalization easy
- minimise computational overhead
- import TF models
Good for analytics and dashboards

- no hyperparameter tuning
- no end to end, needs other tools like vertex ai platform (to serve models)

Vertex AI
- model creation
- provides flexibility, scalability
- training and development
- provides lower infrastructure overhead
- dont need refactor code too much
- train tensorflow estimator code
- distributed training (auto handles training jobs distribution across many
machines)
- automatic scaling of resources, save costs compared to VM
- helps solve
1.

Vertex AI custom containers


- use ML frameworks, non ml dependencies that are not supported on vertex

Vertex AI pipelines
- run modular containerised AI pipeline steps

Vertex AI model monitoring


- fully managed for monitoring at minimal maintenance

Vertex AI Tensorboard
- compact and complete overview of training metrics over time

Dataflow
- data transform and processing
- Unified stream and batch data processing that's serverless, fast, and cost-
effective
- evaluating model on large dataset
- uses apache beam

tabular workflow
- sequential attention
- integrated, managed, scalable pipelines
- end to end ML with tabular data for regression and classification

automl
- tables do not require code
- handles training, validation, test splits auto when specify time column
automl nlp
- have to custom built models for nlp
automl tables
- automates building of ML models from tabular data

cloud filestore
- faster than cloud storage for accessing

Kubeflow
- form end to end architecture
pipelines sdk
- best practice to orchestrate ai pipelines with modular steps

cloud composer
- not cost efficient for one pipeline because env always active
- fully managed workflow orchestration
- used to automate machine learning workflows
- lacks flexibility and scalability as Kubeflow pipelines

cloud vision api


- confidently detects large objects within image
natural language api
- do sentiment analysis
- nlp api gives you sentiment analysis out of the box
- automl vs nlp api
- automl nlp requires custom training

automl nlp
- works well with small datasets
- uses transfer learning

cloud data fusion


- fully managed
- cloud native data integration service
- codeless interface

cloud function
- not good for computationally expensive/heavy data workflows

dataprep
- data preparation like cleaning, new column creation

cloud storage
- managed service for storing unstructured data (binary large objects)
- secured with data encryption

preemptive VM
- purchased VM for a steep discount

Responsible AI
7 Principles and 4 Area to not pursue
- Be socially beneficial
- Avoid creating or reinforcing unfair bias
- Be built and tested for safety
- Be accountable to people
- Incorporate privacy design principles
- Uphold high standards of scientific excellence
- Be made available for uses that follow these principles

4 Area
- likely to cause harm
- main purpose is to cause injury
- tech that gather or use information for surveilliance
- purpose contradicts widely accepted laws and human rights

SHAP lib
- con is computational cost on large feature sets such as images

Learning and Intepretability tool (LIT)


- mainly NLP but preliminary support for image and tabular

TCAV
TCAV focuses on associating predictions with broader human-understandable concepts

XRAI is a technique designed to highlight the specific pixels or regions of an


image that are most influential in the model's decision
ACE is about extracting high-level concepts from the model's decision-making
process

Vertex AI Explainable AI used for model understanding

crossentropy
- sparse categorical crossentropy
- use when classes are mutually exclusive (each sample belongs exclusively to
one class)
- require labels to be integer encoded in a single vector
- categorical crossentropy
- use when one sample can have multiple classes or labels are soft
probabilities
- 1,0

precision
- increasing threshold ->
1. increased precision
2. reduces number of false positives (predict car but no car)
3. might reduce recall (ability to detect all cars)

recommendations
- use frequently bought together to increase revenue while following best
practices

- overfitting
indication: very high auc roc on training
1. dropout param and l2 regularisation helps
2. increase size of network (neurons) makes more complex, no help

situations
- model trained long before, accuracy of model decreased
why?
lack of model training as market changes
- stream files which may contain PII, using Cloud Data Loss Prevention API
- make 3 buckets quarantine, sensitive, nonsensitive. write all data to quarantine,
and do periodic scans using api and move data to either of bucket.

datasets
- [Link] for input data in memory
- tfrecord (most efficient format for tensorflow) for input data in (a file) / non-
memory storage

[Link] is hybrid of Apache Beam on Dataflow and TensorFlow


- preprocessing function is a logical description of a tra

regulated insurance company


- build model that accepts or reject insurance applications
factors for build?
1. traceability (maintaining records of data for regulatory)
2. reproducibility (vital for validating reliability)
3. explainability (model decisions can be easily explained)

TPU reduce bottleneck and speed up training


- interleave for reading data ( helps to parallelize data reading)
- set prefetch option equal to training batch (preload the data)

model skews 6 months later due to change in input data distribution, how to address
- create alerts to monitor for skew, retrain model

time series predictions


- always split by time
- randomly split will artificially increase accuracy ( cant borrow info from future
to predict future)

aggregated data sent at the end of each day


- batch prediction

cnn vs rnn
cnn
- used for com vision
rnn
- time series predictions

tips
sigmoid activation- binary classification
SoftMax activation- multi class classification

- larger batch size require smaller learning ratem

resource tagging/labelling - best way to manage ml resources for medium/big teams

use nested cross validation to avoid data leakage in time series data
feature crosses
- features need to be binned
[Link]
pipelines-and-cloud-build#cicd_architecture
gcp recommends to use cloud build when building Kubeflow pipelines

epochs
- less training, affect model accuracy
learning rate
- converge faster if higher learning rate, might cause exploding gradients

batch size
- reducing results in reduced amount of memory required for each iteration
shape (-1,2) any num of rows, 2 columns (2 elements)

detect car or no car


true positive: predict car got car
true negative: predict no car no car
false positive: predict car no car
false negative: predict no car got car

precision= tp/(tp+fp)
- proportion of positive identifications that are actually correct
recall = tp/(tp+fn)
- proportion of actual positives that correctly identified
accuracy = (tp+tn)/(tp+fp+fn+tn)
- proportion of all correct predictions

auc pr vs auc roc


auc pr
- useful for imbalanced datasets like fraud detection
- considers both precision and recall
auc roc
- less informative for imbalanced datasets because equal weighs true and false
positives

imbalanced dataset
- oversample minority, downsample majority
- upweights of minority

too large to fit in single machine - distributed


alot of dependencies not supported - custom containers
training data split into multi files, reduce execution of input pipeline - parallel
interleave
quickly test, build deploy - automl

train from scratch if model needs to adhere to PII regulations


use key based hashes to tokenize (PII)

post training quantization


- minimally decrease model performance
- reduce model latency when retraining impossible

adam optimiser
- good for large datasets
- alot of parameters to adjust

You might also like