Deep Learning Basics
Deep Learning Basics
Deep Learning
Understanding the Basics of
How (and Why) it Works
A G U I D E www.dataiku.com
BOOK BY DATAIKU
www.dataiku.com
©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku 1
AN INTRODUCTION
TO DEEP LEARNING
Understanding The Basics of How (and Why) it Works
While previously there wasn’t a good way to train deep learning neural networks, now with advancements in machine
learning (ML) algorithms and deep learning chipsets, deep learning (DL) is being more actively implemented. It is
being applied across industries, from healthcare to finance to retail and everything in between, and the global
deep learning market is expected to reach $10.2 billion by 2025 1 .
Picture Source 2
In fact, some of the controversy surrounding deep learning (and more particularly surrounding artificial intelligence,
or AI) is the fear of the “black box.” That is, how can anyone base a service or product on deep learning and trust
the decisions being made if no one knows how they’re being made?
This guidebook will unpack some of the nuances and intricacies to help uncover what makes DL such an effective
solution to some of today’s most complex problems. But on top of that, the goal is to take a deeper dive into how
certain aspects of DL work to build more trust and confidence around the technology with business leaders as well
as data teams. If you know how it works, it becomes less intimidating (and its use cases become more clear).
If you haven’t had any previous exposure to ML, we recommend reading the illustrated Machine
Learning Basics guidebook 3 . In addition, less technically inclined readers might consider going from
the first few sections (high-level definitions and use cases) to the last few sections (real-life applications),
since the sections between dive into the mechanics behind neural networks and are for people more
interested in how deep learning is implemented.
You might already know that deep learning works because it imitates how the human brain works and how
people learn. But to take that one step further, think about a child learning to associate words with objects.
Her first word might be “cat.” She then might point to any animal at all and say “cat.” If what she’s pointing to
is, in fact, a cat, her father might confirm this by saying “yes, this is a cat.”
However, if she points to a dog, her father will then say something like “no, that’s not a cat - it’s a dog.” And
gradually, subconsciously, the child learns about what exactly makes a cat a cat, what makes it different than
a dog, and adds more and more complex layers (like for example, what makes a house cat different than a
lion, which is also - technically - a cat).
Deep learning works similarly in that a computer takes inputs (data - often unstructured, like text, videos,
images, or even sound) and extracts useful information. It does this through a hierarchy of increasing
complexity and abstraction, continuously using knowledge and learning from previous layers, until it reaches
an accurate output. For those ready for more, not to worry - this guidebook will go into even more depth on
this definition later (see “Going Deep on Deep Learning” if you’re impatient).
Talking about deep learning is increasingly complex because it’s often used along side (or even
interchangeably with) the terms machine learning and artificial intelligence (AI).
First here is what you need to know: DL is a subset of ML ,which is itself a subset of AI (this graph helps explain
this nuance) 4 .
Before the start of Machine Learning in the 80s, business decision rules were mostly hand-coded set of
instructions based on the knowledge of business experts. With machine learning, those rules are inferred
from the previously collected data - the business expertise plays a role (and is in fact required) for the feature
engineering part.
Basically, the business expert needs to determine which factors may impact the result you want to predict,
and the algorithm automatically selects the optimal way to combine these factors. You ”train” a model. The
key question is: based on my data, what is the best rule I can create to solve my business problem?
A DL algorithm is able to learn hidden patterns from the data by itself, combine them together, and build much
more efficient decision rules. That’s why it can deal with problems that a human brain could not understand
- all the value of deep learning is this automatic pattern identification capability. This means handling more
complex problems, such as understanding concepts in images, videos, texts, sounds, time series, and all other
unstructured data you think of.
But don’t think deep learning as a model learning by itself. You still need properly labeled data, an evaluation of
the model results, and of course an evaluation of the business value it will bring! Actually, the lack of precisely
labeled data is one of the main reasons DL can have disappointing results in some business cases.
Of course handling more complex data means more complex algorithms. And to extract general enough complex
patterns from complex data, you will need lots - read LOTS - of examples (much more than an ML model) -
typically millions of labeled images for a classification task.
Since the feature engineering is automatically done by the machine, the interpretation is not obvious for a
human and DL “black-box” decision rules can be rejected by business analysts. In fact, DL model interpretability
is one of today’s biggest DL research challenges.
1. 2.
It is based on the failing by learning
It requires lots of data (again) and
strategy, so you need a not-too
critical use case (A/B testing for
instance) or a realistic simulation tool
to train your model.
So don’t worry,
we are far from
Terminator -
when broken
down, AI is really
an extension of
technologies that
we’re already using
today.
7 ©2020 Dataiku, Inc. | www.dataiku.com | [email protected] | @dataiku
Deep Learning Applications
Advancements in deep learning algorithms as well as hardware have resulted in an explosion of applications
both in the consumer sector and within the enterprise that were not possible just five years ago.
Many of today’s use cases leverage computer vision and image detection. And because (as previously
emphasized) deep learning works best as the amount of data scales - that is, it needs massive amounts - its
most practical applications today are in the following industries:
• Automotive: On the enterprise side, many of the gains in deep learning in the automotive sector are
in manufacturing (see above). But of course, deep learning technology - more specifically image recognition
and computer vision - is also the cornerstone of self-driving cars. It is responsible for detecting lanes,
traffic lights, even people (and it often does this better - that is faster - than a human could, especially in
situations like at night or if something comes in front of the vehicle quickly).
• Hospitality: In an industry where exceptional customer service can make a customer for life, deep
learning in the hospitality industry is centered around creating better and better customer service bots.
Creating a bot that truly responds like a human, particularly reacting to emotional states 7 , takes deep
learning technology.
• Health Care: From drug discovery to image detection for early (or more accurate) disease detection
to insurance fraud prevention, health care is perhaps one of the industries poised to be changed the most
by advances in deep learning.
• Banking, Insurance, & Finance: As fraudsters get more advanced, techniques for
fraud detection need to advance along with them. Deep learning is ideal for this industry because it’s
often difficult to identify good features as fraud becomes increasingly difficult to detect.
• Agriculture: Computer vision and deep learning hold great promise to revolutionize farm
machinery. Robots that can “see,” for example, weeds can eliminate them with a targeted approach.
• Entertainment: From advanced recommendation engines to fake news detection, deep learning is
already present across the sector. Upcoming trends include the so-called “Immersive Experience Industry,”
which will largely be based on DL technology.
• IT/Security: Malware detection is an increasingly important cyber security problem, and similar to
the challenges faced in the banking, insurance, and finance industry, detection methods must grow more
sophisticated along with their attackers. Deep learning is well suited because the models are robust enough
to handle natural variations in malware.
• Retail, Supply Chain & Logistics: Deep learning is changing the way retailers buy,
stock and sell products. Just one of the many examples of its applications is the use of computer vision in
warehouses or on retail shelves to determine low stock.
This is one of the key reasons deep learning is more powerful than classical machine learning - it creates
transferable solutions. That is, the concepts of paw, tail, and ears can be easily reused to understand what a
dog is as well.
Deep learning algorithms are able to create transferable solutions through neural networks: that is, layers of
neurons/units.
For some, understanding that neurons make up neural networks, and those in turn allow machines (via deep
learning) to “learn” like humans, might be enough (if so - you might consider skipping ahead to learn about the
types of neural networks).
But to have a more robust understanding, it’s also important to understand how those underlying neurons
actually work.
The output is determined the way you would make a decision: imagine you’re deciding where to eat and
consider taste, location, and price. Each input has a different level of importance.
Well, a neuron similarly takes multiple inputs, each with a corresponding weight (importance). The inputs are
passed through an activation function which gives the final output (class of the input). For example, if there’s
a high probability that you’ll eat at Shake Shack based on your taste, the location of the nearest Shake Shack,
and the price point, then the activation function will output Shake Shack as the final output.
Deep learning problems boil down to classification - whether binary (e.g., is this image a cat, or not a cat?)
or multiclass (e.g., is this image a cat, a dog, a bird, etc.). So finding the optimal features (variables) and
parameters (weights) are key. DL is used for complex problems like medical diagnosis, but the underlying goal
of finding boundaries (for positive and negative) can be thought of conceptually, like classifying purple and
green points in a plane:
In this case, the “drawing” of a diagnosis boundary (our classification model) depends on gene 1 and gene 2
(our features). Points farther from the boundary are more likely to be in their respective class.
Adding layers lets the computer create more and more specific features that lead to a more complex final
output. For our example, adding more layers would let us create a more complex final boundary (straight line
-> simple curve -> complex curve).
If this were an image classification problem, more layers would allow us to identify more complex images
(blobs, edges -> noses, eyes, cheeks -> face).
Understanding gradient descent is helpful for understanding deep learning because it’s one of the most popular
- if not the most popular - strategy for optimizing a model during training (that is, making sure it’s “learning”
correctly).
Remember that in deep learning, it’s the algorithm that finds the features for the most accurate classification
(instead of the human, as is the case in machine learning), so the computer needs a way to determine the
optimal features and weights (ones that lead to the most accurate final classification).
The Nitty-Gritty Details
This happens through choosing the features and weights that minimize some error/cost function. The error/cost
function is the sum of loss functions (predicted value of a point - actual value of a point) + a regularization term.
The regularization term penalizes models with many features to prevent overfitting (being accurate for a specific
dataset but failing to generalize).
To minimize our error function, we use gradient descent: the computer chooses certain parameters (features
and weights) and takes the negative gradient (gradient is the rate of greatest increase, so the negative gradient
is the rate of greatest decrease) of the error function until it finds the parameters that lead a gradient of 0
(corresponding to a minimum of the error function). It works like getting to the lowest point on a mountain as
quickly as possible: you walk in the direction of steepest decrease until you hit a minimum. For example, here
we keep adjusting the line until we have minimized the classification error (larger dots correspond to larger
errors).
Visual Representation
Gradient descent is a little tricky to describe, but it’s easier to understand visually how it works to minimize
errors:
Gradient descent is an optimization algorithm used to find the best solutions to problems. Here we used it to
find the best features and weights, but gradient descent is also used for other optimization problems like finding
the best filters (we’ll talk about this later).
Feed Forward - Used in computer vision and speech recognition when classifying the target classes are
complicated. Responsive to noisy data and easy to maintain.
Radial Basis - Considers the distance of a point with respect to the center. Used for power restoration
systems which are notoriously complicated.
Kohonen - Recognizes patterns in data. Used medical analysis to cluster data into different categories ( a
Kohonen network was able to classify patients with a diseased glomerular vs. a healthy one)
Recurrent - Feeds the output of a layer back as input. Good for predicting the next word in a body of text but
harder to maintain.
Modular - Collection of different networks work independently and contribute towards the final output.
Increases computation speed (through the breakdown of a complicated computational process into simpler
computations), but processing time is subject to the number of neurons.
In this guidebook
we’ll focus on
convolutional neural
networks (CNN),
which are similar
to feed-forward
neural networks but
dominate computer
vision because of
their much higher
accuracy.
For complicated problems like image identification, it’s difficult and time-consuming to try to identify the most
important variables before training (feature engineering). This is why deep learning instead applies feature
learning, where the machine learns the optimal features and weights on its own. Again, each layer corresponds
to more and more specific features (blobs, edges -> noses, eyes, cheeks -> face).
• The convolutional and pooling layer(s) extract the optimal features. Each feature is a filter is slide over
the target image to break the image into simpler images.
• The fully connected layer identifies the class of the image by comparing it to different images and
finding the best match.
The mapping of 1s and -1s to 4s and -4s relies on the type of filter the computer chooses (the best filter, like
the best features for classification, is found through gradient descent!). In this problem, our computer chose
a filter which adds (+) the number in the top left , subtracts(-) the top right, subtracts(-) the bottom left, and
adds(+) the bottom right. So in our case, the filter starts in the top left quadrant (which corresponds to a \)
and performs the operation + (1) - (-1) - (-1) + (1) to get the final value of 4. Then it moves to the top right
quadrant (which corresponds to a /) and performs the operation + (-1) - (1) - (1) + (-1) to get the final value of
-4. The filter reduces each 2x2 area to a 4 or -4. That’s why this filter works well - it outputs different values
for different filter images.
Each filter slides over each section (top left -> top right -> bottom left -> bottom right), outputting a 1 (firing)
for a match or -1 for absence of the filter. The top row are the number outputs from sliding the \ filter over
each section, and the 2nd row is for the / filter.
To reiterate, the convolution and pooling layer(s) slides 2x2 filters over each area of the image, reducing each
2x2 area to a number. The fully connected layer compares the image to different image filters. The image with
the highest score is the best match!
PANTHER
Here, the only difference is that the filters are more complicated and that the network has many more layers to
handle the increased complexity. Also, if the image is in color, an image is initially represented as three stacked
matrices (1 for red, 1 for blue, 1 for green) instead of a single matrix.
Fine-tuning is a method of doing transfer learning. We take a pre-trained model, change the weights of the top
layer(s), and “freeze” other layers so that their weights don’t change during learning. The number of layers we
change depends on how more similar our images are. If we had a pre-trained model of a cat and were trying
to identify a specific cat, we wouldn’t change very many layers. If we were instead trying to identify a lion, we
would need to change more layers, because cats and lions don’t have as much overlap..
The future of deep learning is bright because of its open source community and accessible platforms. Increasingly, leading
corporations such as Apple, Facebook, and Google, are making their technology accessible to the public.
“The main reason organizations make the switch to open-source is that it becomes easier to find deep learning talent. A
company could have developed the most amazing and efficient deep learning system, but if they don’t publish their research
and share their knowledge, talented data scientists and deep learning practitioners won’t be able to learn about their
system and apply it to their organization.”
Rodrigo Agundez, Lead Data Scientist @ GoDataDriven
Because of the shift towards open-source models, deep learning teams like Google Brain, Google DeepMind, and companies
like Facebook and Baidu are finding it easier to hire talented deep learning practitioners and become more cutting-edge.
In the near future, deep learning will significantly improve voice command systems (think Siri and Alexa), as well as health
care and image identification:
The model, based on the U-Net deep learning architecture, takes the MRI scan as input and outputs the
corresponding volumes. Traditionally, this process is done manually by a doctor using hand-drawn diagrams, but
this model greatly accelerates the process and accuracy.
Sorting through all these photographs manually very time-consuming, so GoDataDriven designed a deep
learning system to automate photo quality checking. The system removed the need for tedious manual review as
it can accurately identify and sort pictures, even ones from different angles and devices.
Endnotes
1 Deep Learning Market Size Worth $10.2 Billion By 2025
2 Deep Learning Market Size, Share & Trends Analysis Report By Solution, By Hardware (CPU, GPU, FPGA, ASIC), By Service,
Netezza
Teradata Train MLlib_Prediction
Oracle Vertica
HDFS_Avro Joined_Data
Amazon_S3 HDFS_Parquet
Cassandra 4. Deploy
2. Build + Apply to production
Machine Learning
3. Mining
& Visualization