0% fanden dieses Dokument nützlich (0 Abstimmungen)
76 Ansichten53 Seiten

4,5

Das Dokument behandelt die Grundlagen und Entwicklungen von Deep Feedforward Networks und Deep Learning, einschließlich der Geschichte, der Herausforderungen und der Motivation hinter diesen Technologien. Es erläutert die verschiedenen Lernmethoden, die in Deep Learning verwendet werden, sowie deren Anwendungen in verschiedenen Bereichen wie Medizin, Finanzwesen und industrielle Automatisierung. Zudem werden die Unterschiede zwischen Machine Learning und Deep Learning sowie die Vor- und Nachteile von Deep Learning diskutiert.

Hochgeladen von

sakthibaska6161
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF herunterladen oder online auf Scribd lesen
0% fanden dieses Dokument nützlich (0 Abstimmungen)
76 Ansichten53 Seiten

4,5

Das Dokument behandelt die Grundlagen und Entwicklungen von Deep Feedforward Networks und Deep Learning, einschließlich der Geschichte, der Herausforderungen und der Motivation hinter diesen Technologien. Es erläutert die verschiedenen Lernmethoden, die in Deep Learning verwendet werden, sowie deren Anwendungen in verschiedenen Bereichen wie Medizin, Finanzwesen und industrielle Automatisierung. Zudem werden die Unterschiede zwischen Machine Learning und Deep Learning sowie die Vor- und Nachteile von Deep Learning diskutiert.

Hochgeladen von

sakthibaska6161
Copyright
© © All Rights Reserved
Wir nehmen die Rechte an Inhalten ernst. Wenn Sie vermuten, dass dies Ihr Inhalt ist, beanspruchen Sie ihn hier.
Verfügbare Formate
Als PDF herunterladen oder online auf Scribd lesen
Deep Feedforward Networks syllabus FO oe pugation Probabilistic Theory of Deep Leming. Gradient Learning - Chain Fn ee ‘egularization: Dataset Augmentation - Noise Robusiness -Early sopping, Bagging ropout - batch normalization- VC Dimension and Neural Nets. Contents 44 History of Deep Learing 42. AProbabilistic Theory of Deep Learning 43 Deep Networks 44. Challenges and Motivation of Deep Leaming 45 Gradient Learning 46 Chain Rule and Backpropagation 47 Regularization : Dataset Augmentation 48 Bagging and Dropout 49 VC Dimension 4.10 Two Marks Questions with Answers ee ROIS and Woop Lenrmly ~~ | 4.4 | History of Deep Learning 943, when Waller Pitts ang y, ural networks of the human 4 “ it they called “threshold log ck to 1 ‘The history of Deep Leaming ean be traced ba odel based on the net MeCulloch created a computer 1m . : ithematics “They used a combination of algorithms and m= mimic the thought process. . f human brain; they used algorith, Thi i ‘ imic thought process ims ere basic aim was to mimic thous! uman thought POCSS. Alen 7 , ic mathematics to make the threshold logie t© mimi + 195 = machines would not take much i called the father of Al concluded in 1951 that the machin ich, time , i vould be able to talk to started thinking of their own; at some point of time, they would eihogy 1d take the control of the universe. a combination of mathematics and algorithms rocess. Since then, deep leaming has ey, le aks in its development, and it is also expected that they woul Warreni McCulloch and Walter Pitts used called threshold logic to mimic the thought p steadily, over the years with two significant bre: ‘uous Back Propagation Model is credited to Heay er version based only on the chy in the early 1960s but only becane The development of the basics of a cont J. Kelley in 1960. Stuart Dreyfus came up with a simp rule in 1962. The concept of back propagation existed i useful until 1985. ‘The next significant evolutionary step for deep leaming took place in 1999, when computes started becoming faster at processing data and graphics processing units were developed. Neural networks also have the advantage of continuing to improve as more training datais added. 4 Around the year 2000, The Vanishing Gradient Problem appeared. It was discoverd features" formed in lower layers were not being leamed by the upper layers, because 0 learning signal reached these layers. 3 i In 2001, a research report by META Group described the challenges and opportunist data growth as three-dimensional. The report described the increasing volume of data cs) the increasing speed of data as increasing the range of data sources and types. This wis? call to prepare for the onslaught of Big Data, which was just starting. In 2009, Fei-Fei Li, an Al professor at Stanford launched ImageNet, assembled | database of more than 14 million labeled images. The Internet is and was, full of images. Labeled images were necded to “train” neural nets, By 2011, the speed of GPUs had increased significantly, making it possible © convolutional neural networks “without” the layer-by-layer pre-training. With the i TECHNICAL PUBLICATIONS® . an up.thnust for knowledge works and Deep Leaming 4-3 Deep Feedforward Networks computing speed, it became obvious deep leaming had significant advantages in terms of efficiency and speed, «Generative Adversarial Network (GAN) is a class of machine learning system invented by Jan Goodfellow and his colleagues in 2014, Cothing up in history in 2016 Google DeepMind challenge match between Alpha Go versus Lee Sedal, the AlphaGo win all the matches from aworld champion Lee Sedol, AlfuGo and AlfaZero are computer programs developed by artificial intelligence research company called DeepMind in (2016 - 2017); it plays the board game Go. The transformer introduced in 2017 Natural Language Processing (NLP), ~ 19 a deep leaming model used specially used for Although there is a lot of community contributed to the deep leaming but Yann LeCun, Geoffrey Hinton, and Yoshua Bengio have received Turing awards in 2018, a Probabilistic Theory of Deep Learning Probabilistic modeling isthe application ofthe principles of statisti to data analysis, It was one of the earliest forms of machine learning and it is still widely used to this day. One of the best-known algorithms in this category is the Naive Bayes algorithm, Naive Bayes is a type of machine-leaming classifier based on applying Baye's theorem while assuming that the features in the input data are all independent, This form of data analysis predates computers and was applied by hand decades before its first computer implementation. ' A closely related model is the logistic regression, which is sometimes considered to be the “hello world” of modem machine leaming. Much like Naive Bayes, logistic regression predates computing by a long time, yet itis still useful to this day. Bayes theorem provides a way to calculate the probability of a hypothesis based on its prior probability, the probabilitics of observing various data given the hypothesis and the observed data itself. Baye’s theorem is a method to revise the probability of an event given additioral information, , Baye's theorem ealeulates a conditional probability -called a posterior or revised Probability. Baye's theorem is a result in probability. theory that relates conditional probabilities. 1f A and B denote two events, P(A[B) denotes the conditional probability of A occurring, given that B occurs, The two conditional probabilities P(A[B) and P(BIA) are in general different. a er Deep Feedforward ‘Neural Networks and Doep Loarning 424 Notverg | P(BIA). An important appl © This th etween P(AIB) and ion ay the strengths of evidence. ate or revise Baye's theorem is that Hate OFF belief in light of new evidence a posterior. a relation s a rule how to upd rem gives ‘© A prior probability is an initial probability datue originally obtained BIE 2Y edition information is obtained. i i * A posterior probability is a probability value that has been revised by using adlition information that is later obtained. © IfA and B are two random variables PCAIB) = PBLAYPIAL © Inthe context of classifier hypothesis h and training data T ran =2 rn h) where (iy = Prior probability of hypothesis h (1) = Prior probability of training data I P(h/t) = Probability of h given | PUM) = Probability of I given h ikely a random variable or set of random © A probability distribution is a description of how | The way we describe probability variables is to take on cach of its possible states. distributions depends on whether the variables are discrete or continuous. © A random variable, usually written X, is a variable whose possible values are numerical outcomes of a random phenomenon. Deep Networks , The term “deep” usually refers to the number of hidden layers in the n Deep learning is a subset of machine learning, which is predicated on idea of learning for example. In machine leaming, instead of teaching a computer a massive list of rules to sot lem, we give it a model with which it can evaluate examples, and a small set of cural network. the prol instructions to modify the model when it makes a mistake. The basic idea of deep learning is that repeated composition of functions can often redurt the requirements on the number of base functions (computational units) by a factor that 8 exponentially related to the number of layers in the network. Deep learning eliminates some of data pre-processing that is typically involved wit machine learning. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge syoras and Deep Leaming aes Pe saws elation berweny ay gg nn tera Newer 3.1 shows relation between AI ; + ML and deep learning, Fig. 4.3.1 Relation between Al, ML and deep learning For example, let's say that we had a set of photos of different pets, and we wanted to categorize by “cat” and ° “dog”. Deep learning algorithms can determine which features (e.g. ears) are most important to distinguish each animal from another, In machine learning, this hierarchy of features is established manually by a human expert. In deep learning, a computer model learns to perform classification tasks directly from images, text, or sound. Deep leaming models can achieve state-of-the-art accuracy, sometimes exceeding human-level performance. Models are trained by using a large set of labeled data and neural network architectures that contain many layers. Deep learning classifies information through layers of neural networks, which have a set of inputs that receive raw data, For example, if'a neural network is trained with images of birds, it can be used to recognize images of birds. More layers enable more precise results, such as distinguishing a crow from a raven as compared to distinguishing a crow froma chicken. Deep learning consists of the following methods and their variations : a) Unsupervised Teaming systems such as Boltzman machines for preliminary training, auto-encoders, generative adversarial network. b) Supervised learning such as Convolution neural networks which brought technoogy of Patter recognition to a new level. ¢) Recurrent neural networks, allowing to train on processes in time. d) Recursive neural networks, allowing to include feedback between circuit elements and chains, 143.4] Reasons for using Deep Learning |. Analyzing unstructured data Deep learning algorithms can be trained to look at text data by analyzing social media posts, news, and surveys to provide valuable business and Customer insights. TECHNICAL PUBLICATIONS® n up-thrust for knowledge Deep Fe a 1p Feedforwarg jy Noural Networks and Deep Learning yele data on its own. d data for trainin: Once traineg ing requires tab : 2. Data labelling + Deep learn . label new data and identify different types 0 “ deep leaning algorith Ily from ravw data. i i trained, it 4. Efficiency : When a deep learning algorithm is properly trained, it can nie thousands of tasks over and over again. faster than humans. - avorks used in deep learning have the ability to be appig applications. Additionally, a deep learning model can aizy ym can save time because it og Py bs Feature engineering : A require humans to extract features manus 5. Training : The neural net many different data types and by retraining it with new data. Application of Deep Learning 1. Acrospace and defense : Deep learning is utilized extensively to help satellites identify specific objects or areas of interest and classify them as safe or unsafe for soldiers, nancial institutions regularly use predictive analytics to drive sks for loan approvals, detect fraud, and for clients. 2. Financial services algorithmic trading of stocks, assess busines help manage credit and investment portfoli : 3, Medical research : The medical research ficld uses deep learning extensively. For example, in ongoing cancer research, deep learning is used to detect the presence of cancer cells automatically. Y 4, Industrial automation : The heavy machinery sector is one that requires a large number of safety measures. Deep leaming helps with the improvement of worker safety in such environments by detecting any person or objects that comes within the unsafe radius oft heavy machine. ; 5, Facial recognition : This feature utilizing deep learning is being used not just for a rangt of security purposes but will soon enable purchases at stores. Facial recognition is already being extensively used in airports to enable seamless, paperless check-ins. ‘ _ EE] Difference between Machine Learning and Deep Learning Machine Learning ‘Machine learning uses algorithms to parse data, earn from that data, and make informed to-create an “artificial neural decisions based on what it has teamed. can learn and make intelligent decis its own. Gite TECHNICAL PUBLICATIONS® - an up-thrust for knowledge vas and Deep Learning neti 47 Machine Jcarning gives lesser accuracy, Deep Feedtonward Networks Machine learning requires less time for Detp lang gives mow ai ‘ . Deep raining Teaming training, Tequires more time for ‘ews accurately identified features by hums a intervention. ean create new features: Machine learning models mostly re require data in i wroctured form, Deep Leaning models can work with Structured and unstructured data both as they rely on the layers of the Artificial neural (network, ‘ ‘Algorithms are detected by data analysts to examine specific variables in data sets, Algorithms are largely self-directed on data analysis once they are put into production. Machine lesming ean Work on low-end Deep learning model needs ab tof needs a buge amount of machines. data to work efficiently, so they need GPU's _ and hence the high-end machine Feature extraction ature extraction + Classification Classification [EEE Difference between ML, Al and Data Science ‘Artificial Intelligence [Dita Science: 1. Focuses on providing a means for Focuses on giving machines | Focuses on extracting algorithms and systemsto leam cognitive andintellectual’ | | information needles from capabilities similar to those | data haystacks to aid in” Sr.No, ‘Machine Learning from experience with data and ust k that experience to improve over of humans. decision-making and time. . planning. 2. Machine Leaming uses statistical models. “Artificial Intelligence uses | RENE ie jopic and decision tees. structured data. E A hich { Development of fThe process of using ” Se A att computerized applications | sdvanced analytics to data and find patterns... ~ | that simulate human } extract relevant | ahh Ren intelligence and interaction, information from data.) TECHNICAL PUBLICATIONS® [Neural Networks and Deep Learning 4-8 Deep Feedlorward Netverg 4 Objective is to maximize Objective is to maximize the Objective is to extract, meeacye chance of success. actionable insights from, the data, ©S. Mean be done through Alencompasses a collection | Uses statistics, supervised, unsupervised or of intelligence concepts, mathematics, data reinfareement learning including elements of wrangling, big data approaches. perception, planning and analytics, machine prediction. Jeaming and various other methods to answer. analytics questions. 6. ML is concemed with knowledge Al is concerned with Data science isall. about accumulation. knowledge dissemination __ data engineering, and conscious machine actions. EEE] Ditference between Al, ML and Deep Learning Sr. al ML DL "] No. 1. Al aims towards building MLaims to lear through data DLaimstobuildneural machines that are capable toto solve the problem. networks that autornatically think like humans, discover patterns for feature detection. 2. , Alis subset of data science. ML is subset of Al and data DL is subsct of AI, ML and seience. data science, 3. Alllsystems of artificial ML algorithms can be broadly Deep learning architectures are intelligence fall into three classified into three categories as follows: ‘Types: a) Supervised leaming a) Convolutional Neural a) Antficial Narrow +) Unsupervised learning Networks Intelligence ¢) Reinforcement leaming _) Recurrent Neural Networks ; b) Artificial General ) Recursive Neural Networks F Intelligence . c) Artificial Super Intelligence 4. Making machines intel ‘These algorithms can work Algorithms are dependent on may or may not need high easily on normal low high performance hardware ) 2 computational power, performance computers components that include without GPUs. GPUs. EES Advantages and Disadvantages of Deep Learning Advantages of Deep Learning © No need for feature engineering, ‘© DL solves the problem on the end-to-end basis. Deep learning gives more accuracy. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge ‘Deep Learning anes Deep Feedtorward Networks tages of Deep Learning pp needs high-performance hardware, ‘ ,ismuch more time to train, , ot nec Hise difficult to assess its performance in real world applications. ris vey hard to understand. . . challenges and Motivation of Deep Learning ,, The development of deep learning was motivated in part by the failure of traditional algorithms to generalize well on such AI tasks. pole Curse of Dimensionality Many machine learning problems become exceedingly difficult when the number of dimensions in the data is high. This phenomenon is known as the curse of dimensionality. The curse of dimensionality refers to the phenomena that occur when classifying, organizing, and analyzing high dimensional data that docs not occur in low dimensional spaces, specifically the issue of data sparsity and “closeness” of data. «The volume of the space represented grows so quickly that the data cannot keep up and thus becomes sparse, as shown in Fig. 4.4.1. The sparsity issue is a major one for anyone whose goal has some statistical significance. T 20) : 20 ' 15} 45 H ? 10 ' ' 5 i 0-- | { ° ' i 20 jo woo ose yo os o 5 0 15 (20 200 (2) 1D -4 regions {b) 20-16 regions (€) 30-66 regions Fig. 4.4.1 As the number of relevant dimensions of the data Increases * AS the data space seen above moves from one dimension to two dimensions and finally to three dimensions, the given data fills less and less of the data space. In order to maintain an Accurate representation of the space, the data for analysis grows exponentially, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Foedl Neural Networks and Deep Learning 4-10 Boop Fee S208 Nate, ing the data. In low dimensio, * The second issne that arises is related to sorting or clas ma sn very similar bu the higher the dimension the further (hese data pg, spaces, data may may s to be. ry Local Constancy and Smoothness Regularization J to be guided by prior delieg, In order to generalize well, machine learning algorithms nee about what kind of function they should learn, Among the most widely used priors jg, smoothness or local constancy prior. * There are many different ways to implicitly or explicitly express @ prior belief thatthe learned function should be smooth or locally constant, AU of these different methods ae designed to encourage the learning process to leam a function f+ that satisfies the conditgg F(x) = F(x +e), © If we know a good answer for an input x, then that answer is probably good in the neighborhood of x. If we have several good answers in some neighborhood we woul many of them as possible, combine them to produce an answer that agrees with © Aneextreme example of the local constancy approach is the k «nearest neighbors family of learning algorithms, ing examples, mos kere! machines interpolate between training set outputs associated with nearby training examples. An important class of kemels is the family of local kernels where k(u, v) is large when u = v and decreases as u and v grow farther apart from each other. © The k-nearest neighbor's algorithm copies the output from nearby tral © A local kemel can be thought of as a similarity function that performs template matching, by measuring how closely a test example x resembles each training example x"). © Decision trees also suffer from the limitations of exclusively smoothness-based leaming because they break the input space into as many regions as there are leaves and use & separate parameter in each region. EXE] Manifold Learning © Manifold leaming is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high. © Manifold leaming was introduced in the case of continuous-valued data and the unsupervised leaming setting, although this probability concentration, idea can te generalized to both discrete data and the supervised learning setting : The key assumptin remains that probability mass is highly concentrated, TECHNICAL PUBLICATIONS® - an up-thrust for knowledge id Deep Learning 4- works ft Doop Feedforward Networks 1 ooted to be very difficult to visualize. While data in two or three sca lo ir ns can plotted to show the inherent structure of the data, equivalent high- nal plots are much less intui igh e. To aid visualization of the structure of a datasct, ienension® $ ine gimension must be reduced in some way. rye simplest WaY to accomplish this dimensionality reduction is by taking a random peajection of the data. Though this allows some degree of visualization of the data structure, the randomness of the choice leaves much to be desired. In a random projection, it is likely ipa he more interesting structure within the data will be lost. «When the data lies on a low-dimensional manifold, it can be most natural for machi g algorithms to represent the data in terms of coordinates on the manifold, rather than ine Jearnins jnterms of co-ordinates in R", [Bsradient Learning « Designing and training a neural network is not much different from training any other machine learning model with gradient descent. Choices for gradient learning are as follows : 2) We must choose a cost function b) We must choose how to represent the output of the model ©) We now visit these design consideration: gradient-based optimizers. Gradient- « Neural networks are usually trained by using iterative, wich easier to minimize a reasonably based learning draws on the fact that it is generally rm smooth, continuous function than a discrete function. estimating the impact of small variations of the * The loss function can be minimized by ting from any parameter values on the loss function. Convex optimization converges sta: initial parameters. Stochastic gradient descent applied to non-convex loss functions has no such convergence guarantee and is sensitive to the values of the initial parameters. ortant to initialize all weights to small random all positive Values. The iterative * For feedforward neural networks, it is imp¢ values, The biases may be initialized to zero or to sm gradient-based optimization algorithms used to train feedforward networks and almost all other deep models. 15.1) Cost Function Important aspect of the design of deep neuri . those for parametric models such as lit fines a distribution p(y[x ; 8) and simply use al networks is the cost function. They are similar near models. In most cases, parametric model the principle of maximum likelihood. TECHNICAL PUBLIGATIONS® tp-thrust for knowledge i ‘Neural Networks and Deep Learning 4-12 Deep Feedforward Netyr, a The neuf eosentoy been the taining data and he model's predictions 3 he gy, function, Most modem neural networks are trained using maximum likelihood, M Cost function is given by, JQ) = Rasta !O8 Ppanses( VES) This means cost is simply negative log-likelihood and equivalently, cross-entropy between training set and model distribution. Specific form of cost function changes from mode] to mode! depending on the form of log Py. Cost function with Gaussian model : if Prrote(¥1X) = N(yIP(x: 0), 1) then using maximum likelihood the mean squared error cost is, 1 JO) = ~ FE, gly — AiO)? + const ‘Where “const” depends on the variance of Gaussian. Advantage of this approach to cost is that deriving cost from maximum likelihood Temoves the burden of designing cost functions for cach model. Desirable property of gradient : Gradient must be large and predictable enough to serve as a good guide to the learning algorithm. Cross entropy and regularization : A property of cross-entropy cost used for MLE is that it does not have a minimum value. For discrete output variables, they cannot Tepresent Probability of zero or one but come arbitrarily close. Logistic regression is an example, For real-valued output variables it becomes possible to assign extremely high density to correct training sct outputs, ¢.g, by learning the variance parameter of Gaussian output and the resulting cross-entropy approaches negative infinity. Learning conditional statistics : Instead of learning a full probability distribution, we often want to lear just one conditional statistic of y given x. : Learning a function : If we have a sufficiently powerful neural network, we can think of it as being powerful cnough to determine any function “f". This function is limited only by boundedness and continuity. From this point of view, cost function is a function rather than a function. View cost as a functional, not a function. We can think of learning as a task of choosing @ function rather than a set of parameters. We can design our cost function to have its minimum occur at a specific function we desire. For example, design the cost functional to have its minimum lic on the function that maps x to the expected valuc of y given x. TECHNICAL PUBLICATIONS® - an up-thrust for knowledge Solving an optimizati ization problem wit . Ealled calculus of variation, with respect to a function requires a mathematical tool [Mean squared error and ient-based optimiza Mean absolute error often lead to poor results when used with i . : combined with these ¢ ion, Some output units saturate produce very small gradients when ‘ost functions. This is one reason cross-entropy cost is more popular cross-entro ee 5 the my between data distribution and model distribution. Choice of how to” a ol . Bion o utput then determines the form of the erss-entropy function. In logistic , Output is binary-valued. Any kind of neural network unit that may be used as an can also be used as a hidden unit. mplete the task that the network must perform. One simple kind of output unit is an output based_on an affine transformation with no nonlinearity. These are often just called Softmax units for Multinoulli output. | Other output types. nits for Gaussian output distributions unit : Simple output based on affine transformation with no nonlinearity. Given bes h, a layer of lincar output units produces a vector J=Wihtb. units are often used to produce mean § of a conditional Gaussian distribution a G)=NOi5.D- faximizing the log-likelihood is equivalent to ce of a Gaussian too, or the covariance to be a minimizing the mean squared error. nits can be used to learn the covariant Gnction of the input. However covariance needs to be constrained to be a positive definite Neural Networks and Deep Learning 2. Sigmold units for bernoulll output distributions Many tasks require predicting the value of a binary variable y. Classification Problems yg : mum-tikelihood approach is (0 ding a two classes can solve this problem. The mavi Remoulh distribution over y conditioned on x. Hernoutli distribution is defined by a single number, The neural net needs to predic, a Diy > Hs). For this number to be a valid probability, it must lic in the interval [0, 1}. © To ensure a strong gradient whenever the model has the wrong answer, use SigMOid outy units, A sigmoid output unit has two components ¢ a) A linear layer to compute 2 = W'h + b. b) Use sigmoid activation function to convert 2 into a probability. # Probability distribution using Sigmotd : Describe probability distribution over y using 27° W'h +b yisoutput, z is input. © Probability distributions based on exponentiation and normalization are common throughout ribution over binary variables statistical modeling. The 2’ riable defining such a called a logit. 3. Softplus function the function is © Sigmoid saturates when its argument is very positive or very negative, i. insensitive to small changes in input. Any time we want a probability distribution overa disercte variable with n valucs we may use the softmax function. © Compare it to the softplus function : E(x) = log(1 + exp(x)) © Softmax functions are most often used as the output of a classifier, to represent the probability distribution over n different classes. Softmax functions can be used inside the te | modcl itself, if we wish the model to choose between one of n different options for some internal variable. «Like the sigmoid, the sofimax activation can saturate. The sigmoid function has a singe output that saturates when its input is extremely negative or extremely positive. In the case of the softmax, there are multiple output values. © These output values can saturate when the differences between input values extreme. When the softmax saturates, many cost functions based on the softmak saturate, unless they are able to invert the saturating activating function.

Das könnte Ihnen auch gefallen