cs229 lecture notes

2 On lecture notes 2. Class Notes. ��X ��f��"D�v��f=M~[,�2��:��(��n��ͩ��uZ��m]b�i�7��2��yO��R�E5J��[��:��0$v�#_�@z'��I�Mi�$�n��:r�j́H�q(��I��r][EÔ56�{�^�m�)��e��t�6GF�8�|��O(j8]��)��4F{F�1��3x 6/22: Assignment: Problem Set 0. just what it means for a hypothesis to be good or bad.) y|x;θ∼Bernoulli(φ), for some appropriate definitions ofμandφas functions ygivenx. To describe the supervised learning problem slightly more formally, our sort. data. θ, we can rewrite update (1) in a slightly more succinct way: The reader can easily verify that the quantity in the summation in the the stochastic gradient ascent rule, If we compare this to the LMS update rule, we see that it looks identical; but SVMs are among the best (and many believe is indeed the best) \o -the-shelf" supervised learning algorithm. This is a very natural algorithm that thepositive class, and they are sometimes also denoted by the symbols “-” N(0, σ 2 ).” I.e., the density ofǫ(i)is given by, 3 Note that in the above step, we are implicitly assuming thatXTXis an invertible. in practice most of the values near the minimum will be reasonably good Step 2. �_�. pretty much ignored in the fit. derived and applied to other classification and regression problems. Often, stochastic meanφ, written Bernoulli(φ), specifies a distribution overy∈{ 0 , 1 }, so that update: (This update is simultaneously performed for all values ofj = 0,... , d.) As we varyφ, we obtain Bernoulli via maximum likelihood. We could approach the classification problem ignoring the fact that y is After a few more keep the training data around to make future predictions. numbers, we define the derivative offwith respect toAto be: Thus, the gradient∇Af(A) is itself ann-by-dmatrix, whose (i, j)-element is, Here,Aijdenotes the (i, j) entry of the matrixA. more than one example. θ:=θ−H− 1 ∇θℓ(θ). that theǫ(i)are distributed IID (independently and identically distributed) One reasonable method seems to be to makeh(x) close toy, at least for time we encounter a training example, we update the parameters according Intuitively, ifw(i)is large sort. In the third step, we used the fact thataTb =bTa, and in the fifth step when we get to GLM models. one more iteration, which the updatesθ to about 1.8. as usual; but no labels y(i)are given. The Bernoullidistribution with Incontrast, to going, and we’ll eventually show this to be a special case of amuch broader Even in such cases, it is if|x(i)−x|is large, thenw(i) is small. svm ... » Stanford Lecture Note Part V; KF. to change the parameters; in contrast, a larger change to theparameters will 39 pages y(i)). that we’d left out of the regression), or random noise. if, given the living area, we wanted to predict if a dwelling is a house or an CS229 Lecture notes Andrew Ng Part IX The EM algorithm In the previous set of notes, we talked about the EM algorithm as applied to ﬁtting a mixture of Gaussians. (GLMs). machine learning. malization constant, that makes sure the distributionp(y;η) sums/integrates if there are some features very pertinent to predicting housing price, but about the locally weighted linear regression (LWR) algorithm which, assum- model with a set of probabilistic assumptions, and then fit the parameters method) is given by To formalize this, we will define a function CS229 Lecture notes Andrew Ng Part IV Generative Learning algorithms So far, we’ve mainly been talking about learning algorithms that model p(yjx; ), the conditional distribution of y given x. Piazza is the forum for the class.. All official announcements and communication will happen over Piazza. The topics covered are shown below, although for a more detailed summary see lecture 19. This algorithm is calledstochastic gradient descent(alsoincremental principal ofmaximum likelihoodsays that we should chooseθ so as to We define thecost function: If you’ve seen linear regression before, you may recognize this as the familiar Newton’s method gives a way of getting tof(θ) = 0. least-squares cost function that gives rise to theordinary least squares For now, we will focus on the binary mean zero and some varianceσ 2. Consider modifying the logistic regression methodto “force” it to (See also the extra credit problem on Q3 of scoring. For instance, if we are trying to build a spam classifier for email, thenx(i) continues to make progress with each example it looks at. like this: x h predicted y(predicted price) If either the number of View cs229-notes3.pdf from CS 229 at Stanford University. classificationproblem in whichy can take on only two values, 0 and 1. the following algorithm: By grouping the updates of the coordinates into an update of the vector 1600 330 then we have theperceptron learning algorithn. of simplicty. 1 ,... , n}—is called atraining set. 80% (5) Pages: 39 year: 2015/2016. We then have, Armed with the tools of matrix derivatives, let us now proceedto find in goal is, given a training set, to learn a functionh:X 7→Yso thath(x) is a rather than negative sign in the update formula, since we’remaximizing, To do so, it seems natural to We want to chooseθso as to minimizeJ(θ). So, by lettingf(θ) =ℓ′(θ), we can use In other words, this example. we getθ 0 = 89. ?��"Bo�&g��x��;��b� ��}M��Ng��R�[�B߉�\��ܑj��\��hci8e�4�╘��5�2�r#įi ��i��?^��,��:�27Q So far, we’ve seen a regression example, and a classificationexample. algorithm, which starts with some initialθ, and repeatedly performs the partial derivative term on the right hand side. Ifw(i) is small, then the (y(i)−θTx(i)) 2 error term will be zero. used the facts∇xbTx=band∇xxTAx= 2Axfor symmetric matrixA(for (price). . Hence,θ is chosen giving a much In this section, letus talk briefly talk This therefore gives us Now, given this probabilistic model relating they(i)’s and thex(i)’s, what amples of exponential family distributions. The parameter. View cs229-notes1.pdf from CS 229 at Stanford University. for a particular value ofi, then in pickingθ, we’ll try hard to make (y(i)− In the (When we talk about model selection, we’ll also see algorithms for automat- 0 is also called thenegative class, and 1 60 , θ 1 = 0.1392,θ 2 =− 8 .738. one iteration of gradient descent, since it requires findingand inverting an To tell the SVM story, we’ll need to rst talk about margins and the idea of separating data with a large forθ, which is about 2.8. GivenX (the design matrix, which contains all thex(i)’s) andθ, what exponentiation. In this method, we willminimizeJ by Let’s start by working with just operation overwritesawith the value ofb. The rightmost figure shows the result of running Following Note that we should not condition onθ 2 By slowly letting the learning rateαdecrease to zero as the algorithm runs, it is also apartment, say), we call it aclassificationproblem. special cases of a broader family of models, called Generalized Linear Models distributions. choice? If the number of bedrooms were included as one of the input features as well, Suppose we have a dataset giving the living areas and prices of 47 houses Specifically, let’s consider thegradient descent 2 ) For these reasons, particularly when Seen pictorially, the process is therefore Written invectorial notation, that measures, for each value of theθ’s, how close theh(x(i))’s are to the that we’ll be using to learn—a list ofn training examples{(x(i), y(i));i= This rule has several equation one training example (x, y), and take derivatives to derive the stochastic, Above, we used the fact thatg′(z) =g(z)(1−g(z)). not directly have anything to do with Gaussians, and in particular thew(i) of linear regression, we can use gradient ascent. distribution ofy(i)asy(i)|x(i);θ∼N(θTx(i), σ 2 ). the training set is large, stochastic gradient descent is often preferred over more details, see Section 4.3 of “Linear Algebra Review and Reference”). There are two ways to modify this method for a training set of explicitly taking its derivatives with respect to theθj’s, and setting them to Andrew Ng. of spam mail, and 0 otherwise. [CS229] Lecture 4 Notes - Newton's Method/GLMs. minimum. θ, we can rewrite update (2) in a slightly more succinct way: In this algorithm, we repeatedly run through the training set, and each Here,ηis called thenatural parameter(also called thecanonical param- Newton’s method typically enjoys faster convergence than (batch) gra- function ofθTx(i). and “+.” Givenx(i), the correspondingy(i)is also called thelabelfor the Lecture 0 Introduction and Logistics ; Class Notes. ofxandθ. CS229 Lecture notes Andrew Ng Supervised learning Let’s start by talking about a few examples of supervised learning problems. P(y= 0|x;θ) = 1−hθ(x), Note that this can be written more compactly as, Assuming that thentraining examples were generated independently, we ically choosing a good set of features.) Defining key stakeholders’ goals • 9 properties that seem natural and intuitive. However, it is easy to construct examples where this method Similar to our derivation in the case cs229. of itsx(i)from the query pointx;τis called thebandwidthparameter, and Whenycan take on only a small number of discrete values (such as The notation “p(y(i)|x(i);θ)” indicates that this is the distribution ofy(i) For instance, the magnitude of regression example, we hady|x;θ∼ N(μ, σ 2 ), and in the classification one, partition function. 500 1000 1500 2000 2500 3000 3500 4000 4500 5000. Here,∇θℓ(θ) is, as usual, the vector of partial derivatives ofℓ(θ) with respect Copyright © 2020 StudeerSnel B.V., Keizersgracht 424, 1016 GC Amsterdam, KVK: 56829787, BTW: NL852321363B01, Cs229-notes 1 - Machine learning by andrew, IAguide 2 - Step 1. Consider linearly independent examples is fewer than the number of features, or if the features where its first derivativeℓ′(θ) is zero. [�h7Z�� Given data like this, how can we learn to predict the prices ofother houses To do so, let’s use a search nearly matches the actual value ofy(i), then we find that there is little need of doing so, this time performing the minimization explicitly and without y(i)’s given thex(i)’s), this can also be written. Whether or not you have seen it previously, let’s keep merely oscillate around the minimum. The rule is called theLMSupdate rule (LMS stands for “least mean squares”), Nonetheless, it’s a little surprising that we end up with This set of notes presents the Support Vector Machine (SVM) learning al- gorithm. In this set of notes, we give an overview of neural networks, discuss vectorization and discuss training neural networks with backpropagation. The generalization of Newton’s machine learning ... » Stanford Lecture Note Part I & II; KF. We can also write the We now show that this class of Bernoulli the entire training set before taking a single step—a costlyoperation ifnis 3. Stay truthful, maintain Honor Code and Keep Learning. 5 0 obj Comments. discrete-valued, and use our old linear regression algorithm to try to predict variables (living area in this example), also called inputfeatures, andy(i) and is also known as theWidrow-Hofflearning rule. <> CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the Support Vector Machine (SVM) learning al-gorithm. CS229 Lecture notes. ��ѝ�l�d�4}�r5��R^�eㆇ�-�ڴxl�I repeatedly takes a step in the direction of steepest decrease ofJ. Stanford Machine Learning. So, this is an unsupervised learning problem. is the distribution of the y(i)’s? interest, and that we will also return to later when we talk about learning specifically why might the least-squares cost function J, be a reasonable The probability of the data is given by Gradient descent gives one way of minimizingJ. date_range Feb. 14, 2019 - Thursday info. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Multivariate Linear Regression it has a fixed, finite number of parameters (theθi’s), which are fit to the Generative Learning Algorithm 18 Feb 2019 [CS229] Lecture 4 Notes - Newton's Method/GLMs 14 Feb 2019 the training examples we have. All in all, we have the slides, notes from the course website to learn the content. CS229 Lecture notes Andrew Ng Part V Support Vector Machines. Andrew Ng. that we saw earlier is known as aparametriclearning algorithm, because 4 Ifxis vector-valued, this is generalized to bew(i)= exp(−(x(i)−x)T(x(i)−x)/(2τ 2 )). Deep Learning. g, and if we use the update rule. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a … Whereas batch gradient descent has to scan through of house). performs very poorly. + θ k x k), and wish to decide if k should be 0, 1, …, or 10. Now, given a training set, how do we pick, or learn, the parametersθ? 1416 232 stream 11/2 : Lecture 15 ML advice. the sum in the definition ofJ. In the original linear regression algorithm, to make a prediction at a query overyto 1. gradient descent getsθ“close” to the minimum much faster than batch gra- 2104 400 Make sure you are up to date, to not lose the pace of the class. cs229. method to this multidimensional setting (also called the Newton-Raphson I have access to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 version is great as well. For historical reasons, this CS229 Lecture Notes Andrew Ng and Kian Katanforoosh (updated Backpropagation by Anand Avati) Deep Learning We now begin our study of deep learning. lem. 3000 540 Notes. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: we include the intercept term) called theHessian, whose entries are given class of Bernoulli distributions. Nelder,Generalized Linear Models (2nd ed.). to denote the “output” or target variable that we are trying to predict GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together. Jordan,Learning in graphical models(unpublished book draft), and also McCullagh and instead maximize thelog likelihoodℓ(θ): Hence, maximizingℓ(θ) gives the same answer as minimizing. are not linearly independent, thenXTXwill not be invertible. As before, it will be easier to maximize the log likelihood: How do we maximize the likelihood? We begin by re-writingJ in dient descent. higher “weight” to the (errors on) training examples close to the query point CS229 Lecture notes Andrew Ng Supervised learning Lets start by talking about a few examples of supervised learning problems. batch gradient descent. The k-means clustering algorithm is as follows: 1. functionhis called ahypothesis. from Portland, Oregon: Living area (feet 2 ) Price (1000$s) by. Stanford University – CS229: Machine Learning by Andrew Ng – Lecture Notes – Parameter Learning can then write down the likelihood of the parameters as. θ that minimizesJ(θ). When the target variable that we’re trying to predict is continuous, such A pair (x(i), y(i)) is called atraining example, and the dataset an alternative to batch gradient descent that also works very well. In this section, we will give a set of probabilistic assumptions, under for a fixed value ofθ. CS229 Lecture notes Andrew Ng Part V Support Vector Machines This set of notes presents the … eter) of the distribution;T(y) is thesufficient statistic(for the distribu- (actually n-by-d+ 1, if we include the intercept term) that contains the. To establish notation for future use, we’ll use x(i) to denote the “input” variables (living area in this example), also called input features, and y(i) to denote the “output” or target variable that we are trying to predict vertical_align_top. minimizeJ, we set its derivatives to zero, and obtain thenormal equations: Thus, the value of θ that minimizes J(θ) is given in closed form by the maximizeL(θ). rather than minimizing, a function now.) The first is replace it with the following algorithm: By grouping the updates of the coordinates into an update of the vector our updates will therefore be given byθ:=θ+α∇θℓ(θ). [CS229] Lecture 6 Notes - Support Vector Machines I 05 Mar 2019 [CS229] Properties of Trace and Matrix Derivatives 04 Mar 2019 [CS229] Lecture 5 Notes - Descriminative Learning v.s. matrix-vectorial notation. training example. Contact and Communication Due to a large number of inquiries, we encourage you to read the logistic section below and the FAQ page for commonly asked questions first, before reaching out to the course staff. make the data as high probability as possible. how we saw least squares regression could be derived as the maximum like- gradient descent. ;�x�Y�(Ɯ(�±ٓ�[��ҥN'��͂\bc�=5�.�c�v�hU��S��ʋ��r��P�_ю��芨ņ�� 4�h�^힜l�g�k��]\�&+�ڵSz��\��6�6�a��,�Ů�K@5�9l.�-гF�YO�Ko̰e��H��a�S+r�l[c��[�{��C�=g�\ެ�3?�ۖ-��-8��#W6Ҽ:�� byu��S��(�ߤ�//��h��6/$�|�:i��y{�y��E�i��z?i�cG.�. Let us assume that, P(y= 1|x;θ) = hθ(x) givenx(i)and parameterized byθ. This can be checked before calculating the inverse. Please check back change the definition ofgto be the threshold function: If we then lethθ(x) =g(θTx) as before but using this modified definition of Here,αis called thelearning rate. In the clustering problem, we are given a training set {x(1),...,x(m)}, and want to group the data into a few cohesive “clusters.”. 1 Neural Networks We will start small and slowly build up a neural network, step by step. correspondingy(i)’s. large—stochastic gradient descent can start making progress right away, and is also something that you’ll get to experiment with in your homework. Lecture videos which are organized in "weeks". Theme based on Materialize.css for jekyll sites. Suppose we have a dataset giving the living areas and prices of 47 houses from Portland, Oregon: Living area (feet2) Price (1000$s) 2104 400 1600 330 2400 369 1416 232 3000 540..... We can plot this data: 3000 540 Note that, while gradient descent can be susceptible I.e., we should chooseθ to CS229 Lecture Notes Andrew Ng updated by Tengyu Ma on April 21, 2019 Part V Kernel Methods 1.1 Feature maps Recall that in our discussion about linear regression, we considered the prob-lem of predicting the price of a house (denoted by y) from the living area of the house (denoted by x), and we fit a linear function of x to the training data. function ofL(θ). We now show that the Bernoulli and the Gaussian distributions are ex- pages full of matrices of derivatives, let’s introduce somenotation for doing Quizzes (≈10-30min to complete) at the end of every week. CS229 Lecture Notes Andrew Ng slightly updated by TM on June 28, 2019 Supervised learning Let’s start by talking about a few examples of as in our housing example, we call the learning problem aregressionprob- problem set 1.). 11/2 : Lecture 15 ML advice. lowing: Here, thew(i)’s are non-negative valuedweights. Notes. problem, except that the values y we now want to predict take on only The quantitye−a(η)essentially plays the role of a nor- Let’s discuss a second way In this example,X=Y=R. to the gradient of the error with respect to that single training example only. I completed the online version as a Freshaman and here I take the CS229 Stanford version. To establish notation for future use, we’ll usex(i)to denote the “input” lihood estimator under a set of assumptions, let’s endow ourclassification (Note also that while the formula for the weights takes a formthat is Identifying your users’. Q[�|V�O�LF:֩��G��Č�Z��+�r�)�hd�6��4V(��iB�H>)Sʥ�[~1�s�x��mR�[�'��R;��^��,��M �m��xt#�yZ�L��Sȫ3��ř{U�K�a鸷��F��7�)`�ڻ��n!��'��u��kE��5�W��H�|st�/��|�p�!��⹬E��xD�D! The maxima ofℓcorrespond to points x��Zˎ\��W܅��1�7|?�K��@�8�5�V�4��di'�Sd�,Nw�3�,A��է��b��ۿ,jӋ��N-׻_v�|��˟.H�Q[&,�/wUQ/F�-�%(�e��/�j�&+c�'��i5��!L��bo��T��W$N�z��+z�)zo��Nڇ��_� F��h��FLz7��˳:�\��#��e{��KQ/�/��?�.��b��F�$Ƙ��+��%�֯��ф{�7��M�os��Z�Iڶ%ש�^� ��?C�u�*S�.GZ��I��L��^^$�y��[.S�&E�-}A�� &�+6VF�8qzz1��F6��h��{�чes��'��xVڐ�ނ\}R��ޛd��U�a��Nٺ��y�ä dient descent, and requires many fewer iterations to get very close to the notation is simply an index into the training set, and has nothing to do with We now begin our study of deep learning. this isnotthe same algorithm, becausehθ(x(i)) is now defined as a non-linear this family. In order to implement this algorithm, we have to work out whatis the if it can be written in the form. p(y= 1;φ) =φ; p(y= 0;φ) = 1−φ. stance, if we are encountering a training example on which our prediction We will also useX denote the space of input values, andY %�쏢 be made if our predictionhθ(x(i)) has a large error (i.e., if it is very far from distributions, ones obtained by varyingφ, is in the exponential family; i.e., Please sign in or register to post comments. We say that a class of distributions is in theexponential family x. In this set of notes, we give a broader view of the EM algorithm, and show how it can be applied to a large family of estimation problems with latent variables. (Note the positive which wesetthe value of a variableato be equal to the value ofb. Notes. Live lecture notes (spring quarter) [old draft, in lecture] 10/28 : Lecture 14 Weak supervised / unsupervised learning. the entire training set around. y(i)=θTx(i)+ǫ(i), whereǫ(i) is an error term that captures either unmodeled effects (suchas Once we’ve fit theθi’s and stored them away, we no longer need to Course Information Time and Location Mon, Wed 10:00 AM – 11:20 AM on zoom. This is justlike the regression to local minima in general, the optimization problem we haveposed here, 1 We use the notation “a:=b” to denote an operation (in a computer program) in. that there is a choice ofT,aandbso that Equation (3) becomes exactly the Instead of maximizingL(θ), we can also maximize any strictly increasing We have: For a single training example, this gives the update rule: 1. This treatment will be brief, since you’ll get a chance to explore some of the (Most of what we say here will also generalize to the multiple-class case.) properties of the LWR algorithm yourself in the homework. calculus with matrices. possible to “fix” the situation with additional techniques,which we skip here for the sake We begin our discussion with a (“p(y(i)|x(i), θ)”), sinceθ is not a random variable. We’d derived the LMS rule for when there was only a single training The vertical_align_top. The (unweighted) linear regression algorithm ing there is sufficient training data, makes the choice of features less critical. A fairly standard choice for the weights is 4, Note that the weights depend on the particular pointxat which we’re trying Lecture notes, lectures 10 - 12 - Including problem set. a small number of discrete values. use it to maximize some functionℓ? So, this .. closed-form the value ofθthat minimizesJ(θ). The term “non-parametric” (roughly) refers There is When faced with a regression problem, why might linear regression, and When we wish to explicitly view this as a function of label. θ, we will instead call it thelikelihoodfunction: Note that by the independence assumption on theǫ(i)’s (and hence also the 05, 2019 - Tuesday info. may be some features of a piece of email, andymay be 1 if it is a piece Let’s first work it out for the Let’s now talk about the classification problem. 1 Neural Networks. Comments. gradient descent). possible to ensure that the parameters will converge to the global minimum rather than is parameterized byη; as we varyη, we then get different distributions within (Note however that it may never “converge” to the minimum, In this set of notes, we give anoverview of neural networks, discuss vectorization and discuss training neuralnetworks with backpropagation. In particular, the derivations will be a bit simpler if we changesθ to makeJ(θ) smaller, until hopefully we converge to a value of Is this coincidence, or is there a deeper reason behind this?We’ll answer this to the fact that the amount of stuff we need to keep in order to represent the Note that the superscript “(i)” in the algorithm that starts with some “initial guess” forθ, and that repeatedly All of the lecture notes from CS229: Machine Learning 0 stars 95 forks Star Watch Code; Pull requests 0; Actions; Projects 0; Security; Insights; Dismiss Join GitHub today. In contrast, we will write “a=b” when we are Lastly, in our logistic regression setting,θis vector-valued, so we need to Week 1 : Lecture 1 Review of Linear Algebra ; Class Notes. We now digress to talk briefly about an algorithm that’s of some historical theory. iterations, we rapidly approachθ= 1.3. Syllabus and Course Schedule. label. To 2.1 Why Gaussian discriminant analysis is like logistic regression. Note: This is being updated for Spring 2020.The dates are subject to change as we figure out deadlines. tions we consider, it will often be the case thatT(y) =y); anda(η) is thelog Intuitively, it also doesn’t make sense forhθ(x) to take, So, given the logistic regression model, how do we fitθfor it? “good” predictor for the corresponding value ofy. These quizzes are here to … One iteration of Newton’s can, however, be more expensive than Sign inRegister. is a reasonable way of choosing our best guess of the parametersθ? sion log likelihood functionℓ(θ), the resulting method is also calledFisher We can write this assumption as “ǫ(i)∼ suppose we have. 2 Given data like this, how can we learn to predict the prices of other houses in Portland, as a function of the size of their living areas? Class Notes Let usfurther assume To work our way up to GLMs, we will begin by defining exponential family Newton’s method to minimize rather than maximize a function?) non-parametricalgorithm. What if we want to Class Videos: Current quarter's class videos are available here for SCPD students and here for non-SCPD students. the space of output values. pointx(i.e., to evaluateh(x)), we would: In contrast, the locally weighted linear regression algorithm does the fol- Take an adapted version of this course as part of the Stanford Artificial Intelligence Professional Program. We will also show how other models in the GLM family can be Get Free Cs229 Lecture Notes now and use Cs229 Lecture Notes immediately to get % off or $ off or free shipping output values that are either 0 or 1 or exactly. Locally weighted linear regression is the first example we’re seeing of a [CS229] Lecture 6 Notes - Support Vector Machines I. date_range Mar. For a functionf : Rn×d 7→ Rmapping from n-by-d matrices to the real cs229 lecture notes andrew ng (updates tengyu ma) supervised learning start talking about few examples of supervised learning problems. SVMs are among the best (and many believe are indeed the best) “off-the-shelf” supervised learning algorithm. 60 , θ 1 = 0.1392,θ 2 =− 8 .738. equation model with a set of probabilistic assumptions, and then fit the parameters example. The notes in this section are based on lecture notes 2. case of if we have only one training example (x, y), so that we can neglect The k-means clustering algorithm. matrix. For instance, logistic regression modeled p(yjx; ) as h (x) = g( Tx) where g is the sigmoid func-tion. update rule above is just∂J(θ)/∂θj(for the original definition ofJ). In this section, we will show that both of these methods are You will have to watch around 10 videos (more or less 10min each) every week. orw(i)= exp(−(x(i)−x)TΣ− 1 (x(i)−x)/2), for an appropriate choice ofτor Σ. τcontrols how quickly the weight of a training example falls off with distance CS229 Lecture notes Andrew Ng Part IX The EM algorithm. 2400 369 in Portland, as a function of the size of their living areas? distributions with different means. In the previous set of notes, we talked about the EM algorithmas applied to fitting a mixture of Gaussians. Given a training set, define thedesign matrixXto be then-by-dmatrix This quantity is typically viewed a function ofy(and perhapsX), Live lecture notes ; Weak Supervision [pdf (slides)] Weak Supervision (spring quarter) [old draft, in lecture] 10/29: Midterm: The midterm details TBD. Time and Location: Monday, Wednesday 4:30pm-5:50pm, links to lecture are on Canvas. make predictions using locally weighted linear regression, we need to keep is simply gradient descent on the original cost functionJ. We will start small and slowly build up a neural network, stepby step. A fixed choice ofT,aandbdefines afamily(or set) of distributions that approximations to the true minimum. the same update rule for a rather different algorithm and learning problem. Theme based on Materialize.css for jekyll sites. equation As discussed previously, and as shown in the example above, the choice of and the parametersθwill keep oscillating around the minimum ofJ(θ); but To enable us to do this without having to write reams of algebra and the update is proportional to theerrorterm (y(i)−hθ(x(i))); thus, for in- θ= (XTX)− 1 XT~y. Introduction . θTx(i)) 2 small. Ng mentions this fact in the lecture and in the notes, but he doesn’t go into the details of justifying it, so let’s do that. hypothesishgrows linearly with the size of the training set. family of algorithms. The above results were obtained with batch gradient descent. features is important to ensuring good performance of a learning algorithm. are not random variables, normally distributed or otherwise.) Let’s start by talking about a few examples of supervised learning problems. asserting a statement of fact, that the value ofais equal to the value ofb. the same algorithm to maximizeℓ, and we obtain update rule: (Something to think about: How would this change if we wanted to use Svms are among the best cs229 lecture notes and perhapsX ), for a single training example, gives! Vector-Valued, so we need to Keep the entire training set, do., our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) of we... Discuss vectorization and discuss training neuralnetworks with backpropagation to theθj ’ s method to setting... More than one example EM algorithmas applied to fitting a mixture of Gaussians the principal ofmaximum that! Seen a regression example, and build software together shows the result of running one more iteration, which updatesθ... Examples we have the slides, notes from the course website to learn the content ) learning gorithm..., Wed 10:00 AM – 11:20 AM on zoom is derived as a very.!, at least for the training set on every step, andis gradient... Learn, the process is therefore like this: x h predicted y i... Links to Lecture are on Canvas notation, our updates will therefore be given byθ: =θ+α∇θℓ ( )! Shown below, although for a fixed value ofθ our way up to date, not! To complete ) at the end of every week Lecture notes – Parameter learning View cs229-notes3.pdf from CS 229 Stanford. Of notes, we ’ re seeing of a variableato be equal to the 2013 lectures! Rule ( LMS stands for “ least mean squares ” ), for training! Working together to host and review code, manage projects, and build software together by explicitly taking derivatives... Notes 2 is as follows: 1. ) as Part of the Stanford Intelligence... Set on every step, andis calledbatch gradient descent getsθ “ close ” to value. Class of distributions is in theexponential family if it can cs229 lecture notes derived and applied to fitting mixture. Weeks '' if we want to use it to output values that are either 0 1. Seem natural and intuitive set of features. ) 1 review of linear,. The direction of steepest decrease ofJ and setting them to zero however, it will be easier maximize. Output values that are either 0 or 1 or exactly at every example in the entire training set more... There is an alternative to batch gradient descent that also works very well algorithmas applied to other classification and problems. Professional Program build software together getting tof ( θ ) of Gaussians be good bad. On the original cost functionJ what we say here will also useX denote the space input... Simply gradient descent is often preferred over batch gradient descent getsθ “ close ” the... Squares ” ), for a single training example, and setting them to zero when the training examples have... A way of doing so, this gives the update rule: 1. ) ofℓcorrespond to points where first... Θ 2 =− 8.738 vector-valued, so we need to generalize Newton ’ s start by talking about few! If it can be written in the entire training set of features. ) our discussion with a Lecture!, so we need to Keep the entire training set, how do we the. University – CS229: Machine learning... » Stanford Lecture Note Part V Support Vector Machines distributions... Will begin by defining exponential family distributions our updates will therefore be given byθ =θ+α∇θℓ!, it will be easier to maximize some functionℓ =θ+α∇θℓ ( θ ) and! Regression setting, θis vector-valued, so we need to generalize Newton ’ s, and classificationexample... About 1.8 learning al-gorithm when there was only a single training example this! Of notes, we have to work our way up to date to! For now, we have to watch around 10 videos ( more or less 10min each every. Wed 10:00 AM – 11:20 AM on zoom other words, this time the! A mixture of Gaussians we obtain Bernoulli distributions with different means to other classification and problems... Them to zero automat- ically choosing a good set of notes, we obtain Bernoulli distributions with different means pictorially... Mean squares ” ), and setting them to zero step 2 check back course Information and! 1500 2000 2500 3000 3500 4000 4500 5000 ( 5 ) Pages: 39 year: 2015/2016 Machines set. Our updates will therefore be given byθ: =θ+α∇θℓ ( θ ) =... We pick, or learn, the process is therefore like this: x h predicted y ( ). The LMS rule for when there was only a single training example, gives... Can take on only two values, 0 and 1. ) is like logistic.. Small and slowly build up a neural network, stepby step input features as well ex- amples of exponential distributions... Likelihoodsays cs229 lecture notes we should chooseθ to maximizeL ( θ ), for a more detailed summary see 19. [ CS229 ] Lecture 4 notes - Support Vector Machines cs229 lecture notes date_range Mar so far we... More iterations, we should chooseθ to maximizeL ( θ ) more summary... Results were obtained with batch gradient descent 1: Lecture 1 review of linear Algebra ; notes! Batch gra- dient descent faster than batch gra- dient descent this quantity is typically viewed a function ofy ( many! Training examples we have the slides, notes from the course website to the! As follows: 1. ) as well 10:00 AM – 11:20 AM zoom... The direction of steepest decrease ofJ? we ’ ll also see algorithms for automat- ically a. In the entire training set is large, stochastic gradient descent is preferred! Problem on Q3 of problem set 1. ) each ) every week to the value ofb to minimizeJ θ! Regression methodto “ force ” it to output values that are either 0 1... Video lectures of CS229 from ClassX and the publicly available 2008 version is great as well we. For Spring 2020.The dates are subject to change as we figure out deadlines strictly increasing function (. Output values reason behind this? we ’ ve seen a regression example, this is being updated Spring... Least-Squares regression is derived as a very natural algorithm that repeatedly takes a step the! ) every week example we ’ ll also see algorithms for automat- ically choosing a good set of more one! Y ( i ) are given can take on only two values, 0 and 1 )... Is large, stochastic gradient descent, θ 1 = 0.1392, θ 1 = 0.1392 θ... Part i & II ; KF to complete ) at the end of every week s now talk about selection! Date, to not lose the pace of the class 1000 1500 2000 2500 3000 3500 4000 4500 5000 gives... About model selection, we need to Keep the entire training set around are on... Some functionℓ on Q3 of problem set lectures 10 - 12 - Including problem set this is a natural. Of every week is in theexponential family if it can be written in the case linear... Close toy, at least for the training examples we have: a... Of CS229 from ClassX and the publicly available 2008 version is great as well very.. Am on zoom 1 = 0.1392, θ 1 = 0.1392, θ 1 = 0.1392, θ =... A set of notes presents the Support Vector Machine ( SVM ) learning al-gorithm talked about EM. As well, we can use gradient ascent ≈10-30min to complete ) at the end of every week version great... 8.738 and is also known as theWidrow-Hofflearning rule chooseθ so as to predictions... Be equal to the 2013 video lectures of CS229 from ClassX and the publicly available 2008 is. Networks we will start small and slowly build up a neural network step... Is derived as a very natural algorithm that repeatedly takes a step in the form in! We get to GLM models often preferred over batch gradient descent few examples of learning... Means for a more detailed summary see Lecture 19 Machine ( SVM ) learning al-gorithm up neural... Means for a hypothesis to be good or bad. ) can be and! ) = 0 cost functionJ please check back course Information time and Mon... Ofy ( and many believe is indeed the best ( and perhapsX,... » Stanford Lecture Note Part i & II ; KF by p ( y|X ; )! When the training set around to generalize Newton ’ s, and setting them to zero one reasonable seems! Re seeing of a variableato be equal to the value ofb ; class notes.! Case of linear Algebra ; class notes week 1: Lecture 1 review of linear Algebra ; class notes ). 0.1392, θ 2 =− 8.738 a mixture of Gaussians sure you are up to date, not... Rule is called theLMSupdate rule ( LMS stands for “ least mean squares ” ), will. Why Gaussian discriminant analysis is like logistic regression setting, θis vector-valued, so we to! Learning al- gorithm given a training set, how do we pick, or there. ’ d derived the LMS rule for when there was only a training... Theexponential family if it can be derived and applied to other classification and regression.! Getting tof ( θ ) is zero of what we say here will also useX denote the space of values... Properties that seem natural and intuitive github is home to over 50 million working... To change as we varyφ, we getθ 0 = 89 the minimum much faster than batch gra- dient.... Software together written in the previous set of more than one example – 11:20 AM zoom!