{"id":4845,"date":"2020-02-20T14:32:53","date_gmt":"2020-02-20T14:32:53","guid":{"rendered":"https:\/\/data36.com\/?p=4845"},"modified":"2022-09-13T11:38:02","modified_gmt":"2022-09-13T11:38:02","slug":"linear-regression-in-python-numpy-polyfit","status":"publish","type":"post","link":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/","title":{"rendered":"Linear Regression in Python using numpy + polyfit (with code base)"},"content":{"rendered":"\n<p><strong>I always say that learning linear regression in Python is the best first step towards machine learning.<\/strong> Linear regression is simple and easy to understand even if you are relatively new to data science. So spend time on 100% understanding it! If you get a grasp on its logic, it will serve you as a great foundation for more complex machine learning concepts in the future.<\/p>\n\n\n\n<p>In this tutorial, I&#8217;ll show you everything you&#8217;ll need to know about it: the mathematical background, different use-cases and most importantly the implementation. We will do that in Python &#8212; by using <code>numpy<\/code> (<code>polyfit<\/code>).<\/p>\n\n\n\n<p><em>Note: This is a hands-on tutorial. I highly recommend doing the coding part with me! If you haven\u2019t done so yet, you might want to go through these articles first:<\/em><\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><a href=\"https:\/\/data36.com\/data-coding-101-install-python-sql-r-bash\/\">How to install Python, R, SQL and bash to practice data science!<\/a><\/li><li><a href=\"https:\/\/data36.com\/python-libraries-packages-data-scientists\/\">Python libraries and packages for Data Scientists<\/a><\/li><li><a href=\"https:\/\/data36.com\/learn-python-for-data-science-from-scratch\/\">Learn Python from Scratch<\/a><\/li><\/ol>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Download the code base!<\/strong><\/h2>\n\n\n\n<p>Find the whole code base for this article (in Jupyter Notebook format) here:<\/p>\n\n\n\n<p><a href=\"https:\/\/github.com\/tomimester\/linear-regression-in-python-tutorial\/blob\/master\/Linear%20Regression%20in%20Python%20-%20using%20numpy%20polyfit.ipynb\">Linear Regression in Python (using Numpy polyfit)<\/a><\/p>\n\n\n\n<p>Download it from: <a href=\"https:\/\/github.com\/tomimester\/linear-regression-in-python-tutorial\/archive\/master.zip\">here<\/a>.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The mathematical background<\/strong><\/h2>\n\n\n\n<p>Remember when you learned about <em>linear functions<\/em> in math classes?<br>I have good news: that knowledge will become useful after all!<\/p>\n\n\n\n<p><strong>Here&#8217;s a quick recap<\/strong>!<\/p>\n\n\n\n<p>For linear functions, we have this formula:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = a*x + b<\/pre>\n\n\n\n<p>In this equation, usually, <code>a<\/code> and <code>b<\/code> are given. E.g:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">a = 2<br>b = 5<\/pre>\n\n\n\n<p>So:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 2*x + 5<\/pre>\n\n\n\n<p>Knowing this, you can easily calculate all <code>y<\/code> values for given <code>x<\/code> values.<\/p>\n\n\n\n<p>E.g.<\/p>\n\n\n\n<figure class=\"wp-block-table\"><table><tbody><tr><td>when <code>x<\/code> is&#8230;<\/td><td><code>y<\/code> is&#8230;<\/td><\/tr><tr><td>0<\/td><td>2*0 + 5 = 5<\/td><\/tr><tr><td>1<\/td><td>2*1 + 5 = 7<\/td><\/tr><tr><td>2<\/td><td>2*2 + 5 = 9<\/td><\/tr><tr><td>3<\/td><td>2*3 + 5 = 11<\/td><\/tr><tr><td>4<\/td><td>2*4 + 5 = 13<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p>&#8230;<\/p>\n\n\n\n<p>If you put all the <code>x<\/code>&#8211;<code>y<\/code> value pairs on a graph, you&#8217;ll get a straight line:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img fetchpriority=\"high\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-1024x683.png\" alt=\"linear function example\" class=\"wp-image-4846\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example.png 1440w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>The relationship between <code>x<\/code> and <code>y<\/code> is <em>linear<\/em>.<\/p>\n\n\n\n<p>Using the equation of this specific line (<code>y = 2 * x + 5<\/code>), if you change <code>x<\/code> by <code>1<\/code>, <code>y<\/code> will always change by <code>2<\/code>.<\/p>\n\n\n\n<p>And it doesn&#8217;t matter what <code>a<\/code> and <code>b<\/code> values you use, your graph will always show the same characteristics: it will always be a straight line, only its position and slope change. It also means that <code>x<\/code> and <code>y<\/code> will always be in linear relationship.<\/p>\n\n\n\n<p>In the linear function formula:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = a*x + b<\/pre>\n\n\n\n<ul class=\"wp-block-list\"><li>The <code>a<\/code> variable is often called <em>slope<\/em> because &#8211; indeed &#8211; it defines the slope of the red line.<\/li><li>The <code>b<\/code> variable is called the <em>intercept<\/em>. <code>b<\/code> is the value where the plotted line intersects the y-axis. (Or in other words, the value of <code>y<\/code> is <code>b<\/code> when <code>x = 0<\/code>.)<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-1024x683.png\" alt=\"linear function example slope intercept\" class=\"wp-image-4847\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-function-example-slope-intercept.png 1440w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>This is all you have to know about linear functions for now&#8230;<\/p>\n\n\n\n<p><strong>But why did I talk so much about them?<\/strong><\/p>\n\n\n\n<p><strong>Because linear regression is nothing else but finding the exact linear function equation (that is: finding the <code>a<\/code> and <code>b<\/code> values in the <code>y = a*x + b<\/code> formula) that fits your data points the best.<\/strong><\/p>\n\n\n\n<p><em>Note: Here&#8217;s some advice if you are not 100% sure about the math. The most intuitive way to understand the linear function formula is to play around with its values. Change the <\/em><code><em>a<\/em><\/code><em> and <\/em><code><em>b<\/em><\/code><em> variables above, calculate the new <\/em><code><em>x-y<\/em><\/code><em> value pairs and draw the new graph. Repeat this as many times as necessary. (Tip: try out what happens when <\/em><code><em>a = 0<\/em><\/code><em> or <\/em><code><em>b = 0<\/em><\/code><em>!) By seeing the changes in the value pairs and on the graph, sooner or later, everything will fall into place.<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>A typical linear regression example<\/strong><\/h2>\n\n\n\n<p><strong>Machine learning &#8211; just like statistics &#8211; is all about abstractions. You want to simplify reality so you can describe it with a mathematical formula. But to do so, you have to ignore natural variance &#8212; and thus compromise on the accuracy of your model.<\/strong><\/p>\n\n\n\n<p>If this sounds too theoretical or philosophical, here&#8217;s a typical linear regression example!<\/p>\n\n\n\n<p>We have 20 students in a class and we have data about a specific exam they have taken. Each student is represented by a blue dot on this scatter plot:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li>the <strong>X axis<\/strong> shows how many hours a student studied for the exam<\/li><li>the <strong>Y axis<\/strong> shows the scores that she eventually got<\/li><\/ul>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" src=\"https:\/\/lh5.googleusercontent.com\/UghAefe1aZXuj_vpNTkjLjMX90R1RGykhNH4UIQ5V9JOq_AKRXRJNpKV788xKsWXs3Sr-hfODnPV8adkdXBD76zpfwl11A38empJQtTsaXWCFYbtLDSThRQYXH06Uve4WuX6bJZm\" alt=\"\"\/><\/figure>\n\n\n\n<p>E.g. she studied 24 hours and her test result was 58%:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img decoding=\"async\" width=\"1024\" height=\"714\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-1024x714.png\" alt=\"linear regression scatter plot example one student\" class=\"wp-image-4851\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-1024x714.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-300x209.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-768x535.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-973x678.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student-508x354.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-scatter-plot-example-one-student.png 1142w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>We have 20 data points (20 students) here.<\/p>\n\n\n\n<p>By looking at the whole data set, you can intuitively tell that there must be a correlation between the two factors. If one studies more, she&#8217;ll get better results on her exam. But you can see the natural variance, too. For instance, these 3 students who studied for ~30 hours got very different scores: 74%, 65% and 40%.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-1024x683.png\" alt=\"scatter plot spread\" class=\"wp-image-4852\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/scatter-plot-spread.png 1080w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Anyway, let&#8217;s fit a line to our data set &#8212; using linear regression:<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-1024x683.png\" alt=\"linear regression fitted line\" class=\"wp-image-4853\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line.png 1080w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Nice, we got a line that we can describe with a mathematical equation &#8211; this time, with a linear function. The general formula was:&nbsp;<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = a * x + b<\/pre>\n\n\n\n<p>And in this specific case, the <code>a<\/code> and <code>b<\/code> values of this line are:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">a = 2.01\nb = -3.9<\/pre>\n\n\n\n<p>So the exact equation for the line that fits this dataset is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 2.01*x - 3.9<\/pre>\n\n\n\n<p>And how did I get these <code>a<\/code> and <code>b<\/code> values? By using machine learning.<\/p>\n\n\n\n<p><strong>If you know enough <code>x<\/code>&#8211;<code>y<\/code> value pairs in a dataset like this one, you can use linear regression machine learning algorithms to figure out the exact mathematical equation (so the <code>a<\/code> and <code>b<\/code> values) of your linear function.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Linear regression terminology<\/strong><\/h2>\n\n\n\n<p>Before we go further, I want to talk about the terminology itself &#8212; because I see that it confuses many aspiring data scientists. Let&#8217;s fix that here!<\/p>\n\n\n\n<p>Okay, so one last time, this was our linear function formula:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = a*x + b<\/pre>\n\n\n\n<p><strong><span style=\"text-decoration: underline;\">The <\/span><code><span style=\"text-decoration: underline;\">a<\/span><\/code><span style=\"text-decoration: underline;\"> and <\/span><code><span style=\"text-decoration: underline;\">b<\/span><\/code><span style=\"text-decoration: underline;\"> variables:<\/span><\/strong><\/p>\n\n\n\n<p>The <code>a<\/code> and <code>b<\/code> variables in this equation define the position of your regression line and I&#8217;ve already mentioned that the <code>a<\/code> variable is called <strong><em>slope<\/em><\/strong> (because it defines the slope of your line) and the <code>b<\/code> variable is called <strong><em>intercept<\/em><\/strong>.<\/p>\n\n\n\n<p>In the machine learning community the <code>a<\/code> variable (the <strong><em>slope<\/em><\/strong>) is also often called the <strong><em>regression coefficient<\/em><\/strong><em>.<\/em><\/p>\n\n\n\n<p><strong><span style=\"text-decoration: underline;\">The <\/span><code><span style=\"text-decoration: underline;\">x<\/span><\/code><span style=\"text-decoration: underline;\"> and <\/span><code><span style=\"text-decoration: underline;\">y<\/span><\/code><span style=\"text-decoration: underline;\"> variables:<\/span><\/strong><\/p>\n\n\n\n<p>The <code>x<\/code> variable in the equation is the <strong><em>input variable<\/em><\/strong> &#8212; and <code>y<\/code> is the <strong><em>output variable<\/em><\/strong>.<br>This is also a very intuitive naming convention. For instance, in this equation:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 2.01*x - 3.9<\/pre>\n\n\n\n<p>If your <strong><em>input value<\/em><\/strong> is <code>x = 1<\/code>, your <strong><em>output value<\/em><\/strong> will be <code>y = -1.89<\/code>.<\/p>\n\n\n\n<p>But in machine learning these <code>x-y<\/code> value pairs have many alternative names\u2026 which can cause some headaches. So here are a few common synonyms that you should know:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong><em>input variable<\/em><\/strong> (<code>x<\/code>) &#8211; <strong><em>output variable<\/em><\/strong> (<code>y<\/code>)<\/li><li><em><strong>independent variable<\/strong><\/em> (<code>x<\/code>) &#8211; <strong><em>dependent variable<\/em><\/strong> (<code>y<\/code>)<\/li><li><strong><em>predictor variable<\/em><\/strong> (<code>x<\/code>) &#8211; <strong><em>predicted variable<\/em><\/strong> (<code>y<\/code>)<\/li><li><strong><em>feature<\/em><\/strong> (<code>x<\/code>) &#8211; <strong><em>target<\/em><\/strong> (<code>y<\/code>)<\/li><\/ul>\n\n\n\n<p>See, the confusion is not an accident\u2026 But at least, now you have your linear regression dictionary here.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>How does linear regression become useful?<\/strong><\/h2>\n\n\n\n<p>Having a mathematical formula &#8211; even if it doesn&#8217;t 100% perfectly fit your data set &#8211; is useful for many reasons.<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li><strong>Predictions:<\/strong> Based on your linear regression model, if a student tells you how much she studied for the exam, you can come up with a pretty good estimate: you can predict her results even before she writes the test. Let&#8217;s say someone studied <code>20<\/code> hours; it means that her predicted test result will be <code>2.01&nbsp;* 20 - 3.9 = 36.3<\/code>.<br><\/li><li><strong>Outliers:<\/strong> If something unexpected shows up in your dataset &#8211; someone is way too far from the expected range&#8230;<\/li><\/ol>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-1024x683.png\" alt=\"otlier detection with linear regression\" class=\"wp-image-4854\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/otlier-detection-with-linear-regression.png 1080w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>\u2026 let&#8217;s say, someone who studied only 18 hours but got almost 100% on the exam&#8230; Well, that student is either a genius &#8212; or a cheater. But she&#8217;s definitely worth the teachers&#8217; attention, right? \ud83d\ude42 By the way, in machine learning, the official name of these data points is <strong><em>outliers<\/em><\/strong>.<\/p>\n\n\n\n<p>And both of these examples can be translated very easily to real life business use-cases, too!<\/p>\n\n\n\n<p><strong><span style=\"text-decoration: underline;\">Predictions<\/span><\/strong> are used for: sales predictions, budget estimations, in manufacturing\/production, in the stock market and in many other places. (Although, usually these fields use more sophisticated models than simple linear regression.)<\/p>\n\n\n\n<p>Finding <strong><span style=\"text-decoration: underline;\">outliers<\/span><\/strong> is great for fraud detection. And it&#8217;s widely used in the fintech industry. (E.g. preventing credit card fraud.)<\/p>\n\n\n\t\t<div data-elementor-type=\"section\" data-elementor-id=\"7012\" class=\"elementor elementor-7012\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-259c3993 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"259c3993\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t\t\t\t<div class=\"elementor-background-overlay\"><\/div>\n\t\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-679a79f1\" data-id=\"679a79f1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e5ca0c1 elementor-widget elementor-widget-heading\" data-id=\"e5ca0c1\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">The Junior Data Scientist's First Month<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5bc28c elementor-widget elementor-widget-text-editor\" data-id=\"f5bc28c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"color: var( --e-global-color-text ); font-family: 'PT Serif'; font-size: 1em; word-spacing: var( --e-global-typography-text-word-spacing );\">A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.<\/span><\/p><p><i>&#8220;Solving real problems, getting real experience &#8211; just like in a real data science job.&#8221;<\/i><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6da23e4 elementor-align-center elementor-widget elementor-widget-button\" data-id=\"6da23e4\" data-element_type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-md\" href=\"https:\/\/data36.com\/the-junior-data-scientists-first-month-online-course\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Learn more...<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t\n\n\n\n<h2 class=\"wp-block-heading\"><strong>The limitations of machine learning models<\/strong><\/h2>\n\n\n\n<p>It&#8217;s good to know that even if you find a very well-fitting model for your data set, you have to count on some limitations. <\/p>\n\n\n\n<p><em>Note: These are true for essentially all machine learning algorithms &#8212; not only for linear regression.<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Limitation #1: a model is never a perfect fit<\/strong><\/h4>\n\n\n\n<p>As I said, fitting a line to a dataset is always an abstraction of reality. Describing something with a mathematical formula is sort of like reading the short summary of Romeo and Juliet. You&#8217;ll get the essence\u2026 but you will miss out on all the interesting, exciting and charming details.&nbsp;<\/p>\n\n\n\n<p>Similarly in <a href=\"https:\/\/data36.com\/what-is-data-science\/\">data science<\/a>, by &#8220;compressing&#8221; your data into one simple linear function comes with losing the whole complexity of the dataset: you&#8217;ll ignore natural variance.<\/p>\n\n\n\n<p>But in many business cases, that can be a good thing. Your mathematical model will be simple enough that you can use it for your predictions and other calculations.&nbsp;<\/p>\n\n\n\n<p><em>Note: One big challenge of being a data scientist is to find the right balance between a too-simple and an overly complex model &#8212; so the model can be as accurate as possible. (This problem even has a name: <\/em><strong><em><a href=\"http:\/\/www.r2d3.us\/visual-intro-to-machine-learning-part-2\/\">bias-variance tradeoff<\/a><\/em><\/strong><em>, and I&#8217;ll write more about this in a later article.)<\/em><\/p>\n\n\n\n<p><strong>But a machine learning model &#8211; by definition &#8211; will never be 100% accurate.<\/strong><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>Limitation #2: you can&#8217;t go beyond the range of your historical data<\/strong><\/h4>\n\n\n\n<p>Many data scientists try to extrapolate their models and go beyond the range of their data.<\/p>\n\n\n\n<p>For instance, in our case study above, you had data about students studying for 0-50 hours. The dataset hasn&#8217;t featured any student who studied 60, 80 or 100 hours for the exam. These values are out of the range of your data. If you wanted to use your model to predict test results for these &#8220;extreme&#8221; <code>x<\/code> values\u2026 well you would get nonsensical <code>y<\/code> values:<\/p>\n\n\n\n<p>E.g. your model would say that someone who has studied <code>x = 80<\/code> hours would get:<\/p>\n\n\n\n<p><code>y = 2.01*80 - 3.9 = 159%<\/code> on the test.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-1024x683.png\" alt=\"extrapolate issue\" class=\"wp-image-4855\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/extrapolate-issue.png 1200w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>&#8230;but 100% is the obvious maximum, right?<\/p>\n\n\n\n<p><strong>The point is that you can&#8217;t extrapolate your regression model beyond the scope of the data that you have used creating it. Well, in theory, at least.<\/strong>..<\/p>\n\n\n\n<p>Because I have to admit, that in real life data science projects, sometimes, there is no way around it. If you have data about the last 2 years of sales &#8212; and you want to predict the next month, you have to extrapolate. Even so, we always try to be very careful and don&#8217;t look too far into the future. The further you get from your historical data, the worse your model&#8217;s accuracy will be.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Linear Regression in Python<\/strong><\/h2>\n\n\n\n<p><strong>Okay, now that you know the theory of linear regression, it&#8217;s time to learn how to get it done in Python!<\/strong><\/p>\n\n\n\n<p>Let&#8217;s see how you can fit a simple linear regression model to a data set!<\/p>\n\n\n\n<p>Well, in fact, there is more than one way of implementing linear regression in Python. Here, I\u2019ll present my favorite &#8212; and in my opinion the most elegant &#8212; solution. I&#8217;ll use <strong><code>numpy<\/code><\/strong> and its <strong><code>polyfit<\/code><\/strong> method.<\/p>\n\n\n\n<p>We will go through these 6 steps:<\/p>\n\n\n\n<ol class=\"wp-block-list\"><li>Importing the Python libraries we will use<\/li><li>Getting the data<\/li><li>Defining <code>x<\/code> values (the input variable) and <code>y<\/code> values (the output variable)<\/li><li>Machine Learning: fitting the model<\/li><li>Interpreting the results (coefficient, intercept) and calculating the accuracy of the model<\/li><li>Visualization (plotting a graph)<\/li><\/ol>\n\n\n\n<p><em>Note: You might ask: &#8220;Why isn&#8217;t Tomi using <\/em><code><em>sklearn<\/em><\/code><em> in this tutorial?&#8221; I know that (in online tutorials at least) <\/em><code><em>Numpy<\/em><\/code><em> and its <\/em><code><em>polyfit<\/em><\/code><em> method is less popular than the Scikit-learn alternative\u2026 true. But in my opinion, <\/em><code><em>numpy<\/em><\/code><em>&#8216;s <\/em><code><em>polyfit<\/em><\/code><em> is more elegant, easier to learn &#8212; and easier to maintain in production! <\/em><code><em>sklearn<\/em><\/code><em>&#8216;s linear regression function changes all the time, so if you implement it in production and you update some of your packages, it can easily break. I don&#8217;t like that. Besides, the way it&#8217;s built and the extra data-formatting steps it requires seem somewhat strange to me. In my opinion, <\/em><code><em>sklearn<\/em><\/code><em> is highly confusing for people who are just getting started with Python machine learning algorithms. (By the way, I had the <\/em><code><em>sklearn LinearRegression<\/em><\/code><em> solution in this tutorial&#8230; but I removed it. That&#8217;s how much I don&#8217;t like it. So trust me, you&#8217;ll like <\/em><code><em>numpy<\/em><\/code><em> + <\/em><code><em>polyfit<\/em><\/code><em> better, too. :-))<\/em><\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Linear Regression in Python \u2013 using numpy + polyfit<\/h2>\n\n\n\n<p>Fire up a Jupyter Notebook and follow along with me!<\/p>\n\n\n\n<p><em>Note: Find the code base <a href=\"https:\/\/github.com\/tomimester\/linear-regression-in-python-tutorial\/blob\/master\/Linear%20Regression%20in%20Python%20-%20using%20numpy%20polyfit.ipynb\">here<\/a> and download it from <\/em><a href=\"https:\/\/github.com\/tomimester\/linear-regression-in-python-tutorial\/archive\/master.zip\"><em>here<\/em><\/a><em>.<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #1 &#8211; Importing the Python libraries<\/strong><\/h4>\n\n\n\n<p>Before anything else, you want to <a href=\"https:\/\/data36.com\/python-import-built-in-modules-data-science\/\">import<\/a> a few common <a href=\"https:\/\/data36.com\/python-libraries-packages-data-scientists\/\">data science libraries<\/a> that you will use in this little project:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>numpy<\/code><\/li><li><code>pandas<\/code> (you will store your data in pandas DataFrames)<\/li><li><code>matplotlib.pyplot<\/code> (you will use <code>matplotlib<\/code> to plot the data)<\/li><\/ul>\n\n\n\n<p><em>Note: if you haven&#8217;t installed these libraries and packages to your remote server, find out how to do that in <a href=\"https:\/\/data36.com\/python-libraries-packages-data-scientists\/\">this article<\/a>.<\/em><\/p>\n\n\n\n<p>Start with these few lines:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">import numpy as np\nimport pandas as pd\nimport matplotlib.pyplot as plt\n%matplotlib inline<\/pre>\n\n\n\n<p>(The <code>%matplotlib inline<\/code> is there so you can plot the charts right into your Jupyter Notebook.)<\/p>\n\n\n\n<p>To be honest, I almost always import all these libraries and modules at the beginning of my Python data science projects, by default. But apart from these, you won&#8217;t need any extra libraries: <code>polyfit<\/code> &#8212; that we will use for the machine learning step &#8212; is already imported with <code>numpy<\/code>.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"220\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-1024x220.png\" alt=\"numpy linear regression import\" class=\"wp-image-4870\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-1024x220.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-300x64.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-768x165.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-973x209.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import-508x109.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/numpy-linear-regression-import.png 1418w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #2 &#8211; Getting the data<\/strong><\/h4>\n\n\n\n<p>The next step is to get the data that you&#8217;ll work with. In this case study, I prepared the data and you just have to copy-paste these two lines to your Jupyter Notebook:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">students = {'hours': [29, 9, 10, 38, 16, 26, 50, 10, 30, 33, 43, 2, 39, 15, 44, 29, 41, 15, 24, 50],\n            'test_results': [65, 7, 8, 76, 23, 56, 100, 3, 74, 48, 73, 0, 62, 37, 74, 40, 90, 42, 58, 100]}\n\nstudent_data = pd.DataFrame(data=students)<\/pre>\n\n\n\n<p>This is the very same data set that I used for demonstrating a typical linear regression example at the beginning of the article. You know, with the students, the hours they studied and the test scores.<\/p>\n\n\n\n<p>Just print the <code>student_data<\/code> DataFrame and you&#8217;ll see the two columns with the value-pairs we used.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"517\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-1024x517.png\" alt=\"getting the data for linear regression in python\" class=\"wp-image-4860\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-1024x517.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-300x151.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-768x388.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-973x491.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python-508x257.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/getting-the-data-for-linear-regression-in-python.png 1806w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<ul class=\"wp-block-list\"><li>the <code>hours<\/code> column shows how many hours each student studied<\/li><li>and the <code>test_results<\/code> column shows what their test results were<\/li><\/ul>\n\n\n\n<p>(So one line is one student.)<\/p>\n\n\n\n<p>Of course, in real life projects, we instead open <code>.csv<\/code> files (with the <strong><a href=\"https:\/\/data36.com\/pandas-tutorial-1-basics-reading-data-files-dataframes-data-selection\/\">read_csv<\/a><\/strong> function) or SQL tables (with <code>read_sql<\/code>)&#8230; Regardless, the final format of the cleaned and prepared data will be a similar dataframe.<\/p>\n\n\n\n<p>So this is your data, you will fine-tune it and make it ready for the machine learning step.<\/p>\n\n\n\n<p><em>Note: And another thought about real life machine learning projects\u2026 In this tutorial, we are working with a clean dataset. That&#8217;s quite uncommon in real life data science projects. A big part of the data scientist&#8217;s job is data cleaning and data wrangling: like filling in missing values, removing duplicates, fixing typos, fixing incorrect character coding, etc. Just so you know.<\/em><\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #3 &#8211; Defining the feature and target values<\/strong><\/h4>\n\n\n\n<p>Okay, so we have the data set.<\/p>\n\n\n\n<p>But we have to tweak it a bit &#8212; so it can be processed by <code>numpy<\/code>&#8216;s linear regression function.<\/p>\n\n\n\n<p>The next required step is to break the dataframe into:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><strong><em>input (x) values:<\/em><\/strong> this will be the <code>hours<\/code> column<\/li><li><strong><em>and output (y) values:<\/em><\/strong> and this is the <code>test_results<\/code> column<\/li><\/ul>\n\n\n\n<p><code>polyfit<\/code> requires you to define your input and output variables in 1-dimensional format. For that, you can use pandas Series.<\/p>\n\n\n\n<p>Let&#8217;s type this into the next cell of your Jupyter notebook:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">x = student_data.hours\ny = student_data.test_results<\/pre>\n\n\n\n<p>Okay, the <strong><em>input<\/em><\/strong> and <strong><em>output<\/em><\/strong> &#8212; or, using their fancy machine learning names, the <strong><em>feature<\/em><\/strong> and <strong><em>target<\/em><\/strong> &#8212; values are defined.<\/p>\n\n\n\n<p>At this step, we can even put them onto a scatter plot, to visually understand our dataset.<\/p>\n\n\n\n<p>It&#8217;s only one extra line of code:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">plt.scatter(x,y)<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"638\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-1024x638.png\" alt=\"feature target pandas series\" class=\"wp-image-4872\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-1024x638.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-300x187.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-768x479.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-973x606.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series-508x317.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/feature-target-pandas-series.png 1518w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>And I want you to realize one more thing here: so far, we have done zero machine learning\u2026 This was only old-fashioned data preparation.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #4 &#8211; Machine Learning: Linear Regression (line fitting)<\/strong><\/h4>\n\n\n\n<p>We have the <code>x<\/code> and <code>y<\/code> values&#8230; So we can fit a line to them!<\/p>\n\n\n\n<p>The process itself is pretty easy.<\/p>\n\n\n\n<p>Type this one line:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">model = np.polyfit(x, y, 1)<\/pre>\n\n\n\n<p>This executes the <code>polyfit<\/code> method from the <code>numpy<\/code> library that we have imported before. It needs three parameters: the previously defined input and output variables <code>(x, y)<\/code> \u2014 and an integer, too: <code>1<\/code>. <strong>This latter number defines the degree of the <em>polynomial<\/em> you want to fit.<\/strong><\/p>\n\n\n\n<p>Using <code>polyfit<\/code>, you can fit second, third, etc\u2026 degree polynomials to your dataset, too. (That&#8217;s not called <em>linear<\/em> regression anymore &#8212; but <em>polynomial<\/em> regression. Anyway, more about this in a later article&#8230;)<\/p>\n\n\n\n<p>But for now, let&#8217;s stick with linear regression and linear models &#8211; which will be a first degree polynomial. So you should just put: <code>1<\/code>.<\/p>\n\n\n\n<p>When you hit enter, Python calculates every parameter of your linear regression model and stores it into the <code>model<\/code> variable.<\/p>\n\n\n\n<p><strong>This is it, you are done with the machine learning step!<\/strong><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"117\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-1024x117.png\" alt=\"machine learning step numpy polyfit\" class=\"wp-image-4873\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-1024x117.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-300x34.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-768x88.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-973x111.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit-508x58.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/machine-learning-step-numpy-polyfit.png 1576w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><figcaption>Machine Learning is only one line&#8230;<\/figcaption><\/figure>\n\n\n\n<p><em>Note: isn&#8217;t it fascinating all the hype there is around machine learning<\/em> <em>&#8212; especially now that it turns that it&#8217;s less than 10% of your code? (In real life projects, it&#8217;s more like less than 1%.) The real (data) science in machine learning is really what comes before it (data preparation, data cleaning) and what comes after it (interpreting, testing, validating and fine-tuning the model).<\/em><\/p>\n\n\n\t\t<div data-elementor-type=\"section\" data-elementor-id=\"7012\" class=\"elementor elementor-7012\" data-elementor-post-type=\"elementor_library\">\n\t\t\t\t\t<section class=\"elementor-section elementor-top-section elementor-element elementor-element-259c3993 elementor-section-boxed elementor-section-height-default elementor-section-height-default\" data-id=\"259c3993\" data-element_type=\"section\" data-settings=\"{&quot;background_background&quot;:&quot;classic&quot;}\">\n\t\t\t\t\t\t\t<div class=\"elementor-background-overlay\"><\/div>\n\t\t\t\t\t\t\t<div class=\"elementor-container elementor-column-gap-default\">\n\t\t\t\t\t<div class=\"elementor-column elementor-col-100 elementor-top-column elementor-element elementor-element-679a79f1\" data-id=\"679a79f1\" data-element_type=\"column\">\n\t\t\t<div class=\"elementor-widget-wrap elementor-element-populated\">\n\t\t\t\t\t\t<div class=\"elementor-element elementor-element-e5ca0c1 elementor-widget elementor-widget-heading\" data-id=\"e5ca0c1\" data-element_type=\"widget\" data-widget_type=\"heading.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t<h4 class=\"elementor-heading-title elementor-size-default\">The Junior Data Scientist's First Month<\/h4>\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-f5bc28c elementor-widget elementor-widget-text-editor\" data-id=\"f5bc28c\" data-element_type=\"widget\" data-widget_type=\"text-editor.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<p><span style=\"color: var( --e-global-color-text ); font-family: 'PT Serif'; font-size: 1em; word-spacing: var( --e-global-typography-text-word-spacing );\">A 100% practical online course. A 6-week simulation of being a junior data scientist at a true-to-life startup.<\/span><\/p><p><i>&#8220;Solving real problems, getting real experience &#8211; just like in a real data science job.&#8221;<\/i><\/p>\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t<div class=\"elementor-element elementor-element-6da23e4 elementor-align-center elementor-widget elementor-widget-button\" data-id=\"6da23e4\" data-element_type=\"widget\" data-widget_type=\"button.default\">\n\t\t\t\t<div class=\"elementor-widget-container\">\n\t\t\t\t\t\t\t\t\t<div class=\"elementor-button-wrapper\">\n\t\t\t\t\t<a class=\"elementor-button elementor-button-link elementor-size-md\" href=\"https:\/\/data36.com\/the-junior-data-scientists-first-month-online-course\/\">\n\t\t\t\t\t\t<span class=\"elementor-button-content-wrapper\">\n\t\t\t\t\t\t\t\t\t<span class=\"elementor-button-text\">Learn more...<\/span>\n\t\t\t\t\t<\/span>\n\t\t\t\t\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t\t<\/div>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t<\/section>\n\t\t\t\t<\/div>\n\t\t\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #4 side-note: What&#8217;s the math behind the line fitting?<\/strong><\/h4>\n\n\n\n<p>Now, of course, fitting the model was only one line of code &#8212; but I want you to see what&#8217;s under the hood. How did <strong>polyfit<\/strong> fit that line?<\/p>\n\n\n\n<p><strong>It used the ordinary least squares method (which is often referred to with its short form: OLS).<\/strong> It is one of the most commonly used estimation methods for linear regression. There are a few more. But the ordinary least squares method is easy to understand and also good enough in 99% of cases.<\/p>\n\n\n\n<p><strong>Let&#8217;s see how OLS works!<\/strong><\/p>\n\n\n\n<p>When you fit a line to your dataset, for most <code>x<\/code> values there is a difference between the <code>y<\/code> value that your model estimates &#8212; and the real <code>y<\/code> value that you have in your dataset. In machine learning, this difference is called <strong><em>error<\/em><\/strong>.<\/p>\n\n\n\n<p>Let&#8217;s see an example!<\/p>\n\n\n\n<p>Here&#8217;s a visual of our dataset (blue dots) and the linear regression model (red line) that you have just created. (I&#8217;ll show you soon how to plot this graph in Python &#8212; but let&#8217;s focus on OLS for now.)<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"683\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-1024x683.png\" alt=\"linear regression fitted line\" class=\"wp-image-4853\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-1024x683.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-300x200.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-768x512.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-973x649.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line-508x339.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-fitted-line.png 1080w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Let&#8217;s take a data point from our dataset.<\/p>\n\n\n\n<p><code>x = 24<\/code><\/p>\n\n\n\n<p>In the original dataset, the <code>y<\/code> value for this datapoint was <code>y = 58<\/code>. But when you fit a simple linear regression model, the model itself estimates only <code>y = 44.3<\/code>. The difference between the two is the <strong><em>error<\/em><\/strong> for this specific data point.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"697\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-1024x697.png\" alt=\"error in machine learning ols\" class=\"wp-image-4864\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-1024x697.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-300x204.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-768x523.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-973x663.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols-508x346.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-in-machine-learning-ols.png 1498w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>So the ordinary least squares method has these 4 steps:<\/p>\n\n\n\n<p>1) Let&#8217;s calculate all the errors between all data points and the model.<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"677\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-1024x677.png\" alt=\"error ordinary least squares ols\" class=\"wp-image-4865\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-1024x677.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-300x198.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-768x508.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-973x643.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols-508x336.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/error-ordinary-least-squares-ols.png 1210w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>2) Let&#8217;s square each of these error values!<\/p>\n\n\n\n<p>3) Then sum all these squared values!<\/p>\n\n\n\n<p>4) <strong>Find the line where this sum of the squared errors is the smallest possible value.<\/strong><\/p>\n\n\n\n<p>That&#8217;s OLS and that&#8217;s how line fitting works in <code>numpy polyfit<\/code>&#8216;s linear regression solution.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #5 &#8211; Interpreting the results<\/strong><\/h4>\n\n\n\n<p>Okay, so you&#8217;re done with the machine learning part. Let&#8217;s see what you got!<\/p>\n\n\n\n<p>First, you can query the regression coefficient and intercept values for your model. You just have to type:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">model<\/pre>\n\n\n\n<p><em>Note: Remember, <\/em><code><em>model<\/em><\/code><em> is a variable that we used at STEP #4 to store the output of <\/em><code><em>np.polyfit(x, y, 1)<\/em><\/code>.<\/p>\n\n\n\n<p>The output is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">array([ 2.01467487, -3.9057602 ])<\/pre>\n\n\n\n<p>These are the <code>a<\/code> and <code>b<\/code> values we were looking for in the linear function formula.<\/p>\n\n\n\n<p><code>2.01467487<\/code> is the regression coefficient (the <code>a<\/code> value) and <code>-3.9057602<\/code> is the intercept (the <code>b<\/code> value).<\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"188\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-1024x188.png\" alt=\"regression coefficient and intercept python numpy polyfit\" class=\"wp-image-4875\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-1024x188.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-300x55.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-768x141.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-973x178.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1-508x93.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/regression-coefficient-and-intercept-python-numpy-polyfit-1.png 1342w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>So we finally got our equation that describes the fitted line. It is:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">y = 2.01467487 * x - 3.9057602<\/pre>\n\n\n\n<p>If a student tells you how many hours she studied, you can predict the estimated results of her exam. Quite awesome!<\/p>\n\n\n\n<p>You can do the calculation &#8220;manually&#8221; using the equation.<\/p>\n\n\n\n<p>But there is a simple keyword for it in <code>numpy<\/code> &#8212; it&#8217;s called <code>poly1d()<\/code>:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">predict = np.poly1d(model)\nhours_studied = 20\npredict(hours_studied)<\/pre>\n\n\n\n<p>The result is: <code>36.38773723<\/code><\/p>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"219\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-1024x219.png\" alt=\"predict numpy polyfit linear regression in python\" class=\"wp-image-4876\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-1024x219.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-300x64.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-768x164.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-973x208.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python-508x108.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/predict-numpy-polyfit-linear-regression-in-python.png 1330w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p><em>Note: This is the exact same result that you&#8217;d have gotten if you put the <\/em><code><em>hours_studied<\/em><\/code><em> value in the place of the <\/em><code><em>x<\/em><\/code><em> in the <\/em><code><em>y = 2.01467487 * x - 3.9057602<\/em><\/code><em> equation.<\/em><\/p>\n\n\n\n<p>So from this point on, you can use these coefficient and intercept values &#8211; and the <code>poly1d()<\/code> method &#8211; to estimate unknown values.<\/p>\n\n\n\n<p><strong>And this is how you do predictions by using machine learning and simple linear regression in Python.<\/strong><\/p>\n\n\n\n<p>Well, okay, one more thing&#8230;<\/p>\n\n\n\n<p>There are a few methods to calculate <strong>the accuracy of your model<\/strong>. In this article, I&#8217;ll show you only one: the <strong>R-squared (R<\/strong><sup><strong>2<\/strong><\/sup><strong>) value<\/strong>. I won&#8217;t go into the math here (this article has gotten pretty long already)&#8230; it&#8217;s enough if you know that the R-squared value is a number between 0 and 1. And the closer it is to 1 the more accurate your linear regression model is.<\/p>\n\n\n\n<p>Unfortunately, <strong>R-squared calculation is not implemented in <code>numpy<\/code><\/strong>\u2026 so that one should be borrowed from <code>sklearn<\/code> (so we can&#8217;t completely ignore Scikit-learn after all :-)):<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">from sklearn.metrics import r2_score\nr2_score(y, predict(x))<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"144\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-1024x144.png\" alt=\"sklearn r squared r2score\" class=\"wp-image-4877\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-1024x144.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-300x42.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-768x108.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-973x137.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score-508x72.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/sklearn-r-squared-r2score.png 1334w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>And now we know our R-squared value is <code>0.877<\/code>.<\/p>\n\n\n\n<p>That&#8217;s pretty nice!<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><strong>STEP #6 &#8211; Plotting the linear regression model<\/strong><\/h4>\n\n\n\n<p>Visualization is an optional step but I like it because it always helps to understand the relationship between our model and our actual data.<\/p>\n\n\n\n<p>Thanks to the fact that <code>numpy<\/code> and <code>polyfit<\/code> can handle 1-dimensional objects, too, this won&#8217;t be too difficult.<\/p>\n\n\n\n<p>Type this into the next cell of your Jupyter Notebook:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted\">x_lin_reg = range(0, 51)\ny_lin_reg = predict(x_lin_reg)\nplt.scatter(x, y)\nplt.plot(x_lin_reg, y_lin_reg, c = 'r')<\/pre>\n\n\n\n<figure class=\"wp-block-image\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"699\" src=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-1024x699.png\" alt=\"dataviz linear regression in python plot numpy\" class=\"wp-image-4878\" srcset=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-1024x699.png 1024w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-300x205.png 300w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-768x525.png 768w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-973x665.png 973w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy-508x347.png 508w, https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/dataviz-linear-regression-in-python-plot-numpy.png 1350w\" sizes=\"(max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>Here&#8217;s a quick explanation:<\/p>\n\n\n\n<ul class=\"wp-block-list\"><li><code>x_lin_reg = range(0, 51)<\/code><br>This sets the range you want to display the linear regression model over \u2014 in our case it\u2019s between 0 and 50 hours.<\/li><li><code>y_lin_reg = predict(x_lin_reg)<br><\/code>This calculates the <code>y<\/code> values for all the <code>x<\/code> values between <code>0<\/code> and <code>50<\/code>.<\/li><li><code>plt.scatter(x, y)<br><\/code>This plots your original dataset on a scatter plot. (The blue dots.)<\/li><li><code>plt.plot(x_lin_reg, y_lin_reg, c = 'r')<\/code><br>And this line eventually prints the linear regression model \u2014 based on the <code>x_lin_reg<\/code> and <code>y_lin_reg<\/code> values that we set in the previous two lines. (<code>c = 'r'<\/code> means that the color of the line will be <em>red<\/em>.)<\/li><\/ul>\n\n\n\n<p>Nice, you are done:<strong> this is how you create linear regression in Python using <code>numpy<\/code> and <code>polyfit<\/code>.<\/strong><\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>This was only your first step toward machine learning<\/strong><\/h2>\n\n\n\n<p>You are done with building a linear regression model!<\/p>\n\n\n\n<p>But this was only the first step. In fact, this was only <em>simple<\/em> linear regression. But there is <em>multiple<\/em> linear regression (where you can have multiple input variables), there is <a href=\"https:\/\/data36.com\/polynomial-regression-python-scikit-learn\/\"><em>polynomial<\/em> regression<\/a> (where you can fit higher degree polynomials) and many many more regression models that you should learn. Not to speak of the different <a href=\"https:\/\/data36.com\/random-forest-in-python\/\">classification models<\/a>, <a href=\"https:\/\/data36.com\/k-means-clustering-scikit-learn-python\/\">clustering methods<\/a> and so on\u2026<\/p>\n\n\n\n<p>Here, I haven&#8217;t covered the validation of a machine learning model (e.g. when you break your dataset into a training set and a test set), either. But I&#8217;m planning to write a separate tutorial about that, too.<\/p>\n\n\n\n<p>Anyway, I&#8217;ll get back to all these, here, on the blog!<\/p>\n\n\n\n<p>So stay tuned!<\/p>\n\n\n\n<h2 class=\"wp-block-heading\"><strong>Conclusion<\/strong><\/h2>\n\n\n\n<p>Linear regression is the most basic <a href=\"https:\/\/data36.com\/learn-python-for-data-science-from-scratch\/#machine_learning\">machine learning model<\/a> that you should learn.<\/p>\n\n\n\n<p>If you understand every small bit of it, it&#8217;ll help you to build the rest of your machine learning knowledge on a solid foundation.<\/p>\n\n\n\n<p>Knowing how to use linear regression<strong><em> <\/em><\/strong><em>in Python<\/em> is especially important &#8212; since that&#8217;s the language that you&#8217;ll probably have to use in a real life data science project, too.<\/p>\n\n\n\n<p>This article was only your first step! So stay with me and join the <a href=\"https:\/\/data36.com\/inner-circle-data36-newsletter-free-data-science-resources\/\">Data36 Inner Circle<\/a> (it&#8217;s free).<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>If you want to learn more about how to become a data scientist, take my 50-minute video course: <a href=\"https:\/\/data36.com\/how-to-become-a-data-scientist\/\">How to Become a Data Scientist.<\/a>&nbsp;(It&#8217;s&nbsp;free!)<\/li>\n\n\n\n<li>Also check out my 6-week online course: <a href=\"https:\/\/data36.com\/jds\/\">The Junior Data Scientist\u2019s First Month video course.<\/a><\/li>\n<\/ul>\n\n\n\n<p><em>Cheers,<\/em><br><strong><em>Tomi Mester<\/em><\/strong><\/p>\n\n\n\n<p><em>Cheers,<\/em><br><strong><em>Tomi Mester<\/em><\/strong><\/p>\n","protected":false},"excerpt":{"rendered":"<p>I always say that learning linear regression in Python is the best first step towards machine learning. Linear regression is simple and easy to understand even if you are relatively new to data science. So spend time on 100% understanding it! If you get a grasp on its logic, it will serve you as a [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":1418,"comment_status":"open","ping_status":"open","sticky":true,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[],"class_list":["post-4845","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-uncategorized"],"yoast_head":"<!-- This site is optimized with the Yoast SEO plugin v21.1 - https:\/\/yoast.com\/wordpress\/plugins\/seo\/ -->\n<title>Linear Regression in Python using numpy + polyfit (with code base)<\/title>\n<meta name=\"description\" content=\"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.\" \/>\n<meta name=\"robots\" content=\"index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1\" \/>\n<link rel=\"canonical\" href=\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/\" \/>\n<meta property=\"og:locale\" content=\"en_US\" \/>\n<meta property=\"og:type\" content=\"article\" \/>\n<meta property=\"og:title\" content=\"Linear Regression in Python using numpy + polyfit (with code base)\" \/>\n<meta property=\"og:description\" content=\"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.\" \/>\n<meta property=\"og:url\" content=\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/\" \/>\n<meta property=\"og:site_name\" content=\"Data36\" \/>\n<meta property=\"article:publisher\" content=\"https:\/\/www.facebook.com\/data36\/\" \/>\n<meta property=\"article:author\" content=\"https:\/\/www.facebook.com\/data36\" \/>\n<meta property=\"article:published_time\" content=\"2020-02-20T14:32:53+00:00\" \/>\n<meta property=\"article:modified_time\" content=\"2022-09-13T11:38:02+00:00\" \/>\n<meta property=\"og:image\" content=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-in-python.png\" \/>\n\t<meta property=\"og:image:width\" content=\"1200\" \/>\n\t<meta property=\"og:image:height\" content=\"630\" \/>\n\t<meta property=\"og:image:type\" content=\"image\/png\" \/>\n<meta name=\"author\" content=\"Tomi Mester\" \/>\n<meta name=\"twitter:card\" content=\"summary_large_image\" \/>\n<meta name=\"twitter:title\" content=\"Linear Regression in Python using numpy + polyfit (with code base)\" \/>\n<meta name=\"twitter:description\" content=\"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.\" \/>\n<meta name=\"twitter:image\" content=\"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-in-python.png\" \/>\n<meta name=\"twitter:creator\" content=\"@data36_com\" \/>\n<meta name=\"twitter:site\" content=\"@data36_com\" \/>\n<meta name=\"twitter:label1\" content=\"Written by\" \/>\n\t<meta name=\"twitter:data1\" content=\"Tomi Mester\" \/>\n\t<meta name=\"twitter:label2\" content=\"Est. reading time\" \/>\n\t<meta name=\"twitter:data2\" content=\"22 minutes\" \/>\n<script type=\"application\/ld+json\" class=\"yoast-schema-graph\">{\"@context\":\"https:\/\/schema.org\",\"@graph\":[{\"@type\":\"WebPage\",\"@id\":\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/\",\"url\":\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/\",\"name\":\"Linear Regression in Python using numpy + polyfit (with code base)\",\"isPartOf\":{\"@id\":\"https:\/\/data36.com\/#website\"},\"datePublished\":\"2020-02-20T14:32:53+00:00\",\"dateModified\":\"2022-09-13T11:38:02+00:00\",\"author\":{\"@id\":\"https:\/\/data36.com\/#\/schema\/person\/cbc505eee4cecd9d74a2c0f0d00d356e\"},\"description\":\"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.\",\"breadcrumb\":{\"@id\":\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/#breadcrumb\"},\"inLanguage\":\"en-US\",\"potentialAction\":[{\"@type\":\"ReadAction\",\"target\":[\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/\"]}]},{\"@type\":\"BreadcrumbList\",\"@id\":\"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/#breadcrumb\",\"itemListElement\":[{\"@type\":\"ListItem\",\"position\":1,\"name\":\"Home\",\"item\":\"https:\/\/data36.com\/\"},{\"@type\":\"ListItem\",\"position\":2,\"name\":\"Linear Regression in Python using numpy + polyfit (with code base)\"}]},{\"@type\":\"WebSite\",\"@id\":\"https:\/\/data36.com\/#website\",\"url\":\"https:\/\/data36.com\/\",\"name\":\"Data36\",\"description\":\"Learn Data Science the Hard Way!\",\"potentialAction\":[{\"@type\":\"SearchAction\",\"target\":{\"@type\":\"EntryPoint\",\"urlTemplate\":\"https:\/\/data36.com\/?s={search_term_string}\"},\"query-input\":\"required name=search_term_string\"}],\"inLanguage\":\"en-US\"},{\"@type\":\"Person\",\"@id\":\"https:\/\/data36.com\/#\/schema\/person\/cbc505eee4cecd9d74a2c0f0d00d356e\",\"name\":\"Tomi Mester\",\"image\":{\"@type\":\"ImageObject\",\"inLanguage\":\"en-US\",\"@id\":\"https:\/\/data36.com\/#\/schema\/person\/image\/\",\"url\":\"https:\/\/secure.gravatar.com\/avatar\/8b782b29236065ff5e1c0e47a8bdb6ba?s=96&d=mm&r=g\",\"contentUrl\":\"https:\/\/secure.gravatar.com\/avatar\/8b782b29236065ff5e1c0e47a8bdb6ba?s=96&d=mm&r=g\",\"caption\":\"Tomi Mester\"},\"description\":\"Tomi Mester is a data analyst and researcher. He\u2019s the author of the Data36 blog where he gives a sneak peek into online data analysts\u2019 best practices. He writes posts and tutorials on a weekly basis about data science, AB-testing, online research and data coding. Tomi is a guest blogger on Crazyegg, Hackernoon and Tech-In-Asia. You can meet him as a presenter on conferences like: Global E-commerce Summit, TEDx, Business Intelligence Forum, etc...\",\"sameAs\":[\"https:\/\/data36.com\",\"https:\/\/www.facebook.com\/data36\"],\"url\":\"https:\/\/data36.com\/author\/mestitomi\/\"}]}<\/script>\n<!-- \/ Yoast SEO plugin. -->","yoast_head_json":{"title":"Linear Regression in Python using numpy + polyfit (with code base)","description":"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.","robots":{"index":"index","follow":"follow","max-snippet":"max-snippet:-1","max-image-preview":"max-image-preview:large","max-video-preview":"max-video-preview:-1"},"canonical":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/","og_locale":"en_US","og_type":"article","og_title":"Linear Regression in Python using numpy + polyfit (with code base)","og_description":"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.","og_url":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/","og_site_name":"Data36","article_publisher":"https:\/\/www.facebook.com\/data36\/","article_author":"https:\/\/www.facebook.com\/data36","article_published_time":"2020-02-20T14:32:53+00:00","article_modified_time":"2022-09-13T11:38:02+00:00","og_image":[{"width":1200,"height":630,"url":"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-in-python.png","type":"image\/png"}],"author":"Tomi Mester","twitter_card":"summary_large_image","twitter_title":"Linear Regression in Python using numpy + polyfit (with code base)","twitter_description":"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.","twitter_image":"https:\/\/data36.com\/wp-content\/uploads\/2020\/02\/linear-regression-in-python.png","twitter_creator":"@data36_com","twitter_site":"@data36_com","twitter_misc":{"Written by":"Tomi Mester","Est. reading time":"22 minutes"},"schema":{"@context":"https:\/\/schema.org","@graph":[{"@type":"WebPage","@id":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/","url":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/","name":"Linear Regression in Python using numpy + polyfit (with code base)","isPartOf":{"@id":"https:\/\/data36.com\/#website"},"datePublished":"2020-02-20T14:32:53+00:00","dateModified":"2022-09-13T11:38:02+00:00","author":{"@id":"https:\/\/data36.com\/#\/schema\/person\/cbc505eee4cecd9d74a2c0f0d00d356e"},"description":"Learning linear regression in Python is the best first step towards machine learning. Here, you can learn how to do it using numpy + polyfit.","breadcrumb":{"@id":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/#breadcrumb"},"inLanguage":"en-US","potentialAction":[{"@type":"ReadAction","target":["https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/"]}]},{"@type":"BreadcrumbList","@id":"https:\/\/data36.com\/linear-regression-in-python-numpy-polyfit\/#breadcrumb","itemListElement":[{"@type":"ListItem","position":1,"name":"Home","item":"https:\/\/data36.com\/"},{"@type":"ListItem","position":2,"name":"Linear Regression in Python using numpy + polyfit (with code base)"}]},{"@type":"WebSite","@id":"https:\/\/data36.com\/#website","url":"https:\/\/data36.com\/","name":"Data36","description":"Learn Data Science the Hard Way!","potentialAction":[{"@type":"SearchAction","target":{"@type":"EntryPoint","urlTemplate":"https:\/\/data36.com\/?s={search_term_string}"},"query-input":"required name=search_term_string"}],"inLanguage":"en-US"},{"@type":"Person","@id":"https:\/\/data36.com\/#\/schema\/person\/cbc505eee4cecd9d74a2c0f0d00d356e","name":"Tomi Mester","image":{"@type":"ImageObject","inLanguage":"en-US","@id":"https:\/\/data36.com\/#\/schema\/person\/image\/","url":"https:\/\/secure.gravatar.com\/avatar\/8b782b29236065ff5e1c0e47a8bdb6ba?s=96&d=mm&r=g","contentUrl":"https:\/\/secure.gravatar.com\/avatar\/8b782b29236065ff5e1c0e47a8bdb6ba?s=96&d=mm&r=g","caption":"Tomi Mester"},"description":"Tomi Mester is a data analyst and researcher. He\u2019s the author of the Data36 blog where he gives a sneak peek into online data analysts\u2019 best practices. He writes posts and tutorials on a weekly basis about data science, AB-testing, online research and data coding. Tomi is a guest blogger on Crazyegg, Hackernoon and Tech-In-Asia. You can meet him as a presenter on conferences like: Global E-commerce Summit, TEDx, Business Intelligence Forum, etc...","sameAs":["https:\/\/data36.com","https:\/\/www.facebook.com\/data36"],"url":"https:\/\/data36.com\/author\/mestitomi\/"}]}},"_links":{"self":[{"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/posts\/4845","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/comments?post=4845"}],"version-history":[{"count":0,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/posts\/4845\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/media\/1418"}],"wp:attachment":[{"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/media?parent=4845"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/categories?post=4845"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/data36.com\/wp-json\/wp\/v2\/tags?post=4845"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}