Jekyll2020-08-03T14:33:07+00:00https://rickwierenga.com/feed.xmlRick WierengaA blog about whatever interests me.Building a Multi-platform App with SwiftUI2020-08-03T00:00:00+00:002020-08-03T00:00:00+00:00https://rickwierenga.com/blog/swiftui/xplatform<p>At WWDC 2020, Apple introduced a bunch of great new updates to SwiftUI to make it even easier for developers to write apps for Apple platforms. In this tutorial, you’ll learn how to use those new features to make your app work on both iOS and macOS. By the end of this tutorial you will have created a fully functional HackerNews reader.</p>
<p>You can download the <a href="https://github.com/rickwierenga/heartbeat-tutorials/tree/master/MultiplatformApp/">source code</a> on my GitHub.</p>At WWDC 2020, Apple introduced a bunch of great new updates to SwiftUI to make it even easier for developers to write apps for Apple platforms. In this tutorial, you’ll learn how to use those new features to make your app work on both iOS and macOS. By the end of this tutorial you will have created a fully functional HackerNews reader.Why school sucks2020-07-07T00:00:00+00:002020-07-07T00:00:00+00:00https://rickwierenga.com/blog/edu/sucks<p>I’m a high school student, and like most students I studied from home for the last 4-ish months. I study in my bedroom which is similar to my future workplace, or maybe it even is. My bedroom is equipped with a computer with an internet connection, same setup as my working parents. Since school is meant to prepare us for our careers, my bedroom seems to be the perfect place to pursue an education.</p>
<p>Not having strict supervision allows for easy cheating. No one will know if you use a dictionary. No one will know if you use a calculator. And no one will know if you collaborate with your peers. This got me thinking. If my bedroom <em>is</em> my workplace <em>and</em> I can cheat so easily, does this mean I get a free ride to being a billionaire?</p>
<p>I don’t believe that is the case. Why not? I have come to the conclusion that cheating at school is nothing like cheating in real life. In fact, this “cheating” thing is actually appreciated by employers. Employers don’t care if you use a calculator or dictionary. Using all available tools to the fullest extent is a positive skill. Furthermore, employers value employees who collaborate and discuss difficult problems with other people.</p>
<p>At school, however, you will likely be disqualified for doing any of those things during a test. Weird, since I’m at school to prepare for my life. And my life just happens to provide all of those tools<sup>1</sup>.</p>
<h2 id="calculators-dictionaries-and-wikipedia">Calculators, dictionaries and Wikipedia</h2>
<p>Why do children spend countless hours doing basic arithmetic, when they will be using a calculator for all serious decisions after they graduate? Seriously, who is going to do taxes from the top of their head when they can easily use a calculator? I’m not advocating to stop teaching children math. On the contrary, I actually do think teaching students <em>why</em> addition exists, <em>what</em> it is and <em>when</em> they should use it are important topics that everyone should know. But I don’t care if they can add up 90234 and 87623 leaving a calculator (phone) in their pocket.</p>
<p>I’m also not advocating to stop teaching languages at school and relying on dictionaries instead. I wouldn’t have been able to write this text if I didn’t know English. But I also believe a dictionary helped me improve this text. It’s a combination of practice and resources that made this text what it is. I’m sure native speakers will find some mistakes, and I would love to know how to improve this text but at least we learned Shakespeare died in 1616.</p>
<p>Everyone has access to almost all of mankind’s knowledge regardless of where they are thanks to websites like Wikipedia. That’s a wonderful thing. This is no excuse, however, to stop education altogether; or humans would have been replaced already. Knowing how to apply something requires context, analytical skills and creativity, things a computer cannot yet do, but we can teach children.</p>
<h2 id="cheating-is-bad">Cheating is bad</h2>
<p>Many people say you only fool yourself by cheating. I agree. Cheating is bad if it will hurt you later on. I don’t believe collaborating will ever hurt me. Nor do I believe using a dictionary will.</p>
<p>“But knowing something is quicker than looking it up.” Absolutely! I wish I knew how to provide first aid in pressing situations. I wish I hadn’t wasted time learning to memorize facts I don’t use. And in the rare event of me needing to know who painted some piece, I have Wikipedia right in my pocket.</p>
<p>It’s not fair to cheat at an exam and earn a certificate tricking people into believing you can do something in isolation, when in fact you cannot. So will I study for that exam? Probably not, but I will make sure I can get things done.</p>
<h2 id="what-school-should-do">What school should do</h2>
<p>Surprisingly, exercises are that impossible to cheat at are, besides the fairest, possibly among the best exercises we can spend time on. Filling out a fancy survey, read exam, may be a way to validate the products in an assembly line, but one would be stupid to believe human qualities can be captured by a number. We shouldn’t test humans the same way we test tools. But most importantly, we are only teaching students facts we know today, and that does not seem to be particularly responsible in a world evolving this rapidly.</p>
<p>An alternative to exams could be asking students to devise something new. After all, how could one cheat at doing something that does not exist yet? Writing a philosophical essay, debating issues in society or writing computer programs seem good places to start. Students should be free to use the tools that are available to professionals in the field, work within common constraints such as deadlines, and above all, be able to collaborate. Using a smart rotation system it would still be possible to extract individual student performance from a shared environment.</p>
<h2 id="final-words">Final words</h2>
<p>Exercises I can cheat at are <em>exercises</em>, for what exactly? I think school should pose exercises I can’t cheat at. If cheating takes time that could be made as a constraint on the exercise. Looking up how to make an argument during a debate won’t sound very convincing. Math requires exact answers and computers often can’t manage that. Fair enough, we will have to do it by hand. There is nothing inherently bad about it.</p>
<p>If I can cheat at something and still get the correct answer or get a project done, I have found a loophole in education. It means education is not prepared for the technological advancements. School today doesn’t really prepare us for the future. Let’s change that.</p>
<hr />
<div class="footnotes">
<sup>1</sup> I'm not taking the great technology for granted, but I am planning on continuing to enjoy using it.
</div>
<hr />I’m a high school student, and like most students I studied from home for the last 4-ish months. I study in my bedroom which is similar to my future workplace, or maybe it even is. My bedroom is equipped with a computer with an internet connection, same setup as my working parents. Since school is meant to prepare us for our careers, my bedroom seems to be the perfect place to pursue an education.An Intuitive Guide to Neural Networks2020-02-29T00:00:00+00:002020-02-29T00:00:00+00:00https://rickwierenga.com/blog/ml-fundamentals/NN<p>In this post you will build a classifier model to classify images of handwritten digits. This may sound like a rather complicated problem to solve (what is “the number 5”?). However, by using the power of machine learning we do not have to define each number; it will learn by itself. Along the way I will introduce you to the most powerful classifier yet: neural networks. Entering <em>deep</em> learning for the first time.</p>
<p>I am aware of the fact that due to their insane success, many tutorials have been written about neural networks. Many try to impress you by giving a vague proof of backpropagation. Others will confuse you with explanations of biological neurons. This post is not yet another copy of that, nor will I be ignoring the mathematical foundations (which is contrary to the goal of this series). In this post I hope to give you an understanding of what a neural network actually is, and how they learn.</p>
<p><a href="https://github.com/rickwierenga/MLFundamentals/blob/master/4_NN.ipynb">Here</a> is the corresponding notebook where you can find a complete and dynamic implementation of backprop (see later) for any number of layers.</p>
<div class="warning">
I recommend you read the previous posts in this series before continuing you continue reading because each post builds upon the previously explained principles. <a href="/blog/ml-fundamentals">Series homepage</a>.
</div>
<h2 id="the-dataset">The dataset</h2>
<p>As I just mentioned, in this post we will classify handwritten digits. To do so, the MNIST dataset [1]. It consists of 60000 28 by 28 grayscale images like the following:</p>
<p><img src="/assets/images/nn/mnist.png" alt="MNIST images" /></p>
<p>In computer vision, a subfield-ish of machine learning, each pixel represents a feature. The images in MNIST are grayscale so we can use the raw value of the pixels. But other datasets might be in RGB format and if that’s the case each channel of a pixel will be a feature.</p>
<p>It turns out that the classifiers we have seen thus far are not capable of classifying data with this many features ($28 \times 28 = 784$). For instance, the 4’s in the images above are quite different, most certainly when represented as a matrix.</p>
<p>This does not mean the problem cannot be solved. In fact, it is solved. To understand how, let’s learn about neural networks.</p>
<h2 id="classification-models-as-networks">Classification models as networks</h2>
<p>Think about a classification as follows</p>
<p><img src="/assets/images/nn/classifier.png" alt="classifier" /></p>
<p>where $x$ is an input vector, being mapped to a prediction vector $\hat{y}$.</p>
<p>If you were to visualize the individual elements of both vectors, you would get something like this:</p>
<p><img src="/assets/images/nn/classifier1.png" alt="classifier" /></p>
<p>Let’s look at the definition of the hypothesis function $h$ again: (see <a href="/blog/ml-fundamentals/softmax.html">softmax regression</a>):</p>
<script type="math/tex; mode=display">h_\theta(x) = \sigma(X \cdot \theta) = \begin{bmatrix}p(y = 0 | x; \theta) \\ p(y = 1 | x; \theta) \\ p(y = 2 | x; \theta)\end{bmatrix} = \begin{bmatrix}
\frac{\exp(\theta_0^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)} \\
\frac{\exp(\theta_1^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)} \\
\frac{\exp(\theta_2^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)}
\end{bmatrix}</script>
<p>The most important thing to understand is that every feature $x_j$ is multiplied by a row $j$ in $\theta$ ($\theta_j$); each feature $x_j$ impacts the probability of the entire input $x$ belonging to a class. If we visualized the “impacts” in the schema, we would get this:</p>
<p><img src="/assets/images/nn/classifier2.png" alt="classifier" /></p>
<p>One thing this graph does not account for is the bias factor. Let’s add that next:</p>
<p><img src="/assets/images/nn/classifier3.png" alt="classifier" /></p>
<p>The network seems, on a high level, very representative of the underlying math. Please make sure you fully understand it before moving on.</p>
<h3 id="nodes-in-the-graph">Nodes in the graph</h3>
<p>Let’s now look at an individual node in this network.</p>
<p><img src="/assets/images/nn/node.png" alt="node" /></p>
<p>The node for $\hat{y}_1$ multiplies the inputs $x_j$ by their respective parameters $\theta_j$ and, because we are dealing with vector multiplication, adds up the results. In most cases, an activation function is then applied.</p>
<script type="math/tex; mode=display">\hat{y}_1 = g\left(\displaystyle\sum_{j=0}^n \theta_j \cdot x_j\right)</script>
<p>This is the exact model we discussed in the <a href="/blog/ml-fundamentals/logistic-regression.html">logistic regression post</a>.</p>
<p>In the context of a classifier network, people call nodes such as $\hat{y}$ “neurons.” The complete network is, therefore, considered a “neural network.” For the sake of consistency I will also use those terms, but I will simply define “neuron” as “node.” Keep in mind that the biological definition of neuron is something else entirely.</p>
<h2 id="extending-the-network">Extending the network</h2>
<p>Simple networks like these, it’s just a logistic classifier represented as a network, fail at large classification tasks because they have too few parameters. The entire model is too simple and it will never be able to learn from more interesting datasets such as MNIST. However, logistic regression can still be used in this problem. In fact, the main intuition I will present in this post is that a neural network is just a chain of logistic classifiers.</p>
<p>The graph I presented before can be modified to include another layer, or multiple other layers, the so-called “hidden layers.” These layers are placed between the original input and the original output layer. The reason they are called “hidden” is because we do not directly use their values; they simply exist to forward the weights through the network. Note that these layers also have a bias factor.</p>
<p><img src="/assets/images/nn/nn.png" alt="node" /></p>
<p>Because the hidden layer (blue) is not directly related to the number of features (the input layer, in red) or the number of classes (output layer, in green) they can have an arbitrary number of neurons, and a neural network can have an arbitrary number of layers (denoted $L$).</p>
<h3 id="weights">Weights</h3>
<p>This schema looks promising, but how does it really work?</p>
<p>Let’s start by looking at how we represent the weights. In the first two posts (<a href="/blog/ml-fundamentals/polynomial-regression.html">polynomial regression</a> and <a href="/blog/ml-fundamentals/logistic-regression.html">logistic regression</a>) the weights were stored in a vector $\theta \in \mathbb{R}^n$. With <a href="/blog/ml-fundamentals/softmax.html">softmax regression</a> we started using a matrix $\theta \in \mathbb{R}^{n \times K}$ because we wanted an output vector instead of a scalar.</p>
<p>Neural networks use a list to store weights, often denoted as $\Theta$ (capital $\theta$), each item $\Theta^{(l)}$ being a weight matrix. Unlike the schematic, the shapes of the hidden layers often change throughout the network, so storing them in a matrix would be inconvenient. Each weight matrix has a corresponding input and output layer. A particular weight matrix, in layer $l$, is $\Theta^{(l)} \in \mathbb{R}^{(\text{number of input nodes} + 1)\atop \times \text{number of output nodes}}$.</p>
<p>A Python function initializing the weights for a neural network:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># inspired by:
# https://github.com/google/jax/blob/master/examples/mnist_classifier_fromscratch.py
</span><span class="k">def</span> <span class="nf">init_random_params</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">,</span> <span class="n">rng</span><span class="o">=</span><span class="n">npr</span><span class="p">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">0</span><span class="p">)):</span>
<span class="k">return</span> <span class="p">[</span><span class="n">rng</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="n">nodes_in</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">nodes_out</span><span class="p">)</span> <span class="o">*</span> <span class="n">np</span><span class="p">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mi">2</span> <span class="o">/</span> <span class="p">(</span><span class="n">nodes_in</span> <span class="o">+</span> <span class="n">nodes_out</span><span class="p">))</span>
<span class="k">for</span> <span class="n">nodes_in</span><span class="p">,</span> <span class="n">nodes_out</span><span class="p">,</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="n">layer_sizes</span><span class="p">[:</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="n">layer_sizes</span><span class="p">[</span><span class="mi">1</span><span class="p">:])]</span>
</code></pre></div></div>
<p>This function takes a parameter <code class="language-plaintext highlighter-rouge">layer_sizes</code>, which is a list of the number of nodes in each layer. For this problem I build a neural network with 2 hidden layers, each with 500 neurons. Note that the number of nodes in the input and output layer are determined by the dataset.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">weights</span> <span class="o">=</span> <span class="n">init_random_params</span><span class="p">([</span><span class="mi">784</span><span class="p">,</span> <span class="mi">500</span><span class="p">,</span> <span class="mi">500</span><span class="p">,</span> <span class="mi">10</span><span class="p">])</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">init_random_params</code> automatically accounts for a bias factor, so we get the following shapes:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="p">[</span><span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="k">for</span> <span class="n">x</span> <span class="ow">in</span> <span class="n">weights</span><span class="p">]</span>
<span class="p">[(</span><span class="mi">785</span><span class="p">,</span> <span class="mi">500</span><span class="p">),</span> <span class="p">(</span><span class="mi">501</span><span class="p">,</span> <span class="mi">500</span><span class="p">),</span> <span class="p">(</span><span class="mi">501</span><span class="p">,</span> <span class="mi">10</span><span class="p">)]</span>
</code></pre></div></div>
<h3 id="computing-predictions-feedforward">Computing predictions: feedforward</h3>
<p>To compute the predictions, given the input features, we use a (very simple) algorithm “feedforward.” We loop over each layer, add a bias factor, compute the output by multiplying by the corresponding weight matrix, and finally apply the activation function.</p>
<p>It’s easier in Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="n">weights</span><span class="p">,</span> <span class="n">inputs</span><span class="p">):</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">inputs</span>
<span class="c1"># loop over layers
</span> <span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">weights</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">add_bias</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">@</span> <span class="n">w</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">g</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">return</span> <span class="n">x</span>
</code></pre></div></div>
<p>Given the list of weights $\Theta$, we know that given the input $x = a^{(1)}$, the activations in layer 2 should be $a^{(2)} = g(x \cdot \Theta^{(2)})$. The activations in layer 3 should be $g(a^{(2)} \cdot \Theta^{(3)})$. Generally, the activation in layer $l$ is $g(a^{(l-1)} \cdot \Theta^{(l)})$. Note that the above code example uses <code class="language-plaintext highlighter-rouge">x</code> as the only variable name for convenience, but that does not necessarily mean $x$ as in input.</p>
<p>In short, $h_\theta(x)$ can take on other forms than a matrix multiplication.</p>
<h2 id="training-backpropagation">Training: backpropagation</h2>
<p>Next, let’s discuss how to train a neural network. People often think this is a very complicated process. If that’s you, forget everything you’ve learnt so far because it’s actually quite intuitive.</p>
<p>Once you realize that the values for any $\Theta$ yield deterministic activations in the entire neural network given some input, it is understandable that not every activation in a hidden layer is desired, even though they are not directly interpreted (in fact, interpreting the values of hidden layers is an active problem). This implies that in order to have good predictions, we also need good activations in hidden layers.</p>
<p>We would like to compute how we need to change each value of $\Theta$ so that we get the correct activations in the output layer. By doing that, we also change the values in the hidden layers. This means that the hidden layers also carry an error term (denoted $\delta^{(l)}$), while that’s not directly obvious if you only think about the final output layer. The same thing in reverse: by inspecting the error in each hidden layer, we can compute the change (gradient) for the weight matrices.</p>
<h3 id="computing-errors">Computing errors</h3>
<p>The only layer for which we immediately know the error term is the output layer. Because it serves as a prediction layer we can compare its output to the labels. The error for layer $L$ is given as</p>
<script type="math/tex; mode=display">\delta^{(L)} = \hat{y} - y</script>
<p>For all previous layers we can compute the error term with the following formula:</p>
<script type="math/tex; mode=display">\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .* g'(a^{(l)})</script>
<p>Because $\delta^{(l)}$ is depended on $\delta^{(l+1)}$ the error terms must computed in reverse other. It is also dependent on $a^{(l)}$, the activations of layer $l$, so in order to compute it we first need to know the exact activation in this layer. That’s the reason a full forward propagation is usually computed before starting backpropagation, saving activations along the way.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">x</span> <span class="o">=</span> <span class="n">inputs</span>
<span class="n">activations</span> <span class="o">=</span> <span class="p">[</span><span class="n">inputs</span><span class="p">]</span>
<span class="k">for</span> <span class="n">w</span> <span class="ow">in</span> <span class="n">weights</span><span class="p">:</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">add_bias</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">x</span> <span class="o">@</span> <span class="n">w</span>
<span class="n">activations</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">g</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">x</span>
</code></pre></div></div>
<p>Now to compute the error term, we start by computing the error in the final layer. The error is transposed to match the format of the other errors we will compute.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">final_error</span> <span class="o">=</span> <span class="p">(</span><span class="n">predictions</span> <span class="o">-</span> <span class="n">y</span><span class="p">).</span><span class="n">T</span>
<span class="n">errors</span> <span class="o">=</span> <span class="p">[</span><span class="n">final_error</span><span class="p">]</span>
</code></pre></div></div>
<p>We will compute the other errors in a loop. A few things to note:</p>
<ul>
<li>
<p>We index our activations by <code class="language-plaintext highlighter-rouge">[1:-1]</code> so we skip the first layer (the input has no error term) and the output layer (we already have computed that).</p>
</li>
<li>
<p>We skip the first node in each layer; it is defined as 1.</p>
</li>
<li>
<p>Finally, <code class="language-plaintext highlighter-rouge">weights[-(i+1)]</code> is the weight matrix indexed from the back (<code class="language-plaintext highlighter-rouge">-1</code> because <code class="language-plaintext highlighter-rouge">i</code> starts at 0).</p>
</li>
</ul>
<p>These things are important things to keep in mind when doing backprop, but in this context it’s easy to understand.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">act</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="mi">1</span><span class="p">:</span><span class="o">-</span><span class="mi">1</span><span class="p">]):</span>
<span class="c1"># ignore the first weight because we don't adjust the bias
</span> <span class="n">error</span> <span class="o">=</span> <span class="n">weights</span><span class="p">[</span><span class="o">-</span><span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)][</span><span class="mi">1</span><span class="p">:,</span> <span class="p">:]</span> <span class="o">@</span> <span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">*</span> <span class="n">g_</span><span class="p">(</span><span class="n">act</span><span class="p">).</span><span class="n">T</span>
<span class="n">errors</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">error</span><span class="p">)</span>
</code></pre></div></div>
<p>This snippet uses the derivative of sigmoid <code class="language-plaintext highlighter-rouge">g_</code>. It is defined as:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">g_</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="s">""" derivative sigmoid """</span>
<span class="k">return</span> <span class="n">g</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">g</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
</code></pre></div></div>
<p>or</p>
<script type="math/tex; mode=display">\frac{d}{dz} g(z) = g(z) \cdot (1 - g(z))</script>
<p>Finally, we flip the errors so they are arranged like the layers:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">errors</span> <span class="o">=</span> <span class="n">reverse</span><span class="p">(</span><span class="n">errors</span><span class="p">)</span>
</code></pre></div></div>
<p>For the sake of completeness:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">reverse</span><span class="p">(</span><span class="n">l</span><span class="p">):</span>
<span class="k">return</span> <span class="n">l</span><span class="p">[::</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
</code></pre></div></div>
<h3 id="computing-gradients">Computing gradients</h3>
<p>Recall that a gradient is a multidimensional step for a weight matrix to decrease the error.</p>
<p>We now know the error for each layer. If we go back to logistic regression, the building block of neural networks, you can think of these as the loss of each layer. This means that we can use the same equation we developed in a previous post to compute the gradient for each weight matrix corresponding to each error term. Except the loss is now $\delta^{(l + 1)}$ and the input $a^{(l)}$.</p>
<script type="math/tex; mode=display">\Delta^{(l)} = \frac{1}{m} \cdot \delta^{(l + 1)} a^{(l)}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">grads</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">errors</span><span class="p">)):</span>
<span class="n">grad</span> <span class="o">=</span> <span class="p">(</span><span class="n">errors</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">@</span> <span class="n">add_bias</span><span class="p">(</span><span class="n">activations</span><span class="p">[</span><span class="n">i</span><span class="p">]))</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">))</span>
<span class="n">grads</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">grad</span><span class="p">)</span>
</code></pre></div></div>
<p>Congratulations! You now know backpropagation. Fun note: because we are learning multiple layers, we are doing <em>deep</em> learning. Just so you know.</p>
<h3 id="backpropagation-for-individual-neurons">Backpropagation for individual neurons</h3>
<p>The way backpropagation is usually taught is by presenting it as a method for finding the derivative for the cost function. A great intuition on backpropagation from that perspective is written by Andrej Karpathy in his Stanford course. I’ll let him explain it <a href="http://cs231n.github.io/optimization-2/">himself</a>.</p>
<h3 id="the-training-loop">The training loop</h3>
<p>Because the entire dataset consists of $60000 \cdot 28 \cdot 28 \approx 4.7 \cdot10^7$ elements, it’s too big for most computers to fit in the RAM at once. That’s the reason we use “batching”, loading only a few (100 in this case) at a time.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">lr</span> <span class="o">=</span> <span class="mf">0.001</span>
<span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">2</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="s">'Starting epoch'</span><span class="p">,</span> <span class="n">epoch</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">x_train</span><span class="p">)):</span>
<span class="n">inputs</span> <span class="o">=</span> <span class="n">x_train</span><span class="p">[</span><span class="n">i</span><span class="p">][</span><span class="n">np</span><span class="p">.</span><span class="n">newaxis</span><span class="p">,</span> <span class="p">:]</span>
<span class="k">if</span> <span class="n">x_train</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nb">max</span><span class="p">()</span> <span class="o">></span> <span class="mi">1</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="s">'huhh'</span><span class="p">,</span> <span class="n">i</span><span class="p">,</span> <span class="n">x_train</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="nb">max</span><span class="p">())</span>
<span class="n">labels</span> <span class="o">=</span> <span class="n">T</span><span class="p">([</span><span class="n">y_train</span><span class="p">[</span><span class="n">i</span><span class="p">]],</span> <span class="n">K</span><span class="o">=</span><span class="mi">10</span><span class="p">)</span>
<span class="n">grads</span> <span class="o">=</span> <span class="n">backward</span><span class="p">(</span><span class="n">inputs</span><span class="p">,</span> <span class="n">labels</span><span class="p">,</span> <span class="n">weights</span><span class="p">)</span>
<span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">weights</span><span class="p">)):</span>
<span class="n">weights</span><span class="p">[</span><span class="n">j</span><span class="p">]</span> <span class="o">-=</span> <span class="n">lr</span> <span class="o">*</span> <span class="n">grads</span><span class="p">[</span><span class="n">j</span><span class="p">].</span><span class="n">T</span>
<span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">5000</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">stats</span><span class="p">(</span><span class="n">weights</span><span class="p">))</span>
</code></pre></div></div>
<p>This should get you an accuracy of $90\%$.*</p>
<p>For the complete code, refer to the <a href="https://github.com/rickwierenga/MLFundamentals/blob/master/4_NN.ipynb">notebook</a>.</p>
<h2 id="deep-learning-frameworks">Deep learning frameworks</h2>
<p>You might be wondering if you need to implement everything we did today when you are building a neural network. Fortunately, that’s not the case. Many deep learning libraries exist, <a href="https://pytorch.org">PyTorch</a> and <a href="https://tensorflow.org">TensorFlow + Keras</a> being the most popular.</p>
<p>While this series focusses on the fundamentals, I would like to show you an example in Keras because it’s the easiest, in my opinion. (you should be able to find other tutorials on MNIST in all other frameworks easily if you’re into that)</p>
<p>You can define a model as just a list of <code class="language-plaintext highlighter-rouge">tf.keras.layers</code> objects, and Keras will automatically initialize the weights.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">=</span> <span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">Sequential</span><span class="p">([</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">500</span><span class="p">,</span>
<span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">,</span>
<span class="n">input_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">28</span> <span class="o">*</span> <span class="mi">28</span><span class="p">,)),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">500</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">),</span>
<span class="n">tf</span><span class="p">.</span><span class="n">keras</span><span class="p">.</span><span class="n">layers</span><span class="p">.</span><span class="n">Dense</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">activation</span><span class="o">=</span><span class="s">'sigmoid'</span><span class="p">)</span>
<span class="p">])</span>
</code></pre></div></div>
<h2 id="whats-next">What’s next?</h2>
<p>*This model does not have the same accuracy some other models do because I skipped some things to keep the post concise. In a future post I will present those techniques.</p>
<p>Apart from the ability to stack logistic classifiers, another thing that makes neural networks powerful is that we can combine different kind of layers. Today we have looked at “dense” layers, layers consisting of a matrix multiplication and activation function. Convolutional and dropout layers, for example, are other type of interesting layers I will cover in a future post.</p>
<p>In the next post I will cover optimization algorithms so we can train larger neural networks much faster.</p>
<h2 id="learn-more">Learn more</h2>
<p>You should check out the <a href="https://playground.tensorflow.org/">TensorFlow Playground</a>, a website where you can play around with neural networks in a very visual way to get an even better intuition for how feedforward works.</p>
<p>I added [2] as a reference to learn more about backpropagation as a technique to differentiate the loss function for neural networks.</p>
<h2 id="references">References</h2>
<p>[1] LeCun, Y., Cortes, C., & Burges, C. (2010). MNIST handwritten digit databaseATT Labs [Online]. Available: <a href="http://yann.lecun.com/exdb/mnist">http://yann. lecun. com/exdb/mnist</a>, 2.</p>
<p>[2] Atilim Gunes Baydin and Barak A. Pearlmutter and Alexey Andreyevich Radul (2015). Automatic differentiation in machine learning: a surveyCoRR, <a href="http://arxiv.org/abs/1502.05767">http://arxiv.org/abs/1502.05767</a>.</p>In this post you will build a classifier model to classify images of handwritten digits. This may sound like a rather complicated problem to solve (what is “the number 5”?). However, by using the power of machine learning we do not have to define each number; it will learn by itself. Along the way I will introduce you to the most powerful classifier yet: neural networks. Entering deep learning for the first time.Softmax Regression from Scratch in Python2020-02-22T00:00:00+00:002020-02-22T00:00:00+00:00https://rickwierenga.com/blog/ml-fundamentals/softmax<p><a href="/blog/ml-fundamentals/logistic-regression.html">Last time</a> we looked at classification problems and how to classify breast cancer with logistic regression, a binary classification problem. In this post we will consider another type of classification: multiclass classification. In particular, I will cover one hot encoding, the softmax activation function and negative log likelihood.</p>
<div class="warning">
I recommend you read the previous posts in this series before continuing you continue reading because each post builds upon the previously explained principles. <a href="/blog/ml-fundamentals">Series homepage</a>.
</div>
<h2 id="revisiting-classification">Revisiting classification</h2>
<p>Recall from the previous post that classification is discrete regression. The target $y^{(i)}$ can take on values from a discrete and finite set. In binary classification we only considered sets of size $2$, but classification can be extended beyond that. Let’s look at the complete picture where $y^{(i)} \in {0, 1, \ldots K}$.</p>
<p>The model we build for logistic regression could be intuitively understood by looking at the decision boundary. By forcing the model to predict values as distant from the decision boundary as possible through the logistic loss function, we were able to build theoretically very stable models. The model outputted probabilities for each instance belonging to the positive class.</p>
<p>However, in multiclass classification it’s hard to think about a decision boundary splitting the feature space in more than 2 parts. In fact, such a plane does not even exist. Furthermore, the log loss function does not work with more than two classes because it depends on the fact that if an instance belongs to one class, it does not belong to the other. So we need something else.</p>
<p>Let’s look at where we are thus far. A schematic of polynomial regression:</p>
<p><img src="/assets/images/softmax/polynomial.png" alt="polynomial regression diagram" /></p>
<p>A corresponding diagram for logistic regression:</p>
<p><img src="/assets/images/softmax/logistic.png" alt="logistic regression diagram" /></p>
<p>In this post we will build another model, which is very similar to logistic regression. The key difference in the hypothesis function is that we use $\sigma$ instead of sigmoid, $g$:</p>
<p><img src="/assets/images/softmax/softmax.png" alt="softmax regression diagram" /></p>
<h2 id="one-hot-encoding">One hot encoding</h2>
<p>As I just mentioned, we can’t measure distances over a single “class dimension” (by which I mean the probability of an instance belonging to the positive class). Instead, for multiclass classification we think about each class as a separate channel, or dimension if you will. All of these channels are accumulated in an output vector, $\hat{y} \in \mathbb{R}^K$.</p>
<p>Let’s take a look at what such a vector would look like. For convenience, we define a function $T(y): \mathbb{R} \rightarrow \mathbb{R}^K$ which maps labels from their integer representation (in $0, 1, \ldots k$) to a one hot encoded representation. This function takes into account the total number of classes, $3$ in this case.</p>
<script type="math/tex; mode=display">T(0) = \begin{bmatrix}1 \\ 0 \\ 0 \end{bmatrix} \quad T(1) = \begin{bmatrix}0 \\ 1 \\ 0 \end{bmatrix}</script>
<p>The instance in class $0$ has a $100\%$ chance of belonging to $1$, and a $0\%$ chance for all other classes. We would like $h$ to yield similar values.</p>
<p>In Python $T$ could be implemented as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">T</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">K</span><span class="p">):</span>
<span class="s">""" one hot encoding """</span>
<span class="n">one_hot</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">zeros</span><span class="p">((</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">),</span> <span class="n">K</span><span class="p">))</span>
<span class="n">one_hot</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="nb">len</span><span class="p">(</span><span class="n">y</span><span class="p">)),</span> <span class="n">y</span><span class="p">]</span> <span class="o">=</span> <span class="mi">1</span>
<span class="k">return</span> <span class="n">one_hot</span>
</code></pre></div></div>
<p>If you don’t yet see why this would be useful yet, hang on.</p>
<h2 id="building-the-model-the-softmax-function">Building the model: the softmax function</h2>
<p>Up until now $x \cdot \theta$ has always had a scalar output, an output in one dimension. However, in this case the resulting value will be a vector where each row corresponds to a certain class, as we have just seen. While this could be achieved by initializing $\theta$ as an $n \times K$ dimensional matrix, which we will also do, the dot product would be of little meaning.</p>
<p>That’s the reason we define another activation function, $\sigma$. As you may remember from last post, $g$ is the general symbol for activation functions. But as you will learn in the neural networks post (stay tuned) the softmax activation function is a bit of an outlier compared to the other ones. So we use $\sigma$.</p>
<p>For $z\in\mathbb{R}^k$, $\sigma$ is defined as</p>
<script type="math/tex; mode=display">\sigma(z) = \frac{\exp(z_i)}{\sum_{j=1}^k \exp(z_j)}</script>
<p>which gives</p>
<script type="math/tex; mode=display">p(y = i | x; \theta) = \frac{\exp(\theta_j^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)}</script>
<p>where $\theta_j \in \mathbb{R}^m$ is the vector of weights corresponding to class $i$. $p$ is the probability. For more details on why that is, refer to <a href="http://cs229.stanford.edu/notes2019fall/cs229-notes1.pdf">this document</a>, section 9.3.</p>
<p>The hypothesis function $h$ yields a vector $\hat{y}$ where each row is the probability of the input $x$ belonging to a class. For $K = 3$ we have</p>
<script type="math/tex; mode=display">h_\theta(x) = \sigma(X \cdot \theta) = \begin{bmatrix}p(y = 0 | x; \theta) \\ p(y = 1 | x; \theta) \\ p(y = 2 | x; \theta)\end{bmatrix} = \begin{bmatrix}
\frac{\exp(\theta_0^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)} \\
\frac{\exp(\theta_1^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)} \\
\frac{\exp(\theta_2^Tx)}{\sum_{j=1}^k \exp(\theta_j^Tx)}
\end{bmatrix}</script>
<p>See how $T$ fits into the picture?</p>
<p>To get a final class prediction, we don’t check if the number exceeds a certain threshold ($0.5$ in the last post), but we take the channel with the highest probability. In mathematical terms:</p>
<script type="math/tex; mode=display">\text{class} = \arg\max \hat{y}</script>
<p>One thing I would like to point out here is that</p>
<script type="math/tex; mode=display">\displaystyle\sum_{j=1}^k h_\theta(x)_j = 1</script>
<p>This is obvious because the sum of numerators in $h$ is equal to the denominator, by definition.Intuitively, the model is $100\%$ sure each instance belongs to one of the predefined classes. Furthermore, because $\exp$ is, of course, an exponential function, large elements in $\theta^Tx$ will be “intensified” by $\sigma$ to get a higher probability, also relatively speaking.</p>
<p>A vectorized python implementation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
</code></pre></div></div>
<h3 id="numerical-stability">Numerical stability</h3>
<p>When implementing softmax, $\sum_{j=1}^k \exp(\theta_j^Tx)$ may be very high which leads to numerically unstable programs. To avoid this problem, we normalize each value $\theta_j^Tx$ by subtracting the largest value.</p>
<p>The implementation now becomes</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">softmax</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="n">z</span> <span class="o">-=</span> <span class="n">np</span><span class="p">.</span><span class="nb">max</span><span class="p">(</span><span class="n">z</span><span class="p">)</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">z</span><span class="p">)</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="n">z</span><span class="p">))</span>
</code></pre></div></div>
<p>This normalization step has no further impact on the outcomes.</p>
<h3 id="the-hypothesis">The hypothesis</h3>
<p>For the sake of completeness, here is the final implementation for the hypothesis function:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">):</span>
<span class="k">return</span> <span class="n">softmax</span><span class="p">(</span><span class="n">X</span> <span class="o">@</span> <span class="n">theta</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="negative-log-likelihood">Negative log likelihood</h2>
<p>The loss function is used to measure how bad our model is. Thus far, that meant the distance of a prediction to the target value because we have only looked at 1-dimensional output spaces. In multidimensional output spaces, we need another way to measure badness.</p>
<p>Negative log likelihood is yet another loss function suitable for these kinds of measurements. It is defined as:</p>
<script type="math/tex; mode=display">J(\theta) = -\displaystyle\sum_{i = 1}^m \log p(y^{(i)}|x^{(i)};\theta)</script>
<p>When I first encountered this function it was extremely confusing to me. But it turns out that the idea behind it is actually brilliant and even intuitive.</p>
<p>Let’s first look at the plot of the negative log likelihood for some arbitrary probabilities.</p>
<p><img src="/assets/images/softmax/nll.png" alt="plot of negative log likelihood" /></p>
<p>As the probability increases, the loss decreases. And because we only take the negative log likelihood of the current class, $y^{(i)}$ in the formula, into account, this looks like a nice property. Moreover, when we maximize the probability for the correct class, we automatically decrease the probabilities for other classes because the sum is always equal to $1$. This is an implicit side effect which might not be obvious at first.</p>
<p>We always want to get the loss as low as possible. The <em>negative</em> log likelihood is multiplied by $-1$, which means that you could also look at it as maximizing the log likelihood:</p>
<script type="math/tex; mode=display">\displaystyle\sum_{i = 1}^m \log p(y^{(i)}|x^{(i)};\theta)</script>
<p>Because all machine learning optimizers are designed for minimization instead of maximization, we use <em>negative</em> log likelihood instead of just the log likelihood.</p>
<p>Finally, here is a vectorized Python implementation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">J</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="nb">sum</span><span class="p">(</span><span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">preds</span><span class="p">[</span><span class="n">np</span><span class="p">.</span><span class="n">arange</span><span class="p">(</span><span class="n">m</span><span class="p">),</span> <span class="n">y</span><span class="p">]))</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">preds[np.arange(m), y]</code> indixes all values of <code class="language-plaintext highlighter-rouge">preds</code> for each class of <code class="language-plaintext highlighter-rouge">y</code>, discarding the other probabilities.</p>
<h2 id="conclusion">Conclusion</h2>
<p>If you would like to play around with the concepts I introduced yourself, I recommend you check out the <a href="https://github.com/rickwierenga/MLFundamentals/blob/master/3_Softmax_Regression.ipynb">corresponding notebook</a>.</p>
<p>You now have all the necessary knowledge to learn about neural networks, the topic of next week’s post.</p>Last time we looked at classification problems and how to classify breast cancer with logistic regression, a binary classification problem. In this post we will consider another type of classification: multiclass classification. In particular, I will cover one hot encoding, the softmax activation function and negative log likelihood.Generating docs for your Swift Package and hosting on GitHub Pages2020-02-19T00:00:00+00:002020-02-19T00:00:00+00:00https://rickwierenga.com/blog/apple/swift-packages-ci<p>Swift Packages are one of the most exciting applications of the Swift programming language. Packages allow for development beyond your usual app — it even works on Linux!</p>
<p>It’s well-known that projects without documentation don’t get the attention they deserve. And rightfully so, because it’s very hard for new “users” (developers who consume your package) to get started with your project if documentation is lacking. However, writing and maintaining documentation is often seen as a boring task. Besides, your documentation will quickly be outdated if you don’t update it.</p>
<p>Luckily, great tools exist to <em>generate documentation for you</em>. In this post, I’d like to give you a quick introduction to <code class="language-plaintext highlighter-rouge">jazzy</code>: a Realm project to automatically generate great documentation pages for your project. And it even works with Objective-C.</p>
<p>We’ll also look at how to host the generated documentation on GitHub Pages, for free! Simply put, a deployed version of your documentation is much better than just sending your users some HTML files.</p>
<p>And as icing on the cake, you’ll also learn how to use GitHub Actions to generate new docs each time you deploy a new version of your package, format your code through SwiftLint, and run your tests. All automated and completely free!</p>Swift Packages are one of the most exciting applications of the Swift programming language. Packages allow for development beyond your usual app — it even works on Linux!Logistic Regression from Scratch in Python2020-02-08T00:00:00+00:002020-02-08T00:00:00+00:00https://rickwierenga.com/blog/ml-fundamentals/logistic-regression<p>Classification is one of the biggest problems machine learning explores. Where we used polynomial regression to predict values in a continuous output space, logistic regression is an algorithm for discrete regression, or classification, problems.</p>
<p>In <a href="/blog/ml-fundamentals/polynomial-regression.html">the previous post</a> I explained polynomial regression problems based on a task to predict the salary of a person given certain aspects of that person. I also discussed basic machine learning terminology. In this post I describe logistic regression and classification problems. By working through another example, predicting breast cancer, you will learn how to build your own classification model. I will also cover alternative metrics for measuring the accuracy of a machine learning model.</p>
<div class="warning">
I recommend you read the previous posts in this series before continuing you continue reading because each post builds upon the previously explained principles. <a href="/blog/ml-fundamentals">Series homepage</a>.
</div>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">%</span><span class="n">pylab</span> <span class="n">inline</span>
</code></pre></div></div>
<p>The code is available in <a href="https://github.com/rickwierenga/MLFundamentals">the corresponding GitHub repository for this series</a> (leave a star :)). I encourage you to run the notebook alongside this post.</p>
<h2 id="classification">Classification</h2>
<p>In a classification problem, the target values are called <em>labels</em>. Each label corresponds to a certain <em>class</em> such as “car”, “blue” or “malignant.” Each instance belongs to a certain class*, thus having a label. Both the labels and classes have to be unique, but more than one instance per class is allowed—it’s actually strongly encouraged. Usually each class’ label has an integer value starting at 0 and following classes get a label of 1, 2, 3, etc. This gives us the property that <script type="math/tex">y^{(i)} \in \{0, 1, \ldots, K \}</script> where $K$ is the number of classes.</p>
<p>With the terminology of the previous post, we could state that for binary classification <script type="math/tex">y \in \{0, 1\}^m</script> where <script type="math/tex">0</script> and <script type="math/tex">1</script> are labels. Further, in binary classification the instances belonging to the $0$-class are the “<em>negatives</em>”, often indicating the absence of a something, and $1$ are the “<em>positives</em>”. Another way to look at this is that for negatives there is a $0\%$ chance of something occurring where for positives there’s a $100\%$ chance. These percentages will, hopefully, be the output of a logistic regression model.</p>
<p>*If you wish to classify instances as not belonging to a certain class, you assign a “not classified” class.</p>
<h2 id="the-dataset">The Dataset</h2>
<p>The dataset we are working with today is the <a href="https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)">Breast Cancer Data Set</a> [1]. For this dataset we have the following properties:</p>
<script type="math/tex; mode=display">m = 669 \quad \quad n = 9</script>
<p>To keep the blog post concise and focussed I added explanations in the notebook explaining how to load and clean the data with pandas.</p>
<h2 id="modelling">Modelling</h2>
<p>Much like the previous problem we need a way to map input values to output values. We will once again call this model <script type="math/tex">h_\theta</script>.</p>
<p>Remember the model from <a href="/blog/ml-fundamentals/polynomial-regression.html">polynomial regression</a>:</p>
<script type="math/tex; mode=display">h_\theta(x) = \theta^Tx</script>
<p>We will make one small change to this model to work with classification.</p>
<p>Note that the model can output values much greater than $1$ and much smaller than $0$. This is unnatural and not desired—we would like to get values between $0$ and $1$ so we can interpret the output as probabilities. To achieve this we use the <em>sigmoid</em> function:</p>
<script type="math/tex; mode=display">g(z) = \frac {1}{1+e^{-z}}</script>
<blockquote>
<p>Note: $g$ is a symbol for an activation function (more on that in future posts), not sigmoid.</p>
</blockquote>
<p>While this function looks complicated, it’s easy to see why it’s used when you look at its graph:</p>
<p><img src="/assets/images/log/sigmoid.png" alt="simgoid plot" /></p>
<p>For values $x < 0$ we have that $g(x) < 0.5$ and for $x < 0$ we have $g(x) > 0.5$.</p>
<p>In Python you can just copy the formula over:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">g</span><span class="p">(</span><span class="n">z</span><span class="p">):</span>
<span class="s">""" sigmoid """</span>
<span class="k">return</span> <span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="mi">1</span> <span class="o">+</span> <span class="n">np</span><span class="p">.</span><span class="n">exp</span><span class="p">(</span><span class="o">-</span><span class="n">z</span><span class="p">))</span>
</code></pre></div></div>
<p>As you can see, if we modify $h$ to be</p>
<script type="math/tex; mode=display">h_\theta(x) = g(\theta^Tx)</script>
<p>we have a model that outputs probabilities of an example $x$ belonging to the positive class, or $P(y = 1|x; \theta)$ in mathematical terms. For negative classes we have $P(y = 0|x; \theta) = 1 - P(y = 1|x; \theta)$.</p>
<p>During training we modify the values in $\theta$ to yield high values for $\theta^Tx$ when $x$ is a positive example, and vice versa.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">):</span>
<span class="k">return</span> <span class="n">g</span><span class="p">(</span><span class="n">X</span> <span class="o">@</span> <span class="n">theta</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="decision-boundary">Decision boundary</h3>
<p>To make a final decision about which class a certain example $x$ belongs to we define a certain threshold. If $h$ exceeds that threshold we predict $1$, otherwise $0$. We most normally use $0.5$, we are more than $50%$ sure something will happen, but it depends on the context.*</p>
<p>If we are looking for labels, we have:</p>
<script type="math/tex; mode=display">h_\theta(x) = g(\theta^Tx) > 0.5</script>
<p>If you think about the feature space as an $n$-dimensional space, the decision boundary is a $n-1$-dimensional surface that divides the feature space in two. If you cross that surface the prediction about your class will change. The further you move away from the decision boundary the more certain you are about belonging to the class you’re in.</p>
<p>* More on that later in this post.</p>
<h2 id="the-loss-function">The loss function</h2>
<p>Let’s look at <a href="/blog/ml-fundamentals/polynomial-regression.html#loss">mean squared error loss function</a> again:</p>
<script type="math/tex; mode=display">J(\theta) = \frac{1}{m}\displaystyle\sum_i^m (h_\theta(x^{(i)}) - y^{(i)})^2</script>
<p>With this function you can optimize, minimize, the euclidean distance between a prediction and the target. While using this function would work with logistic regression, it turns out that it’s very hard to optimize with an optimization algorithm like gradient descent. The reason for this is because the loss function is very non-convex as a result of the nonlinearity $g$. In order words, we have more than one minimum and we’re not sure whether a certain minimum is the best fit for our model.</p>
<p>That’s the reason we use another loss function:</p>
<script type="math/tex; mode=display">% <![CDATA[
J(\theta) = \begin{cases}
-\log(1 - h_\theta(x)) & \text{if } y = 0 \\
-\log(h_\theta(x)) & \text{if } y = 1
\end{cases} %]]></script>
<p>Let’s break that down by looking at a few examples. Suppose we have the label $y^{(i)}$ and the prediction $h_\theta(x^{(i)})$:</p>
<p>If $y^{(i)} = h_\theta(x^{(i)}) = 0$ we have a loss of $-\log(1 - h_\theta(x)) = -\log(1 - 0) = -\log1 = 0$</p>
<p>If $y^{(i)} = h_\theta(x^{(i)}) = 1$ we have a loss of $-\log(h_\theta(x)) = -\log1 = 0$</p>
<p>If $y^{(i)} = 0$ but $h_\theta(x^{(i)}) = 1$ we have a loss of $-\log(1 - h_\theta(x)) = -\log(1 - 1) = -\log0 = \infty$</p>
<p>If $y^{(i)} = 1$ but $h_\theta(x^{(i)}) = 1$ we have a loss of $-\log(h_\theta(x)) = -\log0 = \infty$</p>
<p>In general, as the model moves closer to the wrong prediction, the loss gets progressively higher. Let’s look at the loss of the sigmoid function $g$ with respect to $z$:</p>
<p><img src="/assets/images/log/losses.png" alt="losses" /></p>
<p>For $y = 0$, as $z$ approaches $-\infty$, $g(z)$ approaches $0$ so the loss approaches $0$ as well. For $y = 1$, as $z$ approaches $\infty$, $g(z)$ approaches $1$, so the loss approaches $0$.</p>
<p>Another way to understand this is that this loss function pushes the model to be very sure about its predictions. The model will always be penalized, even it if it predicts the correct class, except when it has 100% certainty. For example, let’s suppose the model is 51% sure ($h(x) = 0.51$) about an example belonging to class 1, thus predicting class 1, it will still have a loss of $- \log 0.51 \approx 0.67$.</p>
<p>Because we know that $y^{(i)} \in {0, 1}$, there’s a shorter way of writing this function:</p>
<script type="math/tex; mode=display">J(\theta) = -\frac{1}{m}\begin{bmatrix}\displaystyle\sum_{i=1}^{m}y^{(i)}\log h_\theta(x^{(i)})+(1-y^{(i)})\log(1-h_\theta(x^{(i)}))\end{bmatrix}</script>
<p>If $y^{(i)} = 1$ we have that $y^{(i)} = 1$, of course, so the first term will be multiplied by $1$ and $(1-y^{(i)})$ would be $1 - 1 = 0$ so the second part will be multiplied by $0$. The opposite is also true.</p>
<p>In Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">J</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">1</span><span class="o">/</span><span class="n">m</span> <span class="o">*</span> <span class="p">(</span><span class="o">-</span><span class="n">y</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="n">preds</span><span class="p">)</span> <span class="o">-</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">@</span> <span class="n">np</span><span class="p">.</span><span class="n">log</span><span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">preds</span><span class="p">))</span>
</code></pre></div></div>
<h3 id="derivative">Derivative</h3>
<p>Just like polynomial regression, we will use the derivative of the loss function to calculate a gradient descent step.</p>
<p>The vectorized derivative for $J$ is given as:</p>
<script type="math/tex; mode=display">\nabla J(X) = \frac{1}{m} X^T (X\theta - y)</script>
<p>In Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">compute_gradient</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="n">preds</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">gradient</span> <span class="o">=</span> <span class="mi">1</span><span class="o">/</span><span class="n">m</span> <span class="o">*</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">(</span><span class="n">preds</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span>
<span class="k">return</span> <span class="n">gradient</span>
</code></pre></div></div>
<h2 id="training">Training</h2>
<p>We will again use gradient descent as our optimization algorithm. For more information, refer to the <a href="/blog/ml-fundamentals/polynomial-regression.html#regression-with-gradient-descent">first post in this series</a>.</p>
<p>A basic training loop would look like this:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">gradient</span> <span class="o">=</span> <span class="n">compute_gradient</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">theta</span> <span class="o">-=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">gradient</span>
</code></pre></div></div>
<p>We will later update it to print out statistics about the performance of the model.</p>
<h2 id="measuring-performance">Measuring performance</h2>
<p>One of the only ways we could measure performance of the polynomial regression model was through a loss function. Because classification problem are discrete, it opens up new possibilities to measure the performance, two of which I will be talking about now.</p>
<h3 id="accuracy">Accuracy</h3>
<p>The first one is accuracy: the percentage of examples we correctly predicted the class for.</p>
<p>Implementing this in Python is very easy if we count the number of instances we get correct and divide it by the total number of items. In mathematical terms, that’s equal to the average.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">preds</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="p">((</span><span class="n">preds</span> <span class="o">></span> <span class="mf">0.5</span><span class="p">)</span> <span class="o">==</span> <span class="n">y</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
</code></pre></div></div>
<p>We can update the training loop to print out the accuracy and loss every 10 epochs:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hist</span> <span class="o">=</span> <span class="p">{</span><span class="s">'loss'</span><span class="p">:</span> <span class="p">[],</span> <span class="s">'acc'</span><span class="p">:</span> <span class="p">[]}</span>
<span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.1</span>
<span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">gradient</span> <span class="o">=</span> <span class="n">compute_gradient</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">theta</span> <span class="o">-=</span> <span class="n">alpha</span> <span class="o">*</span> <span class="n">gradient</span>
<span class="c1"># loss
</span> <span class="n">preds</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">loss</span> <span class="o">=</span> <span class="n">J</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">hist</span><span class="p">[</span><span class="s">'loss'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">loss</span><span class="p">)</span>
<span class="c1"># acc
</span> <span class="n">acc</span> <span class="o">=</span> <span class="p">((</span><span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span> <span class="o">></span> <span class="p">.</span><span class="mi">5</span><span class="p">)</span> <span class="o">==</span> <span class="n">y</span><span class="p">).</span><span class="n">mean</span><span class="p">()</span>
<span class="n">hist</span><span class="p">[</span><span class="s">'acc'</span><span class="p">].</span><span class="n">append</span><span class="p">(</span><span class="n">acc</span><span class="p">)</span>
<span class="c1"># print stats
</span> <span class="k">if</span> <span class="n">i</span> <span class="o">%</span> <span class="mi">10</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span> <span class="k">print</span><span class="p">(</span><span class="n">loss</span><span class="p">,</span> <span class="n">acc</span><span class="p">)</span>
</code></pre></div></div>
<p>This also keeps track of the loss and accuracy during training. If we plot the arrays we get the following graphs:</p>
<p><img src="/assets/images/log/eval.png" alt="train history evaluation" /></p>
<h3 id="the-f1-score">The F1 score</h3>
<p>While accuracy might seem like a perfect metric to measure performance, it’s actually naive to believe that. For instance, suppose we had very few negatives in our dataset. Models that always predict $1$ will have a high accuracy while they are not actually very performant. So instead a better measure would be the F1 score: a scalar indicating model performance.</p>
<p>To understand it, let’s first look at the following table:</p>
<table>
<thead>
<tr>
<th> </th>
<th>label: 1</th>
<th>label: 0</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>prediction: 0</strong></td>
<td>false negative</td>
<td>true negative</td>
</tr>
<tr>
<td><strong>prediction: 1</strong></td>
<td>true positive</td>
<td>false positive</td>
</tr>
</tbody>
</table>
<p>The <em>precision</em> of a model is defined as:</p>
<script type="math/tex; mode=display">p = \text{precision} = \frac{\text{True positives}}{\text{True positives}+\text{False positives}}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">precision</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
<span class="n">tp</span> <span class="o">=</span> <span class="p">((</span><span class="n">preds</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)).</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">fp</span> <span class="o">=</span> <span class="p">((</span><span class="n">preds</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)).</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">return</span> <span class="n">tp</span> <span class="o">/</span> <span class="p">(</span><span class="n">tp</span> <span class="o">+</span> <span class="n">fp</span><span class="p">)</span>
</code></pre></div></div>
<p>The <em>recall</em> of a model is defined as:</p>
<script type="math/tex; mode=display">r = \text{recall} = \frac{\text{True positives}}{\text{True positives}+\text{False negatives}}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">recall</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
<span class="n">tp</span> <span class="o">=</span> <span class="p">((</span><span class="n">preds</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)).</span><span class="nb">sum</span><span class="p">()</span>
<span class="n">fn</span> <span class="o">=</span> <span class="p">((</span><span class="n">preds</span> <span class="o">==</span> <span class="mi">0</span><span class="p">)</span> <span class="o">==</span> <span class="p">(</span><span class="n">y</span> <span class="o">==</span> <span class="mi">1</span><span class="p">)).</span><span class="nb">sum</span><span class="p">()</span>
<span class="k">return</span> <span class="n">tp</span> <span class="o">/</span> <span class="p">(</span><span class="n">tp</span> <span class="o">+</span> <span class="n">fn</span><span class="p">)</span>
</code></pre></div></div>
<blockquote>
<p>“$p$ is the number of correct positive results divided by the number of all positive results returned by the classifier, and $r$ is the number of correct positive results divided by the number of all relevant samples (all samples that should have been identified as positive)” <a href="https://en.wikipedia.org/wiki/F1_score">source: Wikipedia</a></p>
</blockquote>
<p>The F1 score is defined as the <a href="https://en.wikipedia.org/wiki/Harmonic_mean">harmonic mean</a> of the precision and recall:</p>
<script type="math/tex; mode=display">F_{1}=\left({\frac {2}{\mathrm {recall} ^{-1}+\mathrm {precision} ^{-1}}}\right)=2\cdot {\frac {\mathrm {precision} \cdot \mathrm {recall} }{\mathrm {precision} +\mathrm {recall} }}</script>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">f1</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">):</span>
<span class="k">return</span> <span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">precision</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span> <span class="o">*</span> <span class="n">recall</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">))</span> <span class="o">/</span> <span class="p">(</span><span class="n">precision</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">)</span> <span class="o">+</span> <span class="n">recall</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">labels</span><span class="p">))</span>
</code></pre></div></div>
<p>The optimal value is $1$, where we have perfect precision and recall. The worst value is, as you might expect, $0$.</p>
<h2 id="optimizing-model-performance">Optimizing model performance</h2>
<p>Back to our goal of classifying breast cancer in patients given certain statistics about the patient. Our model has quite a low recall of $0.5$. This means that our model has many false negatives: it did not predict cancer was present, while actually the patient did have cancer. Given our so-called <em>domain knowledge</em>, we might try to optimize for recall instead of accuracy: we would rather predict too many patients have cancer than too few.</p>
<p>Let’s make a plot of how the recall changes as we update the threshold:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">recalls</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">p</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
<span class="n">preds</span> <span class="o">=</span> <span class="p">(</span><span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span> <span class="o">></span> <span class="n">p</span><span class="p">)</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">recall</span><span class="p">(</span><span class="n">preds</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="n">recalls</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">r</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/log/recalls.png" alt="recall plot" /></p>
<p>While our recall is already quite good, we can do a little better if we choose a threshold of $52\%$ as opposed to $50\%$.</p>
<h2 id="classification-vs-regression">Classification vs Regression</h2>
<p>One could argue that the problem in the previous post could be regarded as a classification problem where the income is a label. This is a very valid argument, and we could build a classification model with no modifications to the dataset. However, given the context it is more natural to solve that problems with a regression model: salary is a continuous variable (continuous enough, that is). For breast cancer, deciding what kind of model to build is easy too: patients do not have “degrees of cancer.”</p>
<h2 id="whats-next">What’s next?</h2>
<p>This was the second post of the <a href="/blog/ml-fundamentals">“ML from the Fundamentals”</a> series. In a future post I will be discussing neural networks: a more sophisticated solution for classification problems. We will also look at unsupervised learning (without targets). Be sure to follow me on Twitter so you stay up to date with the series. I’m <a href="https://twitter.com/rickwierenga">@rickwierenga</a>.</p>
<h2 id="sources">Sources</h2>
<p>[1] Dheeru Dua en Casey Graff. UCI Machine Learning Repository. 2017. url: http://archive.ics.uci.edu/ml.</p>Classification is one of the biggest problems machine learning explores. Where we used polynomial regression to predict values in a continuous output space, logistic regression is an algorithm for discrete regression, or classification, problems.Google Code In 2019/2020 (TensorFlow) - A Review2020-01-23T00:00:00+00:002020-01-23T00:00:00+00:00https://rickwierenga.com/blog/gci/GCI<p>Each year Google organizes Google Code-In: a programming competition for teenagers aged 13 to 17. Different organizations offer a wide variety of tasks for students from all around the world to complete. These tasks take 3 to 10 hours to complete, depending on the requirements and creativity of the student. They receive feedback from mentors and get a chance to incorporate the feedback in their work. When they are done the mentors can accept the task. Now the student can claim another task. And repeat! And repeat!</p>
<p><img src="/assets/images/gci/header.png" alt="gci banner" /></p>
<p>About a month before the contest started <a href="https://twitter.com/bradlarson/">Brad Larson</a> from the Swift for TensorFlow team and <a href="https://twitter.com/DynamicWebPaige">Paige Bailey</a>, the TensorFlow product manager, emailed me suggesting to take part. Gladly! I couldn’t wait for the contest to start.</p>
<p>Today is the last day to work on tasks. Time has flown by! I have completed 29 tasks and learned an incredible amount about TensorFlow and machine learning but also communication and open source. The competition had very interesting tasks encouraging me to explore things I wouldn’t even have known about. In this post I would like to share some of my favorite moments of the contest.</p>
<h2 id="claiming-tasks">Claiming tasks</h2>
<p>Let me start by giving a quick overview of how the Code-In platform works.</p>
<p>When you open the website you see the dashboard:
<img src="/assets/images/gci/dashboard.png" alt="dashboard" /></p>
<p>The top left corner used to show the current task, but since I have completed my last task it’s shows a little message instead.</p>
<p>Each task has its own page where you can read the task description, talk to the mentors and submit your task. When your task is submitted the mentors can review it and send it back requesting more work or approve the task.</p>
<h2 id="swift-for-tensorflow">Swift for TensorFlow</h2>
<p>I started off by completing most of the Swift for TensorFlow tasks (I later completed all of them). While working on these tasks I decided to curate a list of all of my work in a GitHub repository: <a href="https://github.com/rickwierenga/s4tf-notebooks">s4tf-notebooks</a>.</p>
<p><img src="/assets/images/gci/s4tf-notebooks.png" alt="s4tf-notebooks" /></p>
<p>One task was to create a new tutorial about the framework. I wrote a tutorial for beginners on how to get started with the framework: <a href="https://rickwierenga.com/blog/s4tf/s4tf-mnist.html">“Your first Swift for TensorFlow model”</a>. Writing this was a little counterintuitive at first because I had not written much Swift in a notebook before, but it was still a lot of fun! <a href="https://twitter.com/rickwierenga/status/1202531433010671616">The tweet</a> has more than 30K views and even got shared by <a href="https://twitter.com/clattner_llvm">Chris Lattner</a> himself!</p>
<p>While I was at it I decided to write another tutorial about Swift for TensorFlow (not a task): <a href="https://rickwierenga.com/blog/s4tf/s4tf-gan.html">“An introduction to Generative Adversarial Networks (in Swift for TensorFlow)”</a> where I provide in-depth explanations about GANs and showcase how to build a <em>deep convolutional generative adversarial network</em> in Swift.</p>
<p><img src="https://rickwierenga.com/assets/images/dcgan-arc.png" alt="DCGAN architecture" /></p>
<p>Because S4TF’s model garden did not have this model yet, I decided to create a PR. After some really helpful feedback on my code it got merged: <a href="https://github.com/tensorflow/swift-models/tree/master/DCGAN">https://github.com/tensorflow/swift-models/tree/master/DCGAN</a>! <a href="https://twitter.com/rickwierenga/status/1204335520849039365">The corresponding tweet</a> was also very popular—getting shared by <a href="https://twitter.com/jeremyphoward">Jeremy Howard</a> and multiple members of the Swift for TensorFlow team at Google.</p>
<p><img src="/assets/images/gci/dcgan.png" alt="DCGAN PR" /></p>
<p>Swift, being my first programming language, has always been one of my favorite languages to work with. Seeing it being adopted by Google for machine learning is very exciting because it allows me to combine two of my favorite things: Swift and machine learning. I’m definitely planning on continuing to contribute to the libraries.</p>
<h2 id="howpretty">HowPretty</h2>
<p>Another task was to deploy a TensorFlow model to iOS or Android. I have build other apps using machine learning before, but TensorFlow was different. The amount of freedom compared to something like Apple’s CoreML was astounding. The documentation was also very good.</p>
<p>While I could have made a simple classification app, I decided to build a very brutal app called <a href="https://github.com/rickwierenga/HowPretty">HowPretty</a>: an app that tells you how pretty you are! At this point the app is a prototype fulfilling the task requirements, but I am planning on polishing this app and putting it in the App Store soon.</p>
<p><img src="https://github.com/rickwierenga/HowPretty/raw/master/.github/howprettybanner.jpg" alt="HowPretty" />
(I’m not very pretty according to my app 😅)</p>
<p>I started off by looking for a dataset with faces. I found <a href="http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html">CelebA</a>, a dataset with more than 200 000 faces with different features including a column “Attractive.” I used a <a href="https://arxiv.org/abs/1704.04861">MobileNet</a> with transfer learning to train a model. The main advantage about this model is speed and efficiency. Because I wanted the app to run without internet to preserve the users’ privacy, this was a crucial feature. Luckily running a model locally is just as easy as running it in Firebase.</p>
<p>At first I was skeptical about whether the model would even learn or not, because prettiness is subjective. It turns out that neural networks can! I got about 80% validation accuracy which is not too great, but more than 50% which I had expected. When I will put this app in production I will retrain the model on the full dataset focusing more on model performance.</p>
<p>Other things I will improve when this app goes into production:</p>
<ul>
<li>Use bounding boxes of face to crop the image for higher accuracy. Tell the user to move closer or further away to get the perfect resolution (150 by 150).</li>
<li>Make an Android version, possibly using React Native or Flutter.</li>
<li>Add a share-on-social-media button</li>
</ul>
<p>For more details, see the <a href="https://github.com/rickwierenga/HowPretty/blob/master/README.md">README</a>.</p>
<h2 id="tensorflowjs">TensorFlow.js</h2>
<p>Machine learning is more than Python and that’s what GCI has made very clear. Throughout the competition I have used 3 different programming languages in combination with machine learning.</p>
<p>The TensorFlow.js task was to export a Keras model to TensorFlow.js, load it into JavaScript and describe the differences between the different TF.js APIs.</p>
<p>The following snippet shows how to export a model to be used on the web:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="s">'model.h5'</span><span class="p">)</span>
<span class="err">!</span><span class="n">tensorflowjs_converter</span> <span class="o">--</span><span class="n">input_format</span><span class="o">=</span><span class="n">keras</span> <span class="o">/</span><span class="n">content</span><span class="o">/</span><span class="n">model</span><span class="p">.</span><span class="n">h5</span> <span class="o">/</span><span class="n">content</span><span class="o">/</span><span class="n">model</span><span class="o">/</span>
</code></pre></div></div>
<p><img src="/assets/images/gci/tfjs.png" alt="tfjs" /></p>
<h2 id="the-100-layer-tiramisu-implementing-a-paper">The 100 layer Tiramisu: implementing a paper</h2>
<p>This task required us to segment images using Tiramisu: a U-Net like neural network. I could not find an implementation I really liked so I decided to implement the model myself. Implementing a full paper was very exciting. There were two papers I used to implement the full model: <a href="https://arxiv.org/pdf/1611.09326.pdf">“The One Hundred Layers Tiramisu:
Fully Convolutional DenseNets for Semantic Segmentation”</a> and <a href="https://arxiv.org/pdf/1608.06993.pdf">“Densely Connected Convolutional Networks”</a>.</p>
<p><img src="/assets/images/gci/tiramisu.png" alt="tiramisu" /></p>
<p>I also submitted <a href="https://github.com/keras-team/keras-applications/issues/163">a PR</a> for this model to <a href="https://github.com/keras-team/keras-applications/">keras-team/keras-applications</a>, but the team seems to be moving slowly.</p>
<p>You can view the full implementation in <a href="https://colab.research.google.com/drive/1I2taXqYg6sgxA9vjfRmdp0J_8kE-ryYz">Google Colab</a>.</p>
<p><img src="/assets/images/gci/tiramisu2.png" alt="tiramisu2" /></p>
<h2 id="rickwierengatensorflow-tutorials">rickwierenga/TensorFlow-Tutorials</h2>
<p>I took the task “[Advanced] Upgrade a TensorFlow 1.x tutorial using tf.keras to TF 2.0” a bit further and decided to update every tutorial in <a href="https://github.com/nlintz/TensorFlow-Tutorials">nlintz/TensorFlow-Tutorials</a>, a repository with more than 5.8K stars, to TensorFlow 2.</p>
<p>You can view the repository <a href="https://github.com/rickwierenga/TensorFlow-Tutorials">here</a>.</p>
<p><img src="/assets/images/gci/tf-tutorials.png" alt="Screenshot of the repository README" /></p>
<h2 id="heatmap">Heatmap</h2>
<p>This task was very satisfying because I could clearly see the similarities between convolutional neural networks and human brains like my own. I think it is very interesting how humans can focus on one object, and choose where they want to focus without external supervision. Apparently, artificial neural networks do the same.</p>
<p>To generate a heatmap we take the output of a convolutional layer, given the input image, and weigh every channel by the gradient of its class activation. In other words, we take the activation of each channel of the input image in the final convolutional layer and weigh it with how class-like (in our case cheetah-like) the image is.</p>
<p><img src="/assets/images/gci/heatmap.png" alt="heatmap" />
This image is available under the Creative Commons Attribution-Share Alike 4.0 International license. <a href="https://commons.wikimedia.org/wiki/File:Cheetah_(Acinonyx_jubatus)_female_2.jpg">source</a></p>
<h2 id="auto-encoders">Auto encoders</h2>
<p>Auto encoders were entirely new to me. I was glad I chose the task “Build a simple Auto-encoder model using tf.keras” because I learned a lot about them. They are probably the most interesting application of machine learning I have used so far.</p>
<p>I wrote a <a href="https://colab.research.google.com/drive/15ORxHNtUaTspOujGnxI5iqwPUhht-zKx">Colab tutorial</a> detailing:</p>
<ul>
<li><strong>Auto encoders</strong>: 96.2% data compression.</li>
<li><strong>Convolutional auto encoders</strong>: unsupervised classification.</li>
<li><strong>Denoising auto encoders</strong>: clean crappy data.</li>
</ul>
<p><img src="/assets/images/gci/autoencoder.png" alt="autoencoder" /></p>
<p>This tutorial was very well received by the mentors <a href="https://twitter.com/rickwierenga/status/1216801014004797446">as well as Twitter</a>:</p>
<p><img src="/assets/images/gci/autoencoder1.png" alt="autoencoder" /></p>
<h2 id="polynomial-regression">Polynomial regression</h2>
<p>During Christmas I wrote a blog post for the task: <a href="https://rickwierenga.com/blog/machine%20learning/polynomial-regression.html">“Tutorial for Polynomial Regression.”</a>. My blog post covers the basics of machine learning, the mathematical theory behind polynomial regression along with an implementation in Python. It also discusses feature selection and over/underfitting.</p>
<p><img src="/assets/images/gci/pol.png" alt="polynomial regression" /></p>
<p>This post was featured on the front page of Hacker News for more than 18 hours getting more than <a href="https://news.ycombinator.com/item?id=21879374">140 points</a>. Hacker News users provided a lot of good feedback improving future content on this site. In total the post attracted over 6000 new users!</p>
<p><img src="/assets/images/gci/hn.png" alt="hn post" /></p>
<h2 id="contributing-to-open-source">Contributing to open source</h2>
<p>Contributing to open source projects was very exciting because I got to work with many smart people from all around the world including Googlers.</p>
<p>Some merged PRs I made during Code-In:</p>
<ul>
<li><a href="https://github.com/tensorflow/tensorflow/pull/36056">Add usage example to pad_to_bounding_box #36056</a> in tensorflow/tensorflow</li>
<li><a href="https://github.com/tensorflow/tensorflow/pull/36091">Add usage example to tf.keras.utils.to_categorical #36091</a> in tensorflow/tensorflow</li>
<li><a href="https://github.com/vvmnnnkv/SwiftCV/pull/2">Add docs to README #2</a> in vvmnnnkv/SwiftCV</li>
<li><a href="https://github.com/Ayush517/S4TF-Tutorials/pull/16">Add contents section to README #16</a> in Ayush517/S4TF-Tutorials</li>
<li><a href="https://github.com/tensorflow/swift-models/pull/261">Add DCGAN #261</a> in tensorflow/swift-models</li>
</ul>
<p><img src="/assets/images/gci/goss.png" alt="google open source logo" /></p>
<h2 id="thanks-to-the-mentors-admins-and-organizers">Thanks to the mentors, admins and organizers</h2>
<p>Before wrapping up this post I would like to take a moment to thank the awesome mentors, admins and the organizers of Code-In for this amazing event. I really learned a lot by taking part in it.</p>
<p>I would like to thank the TensorFlow mentors in particular for investing their time into helping us, the students, with valuable feedback and counselling. This event would not have been possible without you.</p>
<p>I’m happy I got a chance to work with these mentors:</p>
<ul>
<li>Mohit Uniyal</li>
<li><a href="https://twitter.com/mantis0604">Ayush Agrawal</a></li>
<li><a href="https://twitter.com/kyscg7">Yasaswi</a></li>
<li><a href="https://twitter.com/RisingSayak">Sayak Paul</a></li>
<li>“freedom”</li>
<li>Nishant</li>
<li>Param Bhavsar</li>
<li><a href="https://twitter.com/gauravsaha0">Gaurav Saha</a></li>
<li><a href="https://twitter.com/HunarBatra">Hunar Batra</a></li>
<li><a href="https://twitter.com/utkarshsinha">Utkarsh Sinha</a></li>
<li>Sundaram Dubey</li>
<li><a href="https://twitter.com/GOVINDDIXIT05">Govind Dixit</a></li>
<li>Arun</li>
<li><a href="Saket Prag">Saket Prag</a></li>
<li>adityastic</li>
<li>Satyam Kumar</li>
<li>Sourav Das</li>
<li>Arthjain</li>
<li><a href="https://twitter.com/kurianbenoy2">kurianbenoy</a></li>
</ul>
<p>If you are a mentor (thanks!) and your Twitter handle is missing, my DMs are open!</p>
<p><strong>Y’all are awesome!</strong></p>
<h2 id="final-words">Final words</h2>
<p>The deadline just passed. I guess I’ll have to wait patiently until the winners are announced on February 10th! I guess I won’t get to sleep very much…</p>
<p><img src="/assets/images/gci/timeline.png" alt="timeline" /></p>
<h2 id="update-february-10th">Update: February 10th</h2>
<p><a href="https://twitter.com/GoogleOSS/status/1226972986370068481">Google just announced the winners</a> and I’m super proud to be one of them! I can’t wait to visit California again :)</p>
<p><img src="/assets/images/gci/winner.jpg" alt="Winner" /></p>Each year Google organizes Google Code-In: a programming competition for teenagers aged 13 to 17. Different organizations offer a wide variety of tasks for students from all around the world to complete. These tasks take 3 to 10 hours to complete, depending on the requirements and creativity of the student. They receive feedback from mentors and get a chance to incorporate the feedback in their work. When they are done the mentors can accept the task. Now the student can claim another task. And repeat! And repeat!Speech recognition and speech synthesis on iOS with Swift2020-01-22T00:00:00+00:002020-01-22T00:00:00+00:00https://rickwierenga.com/blog/apple/speech-ios<p>Everyone knows Siri, and many people use it every day. Why? Because Siri provides a very fast and user-friendly way of interacting with an iOS device.</p>
<p>Convenience is not the only motivation for this type of interaction, though. The combination of speech recognition and speech synthesis feels more personal than using a touch screen. On top of that, the option for verbal communication enables visually impaired people to interact with your app.</p>
<p>As you probably already know, Siri’s communication mechanism can be split up in two main components: speaking and listening. Speaking is formally known as “speech synthesis” whereas listening is often referred to as “speech recognition.” Although the tasks look very different in code, they have one thing in common: both are powered by machine learning.</p>
<p>Luckily, Apple’s speech synthesis and speech recognition APIs aren’t private — everyone has access to their cutting-edge technology. In this tutorial, you’ll build an app that uses those APIs to speak and listen to you.</p>Everyone knows Siri, and many people use it every day. Why? Because Siri provides a very fast and user-friendly way of interacting with an iOS device.Polynomial Regression from Scratch in Python2019-12-22T00:00:00+00:002019-12-22T00:00:00+00:00https://rickwierenga.com/blog/ml-fundamentals/polynomial-regression<p>Machine learning is one of the hottest topics in computer science today. And not without a reason: it has helped us do things that couldn’t be done before like image classification, image generation and natural language processing. But all of it boils down to a really simple concept: you give the computer data and the computer then finds patterns in that data. This is called “learning” or “training”, depending on your point of view. These learnt patterns can be extrapolated to make predictions. How? That’s what we are looking at today.</p>
<p>By working through a real world example you will learn how to build a polynomial regression model to predict salaries based on job position. Polynomial regression is one of the core concepts that underlies machine learning. I will discuss the mathematical motivations behind each concept. We will also look at <em>overfitting</em> and <em>underfitting</em> and why you want to avoid both.</p>
<h2 id="the-data">The data</h2>
<p>The first thing to always do when starting a new machine learning model is to load and inspect the data you are working with. As I mentioned in the introduction we are trying to predict the salary based on job prediction. To do so we have access to the following dataset:</p>
<iframe src="https://docs.google.com/spreadsheets/d/e/2PACX-1vQZ9sDZaTdJGBoCWlLB2B-Soiwhx3zf3mHxySGLieZsS_yTbdzb2vbcqtq1XtkHtOgFvWQNJpYsNpuj/pubhtml?gid=1079365787&single=true&widget=true&headers=false"></iframe>
<p>As you can see we have three columns: position, level and salary. Position and level are the same thing, but in different representation. Because it’s easier for computers to work with numbers than text we usually map text to numbers.</p>
<p>In this case the levels are the <em>input data</em> to the model. While we the numbers are already known, the salary is the <em>output data</em>. With this data we will build a model at <em>training time</em>, where both are available. At <em>inference time</em>, we will only have input data. Our job as machine learning engineers is to build a model that outputs good data at <em>inference time</em>.</p>
<p>The input data is usually called $X \in \mathbb{R}^{m \times n}$ where $m$ is the number of <em>training examples</em>, $10$ in our case, and $n$ the dimensionality, or number of features, 1 in our case. A training example is a row in the <em>input dataset</em> which has <em>features</em>, or aspects, which we are using to make predictions.</p>
<p>The output data is called $\vec{y} \in \mathbb{R}^m$, a vector because it typically has only one column.</p>
<p>So in our case</p>
<script type="math/tex; mode=display">X = \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ 5 \\ 6 \\ 7 \\ 8 \\ 9 \\10 \end{bmatrix} \quad\quad\quad y = \begin{bmatrix} 45000 \\ 50000 \\ 60000 \\ 80000 \\ 110000 \\ 150000 \\ 200000 \\ 300000 \\ 500000 \\ 1000000\end{bmatrix}</script>
<p>Of course</p>
<script type="math/tex; mode=display">\left \lVert y \right \rVert = m</script>
<p>In Python:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([[</span><span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">,</span> <span class="mi">8</span><span class="p">,</span> <span class="mi">9</span><span class="p">,</span> <span class="mi">10</span><span class="p">]]).</span><span class="n">T</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">array</span><span class="p">([</span><span class="mi">45000</span><span class="p">,</span> <span class="mi">50000</span><span class="p">,</span> <span class="mi">60000</span><span class="p">,</span> <span class="mi">80000</span><span class="p">,</span> <span class="mi">110000</span><span class="p">,</span> <span class="mi">150000</span><span class="p">,</span> <span class="mi">200000</span><span class="p">,</span> <span class="mi">300000</span><span class="p">,</span> <span class="mi">500000</span><span class="p">,</span> <span class="mi">1000000</span><span class="p">])</span>
<span class="n">m</span><span class="p">,</span> <span class="n">n</span> <span class="o">=</span> <span class="n">X</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>
<p>The $i$th training exmple is $X^{(i)}, y^{(i)}$. The $j$th feature is $X_j$.</p>
<p>We can inspect our training set in a plot, (since $n = 1$)</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">,</span> <span class="s">'rx'</span><span class="p">)</span>
</code></pre></div></div>
<p><img src="/assets/images/pyr/1.png" alt="untrained fit" /></p>
<h2 id="the-hypothesis-function">The hypothesis function</h2>
<p>To predict output values from input values we use an hypothesis function called $h$, paramaterized by $\theta \in \mathbb{R}^{n+1}$. We will fit $h$ to our datapoints so that it can be extrapolated for new values of $x$.</p>
<script type="math/tex; mode=display">h_\theta(x) = \theta_0 + \theta_1 x_1</script>
<p>In order to ease the computation later on we usually add a column of $1$’s at $X_0$ giving</p>
<script type="math/tex; mode=display">% <![CDATA[
X = \begin{bmatrix} 1 && 1 \\ 1 && 2 \\ 1 && 3 \\ 1 && 4 \\ 1 && 5 \\ 1 && 6 \\ 1 && 7 \\ 1 && 8 \\ 1 && 9 \\ 1 && 10 \end{bmatrix} %]]></script>
<p>so that</p>
<script type="math/tex; mode=display">h_\theta(x) = \theta^Tx</script>
<p>Because these $1$s change the hypothesis independently from the input $x$ it’s sometimes called the bias factor. The bias vector is also the reason $\theta \in \mathbb{R}^{n+1}$ and not $\theta \in \mathbb{R}^n$</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Add a bias factor to X.
</span><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span><span class="n">np</span><span class="p">.</span><span class="n">ones</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span> <span class="n">X</span><span class="p">))</span>
</code></pre></div></div>
<p>By changing the values of $\theta$ we can change the hypothesis $h_\theta(x)$.</p>
<h3 id="adding-polynomial-features">Adding polynomial features</h3>
<p>As you will probably have noticed $h$ is a polynomial of degree $1$ while our dataset is nonlinear. This function will always be a bad fit, no matter which values of $\theta$ we use.</p>
<p>To fix that we will add polynomial features to $X$, which, of course, also increases $n$.</p>
<p>By inspecting the plot we learn that adding polynomial features like $(X_j)^2$ could fit our dataset. Nonpolynomial features like $\sqrt{X_j}$ are also allowed, but not used in this tutorial because it’s called “<strong>polynomial</strong> regression.”</p>
<p>In this model I added 3 additional polynomials, increasing $n$ to $3$.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">hstack</span><span class="p">((</span>
<span class="n">X</span><span class="p">,</span>
<span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">**</span> <span class="mi">2</span><span class="p">).</span><span class="n">reshape</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span>
<span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">**</span> <span class="mi">3</span><span class="p">).</span><span class="n">reshape</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">)),</span>
<span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">]</span> <span class="o">**</span> <span class="mi">4</span><span class="p">).</span><span class="n">reshape</span><span class="p">((</span><span class="n">m</span><span class="p">,</span> <span class="mi">1</span><span class="p">))</span>
<span class="p">))</span>
</code></pre></div></div>
<p>You should try adding or removing polynomial features yourself.</p>
<h3 id="normalization">Normalization</h3>
<p>When we added the features a new problem emerged: their ranges are very different from $X_1$. Every feature $X_j$ has an associated weight $\theta_j$ (more in that later). This means that a small change in a weight associated with a generally large feature has a much bigger impact than the same change has on a generally small feature. This causes problems when we are fitting the values $\theta$ later on.</p>
<p>To fix this problem we use a technique called <em>normalization</em>, defined as</p>
<script type="math/tex; mode=display">X_j := \frac{X_j - \mu_j}{\sigma_j} \quad \text{for } x \text{ in }1\ldots j</script>
<p>where $\mu_j$ and $\sigma_j$ are the mean and standard deviation of $X_j$ respectively. Normalization sets the mean close to $0$ and the standard deviation to $1$, which always benefits training. Note that we don’t normalize $X_0$ because $\sigma_0 = X_0 - \mu_0 = 0$.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span> <span class="o">=</span> <span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:]</span> <span class="o">-</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">))</span> <span class="o">/</span> <span class="n">np</span><span class="p">.</span><span class="n">std</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">:],</span> <span class="n">axis</span><span class="o">=</span><span class="mi">0</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="initializing-theta-and-making-predictions">Initializing $\theta$ and making predictions</h2>
<p>Before we make a prediction I would like to make a small change to the hypothesis function. Remember $h_\theta(x) = \theta_0 x_0 + \theta_1 x_1$. Note that it only supports one feature, and a bias. We can generalize it as follows:</p>
<script type="math/tex; mode=display">h_\theta(x) = \displaystyle\sum_i^n \theta_i x_i = \theta_0 x_0 + \theta_1 x_1 + \ldots + \theta_n x_n</script>
<p>Here you can see the link between $X_j$ and $\theta_j$.</p>
<p>Because we will be using the hypothesis function many times in the future it should be very fast. Right now $h$ can only compute one the prediction for one training example at a time.</p>
<p>We can change that by <em>vectorizing</em> it. If we implemented the sum function by looping each $x$ with associated $\theta$, it would take a very long time. We can change that by vectorizing the function. With vectorization you can compute the outputs for an entire matrix, or vector, at once. While you technically compute the same values, good linear algebra libraries such as numpy will optimize the use of the available hardware to speed up the process. A vectorized implementation of $h$:</p>
<script type="math/tex; mode=display">h_\theta(X) = X\theta</script>
<p>You can validate it works by writing down a few examples. This function takes the whole matrix $X$ as an input and produces the prediction $\hat{y}$ in one computation.</p>
<p>In Python $h_\theta(X)$ can be implemented as:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">):</span>
<span class="k">return</span> <span class="n">X</span> <span class="o">@</span> <span class="n">theta</span>
</code></pre></div></div>
<p>Before we can make predictions we need to initialize $\theta$. By convention we fill it with random numbers, but it does not make a difference in this program*.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">random</span><span class="p">(</span><span class="n">n</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>
<span class="n">predictions</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
</code></pre></div></div>
<p>In a graph:</p>
<p><img src="/assets/images/pyr/2.png" alt="untrained fit" /></p>
<p>* Random initialization is crucial for symmetry braking in neural networks.</p>
<h2 id="loss">Loss</h2>
<p>As you can see our current predictions are frankly quite bad. But what does “bad” mean? It’s much too vague for mathematicians.</p>
<p>To measure we models accuracy we use a loss function. In this case mean square error, or MSE for short. While many loss functions exist, MSE is proven to be one of the best for regression problems like ours. It is defined as</p>
<script type="math/tex; mode=display">J(\theta) = \frac{1}{m}\displaystyle\sum_i^m (h_\theta(x^{(i)}) - y^{(i)})^2</script>
<p>$J$ is a function of the current state of the model—the parameters $\theta$ which make up the model. It takes our prediction for example $i$, squares it (signs do not matter). This number is the distance from our prediction to the actual datapoint, squared. We take the average of these “distances”.</p>
<p>A vectorized Python implementation:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">J</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="n">np</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">np</span><span class="p">.</span><span class="n">square</span><span class="p">(</span><span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span> <span class="o">-</span> <span class="n">y</span><span class="p">))</span>
</code></pre></div></div>
<p>In my case I had a loss of $142\ 911\ 368\ 743$, which may vary slightly as a result of the random initialization.</p>
<h2 id="regression-with-gradient-descent">Regression with gradient descent</h2>
<p>We can improve our model, decrease our loss, by chaning the paramters of $\theta$. We do that using an algorithm called gradient descent.</p>
<p>Gradient descent caculates the <em>gradient</em> of a model using the partial derivative of the cost function. This gives the slope of the cost function at our current position ( $\theta$ ) indicating in which direction (gradient) we should move.</p>
<p>This gradient is multiplied by a learning rate, often denoted as $\alpha$, to control the pace of learning*. The result of this multiplication is then substracted from the weights to decrease the loss of further predictions.</p>
<p>Below is a plot of the loss function. The gradient decreases as $J$ approaches the minimum. <a href="https://www.quora.com/Whats-the-difference-between-gradient-descent-and-stochastic-gradient-descent">source</a></p>
<p><img src="/assets/images/gd.png" alt="plot of J" /></p>
<p>More formally, the partial derivative of $J$ with respect to paramters $\theta$ is</p>
<script type="math/tex; mode=display">\frac{\partial J(\theta)}{\partial \theta_j} = \frac{1}{m}x_j^T(X\theta -y)</script>
<p>In vectorized form for all $X_j$</p>
<script type="math/tex; mode=display">\nabla J(\theta) = \frac{1}{m}x^T(X\theta -y)</script>
<p>The gradient descent step is</p>
<script type="math/tex; mode=display">\theta := \theta - \alpha \nabla J(\theta) = \theta - \alpha \frac{1}{m}X^T(X\theta -y)</script>
<p>We repeat this computation many times. This is called <strong>training</strong>.</p>
<p>*Choosing a value of $\alpha$ is an interesting topic on itself so I’m not going to discuss it in this article. If you’re interested you can learn more <a href="https://heartbeat.fritz.ai/an-empirical-comparison-of-optimizers-for-machine-learning-models-b86f29957050">here</a>.</p>
<h3 id="in-python">In Python</h3>
<p>A typical value for $\alpha$ is $0.01$. It’s interesting to play around with this value yourself.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">alpha</span> <span class="o">=</span> <span class="mf">0.01</span>
</code></pre></div></div>
<p>A gradient descent step can be implemented as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta</span> <span class="o">=</span> <span class="n">theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">m</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">((</span><span class="n">X</span> <span class="o">@</span> <span class="n">theta</span><span class="p">)</span> <span class="o">-</span> <span class="n">y</span><span class="p">))</span>
</code></pre></div></div>
<p>While training we often keep track of the loss to make sure it decreases as we progress. A training loop is a fancy term for performing multiple gradient descent step. Our training loop:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">losses</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">500</span><span class="p">):</span>
<span class="n">theta</span> <span class="o">=</span> <span class="n">theta</span> <span class="o">-</span> <span class="n">alpha</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span><span class="o">/</span><span class="n">m</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="p">((</span><span class="n">X</span> <span class="o">@</span> <span class="n">theta</span><span class="p">)</span> <span class="o">-</span> <span class="n">y</span><span class="p">))</span>
<span class="n">losses</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">J</span><span class="p">(</span><span class="n">theta</span><span class="p">,</span> <span class="n">X</span><span class="p">,</span> <span class="n">y</span><span class="p">))</span>
</code></pre></div></div>
<p>We train for $500$ <em>epochs</em>.</p>
<p>Looking at our fit again:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">predictions</span> <span class="o">=</span> <span class="n">h</span><span class="p">(</span><span class="n">X</span><span class="p">,</span> <span class="n">theta</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">predictions</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'predictions'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">plot</span><span class="p">(</span><span class="n">X</span><span class="p">[:,</span> <span class="mi">1</span><span class="p">],</span> <span class="n">y</span><span class="p">,</span> <span class="s">'rx'</span><span class="p">,</span> <span class="n">label</span><span class="o">=</span><span class="s">'labels'</span><span class="p">)</span>
<span class="n">plt</span><span class="p">.</span><span class="n">legend</span><span class="p">()</span>
</code></pre></div></div>
<p><img src="/assets/images/pyr/3.png" alt="fitted" /></p>
<p>That looks much more promising than what we had before.</p>
<p>Let’s look at how loss decreased during training:</p>
<p><img src="/assets/images/pyr/4.png" alt="fitted" /></p>
<p>We still have loss of $2\ 596\ 116\ 902$. While this may seem like a huge number, it’s an improvement of almost $98.2\%$. Since we are working with huge numbers in this project we expect the loss to be high. This is one of the reasons you need to be familiar with the data you are working with.</p>
<p>Now that we have fitted $\theta$, we can make predictions by passing new values of $x$ to $h_\theta(x)$.</p>
<h2 id="the-normal-equation">The normal equation</h2>
<p>Even though it is a very popular choice, gradient descent is not the only way to find values for $\theta$. Another method called the <em>normal equation</em> also exists. With this formula you can compute the optimal values for $\theta$ without choosing $\alpha$ and without iterating.</p>
<p>The normal equation is defined as</p>
<script type="math/tex; mode=display">\theta = (X^TX)^{-1}X^Ty</script>
<p>For more information on where this comes from check out <a href="https://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression">this</a> post.</p>
<p>The biggest advantage of this is you always find the optimal value of $\theta$. Note that this is the best fit for the model ($h$) you built, and might not the best solution for your problem in general.</p>
<p>A drawback of using this method over gradient descent is the computational cost. Computing the inverse $(X^TX)^{-1}$ is $O(n^3)$ so when you have many features, it might be very expensive. In cases where $n$ is large, think $n > 10\ 000$, you would probably want to switch to gradient descent or another training algorithm.</p>
<p>Implementing the normal equation in Python is just a matter of implementing the formula:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">theta</span> <span class="o">=</span> <span class="n">np</span><span class="p">.</span><span class="n">linalg</span><span class="p">.</span><span class="n">pinv</span><span class="p">(</span><span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">X</span><span class="p">)</span> <span class="o">@</span> <span class="n">X</span><span class="p">.</span><span class="n">T</span> <span class="o">@</span> <span class="n">y</span>
</code></pre></div></div>
<p>Note that we use the pseudo inverse instead of the real inverse because a training set might be noninvertable.</p>
<p><img src="/assets/images/pyr/7.png" alt="normal fit" /></p>
<p>Another risk over using the normal equation is overfitting, which I will cover in the next section.</p>
<h2 id="overfitting-underfitting-and-some-tips">Overfitting, underfitting, and some tips</h2>
<p>Before I wrap up this tutorial I would like to share a little more theory behind machine learning.</p>
<h3 id="splitting-datasets">Splitting datasets</h3>
<p>In this tutorial we used every available training example. In real world applications you would want to split your data into three categories:</p>
<ul>
<li><strong>Training data</strong>: is the data you train your model on.</li>
<li><strong>Validation data</strong>: is used to optimize hyperparameters, such as $\alpha$ and the number of your epochs. While not very important in regression models, it is a very crucial part of <em>deep learning</em>*. You do not use validation data using training because this subset is designed to optimize hyperparameters instead of weights. Confusing them leads to worse performance.</li>
<li><strong>Test data</strong>: is used to get a sense of how a model would perform in production. This dataset must not be used to improve the model. Changing parameters based on the test dataset will invalidate your ideas about its performce—use the validation set for this.</li>
</ul>
<p>* Deep learning is a subfield of machine learning.</p>
<h3 id="feature-selection">Feature selection</h3>
<p>No two features in a dataset should be dependent on each other. If they were it would put an excessive emphasis on the underlying cause leading to worse accuracy. For example, we could add a “years of experience” feature to our dataset, but “main task” would not be good idea since it’s highly correlated with the job position.</p>
<p>Remember that machine learning models are not magic applications. As a rule of thumb a human must be able to draw the same conclusions given the same input data. The color of someones hair is most likely not related to their salary so adding it to the dataset would only confuse gradient descent. The model could find random correlations though which do not generalize well in production.</p>
<h3 id="overfitting--underfitting">Overfitting & underfitting</h3>
<p>Underfitting but particularly overfitting are perhaps the biggest problems in machine learning today. To really understand it we have to go back to the fundamental concept of machine learning: learning from data to make predictions about or in the future, using a model.</p>
<p>When your model is fit too specifically to your dataset it’s overfitted. While it has a very low loss, in extreme cases even $0$, it does not generalize well in the real world. The normal equation we used earlier actually overfitted the dataset, because it found a function which passes through our training values very closely, but it does not represent a function of position to salary. For example, notice how it predicts a higher salary for lowest position.</p>
<p><img src="/assets/images/pyr/7.png" alt="overfitted" /></p>
<p>Overfitting can occur when you train your model for too long. Another cause for overfitting is having too many features. To reduce overfitting you should try training your model for fewer epochs, or removing some features. You always want $m > n$.</p>
<p>Underfitting is the opposite of overfitting. When your model is too simple, for example when you try fitting a one degree polynomial to a multidegree dataset, it will underfit.</p>
<p><img src="/assets/images/pyr/6.png" alt="underfitted" /></p>
<p>To reduce underfitting you should try adding polynomial features to your dataset. Another cause might be a bad train/test split—you always want your data to be divided equally over your train and test set. For example, we could put some randomly selected job positions in a test set. But if we used only the highest positions for testing, and everything else for training, the model would underfit because it has not seen the full environment it will be used in.</p>
<p>Another technique to reduce overffiting is called <em>data augmentation</em>. With data augmentation you can create more training examples, without actively gathering more data which might not even be available. If you were working with images for example, you could try flipping them horizontally or cropping them.</p>
<p>You can view the complete code for this project <a href="https://github.com/rickwierenga/MLFundamentals/blob/master/1_Polynomial_Regression.ipynb">here</a>.</p>
<h2 id="whats-next">What’s next?</h2>
<p>This concludes the first post in the <a href="/blog/ml-fundamentals">“ML from the Fundamentals” series</a>.</p>
<p>Machine learning is not just predicting salaries based on job titles, or even predicting any number based on input data. Predicting values in a continuous space as we’ve done today is called regression, a form of <em>supervised learning</em> because we had labelled data (we know $y$) available at training time.</p>
<p>Another form of <em>supervised learning</em> is classification where your goal is to assign a label to an input. For example, classifying images of handwritten digits would be a classification problem.</p>
<p><em>Unsupervised learning</em> is the other major subfield of machine learning. Grouping items based on similarity, for example. But also recommendation systems like the YouTube algorithm use machine learning under the hood.</p>
<p>If you are interested in learning more about machine learning, I recommend you check out the <a href="/blog/ml-fundamentals">series page</a> where I will post all blog posts in this ongoing series.</p>Machine learning is one of the hottest topics in computer science today. And not without a reason: it has helped us do things that couldn’t be done before like image classification, image generation and natural language processing. But all of it boils down to a really simple concept: you give the computer data and the computer then finds patterns in that data. This is called “learning” or “training”, depending on your point of view. These learnt patterns can be extrapolated to make predictions. How? That’s what we are looking at today.Deploying Core ML models using Vapor2019-12-20T00:00:00+00:002019-12-20T00:00:00+00:00https://rickwierenga.com/blog/apple/coremlapi<p>Core ML is Apple’s framework for machine learning. With Core ML, everyone can use machine learning in their apps—as long as that app runs on an Apple platform, and Apple platforms only. Core ML cannot be used with Android, Windows, or on websites. This is very unfortunate because Core ML is such a great piece of technology.</p>
<p>It would be great if we could use Core ML with Python, for example. And as a programmer, you should know that if you want something bad enough, you can make it happen. And that’s what I’m going to show you in this post.</p>
<p>We’ll be building a REST API wrapper around a Core ML model using a new framework called <a href="https://vapor.codes/">Vapor</a>. This API will be running on a remote server, accepting requests <em>from every platform</em>. This particular API will classify images, but any Core ML model should do.
Building an API around your Core ML model is not only beneficial when you want your model to work cross platform—even if your app only supports Apple devices, you might still want to consider using this approach over sending a copy of your Core ML model to each individual device.</p>
<p>First, your app size will decrease dramatically — Core ML models can be quite big. Second, you don’t need to update your app every time you improve your model — deploying new models can be done without Apple’s intervention.</p>
<p>I’ll start by introducing some web programming terminology you’ll need to know before writing a web app. Then we’ll write our own web app that uses Core ML in Vapor, because interfacing with Core ML is easiest with Swift, and because Swift is such a nice language to code with. Finally, we’ll also look at how to consume the API in Python.</p>
<p>The final project can be viewed <a href="https://github.com/rickwierenga/CoreML-API">on my GitHub</a>. Don’t forget to leave a ⭐️ ;)</p>Core ML is Apple’s framework for machine learning. With Core ML, everyone can use machine learning in their apps—as long as that app runs on an Apple platform, and Apple platforms only. Core ML cannot be used with Android, Windows, or on websites. This is very unfortunate because Core ML is such a great piece of technology.