Jekyll2019-01-14T18:14:32+00:00https://kjaer.io/OUTPUTMaxime Kjaer's blog posts about the Web, software & technology. Most likely in that order.Maxime Kjaermaxime.kjaer@gmail.comhttps://kjaer.ioCS-525 Foundations and tools for tree-structured data2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/tree-structured-data
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#xpath" id="markdown-toc-xpath">XPath</a> <ul>
<li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
</ul>
</li>
<li><a href="#xml-schemas" id="markdown-toc-xml-schemas">XML Schemas</a> <ul>
<li><a href="#dtd" id="markdown-toc-dtd">DTD</a></li>
<li><a href="#xml-schema" id="markdown-toc-xml-schema">XML Schema</a> <ul>
<li><a href="#criticism" id="markdown-toc-criticism">Criticism</a></li>
</ul>
</li>
<li><a href="#relax-ng" id="markdown-toc-relax-ng">Relax NG</a></li>
<li><a href="#schematron" id="markdown-toc-schematron">Schematron</a></li>
</ul>
</li>
<li><a href="#xml-information-set" id="markdown-toc-xml-information-set">XML Information Set</a></li>
<li><a href="#xslt" id="markdown-toc-xslt">XSLT</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#default-templates" id="markdown-toc-default-templates">Default templates</a></li>
<li><a href="#example" id="markdown-toc-example">Example</a></li>
</ul>
</li>
<li><a href="#xquery" id="markdown-toc-xquery">XQuery</a> <ul>
<li><a href="#syntax" id="markdown-toc-syntax">Syntax</a></li>
<li><a href="#creating-xml-content" id="markdown-toc-creating-xml-content">Creating XML content</a></li>
<li><a href="#sequences" id="markdown-toc-sequences">Sequences</a></li>
<li><a href="#flwor" id="markdown-toc-flwor">FLWOR</a></li>
<li><a href="#conditional-expressions" id="markdown-toc-conditional-expressions">Conditional expressions</a></li>
<li><a href="#quantified-expressions" id="markdown-toc-quantified-expressions">Quantified expressions</a></li>
<li><a href="#functions" id="markdown-toc-functions">Functions</a></li>
<li><a href="#modules" id="markdown-toc-modules">Modules</a></li>
<li><a href="#updating-xml-content" id="markdown-toc-updating-xml-content">Updating XML Content</a></li>
<li><a href="#advanced-features" id="markdown-toc-advanced-features">Advanced features</a></li>
<li><a href="#coding-guidelines" id="markdown-toc-coding-guidelines">Coding guidelines</a></li>
</ul>
</li>
<li><a href="#xml-based-webapps" id="markdown-toc-xml-based-webapps">XML Based Webapps</a> <ul>
<li><a href="#xml-databases" id="markdown-toc-xml-databases">XML Databases</a></li>
<li><a href="#rest" id="markdown-toc-rest">REST</a></li>
<li><a href="#oppidum" id="markdown-toc-oppidum">Oppidum</a></li>
</ul>
</li>
<li><a href="#foundations-of-xml-types" id="markdown-toc-foundations-of-xml-types">Foundations of XML types</a> <ul>
<li><a href="#tree-grammars" id="markdown-toc-tree-grammars">Tree Grammars</a> <ul>
<li><a href="#dtd--local-tree-grammars" id="markdown-toc-dtd--local-tree-grammars">DTD & Local tree grammars</a></li>
<li><a href="#xml-schema--single-type-tree-grammars" id="markdown-toc-xml-schema--single-type-tree-grammars">XML Schema & Single-Type tree grammars</a></li>
<li><a href="#relax-ng--regular-tree-grammars" id="markdown-toc-relax-ng--regular-tree-grammars">Relax NG & Regular tree grammars</a></li>
</ul>
</li>
<li><a href="#tree-automata" id="markdown-toc-tree-automata">Tree automata</a> <ul>
<li><a href="#definition" id="markdown-toc-definition">Definition</a></li>
<li><a href="#example-1" id="markdown-toc-example-1">Example</a></li>
<li><a href="#properties" id="markdown-toc-properties">Properties</a></li>
</ul>
</li>
<li><a href="#validation" id="markdown-toc-validation">Validation</a> <ul>
<li><a href="#inclusion" id="markdown-toc-inclusion">Inclusion</a></li>
<li><a href="#closure" id="markdown-toc-closure">Closure</a></li>
<li><a href="#emptiness" id="markdown-toc-emptiness">Emptiness</a></li>
<li><a href="#type-inclusion" id="markdown-toc-type-inclusion">Type inclusion</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#dealing-with-non-textual-content" id="markdown-toc-dealing-with-non-textual-content">Dealing with non-textual content</a> <ul>
<li><a href="#mathml" id="markdown-toc-mathml">MathML</a></li>
<li><a href="#tables" id="markdown-toc-tables">Tables</a></li>
</ul>
</li>
<li><a href="#xml-processing" id="markdown-toc-xml-processing">XML Processing</a> <ul>
<li><a href="#dom" id="markdown-toc-dom">DOM</a></li>
<li><a href="#sax" id="markdown-toc-sax">SAX</a></li>
<li><a href="#dom-and-web-applications" id="markdown-toc-dom-and-web-applications">DOM and web applications</a></li>
<li><a href="#xforms-an-alternative-to-html-forms" id="markdown-toc-xforms-an-alternative-to-html-forms">XForms: an alternative to HTML forms</a></li>
</ul>
</li>
<li><a href="#web-services" id="markdown-toc-web-services">Web Services</a> <ul>
<li><a href="#web-service-description-language-wsdl" id="markdown-toc-web-service-description-language-wsdl">Web Service Description Language (WSDL)</a></li>
<li><a href="#simple-object-access-protocol-soap" id="markdown-toc-simple-object-access-protocol-soap">Simple Object Access Protocol (SOAP)</a></li>
<li><a href="#universal-description-discovery-and-integration-uddi" id="markdown-toc-universal-description-discovery-and-integration-uddi">Universal Description, Discovery and Integration (UDDI)</a></li>
</ul>
</li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<!-- More -->
<h2 id="xpath">XPath</h2>
<p>XPath is the W3C standard language for traversal and navigation in XML trees.</p>
<p>For navigation, we use the <strong>location path</strong> to identify nodes or content. A location path is a sequence of location steps separated by a <code class="highlighter-rouge">/</code>:</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>child::chapter/descendant::section/child::para</pre></td></tr></tbody></table></code></pre></figure>
<p>Every location step has an axis, <code class="highlighter-rouge">::</code> and then a node test. Starting from a context node, a location returns a node-set. Every selected node in the node-set is used as the context node for the next step.</p>
<p>You can start an XPath expression with <code class="highlighter-rouge">/</code> start from the root, which is known as an <strong>absolute path</strong>.</p>
<p>XPath defines 13 axes allowing navigation, including <code class="highlighter-rouge">self</code>, <code class="highlighter-rouge">parent</code>, <code class="highlighter-rouge">child</code>, <code class="highlighter-rouge">following-sibling</code>, <code class="highlighter-rouge">ancestor-or-self</code>, etc. There is a special <code class="highlighter-rouge">attribute</code> axis to select attributes of the context node, which are not really in the child hierarchy. Similarly, <code class="highlighter-rouge">namespace</code> selects namespace nodes.</p>
<p>A nodetest filters nodes:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Test</th>
<th style="text-align: left">Semantics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">node()</code></td>
<td style="text-align: left">let any node pass</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">text()</code></td>
<td style="text-align: left">select only text nodes</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">comment()</code></td>
<td style="text-align: left">preserve only comment nodes</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">name</code></td>
<td style="text-align: left">preserves only <strong>elements/attributes</strong> with that name</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">*</code></td>
<td style="text-align: left"><code class="highlighter-rouge">*</code> preserves every <strong>element/attribute</strong></td>
</tr>
</tbody>
</table>
<p>At each navigation step, nodes can be filtered using qualifiers.</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>axis::nodetest[qualifier][qualifier]</pre></td></tr></tbody></table></code></pre></figure>
<p>For instance:</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>following-sibling::para[position()=last()]</pre></td></tr></tbody></table></code></pre></figure>
<p>A qualifier filters a node-set depending on the axis. Each node in a node-set is kept only if the evaluation of the qualifier returns true.</p>
<p>Qualifiers may include comparisons (<code class="highlighter-rouge">=</code>, <code class="highlighter-rouge"><</code>, <code class="highlighter-rouge"><=</code>, …). The comparison is done on the <code class="highlighter-rouge">string-value()</code>, which is the concatenation of all descendant text nodes in <em>document order</em>. But there’s a catch here! Comparison between node-sets is under existential semantics: there only needs to be one pair of nodes for which the comparison is true. Thus, when negating, we can get universal quantification.</p>
<p>XPaths can be a union of location paths separated by <code class="highlighter-rouge">|</code>. Qualifiers can include boolean expressions (<code class="highlighter-rouge">or</code>, <code class="highlighter-rouge">not</code>, <code class="highlighter-rouge">and</code>, …).</p>
<p>There are a few basic functions: <code class="highlighter-rouge">last()</code>, <code class="highlighter-rouge">position()</code>, <code class="highlighter-rouge">count(node-set)</code>, <code class="highlighter-rouge">concat(string, string, ...string</code>), <code class="highlighter-rouge">contains(str1, str2)</code>, etc. These can be used within a qualifier.</p>
<p>XPath also supports abbreviated syntax. For instance, <code class="highlighter-rouge">child::</code> is the default axis and can be omitted, <code class="highlighter-rouge">@</code> is a shorthand for <code class="highlighter-rouge">attribute::</code>, <code class="highlighter-rouge">[4]</code> is a shorthand for <code class="highlighter-rouge">[position()=4]</code> (note that positions start at 1).</p>
<p>XPath is used in XSLT, XQuery, XPointer, XLink, XML Schema, XForms, …</p>
<h3 id="evaluation">Evaluation</h3>
<p>To evaluate an XPath expression, we have in our state:</p>
<ul>
<li>The context node</li>
<li>Context size: number of nodes in the node-set</li>
<li>Context position: index of the context node in the node-set</li>
<li>A set of variable bindings</li>
</ul>
<h2 id="xml-schemas">XML Schemas</h2>
<p>There are three classes of languages that constraint XML content:</p>
<ul>
<li>Constraints expressed by <strong>a description</strong> of each element, and potentially related attributes (DTD, XML Schema)</li>
<li>Constraints expressed by <strong>patterns</strong> defining the admissible elements, attributes and text nodes using regexes (Relax NG)</li>
<li>Constraints expressed by <strong>rules</strong> (Schematron)</li>
</ul>
<h3 id="dtd">DTD</h3>
<p>Document Type Definitions (DTDs) are XML’s native schema system. It allows to define document classes, using a declarative approach to define the logical structure of a document.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><!ELEMENT recipe (title, comment*, item+, picture?, nbPers)></span>
<span class="cp"><!ATTLIST recipe difficulty (easy|medium|difficult) #IMPLIED></span>
<span class="cp"><!ELEMENT title (#PCDATA)></span>
<span class="cp"><!ELEMENT comment (#PCDATA)></span>
<span class="cp"><!ELEMENT item (header?,((ingredient+, step+) | (ingredient+, step)+))></span>
<span class="cp"><!ELEMENT header (#PCDATA)></span>
<span class="cp"><!ELEMENT ingredient (#PCDATA)></span>
<span class="cp"><!ELEMENT step (#PCDATA)></span>
<span class="cp"><!ELEMENT picture EMPTY></span>
<span class="cp"><!ATTLIST picture source CDATA #REQUIRED format (jpeg | png) #IMPLIED ></span>
<span class="cp"><!ELEMENT nbPers (#PCDATA)></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="xml-schema">XML Schema</h3>
<p>XML Schemas are a <a href="http://www.w3.org/TR/xmlschema-0/">W3C standard</a> that go beyond the native DTDs. XML Schema descriptions are valid XML documents themselves.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsd:schema</span> <span class="na">xmlns:xsd=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span><span class="nt">></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"RecipesCollection"</span><span class="nt">></span>
<span class="nt"><xsd:complexType></span>
<span class="nt"><xsd:sequence</span> <span class="na">minOccurs=</span><span class="s">"0"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"Recipe"</span> <span class="na">type=</span><span class="s">"RecipeType"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"></xsd:element></span>
...
<span class="nt"></xsd:schema></span></pre></td></tr></tbody></table></code></pre></figure>
<p>To declare an element, we do as follows; by default, the author element as defined below may only contain string values:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"author"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>But we can define other types of elements, that aren’t just strings. Types include <code class="highlighter-rouge">string</code>,
<code class="highlighter-rouge">boolean</code>, <code class="highlighter-rouge">number</code>, <code class="highlighter-rouge">float</code>, <code class="highlighter-rouge">duration</code>, <code class="highlighter-rouge">time</code>, <code class="highlighter-rouge">date</code>, <code class="highlighter-rouge">AnyURI</code>, … The types are still string-encoded and must be extracted by the XML application, but this helps verify the consistency.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"year"</span> <span class="na">type=</span><span class="s">"xsd:date"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can bound the number of occurrences of an element. Below, the <code class="highlighter-rouge">character</code> element may be repeated 0 to ∞ times (this is equivalent to something like <code class="highlighter-rouge">character*</code> in a regex). Absence of <code class="highlighter-rouge">minOccurs</code> and <code class="highlighter-rouge">maxOccurs</code> implies exactly once (like in a regex).</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"character"</span> <span class="na">minOccurs=</span><span class="s">"0"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can define more complex types using <strong>type constructors</strong>.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre><span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"Characters"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"character"</span> <span class="na">minOccurs=</span><span class="s">"1"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"Prolog"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"series"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"author"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"characters"</span> <span class="na">type=</span><span class="s">"Characters"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"prolog"</span> <span class="na">type=</span><span class="s">"Prolog"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This defines a Prolog type containing a sequence of a <code class="highlighter-rouge">series</code>, <code class="highlighter-rouge">author</code>, and <code class="highlighter-rouge">Characters</code>, which is <code class="highlighter-rouge">character+</code>.</p>
<p>Using the <code class="highlighter-rouge">mixed="true"</code> attribute on an <code class="highlighter-rouge">xsd:complexType</code> allows for mixed content: attributes, elements, and text can be mixed (like we know in HTML, where you can do <code class="highlighter-rouge"><p>hello <em>world</em>!</p></code>).</p>
<p>There are more type constructor primitives that allow to do much of what regexes do: <code class="highlighter-rouge">xsd:sequence</code>, which we’ve seen above, but also <code class="highlighter-rouge">xsd:choice</code> (for enumerated elements) and <code class="highlighter-rouge">xsd:all</code> (for unordered elements).</p>
<p>Attributes can also be declared within their owner element:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"strip"</span><span class="nt">></span>
<span class="nt"><xsd:attribute</span> <span class="na">name=</span><span class="s">"copyright"</span><span class="nt">/></span>
<span class="nt"><xsd:attribute</span> <span class="na">name=</span><span class="s">"year"</span> <span class="na">type=</span><span class="s">"xsd:gYear"</span><span class="nt">/></span>
<span class="nt"></xsd:element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Because writing complex types can be tedious, complex types can be derived by extension or restriction from existing base types:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"BookType"</span><span class="nt">></span>
<span class="nt"><xsd:complexContent></span>
<span class="nt"><xsd:extension</span> <span class="na">base=</span><span class="s">"Publication"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name =</span><span class="s">"ISBN"</span> <span class="na">type=</span><span class="s">"xsd:string"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name =</span><span class="s">"Publisher"</span> <span class="na">type=</span><span class="s">"xsd:string"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:extension></span>
<span class="nt"></xsd:complexContent></span>
<span class="nt"></xsd:complexType></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Additionally, it is possible to define user-defined types:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="nt"><xsd:simpleType</span> <span class="na">name=</span><span class="s">"Car"</span><span class="nt">></span>
<span class="nt"><xsd:restriction</span> <span class="na">base=</span><span class="s">"xsd:string"</span><span class="nt">></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"Audi"</span><span class="nt">/></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"BMW"</span><span class="nt">/></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"VW"</span><span class="nt">/></span>
<span class="nt"></xsd:restriction></span>
<span class="nt"></xsd:simpleType></span>
<span class="nt"><xsd:simpleType</span> <span class="na">name=</span><span class="s">"WeakPasswordType"</span><span class="nt">></span>
<span class="nt"><xsd:restriction</span> <span class="na">base=</span><span class="s">"xsd:string"</span><span class="nt">></span>
<span class="nt"><xsd:pattern</span> <span class="na">value=</span><span class="s">"[a-z A-Z 0-9{8}]"</span><span class="nt">/></span>
<span class="nt"></xsd:restriction></span>
<span class="nt"></xsd:simpleType></span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="criticism">Criticism</h4>
<p>There have been some criticisms addressed to XML Schema:</p>
<ul>
<li>The specification is very difficult to understand</li>
<li>It requires a high level of expertise to avoid surprises, as there are many complex and unintuitive behaviors</li>
<li>The choice between element and attribute is largely a matter of the taste of the designer, but XML Schema provides separate functionality for them, distinguishing them strongly</li>
<li>There is only weak support for unordered content. In SGML, there was support for the <code class="highlighter-rouge">&</code> operator. <code class="highlighter-rouge">A & B</code> means that we must have <code class="highlighter-rouge">A</code> followed by <code class="highlighter-rouge">B</code> or vice-versa (order doesn’t matter). But we could enforce <code class="highlighter-rouge">A & B*</code> such that there would have to be a sequence of <code class="highlighter-rouge">B</code> which would have to be grouped. XML Schema is too limited to enforce such things.</li>
<li>
<p>The datatypes (strings, dates, etc) are tied to <a href="https://www.w3.org/TR/xmlschema-2/">a single collection of datatypes</a>, which can be a little too limited for certain domain-specific datatypes.</p>
<p>But XML Schema 1.1 addressed this with two new features, co-occurrences constraints and assertions on simple types.</p>
<p>Co-occurrences are constraints which make the presence of an attribute, element or values allowable for it, depend on the value or presence of other attributes or elements.</p>
<p>Assertions on simple types introduced a new facet for simple types, called an assertion, to precise constraints using XPath expressions.</p>
</li>
</ul>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xs:schema</span> <span class="na">xmlns:xs=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span><span class="nt">></span>
<span class="nt"><xs:element</span> <span class="na">name=</span><span class="s">"NbOfAttempts"</span><span class="nt">></span>
<span class="nt"><xs:complexType></span>
<span class="nt"><xs:attribute</span> <span class="na">name=</span><span class="s">"min"</span> <span class="na">type=</span><span class="s">"xs:int"</span><span class="nt">/></span>
<span class="nt"><xs:attribute</span> <span class="na">name=</span><span class="s">"max"</span> <span class="na">type=</span><span class="s">"xs:int"</span><span class="nt">/></span>
<span class="nt"><xs:assert</span> <span class="na">test=</span><span class="s">"@min le @max"</span><span class="nt">/></span>
<span class="nt"></xs:complexType></span>
<span class="nt"></xs:element></span>
<span class="nt"></xs:schema></span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>Therefore, some of the original W3C XML Schema committee have gone on to create alternatives, some of which we will see below.</p>
<h3 id="relax-ng">Relax NG</h3>
<p>Pronounced “relaxing”. Relax NG’s goals are:</p>
<ul>
<li>Be easier to learn and use</li>
<li>Provide an XML syntax that is more readable and compact</li>
<li>Provide a theoretical sound language (based on tree automata, which we’ll talk about later)</li>
<li>The schema follows the structure of the document.</li>
</ul>
<p>The reference book for Relax NG is <a href="http://books.xmlschemata.org/relaxng/">Relax NG by Eric van der Vlist</a>.</p>
<p>As the example below shows, Relax NG is much more legible:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre><span class="nt"><element</span> <span class="na">name=</span><span class="s">"AddressBook"</span><span class="nt">></span>
<span class="nt"><zeroOrMore></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Card"</span><span class="nt">></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Name"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Email"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><optional></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Note"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"></optional></span>
<span class="nt"></element></span>
<span class="nt"></zeroOrMore></span>
<span class="nt"></element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Another example shows a little more advanced functionality; here, a card can either contain a single <code class="highlighter-rouge">Name</code>, or (exclusive or) both a <code class="highlighter-rouge">GivenName</code> and <code class="highlighter-rouge">FamilyName</code>.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="nt"><element</span> <span class="na">name=</span><span class="s">"Card"</span><span class="nt">></span>
<span class="nt"><choice></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Name"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><group></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"GivenName"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"FamilyName"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"></group></span>
<span class="nt"></choice></span>
<span class="nt"></element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Some other tags include:</p>
<ul>
<li><code class="highlighter-rouge"><choice></code> allows only one of the enumerated children to occur</li>
<li><code class="highlighter-rouge"><interleave></code> allows child elements to occur in any order (like <code class="highlighter-rouge">xsd:all</code> in XML Schema)</li>
<li><code class="highlighter-rouge"><attribute></code> inside an <code class="highlighter-rouge"><element></code> specifies the schema for attributes. By itself, it’s considered required, but it can be wrapped in an <code class="highlighter-rouge"><optional></code> too.</li>
<li><code class="highlighter-rouge"><group></code> allows to, as the name implies, logically group elements. This is especially useful inside <code class="highlighter-rouge"><choice></code> elements, as in the example above.</li>
</ul>
<p>The Relax NG book has a more detailed overview of these in <a href="http://books.xmlschemata.org/relaxng/relax-CHP-3-SECT-2.html">Chapter 3.2</a></p>
<p>Relax NG allows to reference externally defined datatypes, such as <a href="https://www.w3.org/2001/XMLSchema-datatypes">those defined in XML Schema</a>. To include such a reference, we can specify a <code class="highlighter-rouge">datatypeLibrary</code> attribute on the root <code class="highlighter-rouge"><grammar></code> element:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><grammar</span> <span class="na">xmlns=</span><span class="s">"http://relaxng.org/ns/structure/1.0"</span>
<span class="na">xmlns:a=</span><span class="s">"http://relaxng.org/ns/compatibility/annotations/1.0"</span>
<span class="na">datatypeLibrary=</span><span class="s">"http://www.w3.org/2001/XMLSchema-datatypes"</span><span class="nt">></span>
<span class="nt"><start></span>
...
<span class="nt"></start></span>
<span class="nt"></grammar></span></pre></td></tr></tbody></table></code></pre></figure>
<p>In addition to datatypes, we can also express admissible XML <em>content</em> using regexes, but (and this is important!) <strong>we cannot exprain cardinality constraints or uniqueness constraints</strong>.</p>
<p>If we need to express those, we can make use of Schematron.</p>
<h3 id="schematron">Schematron</h3>
<p><a href="http://schematron.com">Schematron</a> is an assertion language making use of XPath for node selection and for encoding predicates. It is often used <em>in conjunction</em> with Relax NG to express more complicated constraints, that aren’t easily expressed (or can’t be expressed at all) in Relax NG. The common pattern is to build the structure of the schema in Relax NG, and the business logic in Schematron.</p>
<p>They can be combined in the same file by declaring different namespaces. For instance, the example below allows us to write a Relax NG schema as usual, and some Schematron rules rules under the <code class="highlighter-rouge">sch</code> namespace.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><grammar</span> <span class="na">xmlns=</span><span class="s">"http://relaxng.org/ns/structure/1.0"</span>
<span class="na">xmlns:a=</span><span class="s">"http://relaxng.org/ns/compatibility/annotations/1.0"</span>
<span class="na">xmlns:sch=</span><span class="s">"http://purl.oclc.org/dsdl/schematron"</span>
<span class="na">datatypeLibrary=</span><span class="s">"http://www.w3.org/2001/XMLSchema-datatypes"</span><span class="nt">></span>
...
<span class="nt"></grammar></span></pre></td></tr></tbody></table></code></pre></figure>
<p>As we can see in the example below, a Schematron schema is built from a series of assertions:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre><span class="nt"><schema</span> <span class="na">xmlns=</span><span class="s">"http://purl.oclc.org/dsdl/schematron"</span> <span class="nt">></span>
<span class="nt"><title></span>A Schema for Books<span class="nt"></title></span>
<span class="nt"><ns</span> <span class="na">prefix=</span><span class="s">"bk"</span> <span class="na">uri=</span><span class="s">"http://www.example.com/books"</span> <span class="nt">/></span>
<span class="nt"><pattern</span> <span class="na">id=</span><span class="s">"authorTests"</span><span class="nt">></span>
<span class="nt"><rule</span> <span class="na">context=</span><span class="s">"bk:book"</span><span class="nt">></span>
<span class="nt"><assert</span> <span class="na">test=</span><span class="s">"count(bk:author)!= 0"</span><span class="nt">></span>
A book must have at least one author
<span class="nt"></assert></span>
<span class="nt"></rule></span>
<span class="nt"></pattern></span>
<span class="nt"><pattern</span> <span class="na">id=</span><span class="s">"onLoanTests"</span><span class="nt">></span>
<span class="nt"><rule</span> <span class="na">context=</span><span class="s">"bk:book"</span><span class="nt">></span>
<span class="nt"><report</span> <span class="na">test=</span><span class="s">"@on-loan and not(@return-date)"</span><span class="nt">></span>
Every book that is on loan must have a return date
<span class="nt"></report></span>
<span class="nt"></rule></span>
<span class="nt"></pattern></span>
<span class="nt"></schema></span></pre></td></tr></tbody></table></code></pre></figure>
<p>A short description of the different Schematron elements follows:</p>
<ul>
<li><code class="highlighter-rouge"><ns></code>: specifies to which namespace a prefix is bound. In the above example, the <code class="highlighter-rouge">bk</code> prefix, used as <code class="highlighter-rouge">bk:book</code>, is bound to <code class="highlighter-rouge">http://www.example.com/books</code>. This prefix is used by XPath in the elements below.</li>
<li><code class="highlighter-rouge"><pattern></code>: a pattern contains a list of rules, and is used to group similar assertions. This isn’t just for better code organization, but also allows to execute groups at different stages in the validation</li>
<li><code class="highlighter-rouge"><rule></code>: a rule contains <code class="highlighter-rouge"><assert></code> and <code class="highlighter-rouge"><report></code> elements. It has a <code class="highlighter-rouge">context</code> attribute, which is an XPath specifying the element on which we’re operating; all nodes matching the XPath expression are tested for all the assertions and reports of a rule</li>
<li><code class="highlighter-rouge"><assert></code>: provides a mechanism to check if an assertion is true. If it isn’t, a validation error occurs</li>
<li><code class="highlighter-rouge"><report></code>: same as an assertion, but the validation doesn’t fail; instead, a warning is issued.</li>
</ul>
<h2 id="xml-information-set">XML Information Set</h2>
<p>The purpose of <a href="https://msdn.microsoft.com/en-us/library/aa468561.aspx">XML Information Set</a>, or Infoset, is to “purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document<sup id="fnref:infoset-spec"><a href="#fn:infoset-spec" class="footnote">1</a></sup>”.</p>
<p>It specifies a standardized, abstract model to represent the properties of XML trees. The goal is to provide a standardized viewpoint for the implementation and description of various XML technologies.</p>
<p>It functions like an AST for XML documents. It’s abstract in the sense that it abstract away from the concrete encoding of data, and just retains the meaning. For instance, it doesn’t distinguish between the two forms of the empty element; the following are considered equivalent (pairwise):</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="nt"><element></element></span>
<span class="nt"><element/></span>
<span class="nt"><element</span> <span class="na">attr=</span><span class="s">"example"</span><span class="nt">/></span>
<span class="nt"><element</span> <span class="na">attr=</span><span class="s">'example'</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>The Information Set is described as a tree of information items, which are simply blocks of information about a node in the tree; every information item is an abstract representation of a component in an XML document.</p>
<p>As such, at the root we have a document information item, which, most importantly, contains a list of children, which is a list of information items, in document order. Information items for elements contain a local name, the name of the namespace, a list of attribute information items, which contain the key and value of the attribute, etc.</p>
<h2 id="xslt">XSLT</h2>
<h3 id="motivation">Motivation</h3>
<p>XSLT is part of a more general language, XSL. The hierarchy is as follows:</p>
<ul>
<li><strong>XSL</strong>: eXtensible Stylesheet Language
<ul>
<li><strong>XSLT</strong>: XSL Transformation</li>
<li><strong>XLS-FO</strong>: XSL Formatting Objects</li>
</ul>
</li>
</ul>
<p>An XSLT Stylesheet allows us to transform XML input into other formats. An XSLT Processor takes an XML input, and an XSLT stylesheet and produces a result, either in XML, XHTML, LaTeX, …</p>
<p>XSLT is a <strong>declarative</strong> and <strong>functional</strong> language, which uses XML and XPath. It’s a <a href="https://www.w3.org/TR/xslt/all/">W3C recommendation</a>, often used for generating HTML views of XML content.</p>
<p>The XSLT Stylesheet consists of a set of templates. Each of them matches specific elements in the XML input, and participates to the generation of data in the resulting output.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsl:stylesheet</span> <span class="na">xmlns:xsl=</span><span class="s">"http://www.w3.org/1999/XSL/Transform"</span> <span class="na">xmlns:xd=</span><span class="s">"http://oxygenxml.com/ns/doc/xsl"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"a"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"b"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"c"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"d"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"></xsl:stylesheet></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Let’s take a look at an individual XSLT template:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"e"</span><span class="nt">></span>
result: <span class="nt"><xsl:apply-templates/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<ul>
<li><code class="highlighter-rouge">e</code> is an XPath expression that selects the nodes the XSLT processor will apply the template to</li>
<li><code class="highlighter-rouge">result</code> specifies the content to be produces in the output for each node selected by <code class="highlighter-rouge">e</code></li>
<li><code class="highlighter-rouge">xsl:apply-templates</code> indicates that templates are to be applied on the selected nodes, in document order; to select nodes, it may have a <code class="highlighter-rouge">select</code> attribute, which is an XPath expression defaulting to <code class="highlighter-rouge">child::node()</code>.</li>
</ul>
<p>The XSLT execution is roughly as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="n">find</span> <span class="n">most</span> <span class="n">specific</span> <span class="n">pattern</span>
<span class="c1"># instantiate template:
</span> <span class="n">create</span> <span class="n">result</span> <span class="n">fragment</span>
<span class="k">for</span> <span class="p">(</span><span class="n">instruction</span> <span class="n">selecting</span> <span class="n">other</span> <span class="n">nodes</span><span class="p">)</span> <span class="ow">in</span> <span class="n">template</span><span class="p">:</span>
<span class="k">for</span> <span class="n">new_node</span> <span class="ow">in</span> <span class="n">instruction</span><span class="p">:</span>
<span class="n">process</span><span class="p">(</span><span class="n">new_node</span><span class="p">)</span>
<span class="n">process</span><span class="p">(</span><span class="n">xml</span><span class="o">.</span><span class="n">root</span><span class="p">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Recursion stops when no more source nodes are selected.</p>
<h3 id="default-templates">Default templates</h3>
<p>XSLT Stylesheets contain <strong>default templates</strong>:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"/ | *"</span><span class="nt">></span>
<span class="nt"><xsl:apply-templates/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This recursively drives the matching process, starting from the root node. If templates are associated to the root node, then this default template is overridden; if the overridden version doesn’t contain any <code class="highlighter-rouge"><xml: ></code> elements, then the matching process is stopped.</p>
<p>Another default template is:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"text()|@*"</span><span class="nt">></span>
<span class="nt"><xsl:value-of</span> <span class="na">select=</span><span class="s">"self::node()"</span><span class="nt">/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This copies text and attribute nodes in the output.</p>
<p>A third default is:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"processing-instruction()|comment()"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is a template that specifically matches processing instructions and comments; it is empty, so it does not generate anything for them.</p>
<h3 id="example">Example</h3>
<p>To get an idea of what XSLT could do, let’s consider the following example of XML data representing a catalog of books and CDs:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
</pre></td><td class="code"><pre><span class="nt"><Catalog></span>
<span class="c"><!-- Book Sample --></span>
<span class="nt"><Product></span>
<span class="nt"><ProductNo></span>bk-005<span class="nt"></ProductNo></span>
<span class="nt"><Book</span> <span class="na">Language=</span><span class="s">"FR"</span><span class="nt">></span>
<span class="nt"><Price></span>
<span class="nt"><Value></span>19<span class="nt"></Value></span>
<span class="nt"><Currency></span>EUR<span class="nt"></Currency></span>
<span class="nt"></Price></span>
<span class="nt"><Title></span>Profecie<span class="nt"></Title></span>
<span class="nt"><Authors></span>
<span class="nt"><Author></span>
<span class="nt"><FirstName></span>Jonathan<span class="nt"></FirstName></span>
<span class="nt"><LastName></span>Zimmermann<span class="nt"></LastName></span>
<span class="nt"></Author></span>
<span class="nt"></Authors></span>
<span class="nt"><Year></span>2015<span class="nt"></Year></span>
<span class="nt"><Cover></span>profecie<span class="nt"></Cover></span>
<span class="nt"></Book></span>
<span class="nt"></Product></span>
<span class="c"><!-- CD sample --></span>
<span class="nt"><Product></span>
<span class="nt"><ProductNo></span>cd-003<span class="nt"></ProductNo></span>
<span class="nt"><CD></span>
<span class="nt"><Price></span>
<span class="nt"><Value></span>18.90<span class="nt"></Value></span>
<span class="nt"><Currency></span>EUR<span class="nt"></Currency></span>
<span class="nt"></Price></span>
<span class="nt"><Title></span>Witloof Bay<span class="nt"></Title></span>
<span class="nt"><Interpret></span>Witloof Bay<span class="nt"></Interpret></span>
<span class="nt"><Year></span>2010<span class="nt"></Year></span>
<span class="nt"><Sleeve></span>witloof<span class="nt"></Sleeve></span>
<span class="nt"><Opinion></span>
<span class="nt"><Parag></span>Original ce groupe belge.<span class="nt"></Parag></span>
<span class="nt"><Parag></span>Une véritable prouesse technique.<span class="nt"></Parag></span>
<span class="nt"></Opinion></span>
<span class="nt"></CD></span>
<span class="nt"></Product></span>
<span class="nt"></Catalog></span></pre></td></tr></tbody></table></code></pre></figure>
<p>For our example of books and CDs, we can create the following template:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsl:stylesheet</span> <span class="na">xmlns:xsl=</span><span class="s">"http://www.w3.org/1999/XSL/Transform"</span>
<span class="na">xmlns:xs=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span>
<span class="na">exclude-result-prefixes=</span><span class="s">"xs"</span>
<span class="na">version=</span><span class="s">"2.0"</span><span class="nt">></span>
<span class="nt"><xsl:output</span> <span class="na">method=</span><span class="s">"html"</span><span class="nt">/></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"/"</span><span class="nt">></span>
<span class="nt"><html></span>
<span class="nt"><head></span>...<span class="nt"></head></span>
<span class="nt"><body></span>
<span class="nt"><h2></span>Welcome to our catalog<span class="nt"></h2></span>
<span class="nt"><h3></span>Books<span class="nt"></h3></span>
<span class="nt"><ul></span>
<span class="nt"><xsl:apply-templates</span> <span class="na">select=</span><span class="s">"Catalog/Product/Book/Title"</span><span class="nt">></span>
<span class="nt"><xsl:sort</span> <span class="na">select=</span><span class="s">"."</span><span class="nt">/></span>
<span class="nt"></xsl:apply-templates></span>
<span class="nt"></ul></span>
<span class="nt"></body></span>
<span class="nt"></html></span>
<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"Title"</span><span class="nt">></span>
<span class="nt"><li></span>
<span class="nt"><xsl:value-of</span> <span class="na">select=</span><span class="s">"."</span><span class="nt">/></span>
<span class="nt"></li></span>
<span class="nt"></xsl:template></span>
<span class="nt"></xsl:stylesheet></span></pre></td></tr></tbody></table></code></pre></figure>
<p>In the above, the <code class="highlighter-rouge">xsl:sort</code> element has the following possible attributes:</p>
<ul>
<li><code class="highlighter-rouge">select</code>: here, the attribute is <code class="highlighter-rouge">.</code>, which refers to the title in this context</li>
<li><code class="highlighter-rouge">data-type</code>: gives the kind of order (e.g. text or number)</li>
<li><code class="highlighter-rouge">order</code>: <code class="highlighter-rouge">ascending</code> or <code class="highlighter-rouge">descending</code></li>
</ul>
<h2 id="xquery">XQuery</h2>
<p>XQuery is a <strong>strongly typed</strong> and <strong>functional</strong> language that offers features to operate on XML input for searching, selecting, filtering, transforming, restructuring information, etc. It is an SQL-like language for XML. It wasn’t defined with the same goals as XSLT, but has some overlap that we’ll discuss later.</p>
<p>It does not use the XML syntax. Instead, it offers a general purpose (Turing-complete) language that can be used for developing XML based applications.</p>
<p>XQuery is a <a href="https://www.w3.org/TR/xquery/all/">W3C Recommendation</a>, and is therefore closely linked to <a href="#xml-schema">XML Schema</a>, as it uses the XML Schema type system. Note that there is for no support for XQuery with Relax NG or other non-W3C schema languages. A nice book on XQuery is <a href="http://shop.oreilly.com/product/0636920035589.do">available at O’Reily</a>.</p>
<h3 id="syntax">Syntax</h3>
<p>A query is made up of three parts:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>(: Comments are written in these smiley-like delimiters :)
(: 1. Optional version declaration :)
xquery version "3.0";
(: 2. Optional query prolog :)
(: This contains declarations such as namespaces, variables, etc. :)
declare namespace html = "http://www.w3.org/1999/xhtml";
(: 3. Query body :)
substring("Welcome to the world of XML", 1, 7)</pre></td></tr></tbody></table></code></pre></figure>
<p>A query takes some kind of XML content: an XML file, an XML fragment retrieved online, a native XML database, etc. The output is a sequence of values, which are often XML elements (this is important: not a document, but elements). But it could also be an XML Schema type, such as a string, a list of integers, etc.</p>
<p>The output can be serialized to a document, or just kept in-memory in the application for further processing.</p>
<p>Queries are evaluated by an XQuery processor, which works in two phases. First, the analysis phase may raise errors (that do not depend on the input, only on the query). Then, there is an evaluation phase, which may raise dynamic errors (e.g. missing input).</p>
<p>A query consists of one or more comma-separated <strong>XQuery expressions</strong>, which are composed of the following:</p>
<ul>
<li>Primary expressions (literals, variables, function calls, etc)</li>
<li>Arithmetic expressions</li>
<li>Logical expressions</li>
<li>XPath (with <code class="highlighter-rouge">collection</code> and <code class="highlighter-rouge">doc</code> functions used to access resources)</li>
<li>XML constructors</li>
<li>Sequence constructors</li>
<li><a href="https://en.wikipedia.org/wiki/FLWOR">FLWOR statements</a> (pronounced “flower”: for, let, where, order by, return).</li>
<li>Conditional expressions</li>
<li>Quantified expressions</li>
</ul>
<h3 id="creating-xml-content">Creating XML content</h3>
<p>To build XML content, we can embed “escaped” XQuery code using curly brackets, within our template file, as follows:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><report</span> <span class="na">year=</span><span class="s">"2018"</span><span class="nt">></span>
The value is {round (3.14)}
<span class="nt"></report></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="sequences">Sequences</h3>
<p>A sequence is an ordered collection of items, which may be of any type (atomic value, node, etc). Duplicates are allowed. A sequence can contain zero (empty), one (singleton) or many items. Sequences are comma-separated. We can add parentheses for clarity, but not for nesting; a sequence is always flat (even if we nest parentheses):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre>1, 2, <example/>
(1, 2, <example/>)</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="flwor">FLWOR</h3>
<p>A FLWOR expression is constructed as follows:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>flwor ::= ((for | let) expr)+ (where expr)? (order by expr)? return expr</pre></td></tr></tbody></table></code></pre></figure>
<p>For instance:</p>
<p>XQuery also has support for for variables, denoted <code class="highlighter-rouge">$x</code> (which are more like constants):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>let $FREvents := /RAS/Events/Event[Canton/text() = "FR"],
$FRTopics := $FREvents/TopicRef/text()
return /RAS/Members/Member[Topics/TopicRef/text() = $FRTopics]/Email</pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 This gives us the email addresses of reporters who may deal with events in the canton of Fribourg. See exercises 01 for more context.</p>
</blockquote>
<p>Let’s take a look at another XQuery expression:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>for $book in /Catalog/Product/Book
where $book/@Language = "EN"
return $book/Title
(: equivalently written as :)
for $book in /Catalog/Product/Book[@Language = "EN"]
return $book/Title</pre></td></tr></tbody></table></code></pre></figure>
<p>This returns the book titles in the document:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><Title></span>XSLT<span class="nt"></Title></span>
<span class="nt"><Title></span>Electronic Publishing<span class="nt"></Title></span>
<span class="nt"><Title></span>Making Sense of NoSQL<span class="nt"></Title></span></pre></td></tr></tbody></table></code></pre></figure>
<p>As we can see above, there is some overlap between XQuery and XPath; the <code class="highlighter-rouge">where</code> condition can also be written as an XPath selection condition. Which to use is a question of style; there is no difference in performance.</p>
<p>The <code class="highlighter-rouge">order by</code> and <code class="highlighter-rouge">where</code> keywords work just like in SQL, so I won’t go into details here.</p>
<h3 id="conditional-expressions">Conditional expressions</h3>
<p>Like in any templating language, we can create conditional statements. It is mandatory to specify an <code class="highlighter-rouge">else</code> to every <code class="highlighter-rouge">if</code>, but if we do not want to return anything, we can return the empty sequence <code class="highlighter-rouge">()</code>.</p>
<p>The condition of an <code class="highlighter-rouge">if</code> must be a boolean or a sequence. Empty sequences are falsey, and sequences of one or more elements are truthy.</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre>for $book in /catalog/product/book
order by $book/title
return
<title>
{$book/title/text()}
{if ($book/@Language = 'EN') then '[English]' else ()}
</title></pre></td></tr></tbody></table></code></pre></figure>
<p>This returns:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="nt"><title></span>Electronic Publishing [English]<span class="nt"></title></span>
<span class="nt"><title></span>Making Sense of NoSQL [English]<span class="nt"></title></span>
<span class="nt"><title></span>Profecie<span class="nt"></title></span>
<span class="nt"><title></span>XML - le langage et ses applications<span class="nt"></title></span>
<span class="nt"><title></span>XSLT [English]<span class="nt"></title></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="quantified-expressions">Quantified expressions</h3>
<p>A quantified expression allows us to express universal or existential quantifiers using <code class="highlighter-rouge">some</code> and <code class="highlighter-rouge">every</code>. The predicate is given with the keyword <code class="highlighter-rouge">satisfies</code>, as below:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre>some $dept in doc("catalog.xml")//product/@dept
satisfies ($dept = "ACC")</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="functions">Functions</h3>
<p>User defined functions can be declared as follows:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre>declare function local:discountPrice(
$price as xs:decimal?,
$discount as xs:decimal?,
$maxDiscountPct as xs:integer?) as xs:decimal?
{
let $maxDiscount := ($price * $maxDiscountPct) div 100
let $actualDiscount := min(($maxDiscount, $discount))
return ($price - $actualDiscount)
};</pre></td></tr></tbody></table></code></pre></figure>
<p>The types are sequence types, with both the number and types of items. For instance, <code class="highlighter-rouge">xs:string?</code> means a sequence of zero or one string. The return type is optional, but is strongly encouraged for readability, error checking and optimization.</p>
<p>Functions can be overloaded with a different number of parameters.</p>
<p>The body is enclosed in curly braces. It does not have to contain a <code class="highlighter-rouge">return</code> clause, it just needs to be an XQuery expression.</p>
<h3 id="modules">Modules</h3>
<p>Functions can be grouped into modules, which declare the target namespace and bind it to a prefix (here, the <code class="highlighter-rouge">strings</code> prefix):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>module namespace strings = "https://example.com/strings"</pre></td></tr></tbody></table></code></pre></figure>
<p>Anything declared under that prefix can be accessed from the outside, when importing the module.</p>
<p>Modules can be imported at a location using the <code class="highlighter-rouge">at</code> clause:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>import module namespace search = "https://example.com/search" at "search.xqm"</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="updating-xml-content">Updating XML Content</h3>
<p>Unlike SQL, standard XQuery only offers ways of querying data, and not of inserting, deleting or updating data. That’s why the W3C developed an extension to XQuery called the <a href="https://www.w3.org/TR/xquery-update-10/">XQuery Update Facility</a>.</p>
<p>Like SQL, the implementation of this Update Facility is often tied to specific database systems. In this course, we will use the <a href="http://exist-db.org/exist/apps/homepage/index.html">eXist-db</a> variant. Updates are executed by specifying the <code class="highlighter-rouge">update</code> keyword in the <code class="highlighter-rouge">return</code> clause.</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre>let $catalog := doc('db/catalog.xml')
return update insert
<product>...</product>
into $catalog</pre></td></tr></tbody></table></code></pre></figure>
<p>The keyword <code class="highlighter-rouge">into</code> places content after the last child of the element. We can also use <code class="highlighter-rouge">following</code>, placing it as the next sibling, or <code class="highlighter-rouge">preceding</code> to place it as the previous sibling.</p>
<p>Instead of <code class="highlighter-rouge">update insert</code>, we can also do an <code class="highlighter-rouge">update delete</code>, or a <code class="highlighter-rouge">update replace XPATH with ELEMENT</code>.</p>
<p>Updates can be chained as a sequence:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>let $cd := doc('db/catalog.xml')/Product[ProductNo = $no]/CD
return
(
update replace $cd/Price/Value with <value>18</value>,
update replace $cd/Year with <year>2010</year>
)</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="advanced-features">Advanced features</h3>
<p>As we mentionned earlier, XQuery is Turing complete. You can define your own functions, which may be grouped into modules, and may be higher-order functions.</p>
<p>Schema awareness is an optional feature; if it is supported, the <code class="highlighter-rouge">validate</code> expression may be used, which is useful for optimization and error checking. However, as we mentioned earlier, there is only support for W3C standardized schemas, not Relax NG.</p>
<p>While XQuery is mainly associated with XML, it is possible in newer versions to deal with text documents (like CSV, name/value config files, etc. since 3.0) and even JSON (since 3.1).</p>
<h3 id="coding-guidelines">Coding guidelines</h3>
<p>MarkLogic has some <a href="https://developer.marklogic.com/blog/xquery-coding-guidelines">XQuery coding guidelines</a> that are good to follow.</p>
<p>For robustness, it is important to handle missing values (empty sequences) and data variations.</p>
<h2 id="xml-based-webapps">XML Based Webapps</h2>
<p>We’ve now learned to model (with schemas), transform (with XSLT), and query and process (with XQuery). How can we develop an XML based webapp combining these?</p>
<p>We will take a look at the <a href="https://github.com/ssire/oppidum">Oppidum framework</a>, which targets the development of XML-REST-XQuery (XRX) applications, using the eXist-db XML database.</p>
<h3 id="xml-databases">XML Databases</h3>
<p>An XML database looks quite a lot like a normal database; for instance, it uses a traditional, B-tree based indexing system, has a querying language, etc. The main difference is simply that data is XML instead of a table, and that we use XQuery instead of SQL.</p>
<h3 id="rest">REST</h3>
<p>REST stands for REpresentational State Transfer. It’s an architectural style created by Roy Fieding in [his PhD thesis](https://www.ics.uci.edu/~fielding/</p>
<p>In REST, we have resources, located by a URL on Web-based REST, that may be processed by a client. A collection is simply a set of resources. Interaction with a REST API happens with classical CRUD (Create, Read, Update, Delete) on URLs, which in HTTP are the <code class="highlighter-rouge">POST</code>, <code class="highlighter-rouge">GET</code>, <code class="highlighter-rouge">PUT</code> and <code class="highlighter-rouge">DELETE</code> requests.</p>
<h3 id="oppidum">Oppidum</h3>
<p><a href="https://github.com/ssire/oppidum">Oppidum</a> is an open source framework to build XML Web-based applications with an MVC approach. The <a href="https://ssire.github.io/oppidum/docs/fr/guide.html">documentation</a> is only in French, but the core idea is as follows: HTTP requests are handed to Oppidum by eXist. The application logic is then detailed in a pipeline consisting of:</p>
<ul>
<li><strong>Model</strong>: XQuery script (<code class="highlighter-rouge">*.xql</code>) returning relevant XML content</li>
<li><strong>View</strong>: XSLT transformation (<code class="highlighter-rouge">*.xsl</code>)</li>
<li><strong>Epilogue</strong>: XQuery script (<code class="highlighter-rouge">epilogue.xql</code>) for templating common content in HTML pages; this works using tags with the <code class="highlighter-rouge">site</code> namespace</li>
</ul>
<p>To specify the REST architecture, Oppidum has a DSL that allows us to define the set of resources and actions, determine the URLs and associated HTTP verbs (<code class="highlighter-rouge">GET</code>, <code class="highlighter-rouge">POST</code>, etc) recognized by the application, and so on.</p>
<h2 id="foundations-of-xml-types">Foundations of XML types</h2>
<p>We’ve seen seen XML tools for validation (DTD, XML Schema, Relax NG), navigation and extraction (XPath) and transformation (XQuery, XSLT).</p>
<p>Some essential questions about these tools are:</p>
<ul>
<li><strong>Expressive power</strong>: can I express requirement X using XML type language Y?</li>
<li><strong>Operations over XML types</strong>: can I check forward-compatibility when my XML file format evolves? Type inclusion?</li>
<li><strong>Static type-checking</strong>: can we make my XML manipulating programs will never output an invalid document?</li>
</ul>
<p>To answer this, we must know more about XML types, and dive into the theoretical foundations of XML types.</p>
<h3 id="tree-grammars">Tree Grammars</h3>
<p>XML documents can be modelled by finite, ordered, labeled trees of unbounded depth and arity. To describe a tree, we use a tree language, which can be specified by a tree grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>Person = person[Name, Gender, Children?]
Name = name[String]
Gender = gender[Male | Female]
Male = male[]
Female = female[]
Children = children[Person+]</pre></td></tr></tbody></table></code></pre></figure>
<p>By convention, capitalized variables are <strong>type variables</strong> (non-terminals), and non-capitalized are terminals.</p>
<p>A tree grammar defines a set of legal trees. As any grammar, tree grammars are defined within an alphabet $\Sigma$, with a set of type variables $X := \left\{X_1 ::= T_1, \dots, X_N ::= T_n\right\}$. A tree grammar is defined by the pair $(E, X)$, where $E$ represents the starting type variable. Each $T_i$ is a tree type expression:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>T ::=
l[T] // l ∈ Σ with content model T
| () // empty sequence
| T1, T2 // concatenation
| T1 | T2 // choice
| X // reference</pre></td></tr></tbody></table></code></pre></figure>
<p>The usual regex operators <code class="highlighter-rouge">?</code>, <code class="highlighter-rouge">+</code> and <code class="highlighter-rouge">*</code> are syntactic sugar.</p>
<p>To ensure that our tree grammar remains regular, we must introduce a syntactic restriction: every recursive use of a type variable $X$ (unless it is within the content model) must be in the tail. For instance, the following grammars are not acceptable:</p>
<script type="math/tex; mode=display">\left\{ X = a, X, b \right\} \\
\left\{ X = a, Y, b; \quad Y = X \right\} \\</script>
<p>But the following are fine:</p>
<script type="math/tex; mode=display">\left\{ X = a, c[X], b \right\} \\
\left\{ X = a, Y; \quad Y = b, X | \epsilon \right\} \\</script>
<p>A small reminder on regular vs. context-free grammars: regular grammars are decidable (we can check for inclusion with a DFA), while context-free grammars are undecidable (we cannot check for inclusion in $a^n b^n$ with a DFA, for instance).</p>
<p>Within the class of regular grammars, there are three subclasses of interest, in order of specificity (each of these is a subset of the classes above):</p>
<ol>
<li>Context-free</li>
<li>Regular</li>
<li>Single Type</li>
<li>Local</li>
</ol>
<p>Each subclass is defined by additional restrictions compared to its parent. The more restrictions we add, the more expressive power we lose. It turns out that these classes correspond to different XML technologies:</p>
<ol>
<li><strong>Context-free</strong>: ?</li>
<li><strong>Regular</strong>: Relax NG</li>
<li><strong>Single Type</strong>: XML Schema</li>
<li><strong>Local</strong>: DTD</li>
</ol>
<h4 id="dtd--local-tree-grammars">DTD & Local tree grammars</h4>
<p>As we said previously, the expressive power of a grammar class is defined by which restriction have been imposed. In DTD, the restriction is that each element name is associated with a regex. This means that for each $a[T_1]$ and $a[T_1]$ occuring in $X$, the content models are identical: $T_1 = T_2$.</p>
<p>In other words, in DTDs, the content of an XML tag cannot depend on the context of the tag. This removes some expressive power.</p>
<p>To construct a DTD validator, we just use a word automaton associated with each terminal. This automaton is a DFA, as DTD requires regular expressions to be deterministic. That is, the matched regexp must be able to be determined without lookahead to the next symbol. <code class="highlighter-rouge">a(bc | bb)</code> is not deterministic, but <code class="highlighter-rouge">ab(c | b)</code> is.</p>
<p>As a corollary, the union of two DTDs may not be a DTD. Indeed, the two DTDs could define different content models for the same terminal, which would be illegal. We say that the class is not closed composition (here, we showed that it isn’t closed under union).</p>
<h4 id="xml-schema--single-type-tree-grammars">XML Schema & Single-Type tree grammars</h4>
<p>In XML Schema, it is possible to have different content models for elements of the same name when they are in different contexts (unlike for DTD). But still, for each $a[T_1]$ and $a[T_2]$ occuring <em>under the same parent</em>, the content models must be identical ($T_1 = T_2$).</p>
<p>Still, this bring us more expressive power, so we have $\mathcal{L}_{\text{DTD}} \subset \mathcal{L}_{\text{xmlschema}}$. This inclusion is strict, as we can construct grammars that are single-type (and not local) in XML Schema:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>Dealer = dealer[Used, New]
Used = used[UsedCar]
New = new[NewCar]
UsedCar = car[Model, Year] // here, car can have different content models
NewCar = car[Model] // this is allowed as they have different parents
...</pre></td></tr></tbody></table></code></pre></figure>
<p>But XML schemas also have weaknesses: we cannot encode more advanced restrictions in it. For instance, with our car dealership example, we cannot encode something like “at least one car has a discount”, as it is not a <em>single-type</em>; we would require two different content models for a car within the same parent.</p>
<p>Consequently, this class is still not closed under union.</p>
<h4 id="relax-ng--regular-tree-grammars">Relax NG & Regular tree grammars</h4>
<p>Relax NG does not have any of the previously discussed restrictions. The content model does not have to depend on the label of the parent; it can also depend on the ancestor’s siblings, for instance. This allows us to have much more expressive power. Relax NG places itself in the class of regular tree grammars, and $\mathcal{L}_{\text{xmlschema}} \subset \mathcal{L}_{\text{r}}$.</p>
<p>For instance, we can now encode what we couldn’t with XML Schema:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre>Dealer = dealer[Used, New]
Used = used[UsedCar]
New = new[NewCar, DNewCar]
UsedCar = car[Model, Year]
NewCar = car[Model] // the same terminal used within 'new'
DNewCar = car[Model, Discount] // but with different content models
...</pre></td></tr></tbody></table></code></pre></figure>
<p>Regular tree grammars are more robust (closed under set operations like union and intersection), give us high expressive power, while still remaining simply defined and well-characterized (inclusion can still be verified in linear time by DFA).</p>
<h3 id="tree-automata">Tree automata</h3>
<h4 id="definition">Definition</h4>
<p>A tree automaton (plural automata) is a state machine dealing with tree structure instead of strings (like a word automaton would). Introducing these will allow us to provide a general framework for XML type languages by giving us a tool with which we can reason about regular tree languages.</p>
<p>A ranked tree can be thought of as the AST representation of a function call. For instance, <code class="highlighter-rouge">f(a, b)</code> can be represented as a tree with parent node <code class="highlighter-rouge">f</code> and two children <code class="highlighter-rouge">a</code> and <code class="highlighter-rouge">b</code> (in that order). We can also represent more complex trees with these notations (<code class="highlighter-rouge">f(g(a, b, c), h(i))</code> gives us the full structure of a tree, for instance).</p>
<p>We define a ranked alphabet symbol as a formalization of a function call. It is a symbol $a$, associated with an integer representing the number of children, $\text{arity}(a)$. We write $a^{(k)}$ for the symbol $a$ with $\text{arity}(a) = k$.</p>
<p>This allows us to fix an arity to different tree symbols. Our alphabet could then be, for instance, $\left\{ a^{(2)}, b^{(2)}, c^{(3)}, \sharp^{(0)} \right\}$. In this alphabet, <code class="highlighter-rouge">#</code> would always be the leaves.</p>
<p>A ranked tree automaton A consists of:</p>
<ul>
<li>$F$, a finite ranked alphabet of symbols</li>
<li>$Q$, a finite set of states</li>
<li>$\Delta$, a finite set of transition rules</li>
<li>$Q_f \subseteq Q$, a finite set of final states</li>
</ul>
<p>In a word automaton, we write transitions as $\text{even} \overset{1}{\rightarrow} \text{odd}$. In a (bottom-up) tree automaton, the transitions are from the children’s state to the parents’ state. If a tree node has arity 2, a transition could be $(q_0, q_1) \overset{a}{\rightarrow} q_0$. If the arity is $k=0$, we write $\epsilon \overset{a^{(0)}}{\rightarrow} q$.</p>
<h4 id="example-1">Example</h4>
<p>As an example, we can think of a tree of boolean expressions. Let’s consider the following:</p>
<script type="math/tex; mode=display">((0 \land 1) \lor (1 \lor 0)) \land ((0 \lor 1) \land (1 \land 1))</script>
<p>We can construct this as a binary tree by treating the logical operators as infix notation of a function call:</p>
<script type="math/tex; mode=display">\land(\lor(\land(0, 1), \lor(1, 0)), \land(\lor(0, 1), \land(1, 1)))</script>
<p>In this case, our alphabet is $F = \left\{\land, \lor, 0, 1\right\}$. Our states are $Q = \left\{ q_0, q_1\right\}$ (either true or false). The accepting state is $Q_f = \left\{ q_1 \right\}$. Our transition rules are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\epsilon \overset{0}{\rightarrow} q_0 & \quad & \epsilon \overset{1}{\rightarrow} q_0 \\
(q_1, q_1) \overset{\land}{\rightarrow} q_1 & \quad & (q_1, q_1) \overset{\lor}{\rightarrow} q_1 \\
(q_0, q_1) \overset{\land}{\rightarrow} q_0 & \quad & (q_0, q_1) \overset{\lor}{\rightarrow} q_1 \\
(q_1, q_0) \overset{\land}{\rightarrow} q_0 & \quad & (q_1, q_0) \overset{\lor}{\rightarrow} q_1 \\
(q_0, q_0) \overset{\land}{\rightarrow} q_0 & \quad & (q_0, q_0) \overset{\lor}{\rightarrow} q_0 \\
\\
\end{align} %]]></script>
<p>With these rules in place, we can evaluate binary expressions with a tree automaton.</p>
<h4 id="properties">Properties</h4>
<p>The language of A is the set of trees accepted by A. For a tree automaton, the language is a <strong>regular tree language</strong>.</p>
<p>A tree automaton is <strong>deterministic</strong> as long as there aren’t too rules pointing us to different states:</p>
<script type="math/tex; mode=display">(q_1, \dots q_k) \overset{a^{(k)}}{\rightarrow} q, \quad
(q_1, \dots q_k) \overset{a^{(k)}}{\rightarrow} q'
\qquad q \ne q'</script>
<p>With word automata, we know that we can build a DFA from any NFA. The same applies to tree automata: from a given non-deterministic (bottom-up) tree automaton, we can build a deterministic tree automaton.</p>
<p>As a corollary, this tells us that non-deterministic tree automata do not give us more expressive power; deterministic and non-deterministic automata recognize the same languages. However, non-deterministic automata tend to allow us to represent languages more compactly (conversion can turn a non-deterministic tree automaton of size $N$ into a deterministic tree automaton of size $\mathcal{O}(2^N$), so we’ll use those freely.</p>
<h3 id="validation">Validation</h3>
<h4 id="inclusion">Inclusion</h4>
<p>Given a tree automaton A and a tree t, how do we check $t\in\text{Language}(A)$?</p>
<p>What we do is to just mechanically apply the transition rules. If the automaton is non-deterministic, we can keep track of the set of possible states, and see if the root of the tree contains a finishing state.</p>
<p>This mechanism of membership checking is linear in the size of the tree.</p>
<h4 id="closure">Closure</h4>
<p>Tree automata are closed under set theoretic operations (we can just compute the union/intersection/product of the tuples defining the trees).</p>
<h4 id="emptiness">Emptiness</h4>
<p>We can also do emptiness checking with tree automata (that is, checking if $\text{Language}(A) = \emptyset$). To do so, we compute the set of reachable states, and see if any of them are in $Q_f$. This process is linear in the size of the automaton.</p>
<h4 id="type-inclusion">Type inclusion</h4>
<p>Given two automata $A_1$ and $A_2$, how can we check $\text{Language}(A_1) \subseteq \text{Language}(A_2)$?</p>
<p>Containment of a non-deterministic automata can be decided in exponential time. We do this by checking whether $\text{Language}(A_1 \cap \bar{A_2}) = \emptyset$. For this, we must make $A_2$ deterministic (which is an exponential process).</p>
<h2 id="dealing-with-non-textual-content">Dealing with non-textual content</h2>
<p>So far, we’ve just been dealing with text. In the following, we’ll see how we can deal with images, graphics, sound, video, animations, etc. For these types of data, semi-structured tree data is commonly used for its flexibility, while retaining rigorous structures and data typing.</p>
<p>For instance, there are many application-specific markup languages (MathML, CML for chemistry, GraphML, SVG tables, etc).</p>
<h3 id="mathml">MathML</h3>
<p>MathML actually has two possible structures: a presentation structure, telling us how to display math, and a mathematical structure, telling us how to apply or compute the result of a mathematical expression. It’s possible to go from mathematical to presentation structure, but not the other way (the other way is too ambiguous, it’s not a bijection).</p>
<h3 id="tables">Tables</h3>
<p>This distinction between content and presentation also exists within tables. For instance, creating the presentation and layout of a calendar, or of a complex table, is quite difficult because of the discrepancy between the presentation and structural forms.</p>
<p>The main issues with tables are:</p>
<ul>
<li>How can we model it in such a way that variations in presentation only depends on values of the formatting attributes?</li>
<li>How can we edit a table? (How do we modify the structure and update the backing content?)</li>
</ul>
<p>From a logical point of view, we can view a table as a d-dimensional space. A simple row-column table is 2D, but we can “add dimensions” by adding subdivision headers. Each cell in the table is described by a d-dimensional tuple of coordinates. How can we use a tree model to represent this?</p>
<p>We can use tree of height d, but more efficiently (or at least, flatly), we could encode each dimension as a direct child of the root, and link each data point to the relevant axes.</p>
<p>This is what HTML 4 proposes:</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="nt"><tr></span>
<span class="nt"><th></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a2"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Meals<span class="nt"></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a3"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Hotels<span class="nt"></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a4"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Transport<span class="nt"></th></span>
<span class="nt"><td></span>subtotals<span class="nt"></td></span>
<span class="nt"></tr></span>
<span class="nt"><tr></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a6"</span> <span class="na">axis=</span><span class="s">"location"</span><span class="nt">></span>San Jose<span class="nt"></th></span>
<span class="nt"><th></th></span>
<span class="nt"><th></th></span>
<span class="nt"><th></th></span>
<span class="nt"><td></td></span>
<span class="nt"></tr></span>
<span class="nt"><tr></span>
<span class="nt"><td</span> <span class="na">id=</span><span class="s">"a7"</span> <span class="na">axis=</span><span class="s">"date"</span><span class="nt">></span>25-Aug-97<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a2"</span><span class="nt">></span>37.74<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a3"</span><span class="nt">></span>112.00<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a4"</span><span class="nt">></span>45.00<span class="nt"></td></span>
<span class="nt"><td></td></span>
<span class="nt"></tr></span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="xml-processing">XML Processing</h2>
<p>When working with XML, there’s no need to write a parser. General-purpose XML parsers are widely available (e.g. Apache Xerces). Incidentally, an XML parser can be validating or non-validating.</p>
<p>XML parsers can communicate the XML tree structure to applications using it; there are two approaches for this:</p>
<ul>
<li>DOM: the parser stores the XML input to a fixed data structure, and exposes an API</li>
<li>SAX: parser trigger events. The input isn’t stored, the application must specify how to store and process events triggered by the parser.</li>
</ul>
<h3 id="dom">DOM</h3>
<p>DOM (Document Object Model) is a W3C standard. An application generates DOM library calls to manipulate the parsed XML input. There are multiple DOM levels, that have been introduced successively to expand the capabilities of DOM.</p>
<ul>
<li>DOM Level 1 provided basic API to access and manipulate tree structures (<code class="highlighter-rouge">getParentNode()</code>, <code class="highlighter-rouge">getFirstChild()</code>, <code class="highlighter-rouge">insertBefore()</code>, <code class="highlighter-rouge">replaceChild()</code>, …)</li>
<li>DOM Level 2 introduces specialized interfaces dedicated to XM Land namespace-related methods, dynamic access and update of the content of style sheets, an event system, …</li>
<li>DOM Level 3 introduces the ability to dynamically load the content of an XML document into a DOM document, serialize DOM into XML, dynamically update the content while ensureing validation, access the DOM using XPath, …</li>
</ul>
<p>DOM allows us to abstract away from the syntactical details of the XML structure, and allows us to ensure well-formedness (no missing tags, non-matching tags, etc). Thanks to that, document manipulation is considerably simplified.</p>
<p>However, the DOM approach is not without its flaws. The main disadvantage is that we must maintain a data structure representing the whole XML input, which can be problematic for big documents. To remedy this situation, we can preprocess to filter the document, reducing its overall size, but that only takes us so far. Alternatively, we can use a different approach for XML processing: SAX.</p>
<h3 id="sax">SAX</h3>
<p>SAX, the <a href="http://www.saxproject.org/">Simple API for XML</a> is not a W3C standard; it’s more of a de facto standard that started out as a Java-only API.</p>
<p>It’s very efficient, using only constant space, regardless of the XML input size. However, it means that we must also write more code. Indeed, we must specify callbacks for certain events, write our own code to store what we need, etc.</p>
<p>The SAX processor reads the input sequentially (while the DOM afforded us with random access), and once only. It sends events like <code class="highlighter-rouge">startDocument</code>, <code class="highlighter-rouge">startElement</code>, <code class="highlighter-rouge">characters</code>, etc. White spaces and tabs are reported too, so this also potentially means more code to write.</p>
<h3 id="dom-and-web-applications">DOM and web applications</h3>
<p>DOM is language and platform independent, with DOM APIs for all major programming languages. Most common though, is the DOM API used with JavaScript.</p>
<h3 id="xforms-an-alternative-to-html-forms">XForms: an alternative to HTML forms</h3>
<p>XForms give us a declarative approach to capture information from the user, and place it into XML documents, with constraint checking. XForms are a W3C standard, but are not implemented in the browsers.</p>
<h2 id="web-services">Web Services</h2>
<p>Service oriented applications (SOA) is an architectural pattern in software design, in which each component provides services to other components via communication protocols. XML has an answer to this:</p>
<h3 id="web-service-description-language-wsdl">Web Service Description Language (WSDL)</h3>
<p>WSDL is a language to create descriptions of a web service. That is, describe the operations it can perform, the structure of its messages, communication mechanisms it understands, etc. This is a <a href="https://www.w3.org/TR/2007/REC-wsdl20-20070626/">W3C recommendation</a> since 2007.</p>
<p>Inside a <code class="highlighter-rouge"><wsdl:description></code> tag, we can use:</p>
<ul>
<li>Optional documentation (<code class="highlighter-rouge"><wsdl:documentation/></code>), with a human readable description of the web service</li>
<li>Definition of data types (<code class="highlighter-rouge"><wsdl:types/></code>) exchanged between client and web service</li>
<li>Description of the interface (<code class="highlighter-rouge"><wsdl:interface/></code>), i.e. what operations and messages are defined</li>
<li>Binding (<code class="highlighter-rouge"><wsdl:binding/></code>) describing how the web service is accessed over the network</li>
<li>Service tag (<code class="highlighter-rouge"><wsdl:service</code>) describing where the service can be accessed</li>
</ul>
<h3 id="simple-object-access-protocol-soap">Simple Object Access Protocol (SOAP)</h3>
<p>SOAP is a W3C standard protocolSOAP is a W3C standard protocol, with strict rules and advanced security features. However, it comes with substantial complexity, leading to slow page load times.</p>
<figure class="highlight"><pre><code class="language-plain" data-lang="plain"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre>POST /Quotation HTTP/1.0
Host: www.xyz.org
Content-Type: text/xml; charset = utf-8
Content-Length: nnn
<?xml version = "1.0"?>
<SOAP-ENV:Envelope
xmlns:SOAP-ENV = "http://www.w3.org/2001/12/soap-envelope"
SOAP-ENV:encodingStyle = "http://www.w3.org/2001/12/soap-encoding">
<SOAP-ENV:Body xmlns:m = "http://www.xyz.org/quotations">
<m:GetQuotation>
<m:QuotationsName>MiscroSoft</m:QuotationsName>
</m:GetQuotation>
</SOAP-ENV:Body>
</SOAP-ENV:Envelope></pre></td></tr></tbody></table></code></pre></figure>
<p>Note that while REST is an architectural style, and SOAP is a protocol. REST allows using any format (HTML, JSON, XML, plain text, …), while SOAP explicitly only allows XML.</p>
<p>In the days of web apps (instead of web services), the idea of WDSL may be a little outdated. But WADL (Web <strong>Application</strong> Description Language) may be an answer to WSDL; this is a more concise language, with support for Relax NG, but which also has a slightly different goal. <a href="https://www.openapis.org/">Open API</a> is another contender in this field, specifying a way to describe web services in JSON or YAML, with documentation in Markdown.</p>
<h3 id="universal-description-discovery-and-integration-uddi">Universal Description, Discovery and Integration (UDDI)</h3>
<p>UDDI defines a standard method for <strong>publishing</strong> and <strong>discovering</strong> the software components of a service-oriented architecture. This mechanism still exists, although it never had the success people had been hoping for. Nowadays, it’s mostly just used internally for some XML based applications, instead of in public UDDI repositories of components.</p>
<div class="footnotes">
<ol>
<li id="fn:infoset-spec">
<p><a href="https://www.w3.org/TR/xml-infoset/">XML Information Set specification</a>, W3C Recommendation <a href="#fnref:infoset-spec" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
⚠ Work in progressCS-443 Machine Learning2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/ml
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<p>The course follows a few books:</p>
<ul>
<li>Christopher Bishop, <a href="https://www.springer.com/us/book/9780387310732">Pattern Recognition and Machine Learning</a></li>
<li>Kevin Patrick Murphy, <a href="https://www.cs.ubc.ca/~murphyk/MLbook/">Machine Learning: a Probabilistic Perspective</a></li>
<li>Michael Nielsen, <a href="http://neuralnetworksanddeeplearning.com/">Neural Networks and Deep Learning</a></li>
</ul>
<p>The repository for code labs and lecture notes is <a href="https://github.com/epfml/ML_course">on GitHub</a>. A useful website for this course is <a href="http://www.matrixcalculus.org/">matrixcalculus.org</a>.</p>
<!-- More -->
<ul id="markdown-toc">
<li><a href="#linear-regression" id="markdown-toc-linear-regression">Linear regression</a> <ul>
<li><a href="#simple-linear-regression" id="markdown-toc-simple-linear-regression">Simple linear regression</a></li>
<li><a href="#multiple-linear-regression" id="markdown-toc-multiple-linear-regression">Multiple linear regression</a></li>
<li><a href="#the-d--n-problem" id="markdown-toc-the-d--n-problem">The $D > N$ problem</a></li>
</ul>
</li>
<li><a href="#cost-functions" id="markdown-toc-cost-functions">Cost functions</a> <ul>
<li><a href="#properties" id="markdown-toc-properties">Properties</a></li>
<li><a href="#good-cost-functions" id="markdown-toc-good-cost-functions">Good cost functions</a> <ul>
<li><a href="#mse" id="markdown-toc-mse">MSE</a></li>
<li><a href="#mae" id="markdown-toc-mae">MAE</a></li>
</ul>
</li>
<li><a href="#convexity" id="markdown-toc-convexity">Convexity</a></li>
</ul>
</li>
<li><a href="#optimization" id="markdown-toc-optimization">Optimization</a> <ul>
<li><a href="#learning--estimation--fitting" id="markdown-toc-learning--estimation--fitting">Learning / Estimation / Fitting</a></li>
<li><a href="#grid-search" id="markdown-toc-grid-search">Grid search</a></li>
<li><a href="#optimization-landscapes" id="markdown-toc-optimization-landscapes">Optimization landscapes</a> <ul>
<li><a href="#local-minimum" id="markdown-toc-local-minimum">Local minimum</a></li>
<li><a href="#global-minimum" id="markdown-toc-global-minimum">Global minimum</a></li>
<li><a href="#strict-minimum" id="markdown-toc-strict-minimum">Strict minimum</a></li>
</ul>
</li>
<li><a href="#smooth-differentiable-optimization" id="markdown-toc-smooth-differentiable-optimization">Smooth (differentiable) optimization</a> <ul>
<li><a href="#gradient" id="markdown-toc-gradient">Gradient</a></li>
<li><a href="#gradient-descent" id="markdown-toc-gradient-descent">Gradient descent</a></li>
<li><a href="#gradient-descent-for-linear-mse" id="markdown-toc-gradient-descent-for-linear-mse">Gradient descent for linear MSE</a></li>
<li><a href="#stochastic-gradient-descent-sgd" id="markdown-toc-stochastic-gradient-descent-sgd">Stochastic gradient descent (SGD)</a></li>
<li><a href="#mini-batch-sgd" id="markdown-toc-mini-batch-sgd">Mini-batch SGD</a></li>
</ul>
</li>
<li><a href="#non-smooth-non-differentiable-optimization" id="markdown-toc-non-smooth-non-differentiable-optimization">Non-smooth (non-differentiable) optimization</a> <ul>
<li><a href="#subgradients" id="markdown-toc-subgradients">Subgradients</a></li>
<li><a href="#subgradient-descent" id="markdown-toc-subgradient-descent">Subgradient descent</a></li>
<li><a href="#stochastic-subgradient-descent" id="markdown-toc-stochastic-subgradient-descent">Stochastic subgradient descent</a></li>
</ul>
</li>
<li><a href="#comparison" id="markdown-toc-comparison">Comparison</a></li>
<li><a href="#constrained-optimization" id="markdown-toc-constrained-optimization">Constrained optimization</a> <ul>
<li><a href="#convex-sets" id="markdown-toc-convex-sets">Convex sets</a></li>
<li><a href="#projected-gradient-descent" id="markdown-toc-projected-gradient-descent">Projected gradient descent</a></li>
<li><a href="#turning-constrained-problems-into-unconstrained-problems" id="markdown-toc-turning-constrained-problems-into-unconstrained-problems">Turning constrained problems into unconstrained problems</a></li>
</ul>
</li>
<li><a href="#implementation-issues-in-gradient-methods" id="markdown-toc-implementation-issues-in-gradient-methods">Implementation issues in gradient methods</a> <ul>
<li><a href="#stopping-criteria" id="markdown-toc-stopping-criteria">Stopping criteria</a></li>
<li><a href="#optimality" id="markdown-toc-optimality">Optimality</a></li>
<li><a href="#step-size" id="markdown-toc-step-size">Step size</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#least-squares" id="markdown-toc-least-squares">Least squares</a> <ul>
<li><a href="#normal-equations" id="markdown-toc-normal-equations">Normal equations</a></li>
<li><a href="#single-parameter-linear-regression" id="markdown-toc-single-parameter-linear-regression">Single parameter linear regression</a></li>
<li><a href="#multiple-parameter-linear-regression" id="markdown-toc-multiple-parameter-linear-regression">Multiple parameter linear regression</a> <ul>
<li><a href="#simplest-way" id="markdown-toc-simplest-way">Simplest way</a></li>
<li><a href="#directly-verify-the-definition" id="markdown-toc-directly-verify-the-definition">Directly verify the definition</a></li>
<li><a href="#compute-the-hessian" id="markdown-toc-compute-the-hessian">Compute the Hessian</a></li>
</ul>
</li>
<li><a href="#geometric-interpretation" id="markdown-toc-geometric-interpretation">Geometric interpretation</a></li>
<li><a href="#closed-form" id="markdown-toc-closed-form">Closed form</a></li>
<li><a href="#invertibility-and-uniqueness" id="markdown-toc-invertibility-and-uniqueness">Invertibility and uniqueness</a></li>
</ul>
</li>
<li><a href="#maximum-likelihood" id="markdown-toc-maximum-likelihood">Maximum likelihood</a> <ul>
<li><a href="#gaussian-distribution" id="markdown-toc-gaussian-distribution">Gaussian distribution</a></li>
<li><a href="#a-probabilistic-model-for-least-squares" id="markdown-toc-a-probabilistic-model-for-least-squares">A probabilistic model for least squares</a></li>
<li><a href="#defining-cost-with-log-likelihood" id="markdown-toc-defining-cost-with-log-likelihood">Defining cost with log-likelihood</a></li>
<li><a href="#maximum-likelihood-estimator-mle" id="markdown-toc-maximum-likelihood-estimator-mle">Maximum likelihood estimator (MLE)</a> <ul>
<li><a href="#properties-of-mle" id="markdown-toc-properties-of-mle">Properties of MLE</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#overfitting-and-underfitting" id="markdown-toc-overfitting-and-underfitting">Overfitting and underfitting</a> <ul>
<li><a href="#underfitting-with-linear-models" id="markdown-toc-underfitting-with-linear-models">Underfitting with linear models</a></li>
<li><a href="#extended-feature-vectors" id="markdown-toc-extended-feature-vectors">Extended feature vectors</a></li>
<li><a href="#reducing-overfitting" id="markdown-toc-reducing-overfitting">Reducing overfitting</a></li>
</ul>
</li>
<li><a href="#regularization" id="markdown-toc-regularization">Regularization</a> <ul>
<li><a href="#l_2-regularization-ridge-regression" id="markdown-toc-l_2-regularization-ridge-regression">$L_2$-Regularization: Ridge Regression</a> <ul>
<li><a href="#ridge-regression" id="markdown-toc-ridge-regression">Ridge regression</a></li>
<li><a href="#ridge-regression-to-fight-ill-conditioning" id="markdown-toc-ridge-regression-to-fight-ill-conditioning">Ridge regression to fight ill-conditioning</a></li>
</ul>
</li>
<li><a href="#l_1-regularization-the-lasso" id="markdown-toc-l_1-regularization-the-lasso">$L_1$-Regularization: The Lasso</a></li>
</ul>
</li>
<li><a href="#model-selection" id="markdown-toc-model-selection">Model selection</a> <ul>
<li><a href="#probabilistic-setup" id="markdown-toc-probabilistic-setup">Probabilistic setup</a></li>
<li><a href="#training-error-vs-generalization-error" id="markdown-toc-training-error-vs-generalization-error">Training Error vs. Generalization Error</a></li>
<li><a href="#splitting-the-data" id="markdown-toc-splitting-the-data">Splitting the data</a></li>
<li><a href="#generalization-error-vs-test-error" id="markdown-toc-generalization-error-vs-test-error">Generalization error vs test error</a></li>
<li><a href="#method-and-criteria-for-model-selection" id="markdown-toc-method-and-criteria-for-model-selection">Method and criteria for model selection</a> <ul>
<li><a href="#grid-search-on-hyperparameters" id="markdown-toc-grid-search-on-hyperparameters">Grid search on hyperparameters</a></li>
<li><a href="#model-selection-based-on-test-error" id="markdown-toc-model-selection-based-on-test-error">Model selection based on test error</a></li>
</ul>
</li>
<li><a href="#cross-validation" id="markdown-toc-cross-validation">Cross-validation</a></li>
<li><a href="#bias-variance-decomposition" id="markdown-toc-bias-variance-decomposition">Bias-Variance decomposition</a> <ul>
<li><a href="#data-generation-model" id="markdown-toc-data-generation-model">Data generation model</a></li>
<li><a href="#error-decomposition" id="markdown-toc-error-decomposition">Error Decomposition</a></li>
<li><a href="#interpretation-of-the-decomposition" id="markdown-toc-interpretation-of-the-decomposition">Interpretation of the decomposition</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#classification" id="markdown-toc-classification">Classification</a> <ul>
<li><a href="#linear-classifier" id="markdown-toc-linear-classifier">Linear classifier</a></li>
<li><a href="#is-classification-a-special-case-of-regression" id="markdown-toc-is-classification-a-special-case-of-regression">Is classification a special case of regression?</a></li>
<li><a href="#nearest-neighbor" id="markdown-toc-nearest-neighbor">Nearest neighbor</a></li>
<li><a href="#linear-decision-boundaries" id="markdown-toc-linear-decision-boundaries">Linear decision boundaries</a></li>
<li><a href="#optimal-classification-for-a-known-generating-model" id="markdown-toc-optimal-classification-for-a-known-generating-model">Optimal classification for a known generating model</a></li>
</ul>
</li>
<li><a href="#logistic-regression" id="markdown-toc-logistic-regression">Logistic regression</a> <ul>
<li><a href="#training" id="markdown-toc-training">Training</a></li>
<li><a href="#conditions-of-optimality" id="markdown-toc-conditions-of-optimality">Conditions of optimality</a></li>
<li><a href="#gradient-descent-1" id="markdown-toc-gradient-descent-1">Gradient descent</a></li>
<li><a href="#newtons-method" id="markdown-toc-newtons-method">Newton’s method</a> <ul>
<li><a href="#hessian-of-the-cost" id="markdown-toc-hessian-of-the-cost">Hessian of the cost</a></li>
<li><a href="#closed-form-for-newtons-method" id="markdown-toc-closed-form-for-newtons-method">Closed form for Newton’s method</a></li>
</ul>
</li>
<li><a href="#regularized-logistic-regression" id="markdown-toc-regularized-logistic-regression">Regularized logistic regression</a></li>
</ul>
</li>
<li><a href="#generalized-linear-models" id="markdown-toc-generalized-linear-models">Generalized Linear Models</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#exponential-family" id="markdown-toc-exponential-family">Exponential family</a> <ul>
<li><a href="#link-function" id="markdown-toc-link-function">Link function</a></li>
<li><a href="#example-bernoulli" id="markdown-toc-example-bernoulli">Example: Bernoulli</a></li>
<li><a href="#example-poisson" id="markdown-toc-example-poisson">Example: Poisson</a></li>
<li><a href="#example-gaussian" id="markdown-toc-example-gaussian">Example: Gaussian</a></li>
<li><a href="#properties-1" id="markdown-toc-properties-1">Properties</a></li>
</ul>
</li>
<li><a href="#application-in-ml" id="markdown-toc-application-in-ml">Application in ML</a> <ul>
<li><a href="#maximum-likelihood-parameter-estimation" id="markdown-toc-maximum-likelihood-parameter-estimation">Maximum Likelihood Parameter Estimation</a></li>
<li><a href="#conditions-of-optimality-1" id="markdown-toc-conditions-of-optimality-1">Conditions of optimality</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#nearest-neighbor-classifiers-and-the-curse-of-dimensionality" id="markdown-toc-nearest-neighbor-classifiers-and-the-curse-of-dimensionality">Nearest neighbor classifiers and the curse of dimensionality</a> <ul>
<li><a href="#k-nearest-neighbor-knn" id="markdown-toc-k-nearest-neighbor-knn">K Nearest Neighbor (KNN)</a></li>
<li><a href="#analysis" id="markdown-toc-analysis">Analysis</a></li>
</ul>
</li>
<li><a href="#support-vector-machines" id="markdown-toc-support-vector-machines">Support Vector Machines</a> <ul>
<li><a href="#definition" id="markdown-toc-definition">Definition</a></li>
<li><a href="#alternative-formulation-duality" id="markdown-toc-alternative-formulation-duality">Alternative formulation: Duality</a> <ul>
<li><a href="#how-do-we-find-a-suitable-function-g" id="markdown-toc-how-do-we-find-a-suitable-function-g">How do we find a suitable function G?</a></li>
<li><a href="#when-is-it-ok-to-switch-min-and-max" id="markdown-toc-when-is-it-ok-to-switch-min-and-max">When is it OK to switch min and max?</a></li>
<li><a href="#when-is-the-dual-easier-to-optimize-than-the-primal" id="markdown-toc-when-is-the-dual-easier-to-optimize-than-the-primal">When is the dual easier to optimize than the primal?</a></li>
</ul>
</li>
<li><a href="#kernel-trick" id="markdown-toc-kernel-trick">Kernel trick</a> <ul>
<li><a href="#alternative-formulation-of-ridge-regression" id="markdown-toc-alternative-formulation-of-ridge-regression">Alternative formulation of ridge regression</a></li>
<li><a href="#representer-theorem" id="markdown-toc-representer-theorem">Representer theorem</a></li>
<li><a href="#kernelized-ridge-regression" id="markdown-toc-kernelized-ridge-regression">Kernelized ridge regression</a></li>
<li><a href="#kernel-functions" id="markdown-toc-kernel-functions">Kernel functions</a> <ul>
<li><a href="#trivial-kernels" id="markdown-toc-trivial-kernels">Trivial kernels</a></li>
<li><a href="#polynomial-kernel" id="markdown-toc-polynomial-kernel">Polynomial kernel</a></li>
<li><a href="#radial-basis-function-kernel" id="markdown-toc-radial-basis-function-kernel">Radial basis function kernel</a></li>
<li><a href="#new-kernel-functions-from-old-ones" id="markdown-toc-new-kernel-functions-from-old-ones">New kernel functions from old ones</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#classifying-with-the-kernel" id="markdown-toc-classifying-with-the-kernel">Classifying with the kernel</a></li>
<li><a href="#properties-of-kernels" id="markdown-toc-properties-of-kernels">Properties of kernels</a></li>
</ul>
</li>
<li><a href="#unsupervised-learning" id="markdown-toc-unsupervised-learning">Unsupervised learning</a> <ul>
<li><a href="#k-means" id="markdown-toc-k-means">K-Means</a> <ul>
<li><a href="#coordinate-descent-interpretation" id="markdown-toc-coordinate-descent-interpretation">Coordinate descent interpretation</a></li>
<li><a href="#matrix-factorization-interpretation" id="markdown-toc-matrix-factorization-interpretation">Matrix factorization interpretation</a></li>
<li><a href="#probabilistic-interpretation" id="markdown-toc-probabilistic-interpretation">Probabilistic interpretation</a></li>
<li><a href="#issues-with-k-means" id="markdown-toc-issues-with-k-means">Issues with K-means</a></li>
</ul>
</li>
<li><a href="#gaussian-mixture-model-gmm" id="markdown-toc-gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</a> <ul>
<li><a href="#clustering-with-gaussians" id="markdown-toc-clustering-with-gaussians">Clustering with Gaussians</a></li>
<li><a href="#soft-clustering" id="markdown-toc-soft-clustering">Soft clustering</a></li>
<li><a href="#likelihood" id="markdown-toc-likelihood">Likelihood</a></li>
<li><a href="#marginal-likelihood" id="markdown-toc-marginal-likelihood">Marginal likelihood</a></li>
<li><a href="#maximum-likelihood-1" id="markdown-toc-maximum-likelihood-1">Maximum likelihood</a></li>
</ul>
</li>
<li><a href="#em-algorithm" id="markdown-toc-em-algorithm">EM algorithm</a> <ul>
<li><a href="#expectation-step" id="markdown-toc-expectation-step">Expectation step</a></li>
<li><a href="#maximization-step" id="markdown-toc-maximization-step">Maximization step</a></li>
<li><a href="#interpretation" id="markdown-toc-interpretation">Interpretation</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#matrix-factorization" id="markdown-toc-matrix-factorization">Matrix Factorization</a> <ul>
<li><a href="#prediction-using-a-matrix-factorization" id="markdown-toc-prediction-using-a-matrix-factorization">Prediction using a matrix factorization</a></li>
<li><a href="#choosing-k" id="markdown-toc-choosing-k">Choosing K</a></li>
<li><a href="#regularization-1" id="markdown-toc-regularization-1">Regularization</a></li>
<li><a href="#stochastic-gradient-descent" id="markdown-toc-stochastic-gradient-descent">Stochastic gradient descent</a></li>
<li><a href="#alternating-least-squares-als" id="markdown-toc-alternating-least-squares-als">Alternating least squares (ALS)</a> <ul>
<li><a href="#no-missing-entries" id="markdown-toc-no-missing-entries">No missing entries</a></li>
<li><a href="#missing-entries" id="markdown-toc-missing-entries">Missing entries</a></li>
</ul>
</li>
<li><a href="#text-representation-learning" id="markdown-toc-text-representation-learning">Text representation learning</a> <ul>
<li><a href="#co-occurrence-matrix" id="markdown-toc-co-occurrence-matrix">Co-occurrence matrix</a></li>
<li><a href="#motivation-1" id="markdown-toc-motivation-1">Motivation</a></li>
<li><a href="#bag-of-words" id="markdown-toc-bag-of-words">Bag of words</a></li>
<li><a href="#word2vec" id="markdown-toc-word2vec">Word2vec</a></li>
<li><a href="#glove" id="markdown-toc-glove">GloVe</a></li>
<li><a href="#fasttext" id="markdown-toc-fasttext">FastText</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#svd-and-pca" id="markdown-toc-svd-and-pca">SVD and PCA</a> <ul>
<li><a href="#motivation-2" id="markdown-toc-motivation-2">Motivation</a></li>
<li><a href="#svd" id="markdown-toc-svd">SVD</a></li>
<li><a href="#svd-and-dimensionality-reduction" id="markdown-toc-svd-and-dimensionality-reduction">SVD and dimensionality reduction</a> <ul>
<li><a href="#svd-and-matrix-factorization" id="markdown-toc-svd-and-matrix-factorization">SVD and matrix factorization</a></li>
</ul>
</li>
<li><a href="#pca-and-decorrelation" id="markdown-toc-pca-and-decorrelation">PCA and decorrelation</a></li>
<li><a href="#computing-the-svd-efficiently" id="markdown-toc-computing-the-svd-efficiently">Computing the SVD efficiently</a></li>
<li><a href="#pitfalls-of-pca" id="markdown-toc-pitfalls-of-pca">Pitfalls of PCA</a></li>
</ul>
</li>
<li><a href="#neural-networks" id="markdown-toc-neural-networks">Neural Networks</a> <ul>
<li><a href="#motivation-3" id="markdown-toc-motivation-3">Motivation</a></li>
<li><a href="#structure" id="markdown-toc-structure">Structure</a></li>
<li><a href="#how-powerful-are-neural-nets" id="markdown-toc-how-powerful-are-neural-nets">How powerful are neural nets?</a></li>
<li><a href="#approximation-in-average" id="markdown-toc-approximation-in-average">Approximation in average</a> <ul>
<li><a href="#other-activation-functions" id="markdown-toc-other-activation-functions">Other activation functions</a></li>
</ul>
</li>
<li><a href="#popular-activation-functions" id="markdown-toc-popular-activation-functions">Popular activation functions</a> <ul>
<li><a href="#sigmoid" id="markdown-toc-sigmoid">Sigmoid</a></li>
<li><a href="#tanh" id="markdown-toc-tanh">Tanh</a></li>
<li><a href="#relu" id="markdown-toc-relu">ReLU</a></li>
<li><a href="#leaky-relu" id="markdown-toc-leaky-relu">Leaky ReLU</a></li>
<li><a href="#maxout" id="markdown-toc-maxout">Maxout</a></li>
</ul>
</li>
<li><a href="#sgd-and-backpropagation" id="markdown-toc-sgd-and-backpropagation">SGD and Backpropagation</a></li>
<li><a href="#regularization-2" id="markdown-toc-regularization-2">Regularization</a></li>
<li><a href="#dataset-augmentation" id="markdown-toc-dataset-augmentation">Dataset augmentation</a></li>
<li><a href="#dropout" id="markdown-toc-dropout">Dropout</a></li>
<li><a href="#convolutional-nets" id="markdown-toc-convolutional-nets">Convolutional nets</a> <ul>
<li><a href="#structure-1" id="markdown-toc-structure-1">Structure</a></li>
<li><a href="#padding" id="markdown-toc-padding">Padding</a></li>
<li><a href="#channels" id="markdown-toc-channels">Channels</a></li>
<li><a href="#training-1" id="markdown-toc-training-1">Training</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#bayes-nets" id="markdown-toc-bayes-nets">Bayes Nets</a> <ul>
<li><a href="#from-distribution-to-graphs" id="markdown-toc-from-distribution-to-graphs">From distribution to graphs</a></li>
<li><a href="#cyclic-graphs" id="markdown-toc-cyclic-graphs">Cyclic graphs</a></li>
<li><a href="#conditional-independence" id="markdown-toc-conditional-independence">Conditional independence</a> <ul>
<li><a href="#tail-to-tail" id="markdown-toc-tail-to-tail">Tail-to-tail</a></li>
<li><a href="#head-to-tail" id="markdown-toc-head-to-tail">Head-to-tail</a></li>
<li><a href="#head-to-head" id="markdown-toc-head-to-head">Head-to-head</a></li>
<li><a href="#d-separation" id="markdown-toc-d-separation">D-separation</a></li>
<li><a href="#examples" id="markdown-toc-examples">Examples</a></li>
</ul>
</li>
<li><a href="#markov-blankets" id="markdown-toc-markov-blankets">Markov blankets</a></li>
<li><a href="#sampling-and-marginalizing" id="markdown-toc-sampling-and-marginalizing">Sampling and marginalizing</a></li>
<li><a href="#factor-graphs" id="markdown-toc-factor-graphs">Factor graphs</a></li>
</ul>
</li>
</ul>
<p>In this course, we’ll always denote the dataset as a $N \times D$ matrix $\mathbf{X}$, where $N$ is the data size and $D$ is the dimensionality, or the number of features. We’ll always use subscript $n$ for data point, and $d$ for feature. The labels, if any, are denoted in a $\mathbf{y}$ vector, and the weights are denoted by $\mathbf{w}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\newcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\frobnorm}[1]{\norm{#1}_{\text{Frob}}}
\newcommand{\expect}[1]{\mathbb{E}\left[#1\right]}
\newcommand{\expectsub}[2]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\cost}[1]{\mathcal{L}\left(#1\right)}
\newcommand{\normal}[1]{\mathcal{N}\left(#1\right)}
\newcommand{\diff}[2]{\frac{\partial #1}{\partial #2}}
\newcommand{\difftwo}[3]{\frac{\partial^2 #1}{\partial #2 \partial #3}}
\newcommand{\Strain}{S_{\text{train}}}
\newcommand{\Stest}{S_{\text{test}}}
\DeclareMathOperator*{\argmax}{\arg\!\max}
\DeclareMathOperator*{\argmin}{\arg\!\min}
\vec{w}=\begin{bmatrix}
w_1 \\ w_2 \\ \vdots \\ w_N
\end{bmatrix},
\quad
\vec{y}=\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix},
\quad
\vec{X}=\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1D} \\
x_{21} & x_{22} & \dots & x_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \dots & x_{ND} \\
\end{bmatrix} %]]></script>
<p>Vectors are denoted in bold and lowercase (e.g. $\vec{y}$ or $\vec{x}_n$), and matrices are bold and uppercase (e.g. $\vec{X}$). Scalars and functions are in normal font weight<sup id="fnref:here-be-dragons"><a href="#fn:here-be-dragons" class="footnote">1</a></sup>.</p>
<h2 id="linear-regression">Linear regression</h2>
<p>A linear regression is a model that assumes a linear relationship between inputs and the output. We will study three types of methods:</p>
<ol>
<li>Grid search</li>
<li>Iterative optimization algorithms</li>
<li>Least squares</li>
</ol>
<h3 id="simple-linear-regression">Simple linear regression</h3>
<p>For a single input dimension ($D=1$), we can use a simple linear regression, which is given by:</p>
<script type="math/tex; mode=display">y_n \approx f(x_n) := w_0 + w_1 x_{n1}</script>
<p>$\vec{w} = (w_0, w_1)$ are the parameters of the model.</p>
<h3 id="multiple-linear-regression">Multiple linear regression</h3>
<p>If our data has multiple input dimensions, we obtain multivariate linear regression:</p>
<script type="math/tex; mode=display">y_n \approx
f(\vec{x}_n) := w_0 + w_1 x_{n1} + \dots + w_D x_{wD}
= w_0 + \vec{x}_n^T \begin{bmatrix}
w_1 \\
\vdots \\
w_D \\
\end{bmatrix}
= \tilde{\vec{x}}_n^T \tilde{\vec{w}}</script>
<blockquote>
<p>👉 If we wanted to be a little more strict, we should write $f_{\vec{w}}(\vec{x}_n)$, as the model of course also depends on the weights.</p>
</blockquote>
<p>The tilde notation means that we have included the offset term $w_0$, also known as the <strong>bias</strong>:</p>
<script type="math/tex; mode=display">\tilde{\vec{x}}_n=\begin{bmatrix}1 \\ x_{n1} \\ \vdots \\ x_{nD} \end{bmatrix} \in \mathbb{R}^{D+1},
\quad
\tilde{\vec{w}} = \begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_D\end{bmatrix} \in \mathbb{R^{D+1}}</script>
<h3 id="the-d--n-problem">The $D > N$ problem</h3>
<p>If the number of parameters exceeds the number of data examples, we say that the task is <em>under-determined</em>. This can be solved by regularization, which we’ll get to more precisely later.</p>
<h2 id="cost-functions">Cost functions</h2>
<p>$\vec{x}_n$ is the data, which we can easily understand where comes from. But how does one find a good $\vec{w}$ from the data?</p>
<p>A <strong>cost function</strong> (also called loss function) is used to learn parameters that explain the data well. It quantifies how well our model does by giving errors a score, quantifying penalties for errors. Our goal is to find parameters that minimize the loss functions.</p>
<h3 id="properties">Properties</h3>
<p>Desirable properties of cost functions are:</p>
<ul>
<li><strong>Symmetry around 0</strong>: that is, being off by a positive or negative amount is equivalent; what matters is the amplitude of the error, not the sign.</li>
<li><strong>Robustness</strong>: penalizes large errors at about the same rate as very large errors. This is a way to make sure that outliers don’t completely dominate our regression.</li>
</ul>
<h3 id="good-cost-functions">Good cost functions</h3>
<h4 id="mse">MSE</h4>
<p>Probably the most commonly used cost function is Mean Square Error (MSE):</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{MSE}}(\vec{w}) := \frac{1}{N} \sum_{n=1}^N \left(y_n - f(\vec{x}_n)\right)^2
\label{def:mse}</script>
<p>MSE is symmetrical around 0, but also tends to penalize outliers quite harshly (because it squares error): MSE is not robust. In practice, this is problematic, because outliers occur more often than we’d like to.</p>
<p>Note that we often use MSE with a factor $\frac{1}{2N}$ instead of $\frac{1}{N}$. This is because it makes for a cleaner derivative, but we’ll get into that later. Just know that for all intents and purposes, it doesn’t really change anything about the behavior of the models we’ll study.</p>
<h4 id="mae">MAE</h4>
<p>When outliers are present, Mean Absolute Error (MAE) tends to fare better:</p>
<script type="math/tex; mode=display">\text{MAE}(\vec{w}) := \frac{1}{N} \sum_{n=1}^N \left| y_n - f(\vec{x}_n)\right|</script>
<p>Instead of squaring, we take the absolute value. This is more robust. Note that MAE isn’t differentiable at 0, but we’ll talk about that later.</p>
<p>There are other cost functions that are even more robust; these are available as additional reading, but are not exam material.</p>
<h3 id="convexity">Convexity</h3>
<p>A function is <strong>convex</strong> iff a line joining two points never intersects with the function anywhere else. More strictly defined, a function $f(\vec{u})$ with $\vec{u}\in\chi$ is <em>convex</em> if, for any $\vec{u}, \vec{v} \in\chi$, and for any $0 \le\lambda\le 1$, we have:</p>
<script type="math/tex; mode=display">f(\lambda\vec{u}+(1-\lambda)\vec{v})\le\lambda f(\vec{u}) +(1-\lambda)f(\vec{v})</script>
<p>A function is <strong>strictly convex</strong> if the above inequality is strict ($<$). This inequality is known as <em>Jensen’s inequality</em>.</p>
<p>A strictly convex function has a unique global minimum $\vec{w}^*$. For convex functions, every local minimum is a global minimum. This makes it a desirable property for loss functions, since it means that cost function optimization is guaranteed to find the global minimum.</p>
<p>Linear (and affine) functions are convex, and sums of convex functions are also convex. Therefore, MSE and MAE are convex.</p>
<p>We’ll see another way of characterizing convexity for differentiable functions <a href="#non-smooth-non-differentiable-optimization">later in the course</a>.</p>
<h2 id="optimization">Optimization</h2>
<h3 id="learning--estimation--fitting">Learning / Estimation / Fitting</h3>
<p>Given a cost function (or loss function) $\cost{\vec{w}}$, we wish to find $\vec{w}^*$ which minimizes the cost:</p>
<script type="math/tex; mode=display">\min_{\vec{w}}{\cost{\vec{w}}}, \quad\text{ subject to } \vec{w} \in \mathbb{R}^D</script>
<p>This is what we call <strong>learning</strong>: learning is simply an optimization problem, and as such, we’ll use an optimization algorithm to solve it – that is, find a good $\vec{w}$.</p>
<h3 id="grid-search">Grid search</h3>
<p>This is one of the simplest optimization algorithms, although far from being the most efficient one. It can be described as “try all the values”, a kind of brute-force algorithm; you can think of it as nested for-loops over the individual $w_i$ weights.</p>
<p>For instance, if our weights are $\vec{w} = \begin{bmatrix}w_1 \ w_2\end{bmatrix}$, then we can try, say 4 values for $w_1$, 4 values for $w_2$, for a total of 16 values of $\mathcal{L}(\vec{w})$.</p>
<p>But obviously, complexity is exponential $\mathcal{O}(a^D)$ (where $a$ is the number of values to try), which is really bad, especially when we can have $D\approx$ millions of parameters. Additionally, grid search has no guarantees that it’ll find an optimum; it’ll just find the best value we tried.</p>
<p>If grid search sounds bad for optimization, that’s because it is. In practice, it is not used for optimization of parameters, but it <em>is</em> used to tune hyperparameters.</p>
<h3 id="optimization-landscapes">Optimization landscapes</h3>
<h4 id="local-minimum">Local minimum</h4>
<p>A vector $\vec{w}^*$ is a <em>local minimum</em> of a function $\mathcal{L}$ (we’re interested in the minimum of cost functions $\mathcal{L}$, which we denote with $\vec{w}^*$, as opposed to any other value $\vec{w}$, but this obviously holds for any function) if $\exists \epsilon > 0$ such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}(\vec{w}^*) \le \mathcal{L(\vec{w})}, \quad \forall \vec{w} : \norm{\vec{w} -\vec{w}^*} < \epsilon %]]></script>
<p>In other words, the local minimum $\vec{w}^*$ is better than all the neighbors in some non-zero radius.</p>
<h4 id="global-minimum">Global minimum</h4>
<p>The global minimum $\vec{w}^*$ is defined by getting rid of the radius $\epsilon$ and comparing to all other values:</p>
<script type="math/tex; mode=display">\cost{\vec{w}^*} \le \cost{\vec{w}}, \qquad \forall\vec{w}\in\mathbb{R}^D</script>
<h4 id="strict-minimum">Strict minimum</h4>
<p>A minimum is said to be <strong>strict</strong> if the corresponding equality is strict for $\vec{w} \ne \vec{w}^*$, that is, there is only one minimum value.</p>
<script type="math/tex; mode=display">% <![CDATA[
\cost{\vec{w}^*} < \cost{\vec{w}}, \qquad \forall\vec{w}\in\mathbb{R}^D\setminus\set{\vec{w}^*} %]]></script>
<h3 id="smooth-differentiable-optimization">Smooth (differentiable) optimization</h3>
<h4 id="gradient">Gradient</h4>
<p>A gradient at a given point is the slope of the tangent to the function at that point. It points to the direction of largest increase of the function. By following the gradient (in the opposite direction, because we’re searching for a minimum and not a maximum), we can find the minimum.</p>
<p><img src="/images/ml/mse-mae.png" alt="Graphs of MSE and MAE" /></p>
<p>Gradient is defined by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\nabla \mathcal{L}(\vec{w}) := \begin{bmatrix}
\diff{\cost{\vec{w}}}{w_1} &
\diff{\cost{\vec{w}}}{w_2} &
\cdots &
\diff{\cost{\vec{w}}}{w_D} \\
\end{bmatrix}^T %]]></script>
<p>This is a vector, i.e. $\nabla\cost{\vec{w}}\in\mathbb R^D$. Each dimension $i$ of the vector indicates how fast the cost $\mathcal{L}$ changes depending on the weight $w_i$.</p>
<h4 id="gradient-descent">Gradient descent</h4>
<p>Gradient descent is an iterative algorithm. We start from a candidate $\vec{w}^{(t)}$, and iterate.</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)}:=\vec{w}^{(t)} - \gamma \nabla\mathcal{L}\left(\vec{w}^{(t)}\right)</script>
<p>As stated previously, we’re adding the negative gradient to find the minimum, hence the subtraction.</p>
<p>$\gamma$ is known as the <strong>step-size</strong>, which is a small value (maybe 0.1). You don’t want to be too aggressive with it, or you might risk overshooting in your descent. In practice, the step-size that makes the learning as fast as possible is often found by trial and error 🤷🏼♂️.</p>
<p>As an example, we will take an analytical look at a gradient descent, in order to understand its behavior and components. We will do gradient descent on a 1-parameter model ($D=1$ and $\vec{w} = [w_0]$), in which we minimize the MSE, which is defined as follows:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(w_0\right)=\frac{1}{2N}\sum_{n=1}^N{\left(y_n - w_0\right)^2}</script>
<p>Note that we’re dividing by 2 on top of the regular MSE; it has no impact on finding the minimum, but when we will compute the gradient below, it will conveniently cancel out the $\frac{1}{2}$.</p>
<p>The gradient of $\cost{w_0}$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\cost{\vec{w}}
& = \frac{\partial}{\partial w_0}\cost{\vec{w}} \\
& = \frac{1}{2N}\sum_{n=1}^N{-2(y_n - w_0)} \\
& = w_0 - \bar{y}
\end{align} %]]></script>
<p>Where $\bar{y}$ denotes the average of all $y_n$ values. And thus, our gradient descent is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
w_0^{(t+1)}
&:= w_0^{(t)} - \gamma\nabla\mathcal{L}\left(\vec{w}\right) \\
& = w_0^{(t)} - \gamma(w_0^{(t)} - \bar{y}) \\
& = (1-\gamma)w_0^{(t)} + \gamma\bar{y},
\qquad\text{where } \bar{y}:=\sum_{n}{\frac{y_n}{N}}
\end{align} %]]></script>
<p>In this case, we’ve managed to find to this exact problem analytically from gradient descent. This sequence is guaranteed to converge to $\vec{w}^* = \bar{y}$<sup id="fnref:optimality-linear-mse"><a href="#fn:optimality-linear-mse" class="footnote">2</a></sup>. This would set the cost function to 0, which is the minimum.</p>
<p>The choice of $\gamma$ has an influence on the algorithm’s outcome:</p>
<ul>
<li>If we pick $\gamma=1$, we would get to the optimum in one step</li>
<li>If we pick $\gamma < 1$, we would get a little closer in every step, eventually converging to $\bar{y}$</li>
<li>If we pick $\gamma > 1$, we are going to overshoot $\bar{y}$. Slightly bigger than 1 (say, 1.5) would still converge; $\gamma=2$ would loop infinitely between two points; $\gamma > 2$ diverges.</li>
</ul>
<h4 id="gradient-descent-for-linear-mse">Gradient descent for linear MSE</h4>
<p>Our linear regression is given by a line $\vec{y}$ that is a regression for some data $\vec{X}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\vec{y}=\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix},
\quad
\vec{X}=\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1D} \\
x_{21} & x_{22} & \dots & x_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \dots & x_{ND} \\
\end{bmatrix} %]]></script>
<p>We make predictions by multiplying the data by the weights, so our model is:</p>
<script type="math/tex; mode=display">f_{\vec{w}}(\vec{x}_n)=\vec{x}_n^T \vec{w}</script>
<p>We define the error vector by:</p>
<script type="math/tex; mode=display">\vec{e}=\vec{y} - \vec{Xw},
\quad \text{ or } \quad
e_n = y_n - \vec{x}_n^T\vec{w}</script>
<p>The MSE can then be restated as follows:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\vec{w}\right)
:= \frac{1}{2N}\sum_{n=1}^N{\left( y_n - \vec{x}_n^T \vec{w}\right)^2}
= \frac{1}{2N}\vec{e}^T\vec{e}</script>
<p>And the gradient is, component-wise:</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial\vec{w}_d} \cost{\vec{w}}
= -\frac{1}{2N} \sum_{n=1}^N {2(y_n - \vec{x}_n^T \vec{w}) x_{nd}}
= -\frac{1}{N} (\vec{X}_{:d})^T \vec{e}</script>
<p>We’re using column notation $\vec{X}_{:d}$ to signify column $d$ of the matrix $X$.</p>
<p>And thus, all in all, our gradient is:</p>
<script type="math/tex; mode=display">\nabla\cost{\vec{w}} = -\frac{1}{N}\vec{X}^T\vec{e}</script>
<p>To compute this expression, we must compute:</p>
<ul>
<li>The error $\vec{e}$, which takes $2N\cdot D - 1$ floating point operations (flops) for the matrix-vector multiplication, and $N$ for the subtraction, for a total of $2N\cdot D + N - 1$, which is $\mathcal{O}(N\cdot D)$</li>
<li>The gradient $\nabla\mathcal{L}$, which costs $2N\cdot D + D - 1$, which is $\mathcal{O}(N\cdot D)$.</li>
</ul>
<p>In total, this process is $\mathcal{O}(N\cdot D)$ at every step. This is not too bad, it’s equivalent to reading the data once.</p>
<h4 id="stochastic-gradient-descent-sgd">Stochastic gradient descent (SGD)</h4>
<p>In ML, most cost functions are formulated as a sum of:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\vec{w}\right) = \frac{1}{N}\sum_{n=1}^N{\mathcal{L}_n(\vec{w})}</script>
<p>In practice, this can be expensive to compute, so the solution is to sample a training point $n\in\set{1, N}$ uniformly at random, to be able to make the sum go away.</p>
<p>The stochastic gradient descent step is thus:</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)}:=\vec{w}^{(t)} - \gamma \nabla\mathcal{L}_n\left({\vec{w}^{(t)}}\right)</script>
<p>Why is it allowed to pick just one $n$ instead of the full thing? We won’t give a full proof, but the intuition is that:</p>
<script type="math/tex; mode=display">\expect{\nabla\mathcal{L}_n(\vec{w})}
= \frac{1}{N} \sum_{n=1}^N{\nabla\mathcal{L}_n(\vec{w})}
= \nabla\left(\frac{1}{N} \sum_{n=1}^N{\mathcal{L}_n(\vec{w})}\right)
\equiv \nabla\mathcal{L}\left(\vec{w}\right)</script>
<p>The gradient of a single n is:</p>
<script type="math/tex; mode=display">\mathcal{L}_n(\vec{w}) = \frac{1}{2} \left(y_n -\vec{x}_n^T \vec{w}\right)^2 \\
\nabla\mathcal{L}_n(\vec{w}) = (-\vec{x}_n^T) (y_n-\vec{x}_n^T \vec{w})</script>
<p>Note that $\vec{x}_n^T \in\mathbb{R}^D$, and $e_n = (y_n-\vec{x}_n^T \vec{w})\in\mathbb{R}$. Computational complexity for this is $\mathcal{O}(D)$.</p>
<h4 id="mini-batch-sgd">Mini-batch SGD</h4>
<p>But perhaps just picking a <strong>single</strong> value is too extreme; there is an intermediate version in which we choose a subset $B\subseteq \set{1, \dots, N}$ instead of a single point.</p>
<script type="math/tex; mode=display">\vec{g} := \frac{1}{|B|}\sum_{n\in B}{\nabla\mathcal{L}_n(\vec{w}^{(t)})} \\
\vec{w}^{(t+1)} := \vec{w}^{(t)} - \gamma\vec{g}</script>
<p>Note that if $\abs{B} = N$ then we’re performing a full gradient descent.</p>
<p>The computation of $\vec{g}$ can be parallelized easily over $\abs{B}$ GPU threads, which is quite common in practice; $\abs{B}$ is thus often dictated by the number of available threads.</p>
<p>Computational complexity is $\mathcal{O}(\abs{B}\cdot D)$.</p>
<h3 id="non-smooth-non-differentiable-optimization">Non-smooth (non-differentiable) optimization</h3>
<p>We’ve defined <a href="#convexity">convexity previously</a>, but we can also use the following alternative characterization of convexity, for differentiable functions:</p>
<script type="math/tex; mode=display">\cost{\vec{u}} \ge \cost{\vec{w}} + \nabla \cost{\vec{w}}^T (\vec{u} - \vec{w})
\quad \forall \vec{u}, \vec{w}
\iff \mathcal{L} \text{ convex}</script>
<p>Meaning that the function must always lie above its linearization (which is the first-order Taylor expansion) to be convex.</p>
<p><img src="/images/ml/convex-above-linearization.png" alt="A convex function lies above its linearization" /></p>
<h4 id="subgradients">Subgradients</h4>
<p>A vector $\vec{g}\in\mathbb{R}^D$ such that:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\vec{u}\right) \ge \mathcal{L}\left(\vec{w}\right) + \vec{g}^T(\vec{u} - \vec{w}) \quad \forall \vec{u}</script>
<p>is called a <strong>subgradient</strong> to the function $\mathcal{L}$ at $\vec{w}$. The subgradient forms a line that is always below the curve, somewhat like the gradient of a convex function.</p>
<p><img src="/images/ml/subgradient-below-function.png" alt="The subgradient lies below the function" /></p>
<p>This definition is valid even for an arbitrary $\mathcal{L}$ that may not be differentiable, and not even necessarily convex.</p>
<p>If the function $\mathcal{L}$ is differentiable at $\vec{w}$, then the <em>only subgradient</em> at $\vec{w}$ is $\vec{g} = \nabla\mathcal{L}\left(\vec{w}\right)$.</p>
<h4 id="subgradient-descent">Subgradient descent</h4>
<p>This is exactly like gradient descent, except for the fact that we use the <em>subgradient</em> $\vec{g}$ at the current iterate $\vec{w}^{(t)}$ instead of the <em>gradient</em>:</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)} := \vec{w}^{(t)} - \gamma\vec{g}</script>
<p>For instance, MAE is not differentiable at 0, so we must use the subgradient.</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{Let }h: \mathbb{R} \rightarrow \mathbb{R}, \quad h(e) := |e| \\
\text{At } e, \text{the subgradient }
g \in \partial h = \begin{cases}
-1 & \text{if } e < 0 \\
[-1, 1] & \text{if } e = 0 \\
1 & \text{if } e > 0 \\
\end{cases} %]]></script>
<p>Here, $\partial h$ is somewhat confusing notation for the set of all possible subgradients at our position.</p>
<p>For linear regressions, the (sub)gradient is easy to compute using the <em>chain rule</em>.</p>
<p>Let $h$ be non-differentiable, $q$ differentiable, and $\mathcal{L}\left(\vec{w}\right) = h(q(w))$. The chain rule tells us that, at $\vec{w}$, our subgradient is:</p>
<script type="math/tex; mode=display">g \in \partial h(q(\vec{w})) \cdot \nabla q(\vec{w})</script>
<h4 id="stochastic-subgradient-descent">Stochastic subgradient descent</h4>
<p>This is still commonly abbreviated SGD.</p>
<p>It’s exactly the same, except that $\vec{g}$ is a subgradient to the randomly selected $\mathcal{L}_n$ at the current iterate $\vec{w}^{(t)}$.</p>
<h3 id="comparison">Comparison</h3>
<table>
<thead>
<tr>
<th> </th>
<th>Smooth</th>
<th>Non-smooth</th>
</tr>
</thead>
<tbody>
<tr>
<td>Full gradient descent</td>
<td>Gradient of <script type="math/tex">\mathcal{L}</script> <br />Complexity is $\mathcal{O}(N\cdot D)$</td>
<td>Subgradient of $\mathcal{L}$<br />Complexity is $\mathcal{O}(N\cdot D)$</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td>Gradient of $\mathcal{L}_n$</td>
<td>Subgradient of $\mathcal{L}_n$</td>
</tr>
</tbody>
</table>
<h3 id="constrained-optimization">Constrained optimization</h3>
<p>Sometimes, optimization problems come posed with an additional constraint.</p>
<h4 id="convex-sets">Convex sets</h4>
<p>We’ve seen convexity for functions, but we can also define it for sets. A set $\mathcal{C}$ is convex iff the line segment between any two points of $\mathcal{C}$ lies in $\mathcal{C}$. That is, $\forall \vec{u}, \vec{v} \in \mathcal{C}, \quad \forall 0 \le \theta \le 1$, we have:</p>
<script type="math/tex; mode=display">\theta \vec{u} + (1 - \theta)\vec{v} \in \mathcal{C}</script>
<p>This means that the line between any two points in the set $\mathcal{C}$ must also be fully contained within the set.</p>
<p><img src="/images/ml/convex-sets.png" alt="Examples of convex and non-convex sets" /></p>
<p>A couple of properties of convex sets:</p>
<ul>
<li>Intersection of convex sets is also convex.</li>
<li>Projections onto convex sets are <strong>unique</strong> (and often efficient to compute).</li>
</ul>
<h4 id="projected-gradient-descent">Projected gradient descent</h4>
<p>When dealing with constrained problems, we have two options. The first one is to add a projection onto $\mathcal{C}$ in every step:</p>
<script type="math/tex; mode=display">P_\mathcal{C}(\vec{w}') := \argmin_{\vec{v}\in\mathcal{C}}{\norm{\vec{v}-\vec{w}'}}</script>
<p>The rule for gradient descent can thus be updated to become:</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)} := P_\mathcal{C}\left(\vec{w}^{(t)} - \gamma \nabla \cost{\vec{w}^{(t)}} \right)</script>
<p>This means that at every step, we compute the new $w^{(t+1)}$ normally, but apply a projection on top of that. In other words, if the regular gradient descent sets our weights outside of the constrained space, we project them back.</p>
<figure>
<img alt="Steps of projected SGD" src="/images/ml/projected-sgd.png" />
<figcaption>Here, $\vec{w}'$ is the result of regular SGD, i.e. $\vec{w}' = \vec{w}^{(t)} - \gamma \nabla\cost{\vec{w}^{(t)}}$</figcaption>
</figure>
<p>This is the same for stochastic gradient descent, and we have the same convergence properties.</p>
<p>Note that the computational cost of the projection is very important here, since it is performed at every step.</p>
<h4 id="turning-constrained-problems-into-unconstrained-problems">Turning constrained problems into unconstrained problems</h4>
<p>If projection as described above is approach A, this is approach B.</p>
<p>We use a <strong>penalty function</strong>, such as the “brick wall” indicator function below:</p>
<script type="math/tex; mode=display">% <![CDATA[
I_\mathcal{C}(\vec{w}) = \begin{cases}
0 & \vec{w} \in \mathcal{C} \\
+\infty & \vec{w} \notin \mathcal{C}
\end{cases} %]]></script>
<p>We could also perhaps use something with a less drastic error value than $+\infty$, if we don’t care about the constraint quite as extreme.</p>
<p>Note that this is similar to regularization, which we’ll talk about later.</p>
<p>Now, instead of directly solving $\min_{\vec{w}\in\mathcal{C}}{\mathcal{L}(\vec{w})}$, we solve for:</p>
<script type="math/tex; mode=display">\min_{\vec{w}\in \mathbb{R}^D} {
\mathcal{L}(\vec{w}) + I_\mathcal{C}(\vec{w})
}</script>
<h3 id="implementation-issues-in-gradient-methods">Implementation issues in gradient methods</h3>
<h4 id="stopping-criteria">Stopping criteria</h4>
<p>When $\norm{\nabla\mathcal{L}(\vec{w})}$ is zero (or close to zero), we are often close to the optimum.</p>
<h4 id="optimality">Optimality</h4>
<p>For a convex optimization problem, a <em>necessary</em> condition for optimality is that the gradient is 0 at the optimum:</p>
<script type="math/tex; mode=display">\text{optimum at }\vec{w}^*, \quad \mathcal{L} \text{ convex}
\implies
\nabla\cost{\vec{w}^*} = 0</script>
<p>For convex functions, if the gradient is 0, then we’re at an optimum:</p>
<script type="math/tex; mode=display">\nabla\cost{\vec{w}^*} = 0, \quad \mathcal{L} \text{ convex}
\implies
\text{optimum at }\vec{w}^*</script>
<p>This tells us when $\vec{w}^*$ is an optimum, but says nothing about whether it’s a minimum or a maximum. To know about that, we must look at the second derivative, or in the general case where $D > 1$, the Hessian. The Hessian is the matrix of second derivatives, defined as follows:</p>
<script type="math/tex; mode=display">\vec{H}_{ij} = \difftwo{\mathcal{L}}{w_i}{w_j}</script>
<p>If the Hessian of the optimum is <a href="https://en.wikipedia.org/wiki/Positive-definite_matrix">positive semi-definite</a>, then it is a minimum (and not a maximum or a saddle point):</p>
<script type="math/tex; mode=display">\vec{H}(\vec{w}^*) := \difftwo{\cost{\vec{w}^*}}{\vec{w}}{\vec{w}^T} \text{ positive semidefinite}
\implies
\vec{w}^* \text{ is a minimum}</script>
<p>The Hessian is also related to convexity; it is positive semi-definite on its entire domain (i.e. all its eigenvalues are non-negative) if and only if the function is convex.</p>
<script type="math/tex; mode=display">\vec{H} \text{ positive semidefinite}
\iff
\mathcal{L} \text{ convex}</script>
<h4 id="step-size">Step size</h4>
<p>If $\gamma$ is too big, we might diverge (<a href="#gradient-descent">as seen previously</a>). But if it is too small, we might be very slow! Convergence is only guaranteed for $\gamma < \gamma_{min}$, which is a value that depends on the problem.</p>
<h2 id="least-squares">Least squares</h2>
<h3 id="normal-equations">Normal equations</h3>
<p>In some rare cases, we can take an analytical approach to computing the optimum of the cost function, rather than a computational one; for instance, for linear regression with MSE, as we’ve done previously. These types of equations are sometimes called <strong>normal equations</strong>. This is one of the most popular methods for data fitting, called <strong>least squares</strong>.</p>
<p>How do we get these normal equations?</p>
<p>First, we show that the problem is convex. If that is the case, then according to the <a href="#optimality">optimality conditions</a> for convex functions, the point at which the derivative is zero is the optimum:</p>
<script type="math/tex; mode=display">\nabla\cost{\vec{w}^*}=\vec{0}</script>
<p>This gives us a system of $D$ equations known as the normal equations.</p>
<h3 id="single-parameter-linear-regression">Single parameter linear regression</h3>
<p>Let’s try this for a single parameter linear regression (where $D = 1$), with MSE as the cost function. We will start by accepting that the cost function is convex in the $w_0$ parameter<sup id="fnref:mse-is-convex"><a href="#fn:mse-is-convex" class="footnote">3</a></sup>.</p>
<p>As <a href="#gradient-descent">proven previously</a>, we know that for the single parameter model, the derivative is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\mathcal{L}\left(\vec{w}\right)
& = \frac{\partial}{\partial w_0}\mathcal{L} \\
& = \frac{1}{2N}\sum_{n=1}^N{-2(y_n - w_0)} \\
& = w_0 - \bar{y}
\end{align} %]]></script>
<p>This means that the derivative is 0 for $w_0 = \bar{y}$. This allows us to define our optimum parameter $\vec{w}^*$ as $\vec{w}^* = \begin{bmatrix}\bar{y}\end{bmatrix}$.</p>
<h3 id="multiple-parameter-linear-regression">Multiple parameter linear regression</h3>
<p>Having done $D=1$, let’s look at the general case where $D \ge 1$. As we know by now, the cost function for linear regression with MSE is:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\vec{w}\right)
:= \frac{1}{2N}\sum_{n=1}^N{\left( y_n - \vec{x}_n^T \vec{w}\right)^2}
= \frac{1}{2N}(\vec{y-Xw})^T(\vec{y-Xw})</script>
<p>Where the matrices are defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\vec{y}=\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix},
\quad
\vec{X}=\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1D} \\
x_{21} & x_{22} & \dots & x_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \dots & x_{ND} \\
\end{bmatrix} %]]></script>
<p>We denote the $i^\text{th}$ row of $X$ by $x_i^T$. Each $x_i^T$ represents a different data point.</p>
<p>We claim that this cost function is <em>convex</em> in $\vec{w}$. We can prove that in any of the following ways:</p>
<hr />
<h4 id="simplest-way">Simplest way</h4>
<p>The cost function is the sum of many convex functions, and is thus also convex.</p>
<h4 id="directly-verify-the-definition">Directly verify the definition</h4>
<script type="math/tex; mode=display">\forall \lambda\in [0,1],
\quad \forall \vec{w}, \vec{w}',
\qquad
\mathcal{L}\left(\lambda\vec{w} + \left(1-\lambda\right)\vec{w}'\right)
- \left(\lambda\mathcal{L}(\vec{w}) + \left( 1-\lambda \right) \mathcal{L}(\vec{w}')\right) \le 0</script>
<p>The left-hand side of the inequality reduces to:</p>
<script type="math/tex; mode=display">-\frac{1}{2N}\lambda(1-\lambda)\norm{\vec{X}(\vec{w}-\vec{w}')}_2^2</script>
<p>which indeed is $\le 0$.</p>
<h4 id="compute-the-hessian">Compute the Hessian</h4>
<p>As <a href="#optimality">we’ve seen previously</a>, if the Hessian is positive semidefinite, then the function is convex. For our case, the Hessian is given by:</p>
<script type="math/tex; mode=display">\frac{1}{N}\vec{X}^T\vec{X}</script>
<p>This is indeed positive semi-definite, as its eigenvalues are the squares of the eigenvalues of $\vec{X}$, and must therefore be positive.</p>
<hr />
<p>Knowing that the function is convex, we can find the minimum. If we take the gradient of this expression, we get:</p>
<script type="math/tex; mode=display">\nabla\mathcal{L}(\vec{w}) = -\frac{1}{N}\vec{X}^T(\vec{y-Xw})</script>
<p>We can set this to 0 to get the normal equations for linear regression, which are:</p>
<script type="math/tex; mode=display">\vec{X}^T(\vec{y-Xw}) =: \vec{X}^T\vec{e} = \vec{0}</script>
<p>This proves that the normal equations for linear regression are given by $\vec{X}^T\vec{e} = \vec{0}$.</p>
<h3 id="geometric-interpretation">Geometric interpretation</h3>
<p>The above definition of normal equations are given by $\vec{X}^T\vec{e} = \vec{0}$. How can visualize that?</p>
<p>The error is given by:</p>
<script type="math/tex; mode=display">\vec{e} := \vec{y} - \vec{Xw}</script>
<p>By definition, this error vector is orthogonal to all columns of $\vec{X}$. Indeed, it tells us how far above or below the span our prediction $\vec{y}$ is.</p>
<p>The <strong>span</strong> of $\vec{X}$ is the space spanned by the columns of $\vec{X}$. Every element of the span can be written as $\vec{u} = \vec{Xw}$ for some choice of $\vec{w}$.</p>
<p>For the normal equations, we must pick an optimal $\vec{w}^*$ for which the gradient is 0. Picking an $\vec{w}^*$ is equivalent to picking an optimal $\vec{u}^* = \vec{Xw}^*$ from the span of $\vec{X}$.</p>
<p>But which element of $\text{span}(\vec{X})$ shall we take, which one is the optimal one? The normal equations tell us that the optimum choice for $\vec{u}$, called <script type="math/tex">\vec{u}^*</script> is the element such that <script type="math/tex">\vec{y} - \vec{u}^*</script> is orthogonal to $\text{span}(X)$.</p>
<p>In other words, we should pick $\vec{u}^*$ to be the projection of $\vec{y}$ onto $\text{span}(\vec{X})$.</p>
<p><img src="/images/ml/geometric-interpretation-normal-equations.png" alt="Geometric interpretation of the normal equations" /></p>
<h3 id="closed-form">Closed form</h3>
<p>All we’ve done so far is to solve the same old problem of a matrix equation:</p>
<script type="math/tex; mode=display">Ax = b</script>
<p>But we’ve always done so with a bit of a twist; there may not be an exact value of $x$ satisfying exact equality, but we could find one that gets us as close as possible:</p>
<script type="math/tex; mode=display">Ax \approx b</script>
<p>This is also what least squares does. It attempts to minimize the MSE to get as $Ax$ close as possible to $b$.</p>
<p>In this course, we often denote the data matrix $A$ as $\vec{X}$, the weights $x$ as $\vec{w}$, and $b$ as $y$; in other words, we’re trying to solve:</p>
<script type="math/tex; mode=display">\vec{X}\vec{w} \approx \vec{y}</script>
<p>In least squares, we multiply this whole equation by $\vec{X}^T$ on the left. We attempt to find $\vec{w}^*$, the minimal weight that gets us as minimally wrong as possible. In other we’re trying to solve:</p>
<script type="math/tex; mode=display">\left( \vec{X}^T\vec{X} \right) \vec{w} \approx \vec{X}^T\vec{y}</script>
<p>One way to solve this problem would simply be to invert the $A$ matrix, which in our case is $\vec{X}^T\vec{X}$:</p>
<script type="math/tex; mode=display">\vec{w}^* = (\vec{X}^T\vec{X})^{-1} \vec{X}^T y</script>
<p>As such, we can use this model to predict values for unseen data points:</p>
<script type="math/tex; mode=display">\hat{y}_m := \vec{x}_m^T \vec{w}^* = \vec{x}_m^T (\vec{X}^T\vec{X})^{-1} \vec{X}^T y</script>
<h3 id="invertibility-and-uniqueness">Invertibility and uniqueness</h3>
<p>Note that the Gram matrix, defined as $\vec{X}^T\vec{X} \in \mathbb{R}^{D\times D}$, is invertible <strong>if and only if</strong> $\vec{X}$ has <strong>full column rank</strong>, or in other words, $\text{rank}(\vec{X}) = D$.</p>
<script type="math/tex; mode=display">\vec{X}^T\vec{X} \in \mathbb{R}^{D\times D} \text{ invertible}
\iff
\text{rank}(\vec{X}) = D</script>
<p>Unfortunately, in practice, our data matrix $\vec{X}\in\mathbb{R}^{N\times D}$ is often <strong>rank-deficient</strong>.</p>
<ul>
<li>If $D>N$, we always have $\text{rank}(\vec{X}) < D$ (since column and row rank are the same, which implies that $\text{rank}(\vec{X}) \le N < D$).</li>
<li>
<p>If $D \le N$, but some of the columns $\vec{X}_{:d}$ are collinear (or in practice, nearly collinear), then the matrix is <strong>ill-conditioned</strong>. This leads to numerical issues when solving the linear system.</p>
<p>To know how bad things are, we can compute the condition number, which is the maximum eigenvalue of the Gram matrix, divided by the minimum See course contents of Numerical Methods.</p>
</li>
</ul>
<p>If our data matrix is rank-deficient or ill-conditioned (which is practically always the case), we certainly shouldn’t be inverting it directly! We’ll introduce high numerical errors that falsify our output.</p>
<p>That doesn’t mean we can’t do least squares in practice. We can still use a linear solver. In Python, that means you should use <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html"><code class="highlighter-rouge">np.linalg.solve</code></a>, which uses a LU decomposition internally and thus avoids the worst numerical errors. In any case, do not directly invert the matrix as we have done above!</p>
<h2 id="maximum-likelihood">Maximum likelihood</h2>
<p>Maximum likelihood offers a second interpretation of least squares, but starting with a probabilistic approach.</p>
<h3 id="gaussian-distribution">Gaussian distribution</h3>
<p>A Gaussian random variable in $\mathbb{R}$ has mean $\mu$ and variance $\sigma^2$. Its distribution is given by:</p>
<script type="math/tex; mode=display">\normal{y \mid \mu, \sigma^2} =
\frac{1}{\sqrt{2\pi\sigma^2}}
\exp{\left[ -\frac{(y-\mu)^2}{2\sigma^2} \right]}</script>
<p>For a Gaussian random <em>vector</em>, we have $\vec{y} \in \mathbb{R}^N$ (instead of a single random variable in $\mathbb{R}$). The vector has mean $\pmb{\mu}$ and covariance $\pmb{\Sigma}$ (which is positive semi-definite), and its distribution is given by:</p>
<script type="math/tex; mode=display">\pmb{\mathcal{N}}(\vec{y} \mid \pmb{\mu}, \pmb{\Sigma}) =
\frac{1}
{\sqrt{(2\pi)^D \text{ det}(\pmb{\Sigma})}}
\exp{\left[ -\frac{1}{2} (\vec{y} - \pmb{\mu})^T \pmb{\Sigma}^{-1} (\vec{y} - \pmb{\mu}) \right]}</script>
<p>As another reminder, two variables $x$ and $y$ are said to be <strong>independent</strong> when $p(x, y) = p(x)p(y)$.</p>
<h3 id="a-probabilistic-model-for-least-squares">A probabilistic model for least squares</h3>
<p>We assume that our data is generated by a linear model $\vec{x}_n^T\vec{w}$, with added Gaussian noise $\epsilon_n$:</p>
<script type="math/tex; mode=display">y_n = \vec{x}_n^T\vec{w} + \epsilon_n</script>
<p>This is often a realistic assumption in practice.</p>
<p><img src="/images/ml/gaussian-noise.png" alt="Noise generated by a Gaussian source" /></p>
<p>The noise is $\epsilon_n \overset{\text{i.i.d.}}{\sim}\normal{y_n \mid \mu = 0, \sigma^2}$ for each dimension $n$. In other words, it is centered at 0, has a certain variance, and the error in each dimension is independent of that in other dimensions.</p>
<p>The model $\vec{w}$ is, as always, unknown. But we can try to do a thought experiment: if we did know the model $\vec{w}$ the data $\vec{X}$, in a system without the noise $\epsilon_n$, we would know the labels $\vec{y}$ with 100% certainty. The only thing that prevents that is the noise $\epsilon_n$; therefore, given the model and data, the probability distribution of seeing a certain $\vec{y}$ is only given by all the noise sources $\epsilon_n$. Since they are generated independently in each dimension, we can take the product of these noise sources.</p>
<p>Therefore, given $N$ samples, the <strong>likelihood</strong> of the data vector $\vec{y} = (y_1, \dots, y_n)$ given the model $\vec{w}$ and the input $\vec{X}$ is:</p>
<script type="math/tex; mode=display">p(\vec{y} \mid \vec{X}, \vec{w})
= \prod_{n=1}^N {p(y_n \mid \vec{x}_n, \vec{w})}
= \prod_{n=1}^N {\normal{y_n \mid \vec{x}_n^T\vec{w}, \sigma^2}}</script>
<p>Intuitively, we’d like to maximize this likelihood over the choice of the best model $\vec{w}$. The best model is the one that maximizes this likelihood.</p>
<h3 id="defining-cost-with-log-likelihood">Defining cost with log-likelihood</h3>
<p>The log-likelihood (LL) is given by:</p>
<script type="math/tex; mode=display">\mathcal{L}_{LL}(\vec{w}) := \log{p(\vec{y} \mid \vec{X}, \vec{w})}
= - \frac{1}{2\sigma^2} \sum_{n=1}^N{\left(y_n - \vec{x}_n^T\vec{w}\right)^2} + \text{ cnst}</script>
<p>Taking the log allows us to get away from the nasty product, and get a nice sum instead. Notice that this definition looks pretty similar to MSE:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{MSE}}(\vec{w}) := \frac{1}{N} \sum_{n=1}^N \left(y_n - \vec{x}_n^T\vec{w}\right)^2</script>
<p>Note that we would like to minimize MSE, but we want the log-likelihood to be as high as possible (intuitively, we can look at the sign to understand that).</p>
<h3 id="maximum-likelihood-estimator-mle">Maximum likelihood estimator (MLE)</h3>
<p>Maximizing the log-likelihood (and thus the likelihood) will be equivalent to minimizing the MSE; this gives us another way to design cost functions. We can describe the whole process as:</p>
<script type="math/tex; mode=display">\argmin_{\vec{w}}{\mathcal{L}_\text{MSE}(\vec{w})} =
\argmax_{\vec{w}}{\mathcal{L}_\text{LL}(\vec{w})}</script>
<p>The maximum likelihood estimator (MLE) can be understood as finding the model under which the observed data is most likely to have been generated from (probabilistically). This interpretation has some advantages that we discuss below.</p>
<h4 id="properties-of-mle">Properties of MLE</h4>
<p>MLE is a <em>sample</em> approximation to the <em>expected log-likelihood</em>. In other words, if we had an infinite amount of data, MLE would perfectly be equal to the true expected value of the log-likelihood.</p>
<script type="math/tex; mode=display">\mathcal{L}_{LL}(\vec{w})
\approx \expectsub{p(y, \vec{x})}{\log{p(y \mid \vec{x}, \vec{w})}}</script>
<p>This means that MLE is <strong>consistent</strong>, i.e. it gives us the correct model assuming we have enough data. This means it converges in probability<sup id="fnref:convergence-prob-distrib"><a href="#fn:convergence-prob-distrib" class="footnote">4</a></sup> to the true value:</p>
<script type="math/tex; mode=display">\vec{w}_\text{MLE} \overset{p}{\longrightarrow} \vec{w}_\text{true}</script>
<p>MLE is asymptotically normal, meaning that the difference between the approximation and the true value of the weights converges in distribution<sup id="fnref:convergence-prob-distrib:1"><a href="#fn:convergence-prob-distrib" class="footnote">4</a></sup> to a normal distribution centered at 0, and with variance $\frac{1}{N}$ times the Fisher information of the true value:</p>
<script type="math/tex; mode=display">(\vec{w}_{\text{MLE}} - \vec{w}_{\text{true}})
\overset{d}{\longrightarrow}
\frac{1}{\sqrt{N}} \normal{\vec{w}_{\text{MLE}} \mid \vec{0}, \vec{F}^{-1}(\vec{w}_{\text{true}})}</script>
<p>Where the Fisher information<sup id="fnref:fisher-information"><a href="#fn:fisher-information" class="footnote">5</a></sup> is:</p>
<script type="math/tex; mode=display">\vec{F}(\vec{w})
= -\expectsub{p(\vec{y})}{
\frac{\partial^2\mathcal{L}}{\partial\vec{w}\partial\vec{w}^T}
}</script>
<p>This sounds amazing, but the catch is that this all is under the assumption that the noise $\epsilon$ indeed was generated under a Gaussian model, which may not always be true. We’ll relax this assumption later when we talk about <a href="#exponential-family">exponential families</a>.</p>
<h2 id="overfitting-and-underfitting">Overfitting and underfitting</h2>
<p>Models can be too limited; when we can’t find a function that fits the data well, we say that we are <em>underfitting</em>. But on the other hand, models can also be too rich: in this case, we don’t just model the data, but also the underlying noise. This is called <em>overfitting</em>. Knowing exactly where we are on this spectrum is difficult, since all we have is data, and we don’t know a priori what is signal and what is noise.</p>
<p>Sections 3 and 5 of Pedro Domingos’ paper <a href="https://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf"><em>A Few Useful Things to Know about Machine Learning</em></a> are a good read on this topic.</p>
<h3 id="underfitting-with-linear-models">Underfitting with linear models</h3>
<p>Linear models can very easily underfit; as soon as the data itself is given by anything more complex than a line, fitting a linear model will underfit: the model is too simple for the data, and we’ll have huge errors.</p>
<p>But we can also easily overfit, where our model learns the specificities of the data too intimately. And this happens quite easily with linear combination of high-degree polynomials.</p>
<h3 id="extended-feature-vectors">Extended feature vectors</h3>
<p>We can actually get high-degree linear combinations of polynomials, but still keep our linear model. Instead of making the model more complex, we simply “augment” the input to become degree $M$. If the input is one-dimensional, we can add a polynomial basis to the input:</p>
<script type="math/tex; mode=display">% <![CDATA[
\pmb{\phi}(x_n) =
\begin{bmatrix}
1 & x_n & x_n^2 & x_n^3 & \dots & x_n^M
\end{bmatrix} %]]></script>
<p>Note that this is basically a <a href="https://en.wikipedia.org/wiki/Vandermonde_matrix">Vandermonde matrix</a>.</p>
<p>We then fit a linear model to this extended feature vector $\pmb{\phi}(x_n)$:</p>
<script type="math/tex; mode=display">y_n \approx w_0 + w_1 x_n + w_2 x_n^2 + \dots + w_m x_n^M =: \pmb{\phi}(x_n)^T\vec{w}</script>
<p>Here, $\vec{w}\in\mathbb{R}^{M+1}$. In other words, there are $M+1$ parameters in a degree $M$ extended feature vector. One should be careful with this degree; too high may overfit, too low may underfit.</p>
<p>If it is important to distinguish the original input $\vec{x}$ from the augmented input $\pmb{\phi}(\vec{x})$ then we will use the $\pmb{\phi}(\vec{x})$ notation. But often, we can just consider this as a part of the pre-processing, and simply write $\vec{x}$ as the input, which will save us a lot of notation.</p>
<h3 id="reducing-overfitting">Reducing overfitting</h3>
<p>To reduce overfitting, we can chose a less complex model (in the above, we can pick a lower degree $M$), but we could also just add more data:</p>
<p><img src="/images/ml/reduce-overfit-add-data.png" alt="An overfitted model acts more reasonably when we add a bunch of data" /></p>
<h2 id="regularization">Regularization</h2>
<p>To prevent overfitting, we can introduce <strong>regularization</strong> to penalize complex models. This can be applied to any model.</p>
<p>The idea is to not only minimize cost, but also minimize a regularizer:</p>
<script type="math/tex; mode=display">\min_{\vec{w}} {\mathcal{L}(\vec{w}) + \Omega(\vec{w})}</script>
<p>The $\Omega$ function is the regularizer, measuring the complexity of the model. We’ll see some good candidates for the regularizer below.</p>
<h3 id="l_2-regularization-ridge-regression">$L_2$-Regularization: Ridge Regression</h3>
<p>The most frequently used regularizer is the standard Euclidean norm ($L_2$-norm):</p>
<script type="math/tex; mode=display">\Omega(\vec{w}) = \lambda \norm{\vec{w}}^2_2</script>
<p>Where $\lambda \in \mathbb{R}$. The value of $\lambda$ will affect the fit; $\lambda \rightarrow 0$ can have overfitting, while $\lambda \rightarrow \infty$ can have underfitting.</p>
<p>The norm is given by:</p>
<script type="math/tex; mode=display">\norm{\vec{w}}_2^2 = \sum_i{w_i^2}</script>
<p>The main effect of this is that large model weights $w_i$ will be penalized, while small ones won’t affect our minimization too much.</p>
<h4 id="ridge-regression">Ridge regression</h4>
<p>Depending on the values we choose for $\mathcal{L}$ and $\Omega$, we get into some special cases. For instance, choosing MSE for $\mathcal{L}$ is called <strong>ridge regression</strong>, in which we optimize the following:</p>
<script type="math/tex; mode=display">\min_{\vec{w}} {\left(\frac{1}{N} \sum_{n=1}^N \left[y_n - f(\vec{x}_n)\right]^2 \quad + \quad \Omega(\vec{w})\right)}</script>
<p>Least squares is also a special case of ridge regression, where $\lambda = 0$</p>
<p>We can find an explicit solution for $\vec{w}$ in ridge regression by differentiating the cost and regularizer, and setting them to zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla \mathcal{L}(\vec{w}) & = -\frac{1}{N} \vec{X}^T (\vec{y} - \vec{Xw}) \\ \\
\nabla \Omega(\vec{w}) & = 2\lambda \vec{w} \\
\end{align} %]]></script>
<p>We can now set the full cost to zero, which gives us the result:</p>
<script type="math/tex; mode=display">\vec{w}^*_\text{ridge} = (\vec{X}^T\vec{X} + \lambda' \vec{I})^{-1}\vec{X}^T\vec{y}</script>
<p>Where $\frac{\lambda’}{2N} = \lambda$. Note that for $\lambda = 0$, we have the least squares solution.</p>
<h4 id="ridge-regression-to-fight-ill-conditioning">Ridge regression to fight ill-conditioning</h4>
<p>This formulation of $\vec{w}^*$ is quite nice, because adding the identity matrix helps us get something that always is invertible; in cases where we have ill-conditioned matrices, it also means that we can invert with more stability.</p>
<p>We’ll prove that the matrix indeed is invertible. The gist is that the eigenvalues of $(\vec{X}^T\vec{X} + \lambda’ \vec{I})$ are all at least $\lambda’$.</p>
<p>To prove it, we’ll write the singular value decomposition (SVD) of $\vec{X}^T\vec{X}$ as $\vec{USU}^T$. We then have:</p>
<script type="math/tex; mode=display">\vec{X}^T\vec{X} + \lambda'\vec{I} = \vec{USU}^T + \lambda'\vec{UIU}^T = \vec{U}(\vec{S} + \lambda'\vec{I})\vec{U}^T</script>
<p>The singular value is “lifted” by an amount $\lambda’$. There’s an alternative proof in the class notes, but we won’t go into that.</p>
<h3 id="l_1-regularization-the-lasso">$L_1$-Regularization: The Lasso</h3>
<p>We can use a different norm as an alternative measure of complexity. The combination of $L_1$-norm and MSE is known as <strong>The Lasso</strong>:</p>
<script type="math/tex; mode=display">\min_{\vec{w}} {\frac{1}{2N} \sum_{n=1}^N \left[y_n - f(\vec{x}_n)\right]^2 + \lambda \norm{\vec{w}}_1}</script>
<p>Where the $L_1$-norm is defined as</p>
<script type="math/tex; mode=display">\norm{\vec{w}}_1 := \sum_i{\abs{w_i}}</script>
<p>If we draw out a constant value of the $L_1$ norm, we get a sort of “ball”. Below, we’ve graphed $\set{\vec{w} : \norm{\vec{w}}_1 \le 5}$.</p>
<p><img src="/images/ml/lasso.png" alt="Graph of the lasso" /></p>
<p>To keep things in the following, we’ll just claim that $\vec{X}^T\vec{X}$ is invertible. We’ll also claim that the following set is an ellipsoid which scales around the origin as we change $\alpha$:</p>
<script type="math/tex; mode=display">\set{\vec{w} : \norm{\vec{y} - \vec{Xw}}^2 = \alpha}</script>
<p>The slides have a formal proof for this, but we won’t get into it.</p>
<p>Note that the above definition of the set corresponds to the set of points with equal loss (which we can assume is MSE, for instance):</p>
<script type="math/tex; mode=display">\set{\vec{w} : \cost{\vec{w}} = \alpha}</script>
<p>Under these assumptions, we claim that for $L_1$ regularization, the optimum solution will likely be sparse (many zero components) compared to $L_2$ regularization.</p>
<p>To prove this, suppose we know the $L_1$ norm of the optimum solution. Visualizing that ball, we know that our optimum solution $\vec{w}^*$ will be somewhere on the surface of that ball. We also know that there are ellipsoids, all with the same mean and rotation, describing the equal error surfaces. The optimum solution is where the “smallest” of these ellipsoids just touches the
$L_1$ ball.</p>
<p><img src="/images/ml/ball-ellipse.png" alt="Intersection of the L1 ball and the cost ellipses" /></p>
<p>Due to the geometry of this ball this point is more likely to be on one of the “corner” points. In turn, sparsity is desirable, since it leads to a “simple” model.</p>
<h2 id="model-selection">Model selection</h2>
<p>As we’ve seen in ridge regression, we have a <em>regularization parameter</em> $\lambda > 0$ that can be tuned to reduce overfitting by reducing model complexity. We say that the parameter $\lambda$ is a <strong>hyperparameter</strong>.</p>
<p>We’ve also seen ways to enrich model complexity, like <a href="#extended-feature-vectors">polynomial feature expansion</a>, in which the degree $M$ is also a hyperparameter.</p>
<p>We’ll now see how best to choose these hyperparameters; this is called the <strong>model selection</strong> problem.</p>
<h3 id="probabilistic-setup">Probabilistic setup</h3>
<p>We assume that there is an (unknown) underlying distribution $\mathcal{D}$ producing the dataset, with range $\mathcal{X}\times\mathcal{Y}$. The dataset $S$ we see is produced from samples from $\mathcal{D}$:</p>
<script type="math/tex; mode=display">S = \set{
(\vec{x}_n, y_n) \overset{\text{i.i.d}}{\sim} \mathcal{D}
}_{n=1}^N</script>
<p>Based on this, the <em>learning algorithm</em> $\mathcal{A}$ choses the “best” model using the dataset $S$, under the parameters of the algorithm. The resulting prediction function is $f_s = \mathcal{A}(S)$. To indicate that $f_s$ sometimes depend on hyperparameters, we can write the prediction function as $f_{s, \lambda}$.</p>
<h3 id="training-error-vs-generalization-error">Training Error vs. Generalization Error</h3>
<p>Given a model $f$, how can we assess if $f$ is any good? We already have the loss function, but its result is highly dependent on the error in the data, not to how good the model is. Instead, we can compute the <em>expected error</em> over all samples chosen according to $\mathcal{D}$.</p>
<script type="math/tex; mode=display">L_\mathcal{D}(f) = \expectsub{\mathcal{D}}{\mathcal{l}(y, f(\vec{x}))}</script>
<p>Where $\mathcal{l}(\cdot, \cdot)$ is our loss function; e.g. for ridge regression, $\mathcal{l}(y, f(\vec{x})) = \frac{1}{2}(y-f(\vec{x}))^2$.</p>
<p>The quantity $L_\mathcal{D}(f)$ has many names, including <strong>generalization error</strong> (or true/expected error/risk/loss). This is the quantity that we are fundamentally interested in, but we cannot compute it since $\mathcal{D}$ is unknown.</p>
<p>What we do know is the data subset<sup id="fnref:data-subset-training-data"><a href="#fn:data-subset-training-data" class="footnote">6</a></sup> $S$. It’s therefore natural to compute the equivalent <em>empirical</em> quantity, which is the average loss:</p>
<script type="math/tex; mode=display">L_S(f) = \frac{1}{\abs{S}} \sum_{(\vec{x}_n, y_n)\in S} {\mathcal{l}(y_n, f(\vec{x}_n))}</script>
<p>But again, we run into trouble. The function $f$ is itself a function of the data $S$, so what we really do is to compute the quantity:</p>
<script type="math/tex; mode=display">L_S(f_S) = \frac{1}{\abs{S}} \sum_{(\vec{x}_n, y_n)\in S} {\mathcal{l}(y_n, f_S(\vec{x}_n))}</script>
<p>$f_S$ is the trained model. This is called the <strong>training error</strong>. Usually, the training error is smaller than the generalization error, because overfitting can happen (even with regularization, because the hyperparameter may still be too low).</p>
<h3 id="splitting-the-data">Splitting the data</h3>
<p>To avoid validating the model on the same data subset we trained it on (which is conducive to overfitting), we can split the data into a <strong>training set</strong> and a <strong>test set</strong> (aka <em>validation set</em>), which we call $\Strain$ and $\Stest$, so that $S = \Strain \oplus \Stest$. A typical split could be 80% for training and 20% for testing.</p>
<p>We apply the learning algorithm $\mathcal{A}$ on the training set $\Strain$, and compute the function $f_{\Strain}$. We then compute the error on the test set, which is the <strong>test error</strong>:</p>
<script type="math/tex; mode=display">L_{\Stest}(f_{\Strain}) = \frac{1}{\abs{\Stest}} \sum_{(\vec{x}_n, y_n)\in \Stest} {\mathcal{l}(y_n, f_{\Strain}(\vec{x}_n))}</script>
<p>If we have duplicates in our data, then this could be a bit dangerous. Still, in general, this really helps us with the problem of overfitting since $\Stest$ is a “fresh” sample, which means that we can hope that $L_{\Stest}(f_{\Strain})$ defined above is close to the quantity $L_\mathcal{D}(f_{\Strain})$. Indeed, <em>in expectation</em> both are the same:</p>
<script type="math/tex; mode=display">L_\mathcal{D}(f_{\Strain})
= \expectsub{\Stest\sim\mathcal{D}}{
L_{\Stest}(f_{\Strain})
}</script>
<p>The subscript on the expectation means that the expectation is over samples of the test set, and not for a particular test set (which could give a different result due to the randomness of the selection of $\Stest$).</p>
<p>This is a quite nice property, but we paid a price for this. We had to split the data and thus reduce the size of our training data. But we will see that this can be mediated using cross-validation.</p>
<h3 id="generalization-error-vs-test-error">Generalization error vs test error</h3>
<p>Assume that we have a model $f$ and that our loss function $\mathcal{l}(\cdot, \cdot)$ is bounded in $[a, b]$. We are given a test set $\Stest$ chosen i.i.d. from the underlying distribution $\mathcal{D}$.</p>
<p>How far apart is the empirical test error from the true generalization error? As we’ve seen above, they are the same in expectation. But we need to worry about the variation, about how far off from the true error we typically are:</p>
<p>We claim that:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\abs{L_\mathcal{D}(f) - L_{\Stest}(f)}
\ge
\sqrt{\frac{(b-a)^2 \ln{(2/\delta)}}{2\abs{\Stest}}}
\right]
\le \delta
\label{eq:loss-bound}
\tag{loss-bound}</script>
<p>Where $\delta > 0$ is a quality parameter. This gives us an upper bound on how far away our empirical loss is from the true loss.</p>
<p>This bound gives us some nice insights. Error decreases in the size of the test set as $\mathcal{O}(1/\sqrt{\abs{\Stest}})$, so the more data points we have, the more confident we can be in the empirical loss being close to the true loss.</p>
<p>We’ll prove $\ref{eq:loss-bound}$. We assumed that each sample in the test set is chosen independently. Therefore, given a model $f$, the associated losses $\mathcal{l}(y_n, f(\vec{x}_n))$ are also i.i.d. random variables, taking values in $[a, b]$ by assumption. We can call each such loss $\Theta_n$:</p>
<script type="math/tex; mode=display">\Theta_n = \mathcal{l}(y_n, f(\vec{x}_n))</script>
<p>This is just a naming alias; since the underlying value is that of the loss function, the expected value of $\Theta_n$ is simply that of the loss function, which is the true loss:</p>
<script type="math/tex; mode=display">\expect{\Theta_n} = \expect{\mathcal{l}(y_n, f(\vec{x}_n))} = L_\mathcal{D}(f)</script>
<p>The empirical loss on the other hand is equal to the average of $\abs{\Stest}$ such i.i.d. values.</p>
<p>The formula of $\ref{eq:loss-bound}$ gives us the probability that empirical loss $L_{\Stest}(f)$ diverges from the true loss by more than a given constant, which is a classical problem addressed in the following lemma (which we’ll just assert, not prove).</p>
<p><strong>Chernoff Bound</strong>: Let $\Theta_1, \dots, \Theta_N$ be a sequence of i.i.d random variables with mean $\expect{\Theta}$ and range $[a, b]$. Then, for any $\epsilon > 0$:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\abs{\frac{1}{N}\sum_{n=1}^N {\Theta_n - \expect{\Theta}}}
\ge
\epsilon
\right]
\le
2\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}
\label{eq:Chernoff}
\tag{Chernoff}</script>
<p>Using $\ref{eq:Chernoff}$ we can show $\ref{eq:loss-bound}$. By setting $\delta = 2\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}$, we find that $\epsilon = \sqrt{\frac{(b-a)^2 \ln{(2/\delta)}}{2\abs{\Stest}}}$ as claimed.</p>
<h3 id="method-and-criteria-for-model-selection">Method and criteria for model selection</h3>
<h4 id="grid-search-on-hyperparameters">Grid search on hyperparameters</h4>
<p>Our main goal was to look for a way to select the hyperparameters of our model. Given a finite set of values $\lambda_k$ for $k=1, \dots, K$ of a hyperparameter $\lambda$, we can run the learning algorithm $K$ times on the same training set $\Strain$, and compute the $K$ prediction functions $f_{\Strain, \lambda_k}$. For each such prediction function we compute the test error, and choose the $\lambda_k$ which minimizes the test error.</p>
<p><img src="/images/ml/cross-validation.png" alt="Grid search on lambda" /></p>
<p>This is essentially a grid search on $\lambda$ using the test error function.</p>
<h4 id="model-selection-based-on-test-error">Model selection based on test error</h4>
<p>How do we know that, for a fixed function $f$, $L_{\Stest}(f)$ is a good approximation to $L_\mathcal{D}(f)$? If we’re doing a grid search on hyperparameters to minimize the test error $L_{\Stest}(f)$, we may pick a model that obtains a lower test error, but that may increase $\abs{L_\mathcal{D}(f) - L_{\Stest}(f)}$.</p>
<p>We’ll therefore try to see how much the bound increases if we pick a false positive, a model that has lower test error but that actually strays further away from the generalization error.</p>
<p>The answer to this follows the same idea as when we talked about <a href="#generalization-error-vs-test-error">generalization vs test error</a>, but we now assume that we have $K$ models $f_k$ for $k=1, \dots, K$. We assume again that the loss function is bounded in $[a, b]$, and that we’re given a test set whose samples are chosen i.i.d. in $\mathcal{D}$.</p>
<p>How far is each of the $K$ (empirical) test errors $L_{\Stest}(f_k)$ from the true $L_\mathcal{D}(f_k)$? As before, we can bound the deviation for all $k$ candidates, by:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\max_k {\abs{L_\mathcal{D}(f_k) - L_{\Stest}(f_k)}}
\ge
\sqrt{\frac{(b-a)^2 \ln{(2K/\delta)}}{2\abs{\Stest}}}
\right]
\le \delta</script>
<p>A bit of intuition of where this comes from: for a general $K$, we check the deviations for $K$ independent samples and ask for the probability that for at least one such sample we get a deviation of at least $\epsilon$ (this is what the $\ref{eq:Chernoff}$ bound answers). Then by the <a href="https://en.wikipedia.org/wiki/Boole%27s_inequality">union bound</a> this probability is at most $K$ times as large as in the case where we are only concerned with a single instance. Thus the upper bound in Chernoff becomes $2K\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}$, which gives us $\epsilon = \sqrt{\frac{(b-a)^2 \ln{(2K/\delta)}}{2\abs{\Stest}}}$ as above.</p>
<p>As before, this tells us that error decreases in $\mathcal{O}(1/\sqrt{\abs{\Stest}})$.</p>
<p>However, now that we test $K$ hyperparameters, our error only goes up by a tiny amount of $\sqrt{\ln{(K)}}$. This follows from $\ref{eq:loss-bound}$, which we proved for the special case of $K = 1$. So we can reasonably do grid search, knowing that in the worst case, the error will only increase by a tiny amount.</p>
<h3 id="cross-validation">Cross-validation</h3>
<p>Splitting the data once into two parts (one for training and one for testing) is not the most efficient way to use the data. Cross-validation is a better way.</p>
<p>K-fold cross-validation is a popular variant. We randomly partition the data into $K$ groups, and train $K$ times. Each time, we use one of the $K$ groups as our test set, and the remaining $K−1$ groups for training.</p>
<p>To get a common result, we average out the $K$ results. This means we’ll use the average weights to get the average test error over the $K$ folds.</p>
<p>Cross-validation returns an unbiased estimate of the generalization error and its variance.</p>
<h3 id="bias-variance-decomposition">Bias-Variance decomposition</h3>
<p>When we perform model selection, there is an inherent <a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff">bias–variance</a> trade-off.</p>
<figure>
<img src="/images/ml/bias-variance.png" alt="Bullseye representation of bias vs variance" />
<figcaption>Graphical illustration of bias and variance. Taken from <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Scott Fortmann-Roe's website</a></figcaption>
</figure>
<p>If we were to build the same model over and over again with re-sampled datasets, our predictions would change because of the randomness in the used datasets. Bias tells us how far off from the correct value our predictions are in general, while variance tells us about the variability in predictions for a given point in-between realizations of the models.</p>
<p>For now, we’ll just look at “high-bias & low-variance” models, and “high-variance & low-bias” models.</p>
<ul>
<li><strong>High-bias & low-variance</strong>: the model is too simple. It’s underfit, has a large bias, and and the variance of $L_\mathcal{D}(f_S)$ is small (the variations due to the random sample $S$).</li>
<li><strong>High-variance & low-bias</strong>: the model is too complex. It’s overfit, has a small bias and large variance of $L_\mathcal{D}(f_S)$ (the error depends largely on the exact choice of $S$; a single addition of a data point is likely to change the prediction function $f_S$ considerably)</li>
</ul>
<p>Consider a linear regression with one-dimensional input and <a href="#extended-feature-vectors">polynomial feature expansion</a> of degree $d$. The former can be achieved by picking a too low value for $d$, while the latter by picking $d$ too high. The same principle applies for other parameters, such as ridge regression with hyperparameter $\lambda$.</p>
<h4 id="data-generation-model">Data generation model</h4>
<p>Let’s assume that our data is generated by some arbitrary, unknown function $f$, and a noise source $\epsilon$ with distribution $\mathcal{D}_\epsilon$ (i.i.d. from sample to sample, and independent from the data). We can think of $f$ representing the precise, hypothetical function that perfectly produced the data. We assume that the noise has mean zero (without loss of generality, as a non-zero mean could be encoded into $f$).</p>
<script type="math/tex; mode=display">y = f(\vec{x}) + \epsilon</script>
<p>We assume that $\vec{x}$ is generated according to some fixed but unknown distribution $\mathcal{D}_{\vec{x}}$. We’ll be working with square loss $\mathcal{l}(y, f(\vec{x})) = \frac{1}{2}(y-f(\vec{x}))^2$. We will denote the joint distribution on pairs $(\vec{x}, y)$ as $\mathcal{D}$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\epsilon & \sim \mathcal{D}_\epsilon \\
\vec{x} & \sim \mathcal{D}_x \\
(\vec{x}, y) & \sim \mathcal{D} \\
\end{align} %]]></script>
<h4 id="error-decomposition">Error Decomposition</h4>
<p>As always, we have a training set $\Strain$, which consists of $N$ i.i.d. samples from $\mathcal{D}$. Given our learning algorithm $\mathcal{A}$, we compute the prediction function $f_{\Strain} = \mathcal{A}(\Strain)$. The square loss of a single prediction for a fixed element $\vec{x}_0$ is given by the computation of:</p>
<script type="math/tex; mode=display">\mathcal{l}(y_0, f_{\Strain}(\vec{x}_0))
=
\bigl( y_0 - f_{\Strain}(\vec{x}_0) \bigr)^2
=
\bigl( f(\vec{x}_0) + \epsilon - f_{\Strain}(\vec{x}_0) \bigr)^2</script>
<p>Our experiment was to create $\Strain$, learn $f_{\Strain}$, and then evaluate the performance by computing the square loss for a fixed element $\vec{x}_0$. If we run this experiment many times, the expected value is written as:</p>
<script type="math/tex; mode=display">\expectsub{\Strain \sim \mathcal{D},\ \epsilon\sim\mathcal{D}_\epsilon}{
\left( f(\vec{x}_0) + \epsilon - f_{\Strain}(\vec{x}_0) \right)^2
}</script>
<p>This expectation is over randomly selected training sets of size $N$, and over noise sources. We will now show that this expression can be rewritten as a sum of three non-negative terms:</p>
<script type="math/tex; mode=display">% <![CDATA[
\newcommand{\otherconstantterm}{\expectsub{\Strain'\sim\mathcal{D}}{f_{\Strain'}(\vec{x}_0)}}
\begin{align}
& \expectsub{\Strain \sim \mathcal{D},\ \epsilon\sim\mathcal{D}_\epsilon} {
\left( f(\vec{x}_0) + \epsilon - f_{\Strain}(\vec{x}_0) \right)^2
} \\
\overset{(a)}{=}\ &
\expectsub{\epsilon\sim\mathcal{D}_\epsilon} {
\epsilon^2
}
+ \expectsub{\Strain \sim \mathcal{D}} {
\bigl(f(\vec{x}_0) - f_{\Strain}(\vec{x}_0)\bigl)^2
} \\
\overset{(b)}{=}\ &
\text{Var}_{\epsilon\sim\mathcal{D}_\epsilon}\left[\epsilon\right]
+ \expectsub{\Strain \sim \mathcal{D}}{
\bigl(f(\vec{x}_0) - f_{\Strain}(\vec{x}_0)\bigl)^2
} \\
\overset{(c)}{=}\ &
\underbrace{
\text{Var}_{\epsilon\sim\mathcal{D}_\epsilon}\left[\epsilon\right]
}_\text{noise variance} \\
& + \underbrace{
\left( f(\vec{x}_0) - \otherconstantterm \right)^2
}_\text{bias} \\
& + \expectsub{\Strain\sim\mathcal{D}} {
\underbrace{
\left( \otherconstantterm - f_{\Strain(\vec{x}_0)} \right)^2
}_\text{variance}
} \\
\end{align} %]]></script>
<p>Note that here, $S'_\text{train}$ is a second training set, also sampled from $\mathcal{D}$, that is independent of the training set $\Strain$. It has the same expectation, but it is different and thus produces a different trained model $f_{S’}$.</p>
<p>Step $(a)$ uses $(u+v)^2 = u^2 + 2uv + v^2$ as well as linearity of expectation to produce $\expect{(u+v)^2} = \expect{u^2} + 2\expect{uv} + \expect{v^2}$. Note that the $2uv$ part is zero as the noise $\epsilon$ is independent from $\Strain$.</p>
<p>Step $(b)$ uses the definition of variance as:</p>
<script type="math/tex; mode=display">\text{Var}(X) = \expect{(X - \expect{X})^2} = \expect{X^2} - \expect{X}^2</script>
<p>Seeing that our noise $\epsilon$ has mean zero, we have $\expect{\epsilon}^2 = 0$ and therefore $\text{Var}(\epsilon) = \expect{\epsilon^2}$.</p>
<p>In step $(c)$, we add and subtract the constant term $\otherconstantterm$ to the expression like so:</p>
<script type="math/tex; mode=display">\expectsub{\Strain \sim \mathcal{D}}{\left(
\underbrace{f(\vec{x}_0) - \otherconstantterm}_u
+ \underbrace{\otherconstantterm - f_{\Strain}(\vec{x}_0)}_v
\right)^2}</script>
<p>We can then expand the square $(u+v)^2 = u^2 + 2uv + v^2$, where $u^2$ becomes the bias, and $v^2$ the variance. We can drop the expectation around $u^2$ as it is over $\Strain$, while $u^2$ is only defined in terms of $\Strain’$, which is independent from $\Strain$. The $2uv$ part of the expansion is zero, as we show below:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& 2 \cdot \expectsub{\Strain\sim\mathcal{D}} {
\left(
f(\vec{x}_0) - \otherconstantterm
\right) \cdot \left(
\otherconstantterm - f_{\Strain}(\vec{x}_0)
\right)
} \\
& = 2 \cdot \left(
f(\vec{x}_0) - \otherconstantterm
\right) \cdot \expectsub{\Strain\sim\mathcal{D}} {
\otherconstantterm - f_{\Strain}(\vec{x}_0)
} \\
& = 2 \cdot \left(
f(\vec{x}_0) - \otherconstantterm
\right) \cdot \left(
\otherconstantterm - \expectsub{\Strain\sim\mathcal{D}}{f_{\Strain}(\vec{x}_0)}
\right) \\
& = 0 \\
\end{align} %]]></script>
<p>In the first step, we can pull $u$ out of the expectation as it is a constant term with regards to $\Strain$. The same reasoning applies to $\otherconstantterm$ in the second step. Finally, we get zero in the third step by realizing that:</p>
<script type="math/tex; mode=display">\otherconstantterm = \expectsub{\Strain\sim\mathcal{D}}{f_{\Strain}(\vec{x}_0)}</script>
<h4 id="interpretation-of-the-decomposition">Interpretation of the decomposition</h4>
<p>Each of the three terms in non-negative, so each of them is a lower bound on the expected loss when we predict the value for the input $\vec{x}_0$.</p>
<ul>
<li>When the data contains <strong>noise</strong>, then that imposes a strict lower bound on the error we can achieve.</li>
<li>The <strong>bias term</strong> is a non-negative term that tells us how far we are from the true value, in expectation. It’s the square loss between the true value $f(\vec{x}_0)$ and the expected prediction $\otherconstantterm$, where the expectation is over the training sets. As <a href="#bias-variance-decomposition">we discussed above</a>, with a simple model we will not find a good fit on average, which means the bias will be large, which adds to the error we observe.</li>
<li>The <strong>variance term</strong> is the variance of the prediction function. For complex models, small variations in the data set can produce vastly different models, and our prediction will vary widely, which also adds to our total error.</li>
</ul>
<h2 id="classification">Classification</h2>
<p>When we did regression, our data was of the form:</p>
<script type="math/tex; mode=display">\Strain = \set{(\vec{x}_n, y_n)}_{n=1}^N,
\qquad \vec{x}_n \in \mathbb{R}^d,\ y_n \in\mathbb{R}</script>
<p>With <strong>classification</strong>, our prediction is no longer discrete. Now, $y_n\in\set{\mathcal{C}_0, \dots, \mathcal{C}_{K-1}}$. If it can only take two values (i.e. $K=2$), then it is called <strong>binary classification</strong>. If it can take more than two values, it is <strong>multi-class classification</strong>.</p>
<p>There is no ordering among these classes, so we may sometimes denote these labels as $y\in\set{0, 1, 2, \dots, K-1}$.</p>
<p>If we knew the underlying distribution $\mathcal{D}$, then it would be clear how we could measure the probability of error. We have a correct prediction when $y - f(\vec{x}) = 0$, and an incorrect one otherwise, so:</p>
<script type="math/tex; mode=display">\expectsub{\mathcal{D}}{\mathbb{I}\set{y-f(\vec{x}) \ne 0}} = \mathbb{P}(y-f(\vec{x}) \ne 0)</script>
<p>Where $\mathbb{I}$ is an indicator function that returns 1 when the condition is correct, and 0 otherwise. If we don’t know the distribution, we could just take an empirical sum, and use that instead.</p>
<p>A classifier will divide the input space into a collection of regions belonging to each class; the boundaries are called <strong>decision boundaries</strong>.</p>
<h3 id="linear-classifier">Linear classifier</h3>
<p>A linear classifier splits the input with a line in 2D, a plane in 3D, or more generally, a hyperplane. But a linear classifier can also classify more complex shapes if we allow for <a href="#extended-feature-vectors">feature augmentation</a>. For instance (in 2D), if we augment the input to degree $M=2$ and a constant factor, our linear classifier can also detect ellipsoids. So without loss of generality, we’ll simply study linear classifiers and allow feature augmentation, without loss of generality.</p>
<h3 id="is-classification-a-special-case-of-regression">Is classification a special case of regression?</h3>
<p>From the initial definition of classification, we see that it is a special case of regression, where the output $y$ is restricted to a small discrete set instead of a continuous spectrum.</p>
<p>We could construct classification from regression by simply rounding to the nearest $\mathcal{C}_i$ value. For instance, if we have $y\in\left\{0, 1\right\}$, we can use (regularized) least-squares to learn a prediction function $f_{\Strain}$ for this regression problem. We can then convert the regression to a classification by rounding: we decide on $\mathcal{C}_1=0$ if $f_{\Strain}(\vec{x})<0.5$ and $\mathcal{C}_2=1$ if $f_{\Strain}(\vec{x})>0.5$.</p>
<p>But this is somewhat questionable as an approach. MSE penalizes points that are far away from the result <strong>before rounding</strong>, even though they would be correct <strong>after rounding</strong>.</p>
<p>This means that if we have a small loss with MSE, we can guarantee a small classification error (as before), but crucially, the opposite is not true: a regression function can have very high MSE though the classification error is very very small.</p>
<p>It also means that the regression line will likely not be very good. With MSE, the “position” of the line defined by $f_{\Strain}$ will depend crucially on how many points are in each class, and where the points lie. This is not desirable for classification: instead of minimizing the cost function, we’d like for the fraction of misclassified cases to be small. The mean-squared error turns out to be only loosely related to this.</p>
<p><img src="/images/ml/regression-for-classification.png" alt="Example of a regression being skewed by the number of points in each class" /></p>
<p>So instead of building classification as a special case of regression, let’s take a look at some basic alternative ideas to perform classification.</p>
<h3 id="nearest-neighbor">Nearest neighbor</h3>
<p>In some cases it is reasonable to postulate that there is some spatial correlations between points of the same class: inputs that are “close” are also likely to have the same label. Closeness may be measured by Euclidean distance, for instance.</p>
<p>This can be generalized easily: instead of taking the single nearest neighbor, a process very prone to being swayed by outliers, we can take the $k$ nearest neighbors (which we’ll talk about <a href="#k-nearest-neighbor-knn">later in the course</a>), or a weighted linear combination of elements in the neighborhood (<a href="https://en.wikipedia.org/wiki/Kernel_smoother">smoothing kernels</a>, which we won’t talk about).</p>
<p>But this idea fails miserably in high dimensions, where the geometry renders the idea of “closeness” meaningless. High-dimensional space is a very lonely place; in a high-dimensional space, if we grow the area around a point, we’re likely to see no one for a very long time, and then once we get close to the boundaries of the space, 💥, everyone is there at once. This is known as the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>.</p>
<p>The idea also fails when we have too little data, especially in high dimensions, where the closest point may actually be far away and a very bad indicator of the local situation.</p>
<h3 id="linear-decision-boundaries">Linear decision boundaries</h3>
<p>As a starting point, we can assume that decision boundaries are linear (hyperplanes). To keep things simple, we can assume that there is a separating hyperplane, i.e. a hyperplane so that no point in the training set is misclassified.</p>
<p>There may be many such lines, so which one do we pick? This may be a little hand-wavy, but the intuition is the most “robust”, or the one that offers the greatest “margin”: we want to be able to “wiggle” the inputs (by changing the training set) as much as possible while keeping the numbers of misclassifications low. This idea will lead us to <a href="#support-vector-machines"><em>support vector machines</em> (SVMs)</a>.</p>
<p>But the linear decision boundaries are limited, and in many cases too strong of an assumption. We can augment the feature vector with some non-linear functions, which is what we do with <a href="#kernel-trick">the kernel trick</a>, which we will talk about later. Another option is to use neural networks to find an appropriate non-linear transform of the inputs.</p>
<h3 id="optimal-classification-for-a-known-generating-model">Optimal classification for a known generating model</h3>
<p>To find a solution, we can gain some insights if we assume that we know the joint distribution $p(\vec{x}, y)$ that created the data (where $y$ takes values in a discrete set $\mathcal{y}$). In practice, we don’t know the model, but this is just a thought experiment. We can assume that the data was generated from a model $(\vec{x}, y)\sim\mathcal{D}$, where $y=g(\vec{x})+\epsilon$, where $\epsilon$ is noise.</p>
<p>Given the fact that there is noise, a perfect solution may not always be possible. But if we see an input $\vec{x}$, how can we pick an optimal choice $\hat{y}(\vec{x})$ for this distribution? We want to maximize the probability of guessing the correct label, so we should choose according to the rule:</p>
<script type="math/tex; mode=display">\hat{y}(\vec{x}) = \argmax_{y\in\mathcal{Y}}{p(y\mid\vec{x})}</script>
<p>This is known as the maximum a-posteriori (MAP) criterion, since we maximize the posterior probability (the probability of a class label <em>after</em> having observed the input).</p>
<p>The probability of a correct guess is thus the average over all inputs of the MAP, i.e.:</p>
<script type="math/tex; mode=display">\mathbb{P}(\hat{y}(\vec{x}) = y) = \int{p(\vec{x})p(\hat{y}(\vec{x})\mid \vec{x})dx}</script>
<p>In practice we of course do not know the joint distribution, but we could use this approach by using the data itself to learn the distribution (perhaps under the assumption that it is Gaussian, and just fitting the $\mu$ and $\sigma$ parameters).</p>
<h2 id="logistic-regression">Logistic regression</h2>
<p>Recall that <a href="#is-classification-a-special-case-of-regression">we discussed</a> what happens if we look at binary classification as a regression. We also discussed that it is tempting to look at the predicted value as a probability (i.e. if the regression says 0.8, we could interpret it as 80% certainty of $\mathcal{C}_1 = 1$ and 20% probability of $\mathcal{C}_0 = 0$). But this leads to problems, as the predicted values may not be in $[0, 1]$, even largely surpassing these bounds, and this contributes to the error in MSE even though they indicate high certainty.</p>
<p>So the natural idea is to <em>transform</em> the prediction, which can take values in $(-\infty, \infty)$, into a true probability in $[0, 1]$. This is done by applying an appropriate function<sup id="fnref:squishification-function"><a href="#fn:squishification-function" class="footnote">7</a></sup>, one of which is the <em>logistic function</em>, or <em>sigmoid function</em><sup id="fnref:logistic-implementation"><a href="#fn:logistic-implementation" class="footnote">8</a></sup>:</p>
<script type="math/tex; mode=display">\sigma(z) := \frac{e^z}{1+e^z} = \frac{1}{1+e^{-z}}</script>
<p>How do we use this? Let’s consider binary classification, with labels 0 and 1. Given a training set, we learn a weight vector $\vec{w}$. Given a new feature vector $\vec{x}$, the <em>probability</em> of the class labels given $\vec{x}$ are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(1 \mid \vec{x}) & = \sigma(\vec{x}^T\vec{w}) \\
p(0 \mid \vec{x}) & = 1 - \sigma(\vec{x}^T\vec{w}) \\
\end{align} %]]></script>
<p>This allows us to predict a certainty, which is a real value and not a label, which is why logistic regression is called regression, even though it is still part of a classification scheme. The second step of the scheme would be to quantize this value to a binary value. For binary classification, we’d pick 0 if the value is less than 0.5, and 1 otherwise.</p>
<h3 id="training">Training</h3>
<p>To train the classifier, the intuition is that we’d like to maximize the likelihood of our weight vector explaining the data:</p>
<script type="math/tex; mode=display">\argmax_{\vec{w}}{p(\vec{y}, \vec{X} \mid \vec{w})}</script>
<p>We know that <a href="#properties-of-mle">maximizing the likelihood</a> is <strong>consistent</strong>, it gives us the correct model assuming we have enough data. Using the chain rule for probabilities, the probability becomes:</p>
<script type="math/tex; mode=display">p(\vec{y}, \vec{X} \mid \vec{w}) = p(\vec{X}\mid\vec{w})p(\vec{y} \mid \vec{X}, \vec{w}) = p(\vec{X})p(\vec{y} \mid \vec{X}, \vec{w})</script>
<p>As we’re trying to get the argmax over the weights, we can discard $p(\vec{X})$ as it doesn’t depend on $\vec{w}$. Therefore:</p>
<script type="math/tex; mode=display">\argmax_{\vec{w}}{p(\vec{y}, \vec{X} \mid \vec{w})} = \argmax_{\vec{w}}{p(\vec{y} \mid \vec{X}, \vec{w})}</script>
<p>Using the fact that the samples in the dataset are independent, and given the above formulation of the prior, we can express the maximum likelihood criterion (still for the binary case $K=2$)</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(\vec{y} \mid \vec{X}, \vec{w})
& = p(y_1, \dots, y_N \mid \vec{x}_1, \dots, \vec{x}_N, \vec{w}) \\
& = \prod_{n=1}^N{p(y_n \mid \vec{x}_n, \vec{w})} \\
& = \prod_{n=1}^N{\sigma(\vec{x}_n^T \vec{w})^{y_n} (1-\sigma(\vec{x}_n^T \vec{w}))^{1-y_n}} \\
\end{align} %]]></script>
<p>But this product is nasty, so we’ll remove it by taking the log. We also multiply by $-1$, which means we also need to be careful about taking the minimum instead of the maximum. The resulting cost function is thus:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{\vec{w}}
& = -\sum_{n=1}^N{\left[
y_n \log{(\sigma(\vec{x}_n^T \vec{w}))} + (1-y_n)\log{(1-\sigma(\vec{x}_n^T \vec{w}))}
\right]} \\
& = \sum_{n=1}^N{\log{(1+\exp{(\vec{x}_n^T \vec{w})})} - y_n \vec{x}_n^T \vec{w}}
\tag{Log-Likelihood}\label{eq:log-likelihood}
\end{align} %]]></script>
<h3 id="conditions-of-optimality">Conditions of optimality</h3>
<p>As we discuss above, we’d like to minimize the cost $\cost{\vec{w}}$. Let’s look at the stationary points of our cost function by computing its gradient and setting it to zero.</p>
<p>It just turns out that taking the derivative of the logarithm in the inner part of the sum above gives us the logistic function:</p>
<script type="math/tex; mode=display">\diff{\log{(1+\exp{(\vec{x}_n^T \vec{w})})}}{\vec{x}_n} = \sigma(\vec{x}_n)</script>
<p>Therefore, the whole derivative is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\cost{\vec{w}}
& = \sum_{n=1}^N {\vec{x}_n (\sigma(\vec{x}_n^T\vec{w}) - y_n)} \\
& = \vec{X}^T \left[ \sigma(\vec{Xw}) - \vec{y} \right]
\end{align} %]]></script>
<p>The matrix $\vec{X}$ is $N\times N$; both $\vec{y}$ and $\vec{w}$ are column vectors of length $N$. Therefore, to simplify notation, we let $\sigma(\vec{Xw})$ represent element-wise application of the sigmoid function on the size $N$ vector resulting from $\vec{Xw}$.</p>
<p>There is no closed-form solution for this, so we’ll discuss how to solve it in an iterative fashion by using gradient descent or the Newton method.</p>
<h3 id="gradient-descent-1">Gradient descent</h3>
<p>$\ref{eq:log-likelihood}$ is convex in the weight vector $\vec{w}$. We can therefore do gradient descent on this cost function as we’ve always done:</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)} := \vec{w}^{(t)} - \gamma^{(t)}\nabla\cost{\vec{w}^{(t)}}</script>
<h3 id="newtons-method">Newton’s method</h3>
<p>Gradient descent is a <em>first-order</em> method, using only the first derivative of the cost function. We can get a more powerful optimization algorithm using the second derivative. This is based on the idea of Taylor expansions. The 2<sup>nd</sup> order Taylor expansion of the cost, around $\vec{w}^*$, is:</p>
<script type="math/tex; mode=display">\cost{\vec{w}} \approx \cost{\vec{w}^*}^T(\vec{w}-\vec{w}^*) + \frac{1}{2}(\vec{w}-\vec{w}^*)^T \vec{H}(\vec{w}^*)(\vec{w}-\vec{w}^*)</script>
<p>Where $\vec{H}$ denotes the Hessian, the $D\times D$ symmetric matrix with entries:</p>
<script type="math/tex; mode=display">\vec{H}_{i, j} = \frac{\partial^2\cost{\vec{w}}}{\partial w_i \partial w_j}</script>
<h4 id="hessian-of-the-cost">Hessian of the cost</h4>
<p>Let’s compute this Hessian matrix. We’ve already computed the gradient of the cost function <a href="#conditions-of-optimality">in the section above</a>, where saw that the gradient of a single term is:</p>
<script type="math/tex; mode=display">\vec{x}_n \sigma(\vec{x}_n^T\vec{w}) - y_n</script>
<p>Each term only depends on $\vec{w}$ in the $\sigma(\vec{x}_n^T w)$ term. Therefore, the Hessian associated to one term is:</p>
<script type="math/tex; mode=display">\vec{x}_n(\nabla\sigma(\vec{x}_n^T\vec{w}))^T</script>
<p>Given that the derivative of the sigmoid is $\sigma’(x) = \sigma(x)(1-\sigma(x))$, by the <a href="https://en.wikipedia.org/wiki/Chain_rule">chain rule</a>, each term of the sum gives rise to the Hessian:</p>
<script type="math/tex; mode=display">\vec{x}_n\vec{x}_n^T\sigma(\vec{x}_n^T \vec{w})(1 - \sigma(\vec{x}_n^T \vec{w}))</script>
<p>This is the Hessian for a single term; if we sum up over all terms, we get to the following matrix product:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\vec{H}(\vec{w})
& = \sum_{n=1}^N{\nabla^2\mathcal{L}_n(\vec{w})} \\
& = \sum_{n=1}^N{
\underbrace{\vec{x}_n \vec{x}_n^T}_{D\times D}
\sigma(\vec{x}_n^T \vec{w})
\bigl(1 - \sigma(\vec{x}_n^T \vec{w}) \bigr)
} \\
& = \underbrace{\ \vec{X}^T \ }_{D\times N} \
\underbrace{\ \vec{S} \ }_{N\times N} \
\underbrace{\ \vec{X} \ }_{N\times D} \\
\end{align} %]]></script>
<p>The $\vec{S}$ matrix is diagonal, with positive entries, which means that the Hessian is positive semi-definite, and therefore that the problem indeed is convex. The entries are:</p>
<script type="math/tex; mode=display">S_{n, n} = \sigma(\vec{x}_n^T \vec{w})\bigl(1 - \sigma(\vec{x}_n^T \vec{w}) \bigr)</script>
<h4 id="closed-form-for-newtons-method">Closed form for Newton’s method</h4>
<p>In this model, we’ll assume that the Taylor expansion above denotes the cost function exactly instead of approximately. In other words, we’re assuming strict equality $=$ instead of approximation $\approx$ as above. This is only an assumption; it isn’t strictly true, but it’s a decent approximation. Where does this take minimum value? To know that, let’s set the gradient of the Taylor expansion to zero. This yields:</p>
<script type="math/tex; mode=display">H(\vec{w}^*)^{-1} \nabla\cost{\vec{w}^*} = \vec{w}^* - \vec{w}</script>
<p>If we solve for $\vec{w}$, this gives us an iterative algorithm for finding the optimum:</p>
<script type="math/tex; mode=display">\vec{w}^{(t+1)} = \vec{w}^{(t)} - \vec{H}\left(\vec{w}^{(t)}\right)^{-1} \nabla\cost{\vec{w}^{(t)}} \gamma^{(t)}</script>
<p>The trade-off for the Newton method is that while we need fewer iterations, each of them is more costly. In practice, which one to use depends, but at least we have another option with the Newton method.</p>
<h3 id="regularized-logistic-regression">Regularized logistic regression</h3>
<p>If the data is linearly separable, there is no finite-weight vector. Running the iterative algorithm will make the weights diverge to infinity. To avoid this, we can regularize with a penalty term.</p>
<script type="math/tex; mode=display">\argmin_w{-\sum_{n=1}^N{\log{p(y_n \mid \vec{x}_n^T\vec{w})}} + \frac{\lambda}{2}\norm{\vec{w}}^2}</script>
<h2 id="generalized-linear-models">Generalized Linear Models</h2>
<p>Previously, with <a href="#a-probabilistic-model-for-least-squares">least squares</a>, we assumed that our data was of the form:</p>
<script type="math/tex; mode=display">y = x^T \vec{w} + z, \quad \text{with } z\sim\normal{0, \sigma^2}</script>
<p>This is a D-linear model. When talking about generalized linear models, we’re still talking about something linear, but we allow the noise $z$ to be something else than a Gaussian distribution.</p>
<h3 id="motivation">Motivation</h3>
<p>The motivation for this is that while standard logistic regression only allows for binary outputs<sup id="fnref:binary-logistic-regression"><a href="#fn:binary-logistic-regression" class="footnote">9</a></sup>, we may want to have something equivalently computationally efficient for, say, $y\in\mathbb{N}$. To do so, we introduce a different class of distributions, called the <em>exponential family</em>, with which we can revisit logistic regression and get other properties.</p>
<p>This will be useful in adding a degree of freedom. Previously, we most often used linear models, in which we model the data as a line, plus zero-mean Gaussian noise. As we saw, this leads to least squares. When the data is more complex than a simple line, we saw that we could augment the features (e.g. with $x^2$, $x^3$), and still use a linear model. The idea was to augment the feature space $x$. This gave us an added degree of freedom, and allowed us to use linear models for higher-degree problems.</p>
<p>These linear models predicted the mean of the distribution from which we assumed the data to be sampled. When talking about mean here, we mean what we assume the data to be modeled after, without the noise. In this section, we’ll see how we can use the linear model to predict a different quantity than the mean. This will allow us to add another degree of freedom, and use linear models to get other predictions than just the shape of the data.</p>
<p>We’ve actually already done this, without knowing it. In <a href="#logistic-regression">(binary) logistic regression</a>, the probability of the classes was:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y = 1 \mid \eta) & = \sigma(\eta) \\
p(y = 0 \mid \eta) & = 1 - \sigma(\eta) \\
\end{align} %]]></script>
<p>We’re using $\eta$ as a shorthand for $\vec{x}^T\vec{w}$, and will do so in this section. More compactly, we can write this in a single formula:</p>
<script type="math/tex; mode=display">p(y\mid\eta) = \frac{e^{\eta y}}{1 + e^\eta} = \exp{\left[
\eta y - \log{(1 + e^\eta)}
\right]}, \qquad y\in\set{0, 1}</script>
<p>Note that the linear model $\vec{x}^T\vec{w}$ does not predict the mean, which we’ll denote $\mu$ (don’t get confused by this notation; in this section, $\mu$ is not a scalar, but represents the “real values” that the data is modeled after, without the noise). Instead, our linear model predicts $\eta = \vec{x}^T\vec{w}$, which is transformed into the mean by using the $\sigma$ function:</p>
<script type="math/tex; mode=display">\mu = \sigma(\eta)</script>
<p>This relation between $\mu$ and $\sigma$ is known as the <strong>link function</strong>. It is a nonlinear function that makes it possible to use a linear model to predict something else than the mean $\mu$.</p>
<h3 id="exponential-family">Exponential family</h3>
<p>In general, the form of a distribution in the exponential family is:</p>
<script type="math/tex; mode=display">p(y\mid\pmb{\eta}) = h(y)\exp{\left[\pmb{\eta}^T\pmb{\phi}(y) - A(\pmb{\eta})\right]}</script>
<p>Let’s take a look at the various components of this distribution:</p>
<ul>
<li>$\pmb{\phi}(y)$ is called a <strong>sufficient statistic</strong>. It’s usually a vector. Its name stems from the fact that its empirical average is all we need to estimate $\pmb{\eta}$</li>
<li>$A(\pmb{\eta})$ is the <strong>log-partition function</strong>, or the <strong>cumulant</strong>.</li>
</ul>
<p>The domain of $y$ can be vary: we could choose $y\in\mathbb{R}$, $y\in\left\{0, 1\right\}$, $y\in\mathbb{N}$, etc. Depending on this, we may have to do sums or integrals in the following.</p>
<p>We require that the probability be non-negative, so we need to ensure that $h(y) \ge 0$. Additionally, a probability distribution needs to integrate to 1, so we also require that that:</p>
<script type="math/tex; mode=display">\int_y{h(y)\exp{\left[\pmb{\eta}^T\pmb{\phi}(y) - A(\pmb{\eta})\right]}} dy = 1</script>
<p>This can be rewritten to:</p>
<script type="math/tex; mode=display">\int_y{h(y)\exp{\left[\pmb{\eta}^T\pmb{\phi}(y)\right]}} dy = \exp{A(\pmb{\eta})}</script>
<p>The role of $A(\pmb{\eta})$ is thus only to ensure a proper normalization. To create a member of the exponential family, we can choose the factor $h(y)$, the vector $\pmb{\phi}(y)$ and the parameter $\pmb{\eta}$; the cumulant $A(\pmb{\eta})$ is then determined for each such choice, and ensures that the expression is properly normalized. From the above, it follows that $A(\pmb{\eta})$ is defined as:</p>
<script type="math/tex; mode=display">A(\pmb{\eta}) = \log{\left[\int_y{h(y)\exp{\left[\pmb{\eta}^T\pmb{\phi}(y) - A(\pmb{\eta})\right]}} dy\right]}</script>
<p>We exclude the case where the integral is infinite, as we cannot compute a real $A(\pmb{\eta})$ for that case.</p>
<h4 id="link-function">Link function</h4>
<p>There is a relationship between the mean $\pmb{\mu}$ and $\pmb{\eta}$ using the link function $g$:</p>
<script type="math/tex; mode=display">\pmb{\eta} = g(\pmb{\mu}) \iff \pmb{\mu} = g^{-1}(\pmb{\eta})</script>
<p>The link function is a 1-to-1 transformation from the <strong>usual parameters</strong> $\pmb{\mu}$ (e.g. $\pmb{\mu} = \set{\mu, \sigma^2}$ for Gaussian distributions) to the <strong>natural parameters</strong> $\pmb{\eta}$ (e.g. $\pmb{\eta} = \set{\frac{\mu}{\sigma^2}, -\frac{1}{2\sigma^2}}$ for Gaussian distributions).</p>
<p>For a list of such functions, consult the chapter on Generalized Linear Models in <a href="https://www.cs.ubc.ca/~murphyk/MLbook/">the KPM book</a>.</p>
<h4 id="example-bernoulli">Example: Bernoulli</h4>
<p>The Bernoulli distribution is a member of the exponential family. Its probability density is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y\mid\mu)
& = \mu^y(1-\mu)^{1-y}, \quad \text{where } \mu\in(0, 1) \\
& = \exp{\left[
\left( \log{\frac{\mu}{1-\mu}} \right) y +
\log{(1 - \mu)}
\right]} \\
& = \exp{\left[\eta \phi(y) - A(\eta)\right]}
\end{align} %]]></script>
<p>The parameters are thus:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y) & = 1 \\
\phi(y) & = y \\
\eta & = \log{\frac{\mu}{1-\mu}} \\
A(\eta) & = -\log{(1-\mu)}=\log{(1 + e^{\eta})} \\
\end{align} %]]></script>
<p>Here, $\phi(y)$ is a scalar, which means that the family only depends on a single parameter. Note that $\eta$ and $\mu$ are linked:</p>
<script type="math/tex; mode=display">\eta
= g(\mu)
= \log{\frac{\mu}{1-\mu}}
\iff
\mu
= g^{-1}(\eta)
= \log{\frac{e^{\eta}}{1+e^{\eta}}}
= \sigma(\eta)</script>
<p>The link function is the same sigmoid function we encountered in logistic regression.</p>
<h4 id="example-poisson">Example: Poisson</h4>
<p>The Poisson distribution with mean $\mu$ is given by:</p>
<script type="math/tex; mode=display">p(y\mid\mu) = \frac{\mu^y e^{-\mu}}{y!} = \frac{1}{y!}\exp{\left[
y \log{(\mu)} - \mu
\right]} = h(y)\exp{\left[
\eta \phi(y) - A(\eta)
\right]}</script>
<p>Where the parameters of the exponential family are given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y) & = \frac{1}{y!} \\
\phi(y) & = y \\
\eta & = g(\mu) = \log{(\mu)} \\
A(\eta) & = \mu = g^{-1}(\eta) = e^\eta
\end{align} %]]></script>
<h4 id="example-gaussian">Example: Gaussian</h4>
<p>Notation for Gaussian distributions can be a little confusing, so we’ll make sure to distinguish the notation of the usual parameter vectors $\pmb{\mu}$ (in bold), from the parameters themselves, which are the Gaussian mean $\mu$ and variance $\sigma^2$.</p>
<p>The density of a Gaussian $\normal{\mu, \sigma^2}$ is:</p>
<script type="math/tex; mode=display">p(y\mid\mu,\sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y-\mu)^2}{2\sigma^2}},
\qquad \mu\in\mathbb{R},
\quad \sigma\in\mathbb{R}^+</script>
<p>There are two parameters to choose in a Gaussian, $\mu$ and $\sigma$, so we can expect something of degree 2 in exponential form. Let’s rewrite the above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y\mid\mu,\sigma^2) & = \exp{\left[
- \frac{y^2}{2\sigma^2}
+ \frac{\mu y}{\sigma^2}
- \underbrace{
\frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log{(2\pi\sigma^2)}
}_{A(\pmb{\eta})}
\right]} \\
& = \exp{\left[
\pmb{\eta}^T \pmb{\phi}(y) - A(\pmb{\eta})
\right]}
\end{align} %]]></script>
<p>Where:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y) & = 1 \\
\pmb{\phi}(y) & = \begin{bmatrix}
y \\
y^2 \\
\end{bmatrix} \\
\pmb{\eta} & = \begin{bmatrix}
\eta_1 \\
\eta_2 \\
\end{bmatrix} = \begin{bmatrix}
\frac{\mu}{\sigma^2} \\
-\frac{1}{2\sigma^2} \\
\end{bmatrix} \\
A(\pmb{\eta}) & = \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log{(2\pi\sigma^2)}
= \frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log{(-\eta_2/\pi)}
\end{align} %]]></script>
<p>Indeed, this time $\pmb{\phi}(y)$ is a vector of dimension 2, which reflects that the distribution depends on 2 parameters. As the formulation of $\pmb{\eta}$ shows, we have a 1-to-1 correspondence to $\pmb{\eta}=(\eta_1, \eta_2)$ and the $(\mu, \sigma^2)$ parameters:</p>
<script type="math/tex; mode=display">\eta_1 = \frac{\mu}{\sigma^2},\ \eta_2 = -\frac{1}{2\sigma^2}
\quad \iff \quad
\mu = -\frac{\eta_1}{2\eta_2},\ \sigma^2 = -\frac{1}{2\eta_2}</script>
<h4 id="properties-1">Properties</h4>
<ol>
<li>$A(\pmb{\eta})$ is convex</li>
<li>$\nabla_{\pmb{\eta}} A(\pmb{\eta}) = \expect{\pmb{\phi}(y)}$</li>
<li>$\nabla_{\pmb{\eta}}^2 A(\pmb{\eta}) = \expect{\pmb{\phi}(y)^T\pmb{\phi}(y)} - \expect{\pmb{\phi}(y)}^T\expect{\pmb{\phi}(y)}$</li>
<li>$\pmb{\mu} := \expect{\pmb{\phi}(y)}$</li>
</ol>
<p>Proofs for the first 3 properties are in the lecture notes. The last property is given without proof.</p>
<h3 id="application-in-ml">Application in ML</h3>
<p>We use $\eta_n = \vec{x}_n^T\vec{w}$, or equivalently, $\pmb{\eta} = \vec{X}^T\vec{w}$.</p>
<h4 id="maximum-likelihood-parameter-estimation">Maximum Likelihood Parameter Estimation</h4>
<p>Assume that we have samples composing our training set, $\Strain = \set{(\vec{x}_n, y_n)}_{n=1}^N$ i.i.d. from some distribution, which we assume is some exponential family. Assume we have picked a model, i.e. that we fixed $h(y)$ and $\pmb{\phi}(y)$, but that $\pmb{\eta}$ is unknown. How can we find an optimal $\pmb{\eta}$?</p>
<p>We said previously that $\pmb{\phi}(y)$ is a sufficient statistic, and that we could find $\pmb{\eta}$ from its empirical average; this is what we’ll do here. We can use the maximum likelihood principle to find this parameter, meaning that we want to minimize log-likelihood:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{L}_{LL}(\pmb{\eta})
& = -\log{(p(y \mid \pmb{\eta}))} \\
& = \sum_{n=1}^N{\left(
-\log{\left[h(y_n)\right] - \eta_n^T\pmb{\phi}(y_n) + A(\eta_n)}
\right)}
\end{align} %]]></script>
<p>This is a convex function in $\pmb{\eta}$: the $h(y)$ term does not depend on $\pmb{\eta}$, $\pmb{\eta}^T\pmb{\phi}(y_n)$ is linear, $A(\pmb{\eta})$ has the <a href="#properties-1">property of being convex</a>.</p>
<p>If we assume that we have the link function already, we can get $\pmb{\eta}$ by setting the gradient of our exponential family to 0. We also multiply by $\frac{1}{N}$ to get a more convenient form, i.e. with $\expect{\pmb{\phi}(y)}$ instead of $N\cdot\expect{\pmb{\phi}(y)}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{1}{N} \nabla\cost{\pmb{\eta}}
& = -\frac{1}{N}\sum_{n=1}^N{\bigl[\pmb{\phi}(y_n)
- \nabla A(\eta_n)\bigr]} \\
& = -\frac{1}{N}\left( \sum_{n=1}^N{\pmb{\phi}(y_n)} \right)
+ \expect{\pmb{\phi}(y)} \\
& = 0
\end{align} %]]></script>
<p>Since $\pmb{\mu} := \expect{\pmb{\phi}(y)}$, we get:</p>
<script type="math/tex; mode=display">\pmb{\mu} := \expect{\pmb{\phi}(y)} = \frac{1}{N} \sum_{n=1}^N{\pmb{\phi}(y_n)}</script>
<p>Therefore, we can get $\pmb{\eta}$ by using the link function:</p>
<script type="math/tex; mode=display">\pmb{\eta}
= g^{-1}(\pmb{\mu})
= g^{-1}\left( \frac{1}{N}\sum_{n=1}^N{\pmb{\phi}(y_n)} \right) \\</script>
<p>With this, we can see the justification for calling $\pmb{\phi}(y)$ a sufficient statistic.</p>
<h4 id="conditions-of-optimality-1">Conditions of optimality</h4>
<p>If we assume that our samples follow the distribution of an exponential family, we can construct a <em>generalized linear model</em>. As we’ve explained previously, this is a generalization of the model we used for logistic regression.</p>
<p>For such a model, the maximum likelihood problem, as described above, is easy to solve. As we’ve noted above, the cost function is convex, so a greedy, iterative algorithm should work well. Let’s look at the gradient of the cost in terms of $\vec{w}$ (instead of $\pmb{\eta} = \vec{x}^T\vec{w}$ as previously):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{\vec{w}}
& = -\sum_{n=1}^N{
\log{(h(y_n))} + \vec{x}_n^T\vec{w} \pmb{\phi}(y_n) - A(\vec{x}_n^T\vec{w})
} \\
\nabla_{\vec{w}}\cost{\vec{w}}
& = -\sum_{n=1}^N{
\vec{x}_n \pmb{\phi}(y_n) - \nabla_{\vec{w}} A(\vec{x}_n^T\vec{w})
}
\end{align} %]]></script>
<p>Let’s recall that the derivative of the cumulant is:</p>
<script type="math/tex; mode=display">\frac{\partial A(\pmb{\eta})}{\partial \pmb{\eta}} = \expect{\pmb{\phi}(y)} = g^{-1}(\pmb{\eta})</script>
<p>Hence the gradient of the cost function is:</p>
<script type="math/tex; mode=display">\nabla_{\vec{w}}\cost{\vec{w}}
= - \sum_{n=1}^N {\vec{x}_n \pmb{\phi}(y_n)
- \vec{x}_n g^{-1}(\vec{x}_n^T\vec{w})}</script>
<p>Setting this to zero gives us the condition of optimality. Using matrix notation, we can rewrite this sum as follows:</p>
<script type="math/tex; mode=display">\nabla_{\vec{w}}\cost{\vec{w}}
= \vec{X}^T\left( g^{-1}(\vec{Xw}) - \pmb{\phi}(\vec{y}) \right)
= 0</script>
<p>Note that this is a more general form of the formula we had <a href="#conditions-of-optimality">for logistic regression</a>. At this point, seeing that the function is convex, we can use a greedy iterative algorithm like gradient descent to find the minimum.</p>
<h2 id="nearest-neighbor-classifiers-and-the-curse-of-dimensionality">Nearest neighbor classifiers and the curse of dimensionality</h2>
<p>For simplicity, let’s assume that we’re operating in a d-dimensional box, that is, in the domain $\chi = [0, 1]^d$. As always, we have a training set $\Strain=\set{(\vec{x}_n, y_n)}$.</p>
<h3 id="k-nearest-neighbor-knn">K Nearest Neighbor (KNN)</h3>
<p>Given a “fresh” input $\vec{x}$, we can make a prediction using $\text{nbh}_{\Strain,\ k}(\vec{x})$. This is a set of the $k$ inputs in the training set that are closest to $\vec{x}$.</p>
<p>For the regression problem, we can take the average of the k nearest neighbors:</p>
<script type="math/tex; mode=display">f(\vec{x}) = \frac{1}{k}\sum_{n\in\text{nbh}_{\Strain,\ k}(\vec{x})}{y_n}</script>
<p>For binary classification, we take the majority element in the $k$-neighborhood. It’s a good idea to pick $k$ to be odd so that there is a clear winner.</p>
<script type="math/tex; mode=display">f(\vec{x}) = \text{maj}\set{y_n : n \in \text{nbh}_{\Strain, k}(\vec{x})}</script>
<p>If we pick a large value of $k$, then we are smoothing over a large area. Therefore, a large $k$ gives us a simple model, with simpler boundaries, while a small $k$ is a more complex model. In other words, complexity is inversely proportional to $k$. As we saw when we talked about <a href="#bias-variance-decomposition">bias and variance</a>, if we pick a small value of $k$ we can expect a small bias but huge variance. If we pick a large $k$ we can expect large bias but small variance.</p>
<h3 id="analysis">Analysis</h3>
<p>We’ll analyze the simplest setting, a binary KNN model (that is, there are only two output labels, 0 and 1). Let’s start by simplifying our notation. We’ll introduce the following function:</p>
<script type="math/tex; mode=display">\eta(\vec{x}) = \mathbb{P}\left\{y=1\mid\vec{x}\right\}</script>
<p>This is the conditional probability that the label is 1, given that the input is $\vec{x}$. If this probability is to be meaningful at all, we must have some correlation between the “position” x and the associated label; knowing the labels close by must give us some information. This means that we need an assumption on the distribution $\mathcal{D}$:</p>
<script type="math/tex; mode=display">\abs{\eta(\vec{x}) - \eta(\vec{x}')} \le \mathcal{c}\norm{\vec{x} - \vec{x}'}
\label{eq:lipschitz-bound}\tag{Lipschitz bound}</script>
<p>On the right-hand side we have Euclidean distance. In other words, we ask that the conditional probability $\mathbb{P}\left\{y=1\mid\vec{x}\right\}$, denoted by $\eta(x)$, be <a href="https://en.wikipedia.org/wiki/Lipschitz_continuity">Lipschitz continuous</a> with Lipschitz constant $\mathcal{c}$. We will use this assumption later on to prove a performance bound for our KNN model.</p>
<p>Let’s assume for a moment that we know the actual underlying distribution. This is not something that we actually know in practice, but is useful for deriving a formulation for the optimal model. Knowing the distribution probability distribution, our optimum decision rule is given by the classifier:</p>
<script type="math/tex; mode=display">f_*(\vec{x}) = \mathbb{I}\left[ \eta(\vec{x}) > \frac{1}{2} \right]</script>
<p>The idea of this classifier is that with two labels, we’ll pick the label that is likely to happen more than half of the time. The intuition is that if we were playing heads or tails and knew the probability in advance, we would always pick the option that has probability more than one half, and that is the best strategy we can use. This is known as the <strong>Bayes classifier</strong>, also called <strong>maximum a posteriori (MAP) classifier</strong>. It is optimal, in that it has the smallest probability of misclassification of any classifier, namely:</p>
<script type="math/tex; mode=display">\cost{f_*} = \expectsub{\vec{x}\sim\mathcal{D}}{
\min{\set{ \eta(\vec{x}), 1-\eta(\vec{x}) }}
}</script>
<p>Let’s compare this to the probability of misclassification of the real model:</p>
<script type="math/tex; mode=display">\cost{f_{\Strain,\ k=1}} = \expect{\mathbb{I}\left[ f_{\Strain}(\vec{x}) \ne y \right]}</script>
<p>This tells us that the risk (that is, the error probability of our $k=1$ nearest neighbor classifier) is the above expectation. It’s hard to find a closed form for that expectation, but we can place a bound on it by comparing the ideal, theoretical model to the actual model. We’ll state the following lemma:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{f_{\Strain}}
& \le 2 \cost{f_*} + \mathcal{c} \expectsub{\Strain, \vec{x}\sim\mathcal{D}}{\norm{\vec{x} - \text{nbh}_{\Strain, 1}(\vec{x})}} \\
& \le 2 \cost{f_*} + 4\mathcal{c}\sqrt{d} N^{-\frac{1}{d+1}} \\
\end{align} %]]></script>
<p>Before we see where this comes from, let’s just interpret it. The above gives us a bound on the real classifier, compared to the optimal one. The actual classifier is upper bounded by twice the risk of the optimal classifier (this is good), plus a geometric term reflecting dimensionality (it depends on $d$: this will cause us some trouble).</p>
<p>This second term of the sum is the average distance of a randomly chosen point to the nearest point in the training set, times the Lipschitz constant $\mathcal{c}$. It intuitively makes sense to incorporate this factor into our bound: if we are basing our prediction on a point that is very close, we’re more likely to be right, and if it’s far away, less so. If we’re in a box of $[0, 1]^d$, then the distance between two corners would be $\sqrt{d}$ (by Pythagoras’ theorem). The term $N^{-\frac{1}{d+1}}$ indicates that the closest data point may be closer than the opposite corner of the cube: if we have more data, we’ll probably not have to go that far. However, for large dimensions, we need much more data to have something that’ll probably be close.</p>
<p>Let’s prove where this geometric term comes from by considering the cube $[0, 1]^d$, the space of inputs containing $\vec{x}$. We can cut this large cube into small cubes of side length $\epsilon$. Consider the small cube containing $\vec{x}$. If we are lucky, this small cube also contains a neighboring data point at distance at most $\sqrt{d}\epsilon$ (at the opposite corner of the small cube; we use Pythagoras’ theorem as above). However, if we’re less lucky, the closest neighbor may be at the other corner of the big cube, at distance $\sqrt{d}$. So what is the probability of a point not having a neighbor in its small $\epsilon$ cube?</p>
<p>Let’s denote the probability of $\vec{x}$ landing in a particular box by $\mathbb{P}_i$. The chance that none of the N training points are in the box is $(1-\mathbb{P}_i)^N$. We don’t know the distribution $\mathcal{D}$, so we can’t really express $\mathbb{P}_i$ in a closed form, but that doesn’t matter, this notation allows us to abstract over that. The rest of the proof is calculus, carefully choosing the right scaling for $\epsilon$ in order to get a good bound.</p>
<p>Now, let’s understand where the term $2\cost{f_*}$ comes from. If we flip two coins, $y$ and $y’$, what is the probability of the outcome being different?</p>
<script type="math/tex; mode=display">\mathbb{P}\left\{y \ne y' \right\} = 2p(1-p)</script>
<p>Now, let’s consider two points $\vec{x}$ and $\vec{x}’$, both elements of $[0, 1]^d$. Their labels are $y$ and $y’$, respectively. The probability of these two labels being different is roughly the same as above (although the probabilities of the two events may not be the same in general):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}\left\{ y \ne y'\right\}
= & \eta(\vec{x})(1-\eta(\vec{x}')) + \eta(\vec{x}')(1-\eta(\vec{x})) \\
= & 2\eta(\vec{x})(1-\eta(\vec{x})) + (2\eta(\vec{x})-1)(\eta(\vec{x})-\eta(\vec{x}')) \\
\le & 2\eta(\vec{x})(1-\eta(\vec{x})) + (\eta(\vec{x}) - \eta(\vec{x}')) \\
\le & 2\eta(\vec{x})(1-\eta(\vec{x})) + \mathcal{c}\norm{\vec{x}-\vec{x}'}
\end{align} %]]></script>
<p>The second to last step uses the fact that $\eta$ is a probability distribution, so $-1 \le 2\eta(\vec{x})-1 \le 1$. The last step uses the $\ref{eq:lipschitz-bound}$.</p>
<p>Therefore, we can confirm the following bound:</p>
<script type="math/tex; mode=display">\mathbb{P}\left\{ y\ne y' \right\} \le 2\eta(\vec{x})(1-\eta{\vec{x}}) + \mathcal{c}\norm{\vec{x} - \vec{x}'}</script>
<p>But we are still one step away from explaining how we can compare this to the optimal estimator. In the above, we derived a bound for two labels being different. How is this related to our KNN model? The probability of getting a wrong prediction from KNN with $k=1$ (which we denoted $\expectsub{\Strain}{\cost{f_{\Strain}}}$) is the probability of the predicted label being different from the solution label.</p>
<p>We get to our lemma by the following reasoning:</p>
<script type="math/tex; mode=display">2\eta(\vec{x})(1-\eta{\vec{x}})
\le 2\min{\left\{ \eta(\vec{x}), 1-\eta(\vec{x}) \right\}}
= 2\cost{f_*}</script>
<p>Additionally, the average of the term $\mathcal{c}\norm{\vec{x} - \vec{x}’}$ is $\mathcal{c}\expectsub{\Strain, \vec{x}\sim\mathcal{D}}{\norm{\vec{x} - \text{nbh}_{\Strain, 1}(\vec{x})}}$</p>
<p>If we had assumed that it was a ball instead of a cube, we would’ve gotten slightly different results. But that’s besides the point: the main insight from this is that it depends on the dimension, and that for low dimensions at least, we still have a fairly good classifier. But finding a closest neighbor in high dimension can quickly become meaningless.</p>
<h2 id="support-vector-machines">Support Vector Machines</h2>
<h3 id="definition">Definition</h3>
<p>Let’s re-consider binary classification. In the following it will be more convenient to consider $y_n\in\set{\pm 1}$. This is equivalent to what we’ve done previously, under the mapping $0 \mapsto -1$ and $1\mapsto 1$. Note that this mapping can be done continuously in the range $[0, 1] \mapsto [-1, 1]$ by computing $\tilde{y}_n = 2y_n - 1$, and back with $y_n = \frac{1}{2}(\tilde{y}_n + 1)$.</p>
<p>Previously, we used MSE or logistic loss. MSE is symmetric, so something being positive or negative is punished at an equal rate. With logistic regression, we always have a loss, but its value is asymmetric, shrinking the further we go right.</p>
<p>If we instead use hinge loss (as defined below), with an additional regularization term, we get <strong>Support Vector Machines</strong> (SVM).</p>
<script type="math/tex; mode=display">\text{Hinge}(z, y) = [1-yz]_+ = \max{\left\{ 0, 1-yz \right\}}</script>
<p>Here, we use $z$ as shorthand for $\vec{x}^T \vec{w}$. The function multiplies the prediction with the actual label, which produces a positive result if they are of the same sign, and a negative result if they have different signs (this is why we wanted our labels in $\set{\pm 1}$). When the prediction is correct and above one, $1-yz$ becomes negative, and hinge loss returns 0. This makes hinge loss a linear function when predictions are incorrect or below one; it does not punish correct predictions above one, which pushes us to give predictions that we can be very confident about (above one).</p>
<p><img src="/images/ml/hinge-mse-logistic.png" alt="Graph of hinge loss, MSE and logistic" /></p>
<p>SVMs correspond to the following optimization problem:</p>
<script type="math/tex; mode=display">\min_{\vec{w}}{\sum_{n=1}^N{\left[ 1 - y_n \vec{x}_n^T \vec{w}\right]_+} + \frac{\lambda}{2}\norm{\vec{w}}^2}</script>
<p>What does this optimization problem correspond to, intuitively?</p>
<p><img src="/images/ml/margin.png" alt="Margin of a dataset" /></p>
<p>In the figure above, the pink region represents the “margin” created by the SVM. The center of the margin is the separating hyperplane; its direction is perpendicular to $\vec{w}$, the normal vector defining the hyperplane. The margin’s total width is $2/\norm{\vec{w}}$.</p>
<p>Points inside the margin are feature vectors $\vec{x}$ for which $\abs{\vec{x}^T\vec{w}} < 1$. These points incur a cost with hinge loss. Any points outside the margin, for which $\abs{\vec{x}^T\vec{w}} \ge 1$, do not incur any cost, as long as they’re on the correct side. Thus, depending on the $\vec{w}$ that we choose, the orientation and size of the margin will change; there will be a different number of points in it, and the cost will change.</p>
<p>How can we pick a good margin? Let’s assume $\lambda$ is small; we won’t define that further, the main point is just we pick one with the following priorities (in order):</p>
<ol>
<li>We want a separating hyperplane</li>
<li>We want a scaling of $\vec{w}$ so that no point of the data is in the margin</li>
<li>We want the margin to be as wide as possible</li>
</ol>
<p>With conditions 1 and 2, we can ensure that there is no cost incurred in the first expression (the sum over $[1 - y_n \vec{x}_n^T \vec{w}]_+$). The third condition is ensured by the fact that we’re minimizing $\norm{\vec{w}}^2$. Since the size of the margin is inversely proportional to that, we’re maximizing the margin.</p>
<p>We’ve introduced SVMs for the general case, where the data is not necessarily linearly separable, which is the <em>soft-margin</em> formulation. In the <em>hard-margin</em> formulation, the data is linearly separable by a separating hyperplane. Maximizing the margin size in the hard-margin formulation implies that some points will lie exactly on the margin boundary (on the correct side). These points are called <strong>essential support vectors</strong>. For the soft-margin case, this interpretation becomes a little more muddled.</p>
<h3 id="alternative-formulation-duality">Alternative formulation: Duality</h3>
<p>Now that we know what function we’re optimizing, let’s look at how we can optimize it efficiently. The function is convex, and has a subgradient in $\vec{w}$, which means we can use SGD with subgradients. This is good news! We’ll discuss an alternative, but equivalent formulation via the concept of <em>duality</em>, which can lead us to a more efficient implementation in some cases. More importantly though, the dual problem can point us to a more general formulation, called the <a href="#kernel-trick">kernel trick</a>.</p>
<p>Let’s say that we’re interested in minimizing a cost function $\cost{\vec{w}}$. Let’s assume this can be defined through an auxiliary function $G$, such that:</p>
<script type="math/tex; mode=display">\cost{\vec{w}} = \max_{\pmb{\alpha}}{G(\vec{w}, \pmb{\alpha})}</script>
<p>The minimization in question is thus:</p>
<script type="math/tex; mode=display">\min_{\vec{w}}{\cost{\vec{w}}}
= \min_{\vec{w}}{\max_{\pmb{\alpha}}{G(\vec{w}, \pmb{\alpha})}}</script>
<p>We call this the <strong>primal problem</strong>. In some cases though, it may be easier to find this in the other direction:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}}{\min_{\vec{w}}{G(\vec{w}, \pmb{\alpha})}}</script>
<p>We call this the <strong>dual problem</strong>. This leads us to a few questions:</p>
<h4 id="how-do-we-find-a-suitable-function-g">How do we find a suitable function G?</h4>
<p>There’s a general theory on this topic (see <a href="http://www.athenasc.com/nonlinbook.html">Nonlinear Programming</a> by Dimitri Bertsekas). In the case of SVMs though, the finding the function G is rather straightforward, once we restate the hinge loss as follows:</p>
<script type="math/tex; mode=display">[z]_+ = \max{\left\{ 0, z \right\}} = \max_{\alpha}{\alpha z}, \qquad \text{with } \alpha\in[0, 1]</script>
<p>The SVM problem then becomes:</p>
<script type="math/tex; mode=display">\min_{\vec{w}}{\max_{\pmb{\alpha}\in[0, 1]^N}{
\underbrace{
\sum_{n=1}^N{
\alpha_n (1 - y_n \vec{x}_n^T \vec{w})
} + \frac{\lambda}{2}\norm{\vec{w}}^2
}_{G(\vec{w}, \pmb{\alpha})}
}}
\label{eq:svm-primal}\tag{Primal problem}</script>
<p>Note that G is convex in $\vec{w}$, and linear, hence concave, in $\pmb{\alpha}$.</p>
<h4 id="when-is-it-ok-to-switch-min-and-max">When is it OK to switch min and max?</h4>
<p>It is always true that:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}}{\min_{\vec{w}}{G(\vec{w}, \pmb{\alpha})}}
\le
\min_{\vec{w}}{\max_{\pmb{\alpha}}{G(\vec{w}, \pmb{\alpha})}}</script>
<p>This is proven by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\min_{\vec{w}'}{G(\vec{w}', \pmb{\alpha})}
& \le G(\vec{w}, \pmb{\alpha})
\quad \forall \vec{w}, \pmb{\alpha}
& \iff \\
\max_{\pmb{\alpha}}{\min_{\vec{w}'}{G(\vec{w}', \pmb{\alpha})}}
& \le \max_{\pmb{\alpha}}{G(\vec{w}, \pmb{\alpha})}
\quad \forall \vec{w}
& \iff \\
\max_{\pmb{\alpha}}{\min_{\vec{w}'}{G(\vec{w}', \pmb{\alpha})}}
& \le \min_{\vec{w}} \max_{\pmb{\alpha}}{G(w, \pmb{\alpha})}
& \\
\end{align} %]]></script>
<p>Equality is achieved when the function looks like a saddle: when $G$ is a continuous function that is convex in $\vec{w}$, concave in $\pmb{\alpha}$, and the domains of both are compact and convex.</p>
<p><img src="/images/ml/saddle.png" alt="Saddle function" /></p>
<p>For SVMs, this condition is fulfilled, and the switch between min and max can be done. The alternative formulation of SVMs is:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}\in[0, 1]^N}{\min_{\vec{w}}{
\underbrace{
\sum_{n=1}^N{
\alpha_n (1 - y_n \vec{x}_n^T \vec{w})
} + \frac{\lambda}{2}\norm{\vec{w}}^2
}_{G(\vec{w}, \pmb{\alpha})}
}}
\label{eq:svm-dual}\tag{Dual problem}</script>
<p>We can take the derivative with respect to $\vec{w}$:</p>
<script type="math/tex; mode=display">\nabla_{\vec{w}}G(\vec{w}, \pmb{\alpha})
= -\sum_{n=1}^N{\alpha_n y_n \vec{x}_n + \lambda\vec{w}}</script>
<p>We’ll set this to zero to find a formulation of $\vec{w}$ in terms of $\alpha$. We get:</p>
<script type="math/tex; mode=display">\vec{w}(\pmb{\alpha}) = \frac{1}{\lambda}\sum_{n=1}^N{\alpha_n y_n \vec{x}_n} = \frac{1}{\lambda}\vec{X}^T\vec{Y}\pmb{\alpha}</script>
<p>Where $\vec{Y} := \text{diag}(\vec{y})$. If we plug this into $\ref{eq:svm-dual}$, we get the following dual problem, in quadratic form:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \max_{\pmb{\alpha}\in[0, 1]^N}{
\sum_{n=1}^N \alpha_n(1 - \frac{1}{\lambda}y_n \vec{x}_n^T \vec{X}^T\vec{Y}\pmb{\alpha}) + \frac{\lambda}{2}\norm{\frac{1}{\lambda}\vec{X}^T\vec{Y}\pmb{\alpha}}^2
} \\
& = \max_{\pmb{\alpha}\in[0, 1]^N}{
\pmb{\alpha}^T\vec{1} - \frac{1}{2\lambda}\pmb{\alpha}^T\vec{YXX}^T\vec{Y}\pmb{\alpha}
} \label{eq:svm-quadratic-form} \tag{Quadratic form}
\end{align} %]]></script>
<h4 id="when-is-the-dual-easier-to-optimize-than-the-primal">When is the dual easier to optimize than the primal?</h4>
<ol>
<li>When the dual is a differentiable quadratic problem (as SVM is). This is a problem that takes the same $\ref{eq:svm-quadratic-form}$ as above. In this case, we can optimize by using <strong>coordinate descent</strong> (or more precisely, ascent, as we’re searching for the maximum). Crucially, this method only changes one $\alpha_n$ variable at a time.</li>
<li>In the $\ref{eq:svm-quadratic-form}$ above, the data enters the formula in the form $\vec{K} = \vec{XX}^T$. This is called the <strong>kernel</strong>. We say this formulation is <em>kernelized</em>. Using this representation is called the <em>kernel trick</em>, and gives us some nice consequences that we’ll discuss later.</li>
<li>Typically, the solution $\pmb{\alpha}$ is sparse, being non-zero only in the training examples that are instrumental in determining the decision boundary. If we recall how we defined $\alpha$ in <a href="#how-do-we-find-a-suitable-function-g">an alternative formulation</a> of $[z]_+$, we can see that there are three distinct cases to consider:
<ol>
<li>Examples that lie on the correct side, and outside the margin, for which $\alpha_n = 0$. These are <strong>non-support vectors</strong></li>
<li>Examples that are on the correct side and just on the margin, for which $y_n \vec{x}_n^T \vec{w} = 1$, so $\alpha_n \in (0, 1)$. These $\vec{x}_n$ are <strong>essential support vectors</strong></li>
<li>Examples that are strictly within the margin, or on the wrong side have $\alpha_n = 1$, and are called <strong>bound support vectors</strong></li>
</ol>
</li>
</ol>
<h3 id="kernel-trick">Kernel trick</h3>
<p>We saw previously that our data only enters $\ref{eq:svm-quadratic-form}$ in the form of a kernel, $\vec{K} = \vec{XX}^T$. We’ll see now that when we’re using the kernel, we can easily go to a much larger dimensional space (even infinite dimensional space) without adding any complexity. This isn’t always applicable though, so we’ll also see which kernel functions are admissible for this trick.</p>
<h4 id="alternative-formulation-of-ridge-regression">Alternative formulation of ridge regression</h4>
<p>Let’s recall that least squares is a special case of ridge regression (where $\lambda = 0$). Ridge regression corresponds to the following optimization problem:</p>
<script type="math/tex; mode=display">\vec{w}^* = \min_{\vec{w}}{\sum_{n=1}^N{(y_n - \vec{x}_n^T w)^2 + \frac{\lambda}{2}\norm{\vec{w}}^2}}</script>
<p>We saw that the solution has a closed form:</p>
<script type="math/tex; mode=display">\vec{w}^* = (\vec{X}^T\vec{X} + \lambda\vec{I}_D)^{-1} \vec{X}^T y</script>
<p>We claim that this can be alternatively written as:</p>
<script type="math/tex; mode=display">\vec{w}^* =
\vec{X}^T
(\underbrace{\vec{XX}^T\vec{X} + \lambda\vec{I}_N}_{N\times N})^{-1}
y</script>
<p>The original formulation’s runtime is $\mathcal{O}(D^3 + ND^2)$, while the alternative is $\mathcal{O}(N^3 + DN^2)$. Which is more efficient depends on $D$ and $N$.</p>
<details><summary><p>Proof</p>
</summary><div class="details-content">
<p>We can prove this formulation by using the following identity. If we let $\vec{P}$ be an $N\times M$ matrix, and $\vec{Q}$ be $M\times N$. Then:</p>
<script type="math/tex; mode=display">\vec{P}(\vec{QP} + \vec{I}_M) = \vec{PQP} + \vec{P} = (\vec{PQ} + \vec{I}_N)\vec{P}</script>
<p>Assuming that $(\vec{QP} + \vec{I}_M)$ and $(\vec{PQ} + \vec{I}_N)$ are invertible, we have the identity:</p>
<script type="math/tex; mode=display">(\vec{PQ}+\vec{I}_N)^{-1}\vec{P} = \vec{P}(\vec{QP}+\vec{I}_M)^{-1}</script>
<p>To derive the formula, we can let $\vec{P} = \vec{X}^T$ and $\vec{Q} = \frac{1}{\lambda}\vec{X}$.</p>
</div></details>
<h4 id="representer-theorem">Representer theorem</h4>
<p>The representer theorem generalizes what we just saw about ridge regression. For a $\vec{w}^*$ minimizing the following, for any cost $\mathcal{L}_n$,</p>
<script type="math/tex; mode=display">\min_{\vec{w}}{\sum_{n=1}^N{
\mathcal{L}_n(\vec{x}_n^T \vec{w}, y_n) + \frac{\lambda}{2}\norm{\vec{w}}^2
}}</script>
<p>there exists $\pmb{\alpha^*}$ such that $\vec{w}^* = \vec{X}^T \pmb{\alpha}^*$.</p>
<h4 id="kernelized-ridge-regression">Kernelized ridge regression</h4>
<p>The above theorem gives us a new way of searching for $\vec{w}^*$: we can first search for $\pmb{\alpha^*}$, which might be easier, and then get back to the optimal weights by using the identity $\vec{w}^* = \vec{X}^T \pmb{\alpha}^*$.</p>
<p>Therefore, for ridge regression, we can equivalently optimize our alternative formula in terms of $\alpha$:</p>
<script type="math/tex; mode=display">\pmb{\alpha}^* = \argmin_{\pmb{\alpha}}{
\frac{1}{2}\pmb{\alpha}^T(\vec{XX}^T + \lambda \vec{I}_N)\pmb{\alpha}
- \pmb{\alpha}^T \vec{y}}</script>
<p>We see that our data enters in kernel form. How do we get the solution to this minimization problem? We can, as always, take the gradient of the cost function according to $\pmb{\alpha}$ and set it to zero:</p>
<script type="math/tex; mode=display">\nabla_{\pmb{\alpha}}\cost{\pmb{\alpha}}
= (\vec{XX}^T + \lambda \vec{I}_N)\pmb{\alpha} - \vec{y} = 0</script>
<p>Solving for $\alpha$ results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\pmb{\alpha}^* & = (\vec{XX}^T + \lambda \vec{I}_N)^{-1} \vec{y} \\
\vec{w}^* & = \vec{X}^T\pmb{\alpha}^*
= \vec{X}^T(\vec{XX}^T + \lambda \vec{I}_N)^{-1} \vec{y}
\end{align} %]]></script>
<p>We’ve effectively gotten back to our claimed alternative formulation for the optimal weights.</p>
<h4 id="kernel-functions">Kernel functions</h4>
<p>The kernel is defined as $\vec{K} = \vec{XX}^T$. We’ll call this the <strong>linear kernel</strong>. The elements are defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\vec{K} = \vec{XX}^T = \begin{bmatrix}
\vec{x}_1^T\vec{x}_1 & \vec{x}_1^T\vec{x}_2 & \cdots & \vec{x}_1^T\vec{x}_N \\
\vec{x}_2^T\vec{x}_1 & \vec{x}_2^T\vec{x}_2 & \cdots & \vec{x}_2^T\vec{x}_N \\
\vdots & \vdots & \ddots & \vdots \\
\vec{x}_N^T\vec{x}_1 & \vec{x}_N^T\vec{x}_2 & \cdots & \vec{x}_N^T\vec{x}_N \\
\end{bmatrix} %]]></script>
<p>The kernel matrix is a $N\times N$ matrix. Now, assume that we had first augmented the feature space with $\phi(\vec{x})$; the elements of the kernel would then be:</p>
<script type="math/tex; mode=display">% <![CDATA[
\vec{K} = \pmb{\Phi}\pmb{\Phi}^T = \begin{bmatrix}
\phi(\vec{x}_1)^T\phi(\vec{x}_1) & \phi(\vec{x}_1)^T\phi(\vec{x}_2) & \cdots & \phi(\vec{x}_1)^T\phi(\vec{x}_N) \\
\phi(\vec{x}_2)^T\phi(\vec{x}_1) & \phi(\vec{x}_2)^T\phi(\vec{x}_2) & \cdots & \phi(\vec{x}_2)^T\phi(\vec{x}_N) \\
\vdots & \vdots & \ddots & \vdots \\
\phi(\vec{x}_N)^T\phi(\vec{x}_1) & \phi(\vec{x}_N)^T\phi(\vec{x}_2) & \cdots & \phi(\vec{x}_N)^T\phi(\vec{x}_N) \\
\end{bmatrix} %]]></script>
<p>Using this formulation allows us to keep the size of $\vec{K}$ the same, regardless of how much we augment. In other words, we can now solve a problem where the size is independent of the feature space.</p>
<p>The feature augmentation goes from $\vec{x}_n \in \mathbb{R}^D$ to $\phi(\vec{x}_n) \in \mathbb{R}^{D’}$ with $D’ \gg D$, or even to an infinite dimension.</p>
<p>The big advantage of using kernels is that rather than first augmenting the feature space and then computing the kernel by taking the dot product, we can do both steps together, and we can do it more efficiently.</p>
<p>Let’s define a kernel function $\kappa(\vec{x}, \vec{x}’)$. We’ll let entries in the kernel $K$ be defined by:</p>
<script type="math/tex; mode=display">K_{i, j} = \kappa(\vec{x}_i, \vec{x}_j)</script>
<p>We can pick different kernel functions and get some interesting results. If we pick the right kernel, it can be equivalent to augmenting the features with some $\phi(\vec{x})$, and then computing the inner product:</p>
<script type="math/tex; mode=display">\kappa(\vec{x}, \vec{x}') = \phi(\vec{x})^T\phi(\vec{x}')</script>
<p>Hopefully, $\kappa$ is simple enough of a function that it’ll still be easier to compute than going to the higher dimensional space via $\phi$ and then computing the dot product.</p>
<p>Let’s take a look at a few examples of choices for $\kappa$ and see what happens. In the following, we’ll go the other way around, picking a $\kappa$ and showing that it’s equivalent to a particular feature augmentation $\phi$.</p>
<h5 id="trivial-kernels">Trivial kernels</h5>
<p>This is the trivial example, in which there is no feature augmentation. The following definition of $\kappa$ is equivalent to the identity “augmentation”:</p>
<script type="math/tex; mode=display">\kappa(\vec{x}_1, \vec{x}_2) = \vec{x}_1^T\vec{x}_2 \implies \phi(\vec{x}) = \vec{x}</script>
<p>Another trivial example assumes that $x_1, x_2 \in \mathbb{R}$. We’ll define the following kernel function, which is equivalent to the feature augmentation that takes the square:</p>
<script type="math/tex; mode=display">\kappa(x_1, x_2) = (x_1 \cdot x_2)^2 \implies \phi(x) = x^2</script>
<h5 id="polynomial-kernel">Polynomial kernel</h5>
<p>Let’s assume that $\vec{x}’, \vec{x}’ \in\mathbb{R}^3$. Let’s define the kernel function as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\kappa(\vec{x}, \vec{x}')
& = \left(x_1 x'_1 + x_2 x'_2 + x_3 x'_3\right)^2 \\
& = \left( x_1 x'_1 \right)^2
+ \left( x_2 x'_2 \right)^2
+ \left( x_3 x'_3 \right)^2
+ 2 x_1 x'_1 x_2 x'_2
+ 2 x_1 x'_1 x_3 x'_3
+ 2 x_2 x'_2 x_3 x'_3
\end{align} %]]></script>
<p>What is the $\phi$ corresponding to this? The inner product that would produce the above would is produced by taking the inner product $\phi(\vec{x})^T\phi(\vec{x}’)$, where $\phi$ is defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\phi(\vec{x}) = \begin{bmatrix}
\sqrt{2} x_1 x_2 &
\sqrt{2} x_1 x_3 &
\sqrt{2} x_3 x_3 &
x_1^2 &
x_2^2 &
x_3^2
\end{bmatrix} %]]></script>
<h5 id="radial-basis-function-kernel">Radial basis function kernel</h5>
<p>The following kernel corresponds to an infinite feature map:</p>
<script type="math/tex; mode=display">\kappa(\vec{x}, \vec{x}') = \exp{\left[-(\vec{x} - \vec{x}')^T(\vec{x} - \vec{x}')\right]}</script>
<p>This is called the <em>radial basis function</em> (RBF) kernel.</p>
<p>Consider the special case in which $\vec{x}$ and $\vec{x}’$ are scalars; we’ll look at the Taylor expansion of the function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\kappa(x, x')
& = \exp{\left[-(x - x')^2\right]} \\
& = \exp{\left[-(x^2 + (x')^2 - 2xx')\right]} \\
& = e^{-x^2} e^{-(x')^2} e^{2xx'} \\
& = e^{-x^2} e^{-(x')^2}
\sum_{k=0}^\infty{\frac{2^k(x)^k(x')^k}{k!}}
\end{align} %]]></script>
<p>We can think of this infinite sum as the dot-product of two infinite vectors, whose $k$-th components are equal to, respectively:</p>
<script type="math/tex; mode=display">e^{-x^2} \sqrt{\frac{2^k}{k!}} x^k
\quad \text{and} \quad
e^{-(x')^2} \sqrt{\frac{2^k}{k!}} (x')^k</script>
<p>Although it isn’t obvious, we’ll state that this kernel cannot be represented as an inner product in finite-dimensional space; it is inherently the product of infinite dimensional vectors.</p>
<h5 id="new-kernel-functions-from-old-ones">New kernel functions from old ones</h5>
<p>We can simply construct a new kernel as a linear combination of old kernels:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\kappa(\vec{x}, \vec{x'})
& = a\kappa_1(\vec{x}, \vec{x'}) + b\kappa_2(\vec{x}, \vec{x'}),
& \quad \forall a, b \ge 0 \\
\kappa(\vec{x}, \vec{x'})
& = \kappa_1(\vec{x}, \vec{x'}) \kappa_2(\vec{x}, \vec{x'}) \\
\kappa(\vec{x}, \vec{x'})
& = \kappa_1(f(\vec{x}), f(\vec{x'})),
& \quad f: \mathbb{R}^D \rightarrow \mathbb{R}^D \\
\kappa(\vec{x}, \vec{x}')
& = f(\vec{x})f(\vec{x}'),
& \text{in which case } \phi(\vec{x}) = f(\vec{x}) \\
\end{align} %]]></script>
<p>Proofs are in the lecture notes. If we accept these, we can combine them to prove much more complex kernel functions.</p>
<h3 id="classifying-with-the-kernel">Classifying with the kernel</h3>
<p>So far, we’ve seen how to compute the optimal parameter $\pmb{\alpha}$ using only the kernel, without having to go to the extended feature space. This also allows us to have infinite feature spaces. Now, let’s see how to use all of this to create predictions using only the kernel.</p>
<p>Recall that the classifier predicts $y_n = \phi(\vec{x}_n)^T\vec{w}^*$, and that $\vec{w}^* = \vec{X}^T \pmb{\alpha}^*$. This leads us to:</p>
<script type="math/tex; mode=display">y_m = \phi(\vec{x}_m)^T \phi(\vec{X})^T \pmb{\alpha}
= \sum_{n=1}^N{\kappa(\vec{x}_m, \vec{x}_n)\pmb{\alpha}}</script>
<h3 id="properties-of-kernels">Properties of kernels</h3>
<p>How can we ensure that there exists a feature augmentation $\phi$ corresponding to a given kernel $\vec{K}$? A kernel function must be an inner-product in some feature space. Mercer’s condition states that we have this iff the following conditions are fulfilled:</p>
<ol>
<li>$K$ is symmetric, i.e. $\kappa(\vec{x}, \vec{x}’) = \kappa(\vec{x}’, \vec{x})$</li>
<li>For any arbitrary input set $\set{\vec{x}_n}$ and all $N$, $K$ is positive semi-definite</li>
</ol>
<h2 id="unsupervised-learning">Unsupervised learning</h2>
<p>So far, all we’ve done is supervised learning: we’ve gone from a training set with features vectors and labels, and we wanted to output a classification or a regression.</p>
<p>There is a second very important framework in ML called <em>unsupervised</em> learning. Here, the training set is only composed of the feature vectors; there are no associated labels:</p>
<script type="math/tex; mode=display">\Strain = \set{(\vec{x}_n)}_{n=1}^N</script>
<p>We would then like to learn from this dataset without having access to the training labels. The two main directions in unsupervised learning are:</p>
<ul>
<li>Representation learning & feature learning</li>
<li>Density estimation & generative models</li>
</ul>
<p>Let’s take a bird’s eye view of the existing techniques through some examples.</p>
<ol>
<li><strong>Matrix factorization</strong>: can be used for both supervised and unsupervised. We’ll give an example for each
<ol>
<li><strong>Netflix, collaborative filtering</strong>: this is an example of supervised learning. We have a large, sparse matrix with rows of users, columns of movies, containing ratings. If we can approximate the matrix reasonably well by a matrix of rank one (i.e. outer product of two vectors), then this extracts useful features both for the users and the movies; it might group movies by genres, and users by type.</li>
<li><strong>word2vec</strong>: this is an example of unsupervised learning. The idea is to map every word from a large corpus to a vector $w_i \in \mathbb{R}^K$, where K is relatively large. This would allow us to represent natural language in some numeric space. To get this, we build a matrix $N\times N$, with $N$ being the number of words in the corpus. We then factorize the matrix by means of two matrices of rank $K$ to give us the desired representation. The results are pretty astounding, as <a href="https://www.tensorflow.org/tutorials/representation/word2vec">this article</a> shows; closely related words are close in the vector space, and it’s easy to get a mapping from concepts to associated concepts (say, countries to capitals).</li>
</ol>
</li>
<li><strong>PCA and SVD</strong> (Principle Component Analysis and Singular Value Decomposition): Features are vectors in $\mathbb{R}^d$ for some d. If we wanted to “compress” this down to one dimension (this doesn’t have to be an existing feature, it could be a newly generated one from the existing ones), we could ask that the variance of the projected data be as large as possible. This will lead us to PCA, which we compute using SVD.</li>
<li><strong>Clustering</strong>: to reveal structure in data, we can cluster points given some similarity measure (e.g. Euclidean distance) and the number of clusters we want. We can also ask clusters to be hierarchical (clusters within clusters).</li>
<li><strong>Generative models</strong>: a generative model models the distribution of the data
<ol>
<li><strong>Auto-encoders</strong>: these are a form of compression algorithm, trying to find good weights for encoding and compressing the data</li>
<li><strong>Generative Adversarial Networks</strong> (GANs): the idea is to use two neural nets, one that tries to generate samples that look like the data we get, and another that tries to distinguish the real samples from the fake ones. The aim is that after sufficient training, a classifier cannot distinguish real samples from artificial ones. If we achieve that, then we have built a good model.</li>
</ol>
</li>
</ol>
<h3 id="k-means">K-Means</h3>
<p>A common algorithm for unsupervised learning is called K-means (also called vector quantization in signal processing, or the Baum-Welch algorithm for hidden Markov models). The aim of this algorithm is to cluster the data: we want to find a partition such that every point is exactly one group, such that within a group, the (Euclidean) distance between points is much smaller than across the groups.</p>
<p>In K-means, we find these clusters in terms of cluster centers $\pmb{\mu}$ (also called means). Each center dictates the partition: which cluster a point belongs to depends on which center is closest to the point. In other words, we’re minimizing the distance over all $N$ points and $K$ clusters:</p>
<script type="math/tex; mode=display">\min_{\pmb{\mu}, \vec{z}}{\mathcal{L}_{\text{K-means}}(\vec{z}, \pmb{\mu})}
= \min_{\set{\pmb{\mu}_k}, \set{z_{nk}}}{
\sum_{n=1}^N{\sum_{k=1}^K{
z_{nk} \norm{\vec{x}_n - \pmb{\mu}_k}^2
}}
}</script>
<p>The $z_{nk}$ is the k<sup>th</sup> number in the $\vec{z}_n$ vector, which is a one-hot vector encoding the cluster assignment. Every datapoint $\vec{x}_n$ has an associated vector $\vec{z}_n$ of length K, that takes value 1 in the index of the cluster to which $\vec{x}_n$ belongs, and 0 everywhere else. Mathematically, we can write this constraint as:</p>
<script type="math/tex; mode=display">z_{nk} \in \set{0, 1}, \quad \sum_{k=1}^K{z_{nk}} = 1</script>
<p>To recap, we have the following vectors:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\vec{z}_n & = \left[z_{n1}, z_{n2}, \dots, z_{nK} \right]^T \\
\vec{z} & = \left[\vec{z}_1, \vec{z}_2, \dots, \vec{z}_N\right]^T \\
\pmb{\mu} & = \left[\pmb{\mu}_1, \pmb{\mu}_2, \dots, \pmb{\mu}_K\right]^T \\
\end{align} %]]></script>
<p>This formulation of the problem gives rise to two conditions, which will give us an intuitive algorithm for solving this iteratively. We see that there are two sets of variables to optimize under: $\pmb{\mu}_k$ and $z_{nk}$. The idea is to fix one and optimize the other.</p>
<p>First, let’s fix the centers $\set{\pmb{\mu}_k}$ and work on the assignments. To minimize the sum:</p>
<script type="math/tex; mode=display">% <![CDATA[
z_{nk} = \begin{cases}
1, & k = \argmin_{k'}{\norm{\vec{x}_n - \pmb{\mu}_{k'}}^2} \\
0, & \text{otherwise}
\end{cases} %]]></script>
<p>Intuitively, this means that we’re grouping the points by the closest center.</p>
<p>Having computed this, we can fix the assignments $z_{nk}$ to compute optimal centers $\pmb{\mu}_k$. These centers should correspond to the center of the cluster. This minimizes the distance that all points can have to the center.</p>
<script type="math/tex; mode=display">\pmb{\mu}_k = \frac{\sum_{n=1}^N{z_{nk} \vec{x}_n}}{\sum_{n=1}^N{z_{nk}}}</script>
<p>Note that in this formulation, $k$ is fixed by $\pmb{\mu}_k$, and $n$ varies in the sum. This gives us some kind of average: the sum of all the positions of the points in the cluster, divided by the number of points in the cluster.</p>
<p>How did we get to this formulation? If we take the derivative of the cost function and set it to zero, and then solve it for $\pmb{\mu}_k$, we get to the above.</p>
<script type="math/tex; mode=display">\nabla_{\pmb{\mu}_k}\mathcal{L}_{\text{K-means}}
= \sum_{n=1}^N{2 z_{nk} \pmb{\mu}_k - 2 z_{nk} \vec{x}_n}
= 0</script>
<p>Solving this confirms that taking the average position in the cluster indeed is the best way to optimize our cost.</p>
<p>These observations give rise to an algorithm:</p>
<ol>
<li>Initialize the centers $\set{\pmb{\mu}_k^{(0)}}$. In practice, the algorithm’s convergence may depend on this choice, but there is no general best strategy. As such, they can in general be initialized randomly.</li>
<li>Repeat until convergence:
<ol>
<li>Choose $\vec{z}^{(t+1)}$ given $\pmb{\mu}^{(t)}$</li>
<li>Choose $\pmb{\mu}^{(t+1)}$ given $\vec{z}^{(t+1)}$</li>
</ol>
</li>
</ol>
<p>Each of these two steps will only make the partitioning better, if possible. Still, this may get stuck at a local minimum, there’s no guarantee of it converging to the global optimum; it’s a greedy algorithm.</p>
<h4 id="coordinate-descent-interpretation">Coordinate descent interpretation</h4>
<p>There are other ways to look at K-means. One way is to think of it as a coordinate descent, minimizing a cost function by finding parameters $\pmb{\mu}$ and $\vec{z}$ iteratively:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\vec{z}^{(t+1)} & = \argmin_{\vec{z}} \cost{\vec{z}, \pmb{\mu}^{(t)}} \\
\pmb{\mu}^{(t+1)} & = \argmin_{\pmb{\mu}} \cost{\vec{z}^{(t+1)}, \pmb{\mu}}
\end{align} %]]></script>
<p>This doesn’t actually give us much new insight, but it’s a nice way to think about it.</p>
<h4 id="matrix-factorization-interpretation">Matrix factorization interpretation</h4>
<p>Another way to think about it is as a matrix factorization. We can rewrite K-means as the following minimization:</p>
<script type="math/tex; mode=display">\min_{\pmb{\mu}, \vec{z}}{\mathcal{L}_{\text{K-means}}(\vec{z}, \pmb{\mu})}
= \min_{\vec{M}, \vec{Z}}{\frobnorm{\vec{X}^T - \vec{M} \vec{Z}^T}}^2</script>
<p>A few notes on this notation:</p>
<ul>
<li>$\vec{X}$ is, as always, the $N\times D$ data matrix</li>
<li>$\vec{M}$ is a $D\times K$ matrix representing the mean, the $\pmb{\mu}_k$ vectors; each column represents a different center</li>
<li>$\vec{Z}^T$ is the $K\times N$ assignment matrix containing the $\vec{z}_n$ vectors. This means that the columns of $\vec{Z}^T$ are one-hot vectors, i.e. that exactly one element of each column of $\vec{Z}^T$ is 1</li>
<li>$\vec{X}^T - \vec{M} \vec{Z}^T$ computes a matrix whose rows are vectors from each point to its corresponding cluster center.</li>
<li>The $\frobnorm{\cdot}$ norm here is the <a href="https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm">Frobenius norm</a>, the sum of the squares of all elements in matrix. Using the Frobenius norm allows us to get a sum of errors squared, which should be reminiscent of most loss functions we’ve used so far</li>
</ul>
<p>This is indeed a matrix factorization as we’re trying to find two matrices $\vec{M}$ and $\vec{Z}$ that minimize the above criterion.</p>
<h4 id="probabilistic-interpretation">Probabilistic interpretation</h4>
<p>A probabilistic interpretation of K-means will lead us to <a href="#gaussian-mixture-model-gmm">Gaussian Mixture Models (GMMs)</a>. Having a probabilistic approach is useful because it allows us to account for the model that we think generated the data.</p>
<p>The assumption is that we have generated the data by using $K$ separate $D$-dimensional Gaussian distributions. Each sample $\vec{x}_n$ comes from one of the $K$ distributions uniformly at random. These distributions are assumed to have means $\set{\pmb{\mu}_k}$, and the identity matrix as their covariance matrix (that is, variance 1 in each dimension, and the dimensions are i.i.d).</p>
<p>Let’s write down the likelihood of a sample $\vec{x}_n$. It’s the Gaussian density function of the cluster to which the sample belongs:</p>
<script type="math/tex; mode=display">p(\vec{x}_n \mid \pmb{\mu}, \vec{z}) = \prod_{k=1}^K{\left(
\frac{1}{(2\pi)^{D/2}} \exp{\frac{-\norm{\vec{x}_n - \pmb{\mu}_k}^2}{2}}
\right)^{z_{nk}}}</script>
<p>The density assuming that we know that the points are from a given $k$ is what’s inside of the large parentheses. We use $z_{nk}$ in the exponent to cancel out the contributions of the clusters to which $\vec{x}_n$ does not belong, keeping only the contribution of its cluster.</p>
<p>Now, if we want the likelihood for the whole set instead of for a single sample, assuming that the samples are i.i.d, we can take the product over all samples:</p>
<script type="math/tex; mode=display">p(\vec{X}\mid\pmb{\mu},\vec{z})
= \prod_{n=1}^N{p(\vec{x}_n \mid \pmb{\mu}, \vec{z})}
= \prod_{n=1}^N{\prod_{k=1}^K{\left(
\frac{1}{(2\pi)^{D/2}} \exp{\frac{-\norm{\vec{x}_n - \pmb{\mu}_k}^2}{2}}
\right)^{z_{nk}}}}</script>
<p>This is the likelihood, which we want to maximize. We could equivalently minimize the log-likelihood. We’ll also remove the constant factor as it has no influence on our minimization.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
-\log{p(\vec{X}\mid\pmb{\mu},\vec{z})}
& = -\log{\prod_{n=1}^N{p(\vec{x}_n \mid \pmb{\mu}, \vec{z})}} \\
& = -\log{\prod_{n=1}^N{\prod_{k=1}^K{\left(
\exp{\frac{-\norm{\vec{x}_n - \pmb{\mu}_k}^2}{2}}
\right)^{z_{nk}}}}} \\
& = \sum_{n=1}^N{\sum_{k=1}^K{z_{nk} \norm{\vec{x}_n - \pmb{\mu}_k}^2}}
\end{align} %]]></script>
<p>And this is of course the cost function we were optimizing before.</p>
<h4 id="issues-with-k-means">Issues with K-means</h4>
<ol>
<li>Computation may be heavy for large values of $N$, $D$ and $K$</li>
<li>Clusters are forced to be spherical (and cannot be elliptical for instance)</li>
<li>Each input can belong to only one cluster (this is known as “hard” cluster assignment, as opposed to “soft” assignment which allows for weighted memberships in different clusters)</li>
</ol>
<h3 id="gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</h3>
<p>So now that we’ve expressed K-means from a probabilistic view, let’s view the probabilistic generalization, which is called a Gaussian Mixture Model.</p>
<h4 id="clustering-with-gaussians">Clustering with Gaussians</h4>
<p>To generalize the previous, what if our data comes from Gaussian sources that aren’t perfectly circularly symmetric<sup id="fnref:isotropic"><a href="#fn:isotropic" class="footnote">10</a></sup>, that don’t have the identity matrix as variance? A more general solution is to allow for an arbitrary covariance matrix $\pmb{\Sigma}_k$. This will add another parameter that we need optimize over, but can help us more accurately model the data.</p>
<h4 id="soft-clustering">Soft clustering</h4>
<p>Another extension is that we were previously forced to be either from one or another distribution. This is called hard clustering. We can generalize this to soft clustering, where a point can be associated to multiple clusters. In soft clustering, we model $z_n$ as a random variable taking values in $\set{1, \dots, K}$, instead of a one-hot vector $\vec{z}_n$.</p>
<p>This assignment is given by a certain distribution. We denote the prior probability that the sample comes from the k<sup>th</sup> Gaussian $\normal{\pmb{\mu}_k, \pmb{\Sigma}_k}$, by $\pi_k$:</p>
<script type="math/tex; mode=display">p(z_n = k) = \pi_k,
\quad \text{where } \pi_k > 0 \, \forall k \text{ and } \sum_{k=1}^K{\pi_k} = 1</script>
<h4 id="likelihood">Likelihood</h4>
<p>What we’re trying to minimize in this extended model is then (still under the assumption that the data is independently distributed from $K$ samples):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(\vec{X}, \vec{z} \mid \pmb{\mu}, \pmb{\Sigma}, \pmb{\pi})
& = \prod_{n=1}^N{p(z_n \mid \pmb{\pi}) \normal{\vec{x}_n \mid z_n, \pmb{\mu}, \pmb{\Sigma}}} \\
& = \prod_{n=1}^N{
\prod_{k=1}^K{\left(\pi_k \normal{\vec{x}_n\mid\pmb{\mu}_k, \pmb{\Sigma}_k}\right)^{z_{nk}}}
} \\
\end{align} %]]></script>
<p>Our notation here maybe isn’t the best; we’re still using $z_{nk}$ as an indicator, but also $z_n$ as a random variable, and not a one-hot vector anymore. Therefore, to be clear, we should define $z_{nk} = \mathbb{I}\set{z_n = k}$.</p>
<p>This is the model that we’ll use. It’s not something that we aim to prove or not prove, it’s just what we chose to base ourselves on. We’ll want to optimize over $\pmb{\mu}$ and $\pmb{\Sigma}$.</p>
<p>The $\vec{z}_n$ variable is what’s known asf a <strong>latent variable</strong>; it’s not something that we observe directly, it’s just something that we use to make our model more complex. The parameters of the model are $\pmb{\theta} := \set{\pmb{\mu}, \pmb{\Sigma}, \pmb{\pi}}$.</p>
<h4 id="marginal-likelihood">Marginal likelihood</h4>
<p>The advantage of treating $z_n$ are latent variables instead of parameters is that we can marginalize them out to get a cost function that doesn’t depend on them. If we’re not interested in these latent variables, we can integrate over the latent variables to get the <strong>marginal likelihood</strong>:</p>
<script type="math/tex; mode=display">p(\vec{X}\mid\pmb{\theta}) =
\prod_{n=1}^N{p(\vec{x}_n \mid \pmb{\theta})} =
\prod_{n=1}^N{\sum_{k=1}^K{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}}</script>
<figure>
<img alt="2D view of weighted gaussians forming a single distribution" src="/images/ml/gmm-multiple-gaussians.png" />
<figcaption>Multiple Gaussians form a single distribution in GMM</figcaption>
</figure>
<p>This is a weighted sum of all the models. The weights sum up to one, so we have a valid density. In other words, we are now able to model much more complex distribution functions by building up our distribution from $K$ Gaussian distributions.</p>
<figure>
<img alt="Weighted Gaussian bell curves" src="/images/ml/weighted-gaussians.svg" />
<figcaption>The $\pi_k$ factors allow us to weigh multiple Gaussian distributions</figcaption>
</figure>
<p>Assuming that $D, K \ll N$, the number of parameters in the model was $\mathcal{O}(N)$, because we had an assignment $\vec{z}_n$ for each of the $N$ datapoints. Now, assignments are no longer a parameter, so the number of parameters grows in $\mathcal{O}(D^2 K)$, since we have $K$ covariance matrices, which are $D \times D$, and $K$ $D$-dimensional clusters. Under our assumption that $D, K \ll N$, having $\mathcal{O}(D^2 K)$ parameters is much better.</p>
<h4 id="maximum-likelihood-1">Maximum likelihood</h4>
<p>We can optimize the fit of the model by changing the parameters of $\pmb{\theta}$ and optimizing the log likelihood of the above, which is:</p>
<script type="math/tex; mode=display">\hat{\pmb{\theta}} = \max_{\pmb{\theta}}{
\sum_{n=1}^N{
\log{\left(
\sum_{k=1}^K{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
\right)}
}
}</script>
<p>This can be optimized over $\pi_k, \pmb{\mu}_k, \pmb{\Sigma}_k$. Unfortunately, we now have the log of a sum of Gaussians (which are exponentials), which isn’t a very nice formula. We’ll use this as an excuse to talk about another algorithm, the EM algorithm.</p>
<h3 id="em-algorithm">EM algorithm</h3>
<p>In GMM, we had the following set of parameters:</p>
<script type="math/tex; mode=display">\pmb{\theta}^{(t)} := \set{
\set{\pmb{\mu}_k^{(t)}}_{k=1}^K,
\set{\pmb{\Sigma}_k^{(t)}}_{k=1}^K,
\set{\pi_k^{(t)}}_{k=1}^K
}</script>
<p>We wanted to optimize these parameters under the following maximization problem:</p>
<script type="math/tex; mode=display">\max_{\pmb{\theta}} \cost{\pmb{\theta}} =
\max_{\pmb{\theta}}{
\sum_{n=1}^N{
\log{\left(
\sum_{k=1}^K{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
\right)}
}
}</script>
<p>Note that in this problem, we’re maximizing the cost function instead of minimizing it as we’re used to. This is strictly equivalent to minimizing the negative of this, and we’re using maximizing and minimizing the negative equivalently.</p>
<p>This is not an easy optimization problem, because wee need to optimize the logarithm of a sum over all choices of $\pmb{\theta}$.</p>
<p>The <strong>expectation-maximization (EM) algorithm</strong> provides with a general method to tackle this kind of problem. It uses an iterative two-step algorithm: at every step, we try to go from a set of parameters $\pmb{\theta}^{(t)}$ to a better set of parameters $\pmb{\theta}^{(t+1)}$.</p>
<p>In the following, we’ll consider an arbitrary probability distribution $q_n^{(t)}$ over $K$ members. Since it is a probability distribution, we have:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} \ge 0, \quad \sum_{k=1}^K{q_{nk}^{(t)}} = 1</script>
<p>The EM algorithm consists of optimizing for $q_{nk}$ and $\pmb{\theta}$ alternatively. Note that while every step improves the cost, there is no guarantee that this will converge to the global optimum.</p>
<p>We start by initializing $\pmb{\mu}^{(0)}, \pmb{\Sigma}^{(0)}, \pmb{\pi}^{(0)}$. Then, we iterate between the E and M steps until $\cost{\pmb{\theta}}$ stabilizes.</p>
<h4 id="expectation-step">Expectation step</h4>
<p>In the expectation step, we compute how well we’re doing:</p>
<script type="math/tex; mode=display">\cost{\pmb{\theta}^{(t)}} =
\sum_{n=1}^N{\log{\left(
\sum_{k=1}^K{\pi_k^{(t)} \normal{\vec{x}_n \mid \pmb{\mu}_k^{(t)}, \pmb{\Sigma}_k^{(t)}}}
\right)}}</script>
<p>We can then choose the new $q_{nk}^{(t)}$ values:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} = \frac{
\pi_k^{(t)} \normal{\vec{x}_n \mid \pmb{\mu}_k^{(t)}, \pmb{\Sigma}_k^{(t)}}
}{
\sum_{k=1}^K{\pi_k^{(t)} \normal{\vec{x}_n \mid \pmb{\mu}_k^{(t)}, \pmb{\Sigma}_k^{(t)}}}
}</script>
<p>This gives us a new lower bound on the cost:</p>
<script type="math/tex; mode=display">\cost{\pmb{\theta}^{(t+1)}}
\ge
\sum_{n=1}^N{\sum_{k=1}^K}{q_{nk}^{(t+1)} \log{\left(
\frac{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}{q_{nk}^{(t+1)}}
\right)}}</script>
<p>Getting a lower bound means that we have a monotonically non-decreasing cost over the steps $t$. Again, this is a good guarantee because we’re maximizing over the cost: it tells us that our E-step improves at every step.</p>
<p>This value is actually the expected value, hence the name of the E-step. We’ll see this in the interpretation section below.</p>
<details><summary><p>Derivation</p>
</summary><div class="details-content">
<p>Due to the concavity of the log function, we can apply <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s inequality</a> recursively to the cost function to get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\log{\left( \sum_{k=1}^K{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}} \right)}
& = \log{\left(
\sum_{k=1}^K{
q_{nk}^{(t)}
\frac{
\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}
}{
q_{nk}^{(t)}
}
} \right)} \\
& \ge \sum_{k=1}^K{
q_{nk}^{(t)}
\log{\frac{
\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}
}{
q_{nk}^{(t)}
}}
} \\
\end{align} %]]></script>
<p>Just like in the <a href="https://en.wikipedia.org/wiki/Log_sum_inequality">log-sum inequality</a>, we have equality when the terms in the log are equal for all members of the sum. If that is the case, it means that all these terms are the same scalar, and therefore that the numerator and denominator are proportional:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} \propto \pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}</script>
<p>Since $q_{nk}$ is a probability, it must sum up to 1 so we have:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} = \frac{
\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}
}{
\sum_{k=1}^K{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
}</script>
</div></details>
<h4 id="maximization-step">Maximization step</h4>
<p>We update the parameters $\pmb{\theta}$ as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\pmb{\mu}_k^{(t+1)} & := \frac{\sum_n{q_{nk}^{(t)} \vec{x}_n}}{\sum_n{q_{nk}^{(t)}}} \\ \\
\pmb{\Sigma}_k^{(t+1)} & := \frac{
\sum_n{q_{nk}^{(t)} (\vec{x}_n - \pmb{\mu}_k^{(t+1)}) (\vec{x}_n - \pmb{\mu}_k^{(t+1)})^T}
}{
\sum_n{q_{nk}^{(t)}}
} \\ \\
\pi_k^{(t+1)} & := \frac{1}{N}\sum_n{q_{nk}^{(t)}}
\end{align} %]]></script>
<details><summary><p>Derivation</p>
</summary><div class="details-content">
<p>We had previously let $q_{nk}$ be an abstract, undefined distribution. We now freeze the $q_n^{(t)}$ assignments, and optimize over $\pmb{\theta}$.</p>
<p>In the E step, we derived a lower bound for the cost function. In general, the lower bound is not equal to the original cost. We can however carefully choose $q_{nk}$ to achieve equality. And since we want to maximize the original cost function, it makes sense to maximize this lower bound. Thus, we’ll work under this locked assignment of $q_{nk}$ (thus achieving equality for the lower bound). Seeing that we have equality, our objective function (which we want to maximize) is:</p>
<script type="math/tex; mode=display">\prod_{n=1}^N \sum_{k=1}^K{
q_{nk}^{(t)}
\log{\frac{
\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}
}{
q_{nk}^{(t)}
}}
}</script>
<p>This leads us to maximizing the expression:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K}{
q_{nk}^{(t)} \left[
\log{\pi_k} - \log{q_{nk}^{(t)}} + \log{\normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
\right]
}</script>
<p>The $\pi_k$ should sum up to one, so we’re dealing with a constrained optimization problem. We therefore add a term to turn it into an unconstrained problem. We therefore want to maximize the following over $\pmb{\theta}$:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K}{
q_{nk}^{(t)} \left[
\log{\pi_k} - \log{q_{nk}^{(t)}} + \log{\normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
\right] + \lambda \sum_{k=1}^K{\pi_k}
}</script>
<p>Differentiating with respect to $\pi_k$, and setting the result to 0 yields:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{q_{nk}^{(t)} \frac{1}{\pi_k} + \lambda} = 0</script>
<p>Solving for $\pi_k$ gives us:</p>
<script type="math/tex; mode=display">\pi_k = -\frac{1}{\lambda} \sum_{n=1}^N{q_{nk}^{(t)}}</script>
<p>We can choose $\lambda$ so that this leads to a proper normalization ($\pi_k$ summing up to 1); this leads us to $\lambda = -N$. Hence, we have:</p>
<script type="math/tex; mode=display">\pi_k^{(t+1)} := \frac{1}{N}\sum_{n=1}^N {q_{nk}^{(t)}}</script>
<p>This is our first update rule. Let’s see how to derive the others. The term $\log{\normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}$ has the form:</p>
<script type="math/tex; mode=display">-\frac{D}{2}\log{(2\pi)}
+\frac{1}{2}\log{\abs{\pmb{\Sigma}^{-1}}}
-\frac{1}{2}(\vec{x} - \pmb{\mu}_k)^T\pmb{\Sigma}^{-1}(\vec{x} - \pmb{\mu}_k)</script>
<p>We used the fact that for an invertible matrix, $\abs{\pmb{\Sigma}} = 1/\abs{\pmb{\Sigma}^{-1}}$. Differentiating the cost function with respect to $\pmb{\mu}_k$ and setting the result to 0 yields:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N {q_{nk}^{(t)} \pmb{\Sigma}^{-1}(\vec{x}_n - \pmb{\mu}_k)} = 0</script>
<p>We can multiply this by $\pmb{\Sigma}$ on the left to get rid of the $\pmb{\Sigma}^{-1}$, and solve for $\pmb{\mu}_k$ to get:</p>
<script type="math/tex; mode=display">\pmb{\mu}_k^{(t+1)} := \frac{
\sum_n q_{nk}^{(t)}\vec{x}_n
}{
\sum_n{q_{nk}^{(t)}}
}</script>
<p>Finally, for the $\pmb{\Sigma}$ update rule, we take the derivative with respect to $\pmb{\Sigma}_k^{-1}$ and set the result to 0, yielding:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{q_{nk}^{(t)} \frac{1}{2} \pmb{\Sigma}^T_k}
- \frac{1}{2}\sum_{n=1}^N{q_{nk}^{(t)}(\vec{x}_n - \pmb{\mu}_k)(\vec{x}_n - \pmb{\mu}_k)^T}
= 0</script>
<p>Solving for $\pmb{\Sigma}$ yields:</p>
<script type="math/tex; mode=display">\pmb{\Sigma}_k^{(t+1)} := \frac{
\sum_n{q_{nk}^{(t)} (\vec{x}_n - \pmb{\mu}_k^{(t+1)}) (\vec{x}_n - \pmb{\mu}_k^{(t+1)})^T}
}{
\sum_n{q_{nk}^{(t)}}
}</script>
<p>We’re using the following fact, which I won’t go into details to prove:</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial \vec{A}} \log{\abs{\vec{A}}} = \vec{A}^{-T}</script>
</div></details>
<h4 id="interpretation">Interpretation</h4>
<p>The original model for GMM was that our data points are i.i.d. from a mixture model with $K$ Gaussian components. This led us to the following choice of prior distribution:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(\vec{x}_n \mid \pmb{\theta})
& = \sum_{k=1}^K {p(\vec{x}_n, z_n = k \mid \pmb{\theta})}
= \sum_{k=1}^K {p(z_n = k \mid \pmb{\theta}) p(\vec{x}_n \mid z_n = k, \pmb{\theta})} \\
& = \sum_{k=1}^K {\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}
\end{align} %]]></script>
<p>Note that we can generalize the EM algorithm to other choices of $p(\vec{x}_n, z_n = k \mid \pmb{\theta})$, but that this is the one we used here.</p>
<p>This probability is an expectation based on the prior $\pi_k$. Let’s now look at the posterior distribution of $z_n$, given the datapoints $\vec{x}_n$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(z_n = k \mid \vec{x}_n, \pmb{\theta})
& = \frac{p(z_n = k, \vec{x}_n, \pmb{\theta})}
{p(\vec{x}_n, \pmb{\theta})}
= \frac{p(z_n = k, \vec{x}_n \mid \pmb{\theta})}
{p(\vec{x}_n \mid \pmb{\theta})} \\
& = \frac{p(z_n = k, \mid \pmb{\theta})p(\vec{x}_n \mid z_n = k, \pmb{\theta})}
{p(\vec{x}_n \mid \pmb{\theta})} \\
& = \frac{p(z_n = k, \mid \pmb{\theta})p(\vec{x}_n \mid z_n = k, \pmb{\theta})}
{\sum_{j=1}^K p(z_n = j\mid\pmb{\theta})p(\vec{x}_n\mid z_n = j, \pmb{\theta})} \\
& = \frac{\pi_k \normal{\vec{x} \mid \mu_k, \pmb{\Sigma}_k}}
{\sum_{j=1}^K{\pi_j \normal{\vec{x} \mid \mu_j, \pmb{\Sigma}_j}}} =: q_{nk}
\end{align} %]]></script>
<p>The distribution that we previously just explained as an abstract, unknown distribution is in fact the posterior $p(z_n = k \mid \vec{x}_n, \pmb{\theta})$.</p>
<p>We can now explain why the E step is the <em>expectation</em> step. Assume that we know the $q_{nk}$ (as a thought experiment, imagine a genie told us the assignment probabilities of each sample $\vec{x}_n$ to a component $k$, which is exactly what the $q_{nk}$ quantities are).</p>
<p>As a reminder, the log-likelihood is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\log{p(\vec{x}_n, z_n = k \mid \pmb{\theta})}
& = \log{\left(
p(z_n = k \mid \pmb{\theta}) p(\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k)
\right)} \\
& = \log{\left(
\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}
\right)}
\end{align} %]]></script>
<p>Given the parameters $\pmb{\theta}$, the expected value of the above log-likelihood, over the distribution of $z_n$, is:</p>
<script type="math/tex; mode=display">\expectsub{z_n}{\log{p(\vec{x}_n, z_n = k \mid \pmb{\theta})}} =
\sum_{k=1}^K{q_{nk} \log{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}}</script>
<p>Summing this over all samples $\vec{x}_n$, we find the cost</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K{q_{nk} \log{\pi_k \normal{\vec{x}_n \mid \pmb{\mu}_k, \pmb{\Sigma}_k}}}}</script>
<p>This is almost the same as the expression we maximized in the derivation for the M step, modulo the terms $-q_{nk} \log{(q_{nk})}$, which are just constants for the maximization.</p>
<p>With this probabilistic interpretation, can write the whole EM algorithm compactly as:</p>
<script type="math/tex; mode=display">\pmb{\theta}^{(t+1)} = \argmax_{\pmb{\theta}}{\expectsub{p\left(z_n \mid \vec{x}_n, \pmb{\theta}^{(t)}\right)}{\log{p(\vec{x}_n, z_n \mid \pmb{\theta})}}}</script>
<h2 id="matrix-factorization">Matrix Factorization</h2>
<p>Matrix factorization is a form of unsupervised learning. A well-known example in which matrix factorization was used is the Netflix prize. The goal was to predict ratings of users for movies, given a very sparse matrix of ratings. We’ll study the method that achieved the best error.</p>
<p>Let’s describe the data a little more formally. Given movies $d = 1, 2, \dots, D$ and users $n = 1, 2, \dots, N$, we define $\vec{X}$ as the $D\times N$ matrix<sup id="fnref:inverted-matrix-notation"><a href="#fn:inverted-matrix-notation" class="footnote">11</a></sup> containing all rating entries; that is, $x_{dn}$ is the rating of the n<sup>th</sup> user for the d<sup>th</sup> movie. We don’t have any additional information on the users or on the movies, apart from the ID that’s been assigned to them. In practice, the matrix was $D=20’000$ and $N=500’000$, and 99.98% unobserved.</p>
<p>We want to give a prediction for all the unobserved entries, so that we can give the top entries (say, top 10 movies) for each user.</p>
<h3 id="prediction-using-a-matrix-factorization">Prediction using a matrix factorization</h3>
<p>We will aim to find $\vec{W}$ and $\vec{Z}$ such that:</p>
<script type="math/tex; mode=display">\vec{X} \approx \vec{W}\vec{Z}^T</script>
<p>The hope is to “explain” each rating $x_{dn}$ by a numerical representation of the corresponding movie and user.</p>
<p>Here, we have a “tall” matrix $W\in\mathbb{R}^{D\times K}$, and $\vec{Z}\in\mathbb{R}^{N\times K}$, forming a “flat matrix” $\vec{Z}^T \in \mathbb{R}^{K\times N}$. In practice, compared to the size of $N$ or $D$, $K$ will be relatively small (maybe 50 or so).</p>
<p>We’ll assign a cost function that we’re trying to optimize:</p>
<script type="math/tex; mode=display">\min_{\vec{W}, \vec{Z}} \cost{\vec{W}, \vec{Z}}
:= \min_{\vec{W}, \vec{Z}} \frac{1}{2} \sum_{(d, n)\in\Omega}{\left[
x_{dn} - (\vec{WZ}^T)_{dn}
\right]^2}</script>
<p>Here, $\Omega\subseteq [D]\times[N]$ is given. It collects the indices of the observed ratings of the input matrix $\vec{X}$. Our cost function here compares the number of stars $x_{dn}$ a user assigned to a movie, to the prediction of our model $\vec{WZ}^T$, by using mean squares.</p>
<p>To optimize this cost function, we need to know whether it is jointly <em>convex</em> with respect to $\vec{W}$ and $\vec{Z}$, and whether it is <em>identifiable</em> (there is a unique minimum).</p>
<p>We won’t go into the full proof, but the answer is the minimum is not unique. Since $\vec{WZ}^T$ is a product, we could just divide one by 10 and multiply the other by 10 to get a different solution with the same cost.</p>
<p>And in fact, it’s not even convex. We could compute the Hessian, which is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
0 & 1 \\
1 & 0
\end{bmatrix} %]]></script>
<p>This isn’t positive semi-definite, and therefore the product isn’t convex.</p>
<p>If we think of $W$ and $Z$ as numbers (or as $1\times 1$ matrices), we can give a simpler explanation, that also gives us the intuition for why this isn’t convex. The function $w\cdot z$ <a href="https://www.wolframalpha.com/input/?i=xy">looks like a saddle function</a>, and therefore isn’t convex.</p>
<h3 id="choosing-k">Choosing K</h3>
<p>$K$ is the number of <em>latent features</em>. This is comparable to the K we chose in K-means, defining the number of clusters. Large values of K facilitate overfitting.</p>
<h3 id="regularization-1">Regularization</h3>
<p>We can add a regularizer and minimize the following cost:</p>
<script type="math/tex; mode=display">\cost{\vec{W}, \vec{Z}} =
\frac{1}{2} \sum_{(d, n)\in\Omega}{\left[
x_{dn} - (\vec{WZ}^T)_{dn}
\right]^2}
+ \frac{\lambda_w}{2}\frobnorm{\vec{W}}^2
+ \frac{\lambda_z}{2}\frobnorm{\vec{Z}}^2</script>
<p>With scalars $\lambda_w, \lambda_z > 0$.</p>
<h3 id="stochastic-gradient-descent">Stochastic gradient descent</h3>
<p>With our cost functions in place, we can look at our standard algorithm for minimization. We’ll define loss as a sum of many individual loss functions:</p>
<script type="math/tex; mode=display">\cost{\vec{W}, \vec{Z}} =
\sum_{(d, n)\in\Omega}{f_{d, n}(\vec{W}, \vec{Z})}
= \sum_{(d, n)\in\Omega}{\frac{1}{2}\left[
x_{dn} - (\vec{WZ}^T)_{dn}
\right]^2}</script>
<p>Let’s derive the stochastic gradient for an individual loss function (which is what we need to compute when doing SGD). Matrix calculus is not easy, but understanding it starts with understanding the following sentence: <em>a gradient with respect to a matrix is a matrix of gradients</em>. If we compute the gradient of a function $f$ with respect to a matrix $\vec{X}\in\mathbb{R}^{D\times N}$, we get a gradient matrix $\vec{g}\in\mathbb{R}^{D\times N}$, where each element $g_{a, b}$ is the derivative of $f$ with respect to the $(a, b)$ element of $\vec{X}$:</p>
<script type="math/tex; mode=display">g_{a, b} = \diff{f}{x_{a, b}}</script>
<p>Before we find the stochastic gradient, let’s start by just looking at the dimensions of what we’re going to compute:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla_{\vec{W}} f_{d, n} & \in \mathbb{R}^{D\times K} \\
\nabla_{\vec{Z}} f_{d, n} & \in \mathbb{R}^{N\times K}
\end{align} %]]></script>
<p>Luckily, we’re not doing the full gradient here, but only the <em>stochastic</em> gradient, which only requires computing a single entry in the gradient matrix. Therefore, for a fixed pair $(d, n)$ (that is, a rating from user $n$ of movie $d$), we will compute a single entry $(d’, k)$ in the $\vec{W}$ derivative:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left(\nabla_{\vec{W}} f_{d, n}\right)_{(d', k)}
= \diff{f_{d, n}}{w_{d', k}}(\vec{W}, \vec{Z})
= \begin{cases}
- \left[x_{dn} - (\vec{WZ}^T)_{dn} \right] z_{n, k} & \text{if } d' = d \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>The same goes for the derivative by $\vec{Z}$. We’ll compute a single entry $(n’, k)$ in $\nabla_{\vec{W}} f_{d, n}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\left(\nabla_{\vec{Z}} f_{d, n}\right)_{(n', k)}
= \diff{f_{d, n}}{z_{n', k}}(\vec{W}, \vec{Z})
= \begin{cases}
- \left[x_{dn} - (\vec{WZ}^T)_{dn} \right] w_{d, k} & \text{if } n' = n \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>With these, we have the formulation for the whole matrices.</p>
<p>It turns out that computing this is very cheap: $\mathcal{O}(K)$. This is the greatest advantage of using SGD for this. There are no guarantees that this works though; this is still an open research question. But in practice, it works really well.</p>
<p>The update step is then:</p>
<script type="math/tex; mode=display">\vec{W}^{(t+1)} = \vec{W}^{(t)} - \gamma \nabla_{\vec{W}} f_{d, n} \\
\vec{Z}^{(t+1)} = \vec{Z}^{(t)} - \gamma \nabla_{\vec{Z}} f_{d, n} \\</script>
<p>With stochastic gradient descent, we only compute the gradient of a single $f_{d, n}$ instead of the whole cost $\mathcal{L}$. Therefore, each step only updates the d<sup>th</sup> row of $\vec{W}$, and the n<sup>th</sup> row of $\vec{Z}$.</p>
<h3 id="alternating-least-squares-als">Alternating least squares (ALS)</h3>
<p>The alternating minimization algorithm alternates between optimizing $\vec{Z}$ and $\vec{W}$. ALS is a special case of this, with square error.</p>
<h4 id="no-missing-entries">No missing entries</h4>
<p>For simplicity, let’s just assume that there are no missing entries in the data matrix, that is $\Omega = [D]\times[N]$ (instead of $\subseteq$). This makes our life a little easier, and we’ll be able to find a closed form solution (indeed, if $\Omega$ is the whole set, the problem is pretty easy to solve; if it’s an arbitrary subset, it becomes a NP-hard problem). Our cost is then:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{\vec{W}, \vec{Z}}
& = \frac{1}{2}\sum_{d=1}^D\sum_{n=1}^N{\left[
x_{dn} - (\vec{WZ}^T)_{dn}
\right]^2}
+ \frac{\lambda_w}{2} \frobnorm{\vec{W}}^2
+ \frac{\lambda_z}{2} \frobnorm{\vec{Z}}^2 \\
& = \frac{1}{2}\frobnorm{\vec{X} - \vec{WZ}^T}^2
+ \frac{\lambda_w}{2} \frobnorm{\vec{W}}^2
+ \frac{\lambda_z}{2} \frobnorm{\vec{Z}}^2
\end{align} %]]></script>
<p>ALS then does a <strong>coordinate descent</strong> to minimize the cost (plus a regularizer). First, we fix $\vec{W}$ and compute the minimum with respect to $\vec{Z}$ (we ignore the other regularizer, as minimization is the same with or without an added constant):</p>
<script type="math/tex; mode=display">\min_{\vec{Z}}{
\frac{1}{2} \frobnorm{\vec{X} - \vec{WZ}^T}^2}
+ \frac{\lambda_z}{2} \frobnorm{\vec{Z}}^2</script>
<p>Then, we alternate, minimizing $\vec{W}$ and fixing $\vec{Z}$:</p>
<script type="math/tex; mode=display">\min_{\vec{W}}{
\frac{1}{2}\frobnorm{\vec{X} - \vec{WZ}^T}^2}
+ \frac{\lambda_w}{2} \frobnorm{\vec{W}}^2</script>
<p>These are two least squares problems. The only difference is that we’re searching for a whole matrix in this case, unlike in least squares where we searched for a vector. Still, we can find a closed form for it by setting the gradient with respect to $\vec{W}$ and then $\vec{Z}$ to 0, which will give:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
(\vec{Z}^*)^T & := (\vec{W}^T \vec{W} + \lambda_z \vec{I}_K)^{-1} \vec{W}^T \vec{X} \\
(\vec{W}^*)^T & := (\vec{Z}^T \vec{Z} + \lambda_w \vec{I}_K)^{-1} \vec{Z}^T \vec{X}^T \\
\end{align} %]]></script>
<p>Note that the regularization helps us make sure that problem indeed is invertible (since we’re adding an identity matrix). This means that we can find a closed form solution if we don’t have any missing entries.</p>
<p>The cost of finding the solution in each step is then per column, $\mathcal{O}(N)$ and $\mathcal{O}(D)$, which is not quite as good as the $\mathcal{O}(K)$ with SGD. Additionally, we need to construct $\vec{W}^T\vec{W}$ and $\vec{Z}^T\vec{Z}$, which is $\mathcal{O}(D^2)$. The inversion isn’t too bad: we’re only inverting a $K\times K$ matrix, which is much nicer than dealing with $D$ or $N$. Also note that there is no step size to tune, which makes it easier to deal with (though slower!).</p>
<h4 id="missing-entries">Missing entries</h4>
<p>As before, we can derive the ALS updates for the more general setting, where we only have certain ratings $(d, n)\in\Omega$. The idea is to compute the gradient with respect to each group of variables, and set it to zero.</p>
<h3 id="text-representation-learning">Text representation learning</h3>
<h4 id="co-occurrence-matrix">Co-occurrence matrix</h4>
<p>To attempt to get the meaning of words, we can start by constructing co-occurrence counts from a big corpus or text. This is a matrix $\vec{N}$ in which $n_{ij}$ is the number of contexts where word $w_i$ occurs together with word $w_j$. A context is a window of words occurring together (it could be a document, paragraph, sentence, or a window of $n$ words).</p>
<p>For a vocabulary $\nu = \set{w_1, \dots, w_D}$ and context words $w_n = 1, 2, \dots N$, the co-occurrence matrix is a very sparse $D\times N$.</p>
<h4 id="motivation-1">Motivation</h4>
<p>We can’t plug string-encoded words directly into our learning models. Can we find a meaningful numerical representation for all of our data? We’d like to find a mapping, or <strong>embedding</strong>, for each word $w_i$:</p>
<script type="math/tex; mode=display">w_i \mapsto \vec{w}_i \in \mathbb{R}^K</script>
<p>To construct a word embedding, we want to find a factorization of the co-occurrence matrix $\vec{N}$. Typically, we actually use $\vec{X} = \log{\vec{N}}$ as the element-wise log of the co-occurrence matrix, i.e. $x_{dn} := \log{(n_{dn})}$. We’ll find a factorization such that:</p>
<script type="math/tex; mode=display">\vec{X} \approx \vec{W}\vec{Z}^T</script>
<p>As before, we let $\Omega\subseteq [D] \times [N]$ collect the indices of non-zero counts in $\vec{X}$. In other words, $\Omega$ contains indices of word pairs that have been observed in the same context.</p>
<p>For each pair of observed words $(w_d, w_n) \in \Omega$, we’ll try to explain their co-occurrence count by a numerical representation of the two words; the d<sup>th</sup> row of $\vec{W}$ is the representation of a word $w_d$, and n<sup>th</sup> row of $\vec{Z}$ is the representation of a context word $w_n$.</p>
<h4 id="bag-of-words">Bag of words</h4>
<p>The naive approach would be to pick $K$ to be the size of the vocabulary, $K = \abs{\nu}$. We can then encode words $w_i$ as one-hot vectors taking value 1 at index $i$. This works nicely, but has high dimensionality, and cannot capture the order of the words, which is why it’s called the <strong>bag of words</strong> approach.</p>
<p>But we can do this in smarter way. The idea is to pick a much lower $K$, and try to group semantically similar words in this $K$-dimensional space.</p>
<p><img src="/images/ml/semantic-hyperspace.png" alt="Words with different semantic meanings in different areas of hyperspace" /></p>
<h4 id="word2vec">Word2vec</h4>
<p><a href="https://code.google.com/archive/p/word2vec/">word2vec</a> is an implementation of the skip-gram model. This model uses binary classification (like logistic regression) to separate real word pairs $(w_d, w_n)$ appearing together in a context window, from fake word pairs $(w_d, w_{n’})$.</p>
<p>It does so by computing the inner product score of the words; $\vec{w}_d^T \vec{w}_n$ is real, and must be distinguished from the fake $\vec{w}_d^T \vec{w}_{n’}$.</p>
<h4 id="glove">GloVe</h4>
<p>In the following, we’ll give an overview of the method known as <a href="https://nlp.stanford.edu/projects/glove/">GloVe (Global Vectors)</a>, which offers an alternative to word2vec.</p>
<p>To do this, we do the following cost minimization:</p>
<script type="math/tex; mode=display">\min_{\vec{W}, \vec{Z}} \cost{\vec{W}, \vec{Z}}
:= \min_{\vec{W}, \vec{Z}} \frac{1}{2} \sum_{(d, n)\in\Omega} f_{dn} \left( x_{dn} - (\vec{W}\vec{Z}^T)_{dn} \right)^2</script>
<p>The GloVe embedding uses a little trick to weight the importance of each entry. It computes a weight $f_{dn}$ used in the cost above, according to the following function:</p>
<script type="math/tex; mode=display">f_{dn} = \min\set{1, \left(\frac{n_{dn}}{ n_{\text{max}} }\right)^\alpha},
\quad \alpha\in[0, 1], \text{ e.g. } \alpha = \frac{3}{4}</script>
<p>Where $n_{\text{max}}$ is a parameter to be tuned, and $n_{dn}$ is the count of $w_d$ and $w_n$ appearing together (not the log, just the normal count). This is a carefully chosen function by the GloVe creators; we can also choose $f_{dn} := 1$ if we don’t want to weigh the vectors, but GloVe achieves good results with this choice.</p>
<p><img src="/images/ml/glove-weight-function.png" alt="Glove weight function" /></p>
<p>For $K$, we can just choose a value, say 50, 100 or 200. Trial and error will serve us well here.</p>
<p>We can train the factorization with <a href="#stochastic-gradient-descent">SGD</a> or <a href="#alternating-least-squares-als">ALS</a>.</p>
<h4 id="fasttext">FastText</h4>
<p>This is another matrix factorization approach to learn document or sentence representations. Unlike the two previous approaches, <a href="https://github.com/facebookresearch/fastText">FastText</a> is a supervised algorithm.</p>
<p>A sentence $s_n$ is composed of $m$ words: $s_n = \set{w_1, w_2, \dots, w_m}$. We try to optimize over the following cost function:</p>
<script type="math/tex; mode=display">\min_{\vec{W}, \vec{Z}} \cost{\vec{W}, \vec{Z}} :=
\min_{\vec{W}, \vec{Z}} \sum_{s_n \text{ a sentence}} f(y_n \vec{WZ}^T\vec{x}_n)</script>
<p>Where:</p>
<ul>
<li>$\vec{W}\in\mathbb{R}^{1\times K}$ and $\vec{Z}\in\mathbb{R}^{\abs{\nu}\times K}$ are the factorization</li>
<li>$\vec{x}_n\in\mathbb{R}^{\abs{\nu}}$ is the bag-of-words representation of sentence $s_n$</li>
<li>$f$ is a linear classifier loss function, such as the logistic function or hinge loss</li>
<li>$y_n\in\set{\pm 1}$ is the classification label for sentence $s_n$</li>
</ul>
<h2 id="svd-and-pca">SVD and PCA</h2>
<h3 id="motivation-2">Motivation</h3>
<p><strong>Principal Component Analysis</strong> (PCA) is a popular <em>dimensionality reduction</em> method. Given a data matrix, we’re looking for a way to linearly map the original $D$ dimensions into $K$ dimensions, with $K \le D$. The criteria for a good such mapping is that the $K$-dimensional representation should represent the original data well.</p>
<p>There are different ways to think of PCA:</p>
<ul>
<li>It <em>compresses data</em> from $K$ to $D$ dimensions</li>
<li>It <em>decorrelates data</em>, finding a $K$-dimensional space with maximum variance</li>
</ul>
<p>For machine learning, it’s often best not to compress data in this manner, but it may be necessary in certain situations (for reasons of interpretability for example).</p>
<p>In our subsequent discussion, $\vec{X}$ is the $D \times N$ data matrix, whose $N$ columns represent the feature vectors in $D$-dimensional space.</p>
<p>The PCA will be computed from the data matrix $\vec{X}$ using singular value decomposition.</p>
<h3 id="svd">SVD</h3>
<p>The <strong>singular value decomposition</strong> (SVD) of a $D \times N$ matrix $\vec{X}$ is:</p>
<script type="math/tex; mode=display">\vec{X} = \vec{USV}^T</script>
<p>The matrices</p>
<ul>
<li>$\vec{U}$ is a $D \times D$ orthonormal<sup id="fnref:orthonormal"><a href="#fn:orthonormal" class="footnote">12</a></sup> matrix</li>
<li>$\vec{V}$ is a $N \times N$ orthonormal matrix</li>
<li>$\vec{S}$ is a $D\times N$ diagonal matrix (with $D$ diagonal entries)</li>
</ul>
<p>One useful property about unitary matrices (like $\vec{U}$ and $\vec{V}$, which are orthonormal, a stronger claim) is that they preserve the norms (they don’t change the length of the vectors being transformed), meaning that we can think of them as a rotation. A small proof of this follows:</p>
<script type="math/tex; mode=display">\frobnorm{\vec{Ux}}^2 = \vec{x}^T\vec{U}^T\vec{Ux} = \vec{x}^T\vec{I}\vec{x} = \frobnorm{\vec{x}}^2</script>
<p>We’ll assume $D < N$ without loss of generality (we could just take the transpose of $\vec{X}$ otherwise). This is an arbitrary choice, but helps us tell the dimensions apart.</p>
<p>The diagonal entries in $\vec{S}$ are the <em>singular values</em> in descending order:</p>
<script type="math/tex; mode=display">s_1 \ge s_2 \ge \dots \ge s_D \ge 0</script>
<p>The columns of $\vec{U}$ and $\vec{V}$ are the <em>left</em> and <em>right singular vectors</em>.</p>
<h3 id="svd-and-dimensionality-reduction">SVD and dimensionality reduction</h3>
<p>Suppose we want to compress a $D\times N$ data matrix $\vec{X}$ to a $K\times N$ matrix $\tilde{\vec{X}}$, where $1 \le K \le D$. We’ll define this transformation from $\vec{X}$ to $\tilde{\vec{X}}$ by the $K\times D$ compression matrix $\vec{C}$. The decompression (or reconstruction) from $\tilde{\vec{X}}$ to $\vec{X}$ is $\vec{R}$.</p>
<p>Can we find good matrices? Our criteria is that the error introduced when compressing and reconstructing should be small, over all choices of compression and reconstruction matrices:</p>
<script type="math/tex; mode=display">\frobnorm{\vec{X} - \vec{R}\vec{C}\vec{X}}^2</script>
<p>There are other ways of measuring the quality of a compression and reconstruction, but for the sake of simplicity, we’ll stick to this one.</p>
<p>We can actually place a bound on the reconstruction error using the following lemma.</p>
<hr />
<p><strong>Lemma</strong>: For any $D \times N$ matrix $\vec{X}$ and any $D\times N$ rank-K matrix $\hat{\vec{X}}$:</p>
<script type="math/tex; mode=display">\frobnorm{\vec{X} - \hat{\vec{X}}}^2 \ge \frobnorm{\vec{X} - \vec{U}_K \vec{U}_K^T \vec{X}} = \sum_{i \ge K+1}{s_i^2}</script>
<p>Where:</p>
<ul>
<li>$\vec{X} = \vec{U}\vec{S}\vec{V}^T$ is the SVD of $\vec{X}$</li>
<li>$s_i$ are the singular values of $\vec{X}$</li>
<li>$\vec{U}_K$ is the $D\times K$ matrix of the first $K$ rows of $\vec{U}$</li>
</ul>
<hr />
<p>If we use $\vec{C} = \vec{U}_K^T$ as our compression matrix, and $\vec{R} = \vec{U}_K$ as the reconstruction matrix, we get a better (or equal) error than any other choice of reconstruction $\hat{\vec{X}}$. This tells us that the best compression to dimension $K$ is a projection onto the first $K$ columns of $\vec{U}$, which are the first $K$ left singular vectors.</p>
<p>Note that the reconstruction error is the sum of the singular values after the cut-off $K$; intuitively, we can think of the error as coming from the singular values we ignored.</p>
<p>This also tells us that the left singular vectors are ordered in decreasing order of importance. In other words, the above choice of compression uses the <em>principal</em> components, the most important ones. This is what really defines PCA.</p>
<p>The term $\vec{U}_K \vec{U}_K^T \vec{X}$ has another simple interpretation. Let $\vec{S}^{(K)}$ be the $D\times N$ diagonal matrix corresponding to a truncated version of $\vec{S}$. It is of the same size, but only has the $K$ first diagonal values of $\vec{S}$, and is zero everywhere else. We claim that:</p>
<script type="math/tex; mode=display">\vec{U}_K \vec{U}_K^T \vec{X} = \vec{U}_K \vec{U}_K^T \vec{USV}^T = \vec{US}^{(K)}\vec{V}^T</script>
<blockquote>
<p>👉 It’s okay to drop the $K$ subscript on the $\vec{U}$ matrix because $\vec{S}^{(K)}$ already takes care of selecting the first $K$ rows</p>
</blockquote>
<p>This tells us that the <em>best</em> rank $K$ approximation of a matrix is obtained by computing its SVD, and truncating it at $K$.</p>
<h4 id="svd-and-matrix-factorization">SVD and matrix factorization</h4>
<p>Expressing $\vec{X}$ as an SVD allows us to easily get a matrix factorization.</p>
<script type="math/tex; mode=display">\vec{X}
= \vec{USV}^T
= \underbrace{\vec{U}}_{\vec{W}} \underbrace{\vec{SV}^T}_{\vec{Z}^T}
= \vec{WZ}^T</script>
<p>This is clearly a special case of the matrix factorization as we saw it previously. In this form, the matrix factorization is perfect equality, and not an approximation—though in all fairness, this one uses $K = D$. We get a less perfect (but still optimal) factorization with lower values of $K$.</p>
<p>There are two differences from the general case:</p>
<ul>
<li>We don’t need to preselect the rank $K$ from the start. We can compute the full SVD, and control $K$ at any time later, letting it range from 1 to $\min(D, N)$.</li>
<li>Matrix factorization started with a $\vec{X}$ with many missing entries; the idea was that the factorization should model the existing entries well, so that we can predict the missing values. This is not something that the SVD can do.</li>
</ul>
<p>As we’ve discussed previously, this is the <em>best</em> rank K approximation that we can find, as the Frobenius norm of the difference between the approximation and the true value is the smallest possible (sum of the squares of the singular values).</p>
<p>In response to the first point above, note that we still can preselect $K$ and compute the matrix factorization that defines our dimensionality reduction:</p>
<script type="math/tex; mode=display">\vec{X}_K
= \vec{U}_K \vec{S}^{(K)} \vec{V}^T
= \underbrace{\vec{U}_K}_{\vec{W}}
\underbrace{\vec{S}^{(K)}\vec{V}^T}_{\vec{Z}^T}
= \vec{W}\vec{Z}^T</script>
<h3 id="pca-and-decorrelation">PCA and decorrelation</h3>
<p>Assume that we have $N$ $D$-dimensional points in a $D\times N$ matrix $\vec{X}$. We can compute the empirical mean and covariance by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\bar{\vec{x}} & = \frac{1}{N} \sum_{n=1}^N {\vec{x}_n} \\
\vec{K} & = \frac{1}{N} \sum_{n=1}^N (\vec{x}_n - \bar{\vec{x}}) (\vec{x}_n - \bar{\vec{x}})^T
\end{align} %]]></script>
<p>The covariance matrix $\vec{K}$ is a $D \times D$ rank-1 matrix. If our data is from i.i.d. samples then these empirical values will converge to the true values when $N \rightarrow \infty$.</p>
<p>Before we do PCA, we need to <em>center the data around the mean</em>. Let’s assume our data matrix $\vec{X}$ has been preprocessed as such. Using the SVD, we can rewrite the empirical covariance matrix as:</p>
<script type="math/tex; mode=display">N\vec{K}
= \sum_{n=1}^N {(\vec{x}_n \vec{x}_n^T)}
= \vec{X}\vec{X}^T
= \vec{U}\vec{S}\vec{V}^T \vec{V}\vec{S}^T \vec{U}^T
= \vec{U}\vec{S}\vec{S}^T \vec{U}^T
= \vec{U}\vec{S}_D^2 \vec{U}^T</script>
<p>This works because $\vec{V}$ is an orthogonal matrix, so $\vec{V}^T\vec{V} = I_N$, and $\vec{S}$ is diagonal, so $\vec{SS}^T = S_D^2$, where $S_D^2$ is a $D\times D$ diagonal matrix consisting of the D first columns of $\vec{S}$.</p>
<p>PCA finds orthogonal axes centered at the mean, that represent the most variance, in decreasing order of variance. Starting with orthogonal axes, it finds the rotation $\vec{U}^T$ so that the axes point in the direction of maximum variance. This can be seen in <a href="http://setosa.io/ev/principal-component-analysis/">this visual explanation of PCA</a>.</p>
<p>With this intuition about PCA in mind, let’s try to decompose the covariance again, but this time considering the transformed, compressed data $\tilde{\vec{X}} = \vec{U}_K^T\vec{X}$. The empirical covariance of along this transformed axis is:</p>
<script type="math/tex; mode=display">N \tilde{\vec{K}}
= \tilde{\vec{X}} \tilde{\vec{X}}^T
= \vec{U}^T\vec{X}\vec{X}^T\vec{U}
= \vec{U}^T\vec{US}_D^2\vec{U}^T\vec{U}
= \vec{S}_D^2</script>
<p>Here, the empirical co-variance is <em>diagonal</em>. This means that through PCA, we’ve transformed our data to make the various components <strong>uncorrelated</strong>. This gives us some intuition of why it may be useful to first transform the data with the rotation $\vec{U}^T\vec{X}$.</p>
<p>Additionally, by the definition of SVD, the singular values are in decreasing order (so the first one, $s_1$, is the greatest one). Since we have a diagonal matrix as our empirical variance, it means that the variance of the first component is $s_1^2$, which proves the property of PCA’s axes being in decreasing order of variance.</p>
<p>Assume that we’re doing classification. Intuitively, it makes sense that classifying features with a larger variance would be easier (when the variance is 0, all data is the same and it becomes impossible to classify using that component). From this point of view, it makes intuitive sense to only keep the first $K$ rows of $\tilde{\vec{X}}$ when we perform dimensionality reduction; we keep the features that have high variance and are uncorrelated, and we discard all features with variance close to 0 as they’re hard to classify.</p>
<h3 id="computing-the-svd-efficiently">Computing the SVD efficiently</h3>
<p>To compute the SVD of a matrix $\vec{X}$, we must compute the matrices $\vec{U}$ and $\vec{S}$ Let’s see how we can do this efficiently.</p>
<p>Let’s consider the $D\times D$ matrix $\vec{XX}^T$. As before, since $\vec{V}$ is orthogonal, we can use the SVD to get:</p>
<script type="math/tex; mode=display">\vec{X}\vec{X}^T
= \vec{USV}^T\vec{VS}^T\vec{U}
= \vec{USS}^T\vec{U}^T
= \vec{U} \vec{S}_D^2 \vec{U}^T</script>
<p>Let $\vec{u}_j$ denote the j<sup>th</sup> column of $\vec{U}$.</p>
<script type="math/tex; mode=display">\vec{XX}^T \vec{u}_j = \vec{U}\vec{S}_D^2 \vec{U}^T \vec{u}_j = s_j^2 \vec{u}_j</script>
<p>We see that the the j<sup>th</sup> column of $\vec{U}$ is the j<sup>th</sup> eigenvector of $\vec{XX}^T$, with eigenvalue $s_j^2$. Therefore, finding the eigenvalues and eigenvectors for $\vec{XX}^T$ gives us a way to compute $\vec{U}$ and $\vec{S}$.</p>
<p>There’s a subtle point to be made here about the sign of the eigenvector. If $\vec{u}_j$ is an eigenvector, then so is $-\vec{u}_j$. But if our goal is simply to use that decomposition to do PCA, then it doesn’t matter as the sign of the columns of $\vec{U}_K^T$ disappear when computing $\vec{U}_K\vec{U}_K^T$. However, if the goal is simply to do SVD, we must fix some choice of signs, and be consistent in $\vec{V}$.</p>
<p>To compute this decomposition, we can either work with $\vec{X}^T\vec{X}$ or $\vec{XX}^T$. This is practical, as it allows us to pick the smaller of the two and work in dimension $D$ or $N$.</p>
<h3 id="pitfalls-of-pca">Pitfalls of PCA</h3>
<p>Unfortunately, PCA is no miracle cure. The SVD is not invariant under scalings of the features in the original matrix $\vec{X}$. This is why it’s so important to normalize features. But there are many ways of doing this, and the result of PCA is highly dependent on how we do this, and there is a large degree of arbitrariness.</p>
<p>Still, the conventional approach for PCA is to remove the mean and normalize the variance to 1.</p>
<h2 id="neural-networks">Neural Networks</h2>
<h3 id="motivation-3">Motivation</h3>
<p>We’ve seen that simple linear classification schemes like logistic regression can work well, but also have their limitations. They work best when we add well chosen features to the original data matrix, but this can be a difficult task: a priori, we don’t know which features are useful.</p>
<p>We could add a ton of polynomial features and hope that some of them are useful, but this quickly becomes computationally infeasible, and leads to overfitting. To mediate the computational complexity, we can use the kernel trick; to solve the feature selection task, we could collaborate with domain experts to pick just a few good features.</p>
<p>But what if we could <em>learn</em> the features instead of having to construct them manually? This is what neural networks allow us to do.</p>
<h3 id="structure">Structure</h3>
<p>As always in supervised learning, we start with a dataset $\Strain = \set{(\vec{x}_n, y_n)}$, with $\vec{x}_n \in\mathbb{R}^D$.</p>
<p>Let’s take a look at a simple multilayer perceptron neural network. It has an <strong>input layer</strong> of size $D$ (one for each dimension of the data), $L$ <strong>hidden layers</strong> of size $K$, and one <strong>output layer</strong>.</p>
<p><img src="/images/ml/nn.svg" alt="Fully connected multilayer perceptron" /></p>
<p>This is a <em>feedforward</em> network: the computation is performed from left to right, with no feedback loop. Each node in the hidden layer $l$ is connected to all nodes in the previous layer $l-1$ via a weighted edge $w_{i, j}^{(l)}$. The number $L$ and size $K$ of hidden layers are hyperparameters to be tuned.</p>
<p>A node outputs a non-linear function of a weighted sum of all the nodes in the previous layer, plus a bias term. For instance, the output of node $j$ at layer $l$ is given by:</p>
<script type="math/tex; mode=display">x_j^{(l)} = \phi\left( \sum_{i=1}^K w_{i, j}^{(l)} x_i^{(l - 1)} + b_j^{(l)} \right)</script>
<p>The actual learning consists of choosing all these weights appropriately for the task. The $\phi$ function is called the <strong>activation function</strong>. It’s very important that this is non-linear; otherwise, the whole neural net’s global function is just a linear function, which defeats the idea of having a complicated, layered function.</p>
<p>A typical choice for this function is the sigmoid function:</p>
<script type="math/tex; mode=display">\phi(x) = \frac{1}{1+e^{-x}}</script>
<p>The layered structured of our neural net means that there are $K^2 L$ parameters.</p>
<h3 id="how-powerful-are-neural-nets">How powerful are neural nets?</h3>
<p>This chapter somewhat follows <a href="http://neuralnetworksanddeeplearning.com/chap4.html">Chapter 4 of Nielsen’s book</a>. See that for a more in-depth explanation of this argument.</p>
<p>We’ll state the following lemma without proof. Let $f: \mathbb{R}^D \rightarrow \mathbb{R}$, where its Fourier transform is:</p>
<script type="math/tex; mode=display">\tilde{f}(w) = \int_{\mathbb{R}^D} {f(\vec{x}) e^{-j\omega^T\vec{x}}} d\vec{x}</script>
<p>We also assume that:</p>
<script type="math/tex; mode=display">\int_{\mathbb{R}^D} {\abs{\omega} \abs{\tilde{f}(\omega)}} d\omega \le C</script>
<p>Essentially, these assumptions just say that our function is “sufficiently smooth” (the $C$ has to do with the smoothness; as long as it is real, the function can be shown to be continuously differentiable). Then, for all $n \ge 1$, there exists a function $f_n$ of the form:</p>
<script type="math/tex; mode=display">f_n(\vec{x}) = \sum_{j=1}^n {c_j \phi(\vec{x}^T\vec{w}_j + b_j)} + c_0</script>
<p>This is a function that is representable by a neural net with one hidden layer with $n$ nodes and “sigmoid-like” activation functions (this is more general than just sigmoid, but includes sigmoid) such that:</p>
<script type="math/tex; mode=display">\int_{\abs{\vec{x}} \le r} {(f(\vec{x}) - f_n(\vec{x}))^2} d\vec{x}
\le
\frac{(2Cr)^2}{n}</script>
<p>This tells us that the error goes down with a rate of $\frac{1}{n}$. Note that this only guarantees us a good approximation in a ball of radius $r$ around the center. The larger the bounded domain, the more nodes we’ll need to approximate a function to the same level (the upper bound grows in terms of $r^2$).</p>
<p>In fact, we’ll see that if we have enough nodes in the network, then we can approximate the underlying distribution function. There is no limit, and no real lower bounds, but we do have the property that neural nets have significant expressive power provided that they’re large enough; we’ll give an intuitive explanation of this below.</p>
<h3 id="approximation-in-average">Approximation in average</h3>
<p>We’ll give a simple and intuitive, albeit a little hand-wavy explanation as to why neural nets with sigmoid activation function and at most two hidden layers already have a large expressive power. We’re searching for an approximation “in average”, i.e. so that the integral over the absolute value of the difference is small.</p>
<p>In the following, we let $f: \mathbb{R} \rightarrow \mathbb{R}$ be a scalar function on a bounded domain. This discussion generalizes to functions that are $\mathbb{R}^D \rightarrow \mathbb{R}$, but in these notes we’ll just cover the simple scalar function case (see Nielsen book and lecture notes for the generalization).</p>
<p>$f$ is Riemann integrable, meaning that it can be approximated arbitrarily precisely (with error at most $\epsilon$, for arbitrary $\epsilon > 0$) by a finite number of rectangles.</p>
<figure>
<img src="/images/ml/riemann.png" alt="Riemann integrals of a function" />
<figcaption>Lower and upper Riemann sums</figcaption>
</figure>
<p>It follows that a finite number of hidden nodes can approximate any such function arbitrarily closely, since we can model rectangles with the function:</p>
<script type="math/tex; mode=display">f(x) = \phi(w(x-b))</script>
<p>Indeed, this function takes on value $\frac{1}{2}$ at $x=b$; we can think of this as the “transition point”. The larger the value of the weight $w$, the faster the transition from 0 to 1 happens. So if we set $b=0$, the transition from 0 to 1 happens at $x=0$. At this point, the derivative of $f$ if $w/4$, to the width of the transition is of the order of $4/w$.</p>
<p>All of the above says that we can create a rectangle that jumps from 0 to 1 at $x=a$ and jumps back to 0 at $x=b$, with $a < b$, with the following, taking a very large value for $w$:</p>
<script type="math/tex; mode=display">\phi(w(x-a)) - \phi(w(x-b))</script>
<p>A few of these rectangles are graphed below:</p>
<figure>
<img src="/images/ml/nn-rectangles.png" />
<figcaption>Approximate rectangles for $w=10, 20, 50$, respectively</figcaption>
</figure>
<p>This special rectangle formula has a simple representation in the form of a neural net. This network creates a rectangle from $a$ to $b$ with transition weight $w$ and height $h$: the output of the nodes in the hidden layer is $\phi(w(x - a))$ and $\phi(w(x - b))$, respectively.</p>
<p><img src="/images/ml/small-nn.svg" alt="A neural net implementation of the above rectangle function" /></p>
<p>Scaling this up, we can create the number of rectangles we need to do a Riemann approximation of the function.</p>
<p>Note that doing the Riemann integral is rarely, if ever, the best way to approximate a function. We wouldn’t want to approximate a smooth function with horrible squares. The argument here isn’t that this is an efficient approach, just that NNs are <em>capable</em> of doing this.</p>
<h4 id="other-activation-functions">Other activation functions</h4>
<p>The same argument also holds under other activation functions. For instance, let’s try to work it out with the rectified linear unit (ReLU) function:</p>
<script type="math/tex; mode=display">(x)_+ = \max{\set{0, x}}</script>
<p>Let $f(x)$ be the function we’re trying to approximate. The Stone-Weierstrass theorem tells us that for every $\epsilon > 0$, there’s a polynomial $p(x)$ locally approximating it arbitrarily precisely; that is, for all $x\in[0, 1]$, we have:</p>
<script type="math/tex; mode=display">% <![CDATA[
\abs{f(x) - p(x)} < \epsilon %]]></script>
<p>This function $f(x)$ can also be approximated in $L_\infty$ norm by piecewise linear function of the form:</p>
<script type="math/tex; mode=display">% <![CDATA[
q(x) = \sum_{i=1}^m (a_i x + b_i) \mathbb{I}_{\set{r_{i-1} \le x < r_i}} %]]></script>
<p>Where $0 = r_0 < r_1 < \dots < r_m = 1$ is a suitable partition of $[0, 1]$. This continuity imposes the constraint:</p>
<script type="math/tex; mode=display">a_i r_i + b_i = a_{i+1}r_i + b_{i+1}, \quad i = 1, \dots, m-1</script>
<p>This allows us to rewrite the $q(x)$ function as follows:</p>
<script type="math/tex; mode=display">q(x) = \tilde{a}_1 x + \tilde{b}_1 + \sum_{i = 2}^m{\tilde{a}_i(x - \tilde{b}_i)_+}</script>
<p>Where:</p>
<script type="math/tex; mode=display">a_1 = \tilde{a}_1,
\quad
a_i = \sum_{j=1}^m{\tilde{a}_i},
\quad
\tilde{b}_i = r_{i - 1}</script>
<h3 id="popular-activation-functions">Popular activation functions</h3>
<h4 id="sigmoid">Sigmoid</h4>
<p>The sigmoid function $\sigma(x)$ has a domain of $[0, 1]$. The main problem with sigmoid is the gradient for large values of $x$, which goes very close to zero. This is known as the “vanishing gradient problem”, which may make learning slow.</p>
<script type="math/tex; mode=display">\phi(x) = \sigma(x) = \frac{1}{1+e^{-x}}</script>
<h4 id="tanh">Tanh</h4>
<p>The hyperbolic tangent has a domain of $[-1, 1]$. It suffers from the same “vanishing gradient problem”.</p>
<script type="math/tex; mode=display">\phi(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1</script>
<h4 id="relu">ReLU</h4>
<p>Rectified linear unit (ReLU) is a very popular choice, and is what works best in most cases.</p>
<script type="math/tex; mode=display">\phi(x) = (x)_+ = \max{\set{0, x}}</script>
<p>ReLu is always positive, and is unbounded. A nice property about it is that its derivative is 1 (and does not vanish) for positive values of $x$ It has 0 derivative for negative values, though.</p>
<h4 id="leaky-relu">Leaky ReLU</h4>
<p>Leaky ReLu solves the 0-derivative problem of ReLU by adding a very small slope $\alpha$ (a hyper-parameter that can be optimized) for negative values:</p>
<script type="math/tex; mode=display">\phi(x) = \max{\set{\alpha x, x}}</script>
<h4 id="maxout">Maxout</h4>
<p>Finally, maxout is a generalization of ReLU and leaky ReLU. Again, the constants can be optimized. Note that this is quite different from previous cases, where we computed the activation function of a weighted sum. Here, we compute $k \ge 2$ different weighted sums, and then choose the maximum.</p>
<script type="math/tex; mode=display">\phi(\vec{x}) = \max{\set{\vec{x}^T \vec{w}_1 + b_1, \dots, \vec{x}^T \vec{w}_k + b_k}}</script>
<h3 id="sgd-and-backpropagation">SGD and Backpropagation</h3>
<p>Remember that the value of every node is computed by:</p>
<script type="math/tex; mode=display">x_j^{(l)} = \phi\left( \sum_{i=1}^K w_{i, j}^{(l)} x_i^{(l - 1)} + b_j^{(l)} \right)</script>
<p>We’d like to optimize this process. Let’s assume that we want to do a regression. Let’s denote the output of the neural net by the function $f$. Our cost function would then simply be:</p>
<script type="math/tex; mode=display">\mathcal{L} = \frac{1}{N} \sum_{n=1}^N{(y_n - f(\vec{x}_n))^2}</script>
<p>We’ll omit regularization for the simplicity of our explanation, but it can trivially be added in, without loss of generality.</p>
<p>To optimize our cost, we’d like to do a gradient descent. Unfortunately, this problem is not convex<sup id="fnref:convexity-nn"><a href="#fn:convexity-nn" class="footnote">13</a></sup>, and we expect it to have many local minima, so there is no guarantee of finding an optimal solution. But the good news is that SGD is <em>stable</em> when applied to a neural net, which means that the outcome won’t be too dependent on the training set. SGD is still the state-of-the art in neural nets.</p>
<p>Let’s do a stochastic gradient descent on a single data point. We need to compute the derivative of the cost of this single point, which is:</p>
<script type="math/tex; mode=display">\frac{\partial \mathcal{L}_n}{\partial w_{i, j}^{(l)}},
\qquad
\frac{\partial \mathcal{L}_n}{\partial b_j^{(l)}}</script>
<p>We can gain a more general formula by restating the problem in vector form. Generally, a layer of neurons is computed by:</p>
<script type="math/tex; mode=display">\vec{x}^{(l)}
= f^{(l)}(\vec{x}^{(l - 1)})
= \phi\left(
\left(\vec{W}^{(l)}\right)^T \vec{x}^{(l - 1)} + \vec{b}^{(l)}
\right)</script>
<p>The overall function of the neural net is thus something taking the input layer $\vec{x}^{(0)}$, and passing it through all hidden layers:</p>
<script type="math/tex; mode=display">\vec{y} = f(\vec{x}^{(0)}) = f^{(L+1)} \circ \dots \circ f^{(2)} \circ f^{(1)}(\vec{x}^{(0)})</script>
<p>To make things more convenient, we’ll introduce notation for the linear part of the computation of a layer. The computation below corresponds to our <strong>forward pass</strong>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\vec{z}^{(l)} & = \left(\vec{W}^{(l)}\right)^T \vec{x}^{(l - 1)} + \vec{b}^{(l)} \\
\vec{x}^{(l)} & = \phi(\vec{z}^{(l)})
\end{align} %]]></script>
<p>To be formal, we’ll just quickly state that our notation here means that we’re applying $\phi$ component-wise. We see that to compute a $\vec{x}^{(l)}$, we need $\vec{x}^{(l - 1)}$; we therefore need to start from the input layer and compute our way forward, until the last layer, which is why this is called the forward path.</p>
<p>Note that the full chain of computation that gets us to the output in $\mathcal{O}(K^2 L)$, which is not too bad.</p>
<p>For the <strong>backwards pass</strong>, let’s remember that the cost of a single data-point is:</p>
<script type="math/tex; mode=display">\mathcal{L}_n = (y_n - f^{(L+1)} \circ \dots \circ f^{(2)} \circ f^{(1)}(\vec{x}^{(0)}))^2</script>
<p>we’ll want to compute the following, which is a derivative over both $\partial w_{i, j}^{(l)}$ and $\partial b_j^{(l)}$.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\delta_j^{(l)}
& = \frac{\partial\mathcal{L}_n}{\partial z_j^{(l)}} \\
& = \sum_k
\frac{\partial\mathcal{L}_n}{\partial z_k^{(l+1)}}
\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}} \\
& = \sum_k \delta_k^{(l+1)} \vec{W}_{j, k}^{(l+1)} \phi'\left( z_j^{(l)} \right)
\end{align} %]]></script>
<p>We can write this more compactly using $\odot$, which is the <a href="https://en.wikipedia.org/wiki/Hadamard_product_(matrices)">Hadamard product</a> (element-wise multiplication of vectors):</p>
<script type="math/tex; mode=display">\pmb{\delta}^{(l)} = \left(\vec{W}^{(l+1)} \pmb{\delta}^{(l+1)}\right) \odot \phi'\left(\vec{z}^{(l)}\right)</script>
<p>Here, to compute a $\pmb{\delta}^{(l)}$, we need $\pmb{\delta}^{(l+1)}$; we must therefore start from the output, and compute our way back to layer 0, which is why we call this a backwards pass. Speaking of which, we will need a $\delta^{(L+1)}$ to start with on the the right side. Therefore, we set:</p>
<script type="math/tex; mode=display">\delta^{(L+1)} = -2\left(y_n - x^{(L+1)}\right) \phi'\left(z^{(L+1)}\right)</script>
<p>Note that $z^{(L+1)}$, $\delta^{(L+1)}$ and $x^{(L+1)}$ are denoted as scalars because we assumed that our neural net only had a single output node.</p>
<p>Now that we have both $\vec{z}^{(l)}$ and $\pmb{\delta}^{(l)}$, let’s go back to our initial goal, which is to compute the following:</p>
<script type="math/tex; mode=display">\frac{\partial \mathcal{L}_n}{\partial w_{i, j}^{(l)}}
= \sum_k
\frac{\partial\mathcal{L}_n}{\partial z_k^{(l)}}
\frac{\partial z_k^{(l)}}{\partial w_{i, j}^{(l)}}
= \frac{\partial\mathcal{L}_n}{\partial z_k^{(l)}}
\frac{\partial z_k^{(l)}}{\partial w_{i, j}^{(l)}}
= \delta_j^{(l)} \vec{x}_i^{(l - 1)}</script>
<p>We were able re-express this as a product of these elements that we already have. We were able to drop the sum because changing a single weight $w_{i, j}^{(l)}$ <em>only</em> changes the single sum $z_j$; all other sums stay unchanged, and therefore do not enter into the derivative with respect to $w_{i, j}^{(l)}$. In other words, the term $\frac{\partial z_k^{(l)}}{\partial w_{i, j}^{(l)}}$ is only non-zero when $j=k$.</p>
<p>We’ve thus found the result of the two derivatives we wanted to originally find:</p>
<script type="math/tex; mode=display">\frac{\partial \mathcal{L}_n}{\partial w_{i, j}^{(l)}}
= \delta_j^{(l)} \vec{x}_i^{(l - 1)},
\qquad
\frac{\partial \mathcal{L}_n}{\partial b_j^{(l)}}
= \delta_j^{(l)}</script>
<h3 id="regularization-2">Regularization</h3>
<p>To regularize the weights, we can add $\Omega(\vec{W})$ to the cost function. Typically, we don’t include bias terms in the regularization (experience shows that it just doesn’t work quite as well). Therefore, the regularization term is expressed as something like:</p>
<script type="math/tex; mode=display">\Omega(\vec{W}) = \frac{1}{2} \mu^{(l)} \sum_{l=1}^{L+1} \frobnorm{\vec{W}^{(l)}}^2</script>
<p>We have different weights $\mu^{(l)} \ge 0$ for each layer. With the right constants $\mu^{(l)}$, this regularization will favor small weights and can help us avoid overfitting.</p>
<p>Let $\Theta = w_{i, j}^{(l)}$ denote the weight that we’re updating, and let $\eta$ be the step size. Assuming that we use the same weight $\mu^{(l)} = \mu$ for all layers $l$, the gradient descent rule becomes:</p>
<script type="math/tex; mode=display">\Theta^{(t+1)} = \Theta^{(t)} - \eta (\nabla_{\Theta}\mathcal{L} + \mu \Theta^{(t)}) = \Theta^{(t)} (1 - \eta\mu) + \eta \nabla_{\Theta}\mathcal{L}</script>
<p>Usual GD deducts the step size $\eta$ times the gradient from the variable, but here, we also decrease the weights by a factor $(1 - \eta\mu)$; we call this <em>weight decay</em>.</p>
<h3 id="dataset-augmentation">Dataset augmentation</h3>
<p>The more data we have, the better we can train. In some instances we can generate new data from the data we are given. For instance, with the classic <a href="https://en.wikipedia.org/wiki/MNIST_database">MNIST database of handwritten digits</a>, we could generate new data by generating rotated characters from the existing dataset. That way, we can also train our network to become invariant to these transformations. We could also add a small amount of noise to our data (by means of compression to degree $K$ with PCA, for instance).</p>
<h3 id="dropout">Dropout</h3>
<p>We define the probability $p_i^{(l)}$ to be the probability of whether or not to keep node $i$ in layer $l$ in the network at a given step. A typical value would be $p_i^{(l)} = 0.8$, which means 80% chance of keeping a given node. This defines a different <em>subnetwork</em> at every step of SGD.</p>
<p>There are many variations of dropout; we talked about dropping nodes, but one could also drop edges. To predict, we can generate $K$ subnets and take the average prediction. Alternatively, we could use the whole network for the prediction, but scaling the output of node $i$ at layer $l$ by $p_i^{(l)}$, which guarantees that the expected input at each node stays the same as during training.</p>
<p>Dropout is a method to avoid overfitting, as nodes cannot “rely” on other nodes being present. It allows us to do a kind of model averaging, as there’s an exponential number of subnetworks, and we’re averaging the training over several of them. Averaging over many models is a standard ML trick, that’s usually called <em>bagging</em>, which usually leads to improved performance.</p>
<h3 id="convolutional-nets">Convolutional nets</h3>
<p>The basic idea in convolutions is to slide a small window (called a <em>filter</em>) over an array, and computing the dot product between the filter and the elements it overlaps for every position in the array. A good introduction to the subject can be found on <a href="https://eli.thegreenplace.net/2018/depthwise-separable-convolutions-for-machine-learning/">Eli Bendersky’s website</a>.</p>
<h4 id="structure-1">Structure</h4>
<p>Classically, we’ve defined our networks as fully connected graphs, where every node in layer $l$ is connected to every node in layer $l-1$. This means that if we have $K$ nodes in each of the two layers, we have $K^2$ edges, and thus parameters, between them. Convolutional nets allow us to have somewhat more sparse networks.</p>
<p>In some scenarios, it makes sense that a more local processing of data should suffice. For instance, convolutions are commonly used in signal processing, were we have a discrete-time system (e.g. audio samples forming an audio stream), which is denoted by $x^{(0)}[n]$. To process the stream we run it through a linear filter $f[n]$, which produces an output $x^{(1)}[n]$. This filter is often “local”, looking at a window of size $k$ around a central value:</p>
<script type="math/tex; mode=display">x^{(1)}[n] = \sum_k f[k]x^{(0)}[n - k]</script>
<p>We have the same scenario if we think of a 2D picture, where the signal is $x^{(0)}[n, m]$. The filter can bring out various aspects, either smoothing features by averaging, or enhancing them by taking a so-called “high-pass” filter.</p>
<script type="math/tex; mode=display">x^{(1)}[n, m] = \sum_{k, l} f[k, l]x^{(0)}[n-k, m-l]</script>
<p>The output $x^{(1)}$ of the filter at position $[n, m]$ only depends on the values of the input $x^{(0)}$ at positions close to $[n, m]$. This is more sparse and local than a fully connected network. This also implies that we use the <em>same filter</em> at every position, which drastically reduces the number of parameters.</p>
<p>In ML, we do something similar. We have a filter with a fixed size $K_1 \times K_2$ with coefficients for every item in the filter. We move the filter over the input matrix, and compute a weighted sum for every position in the matrix.</p>
<h4 id="padding">Padding</h4>
<p>To handle border cases, we can either do:</p>
<ul>
<li><em>Zero padding</em>, where give the filter a default value (usually 0) when going over the edges.</li>
<li><em>Valid padding</em>, where we are careful only to run the filter within the bounds of the matrix. This results in a smaller output matrix.</li>
</ul>
<h4 id="channels">Channels</h4>
<p>A picture naturally has at least three channels: every pixel has a red, green and blue component. So a 2D picture can actually be represented as a 3D cube with a depth of 3. Each layer in the depth represents the same 2D image in red, green and blue, respectively. Each such layer is called a <em>channel</em>.</p>
<p>Channels can also stem from the convolution itself. If we’re doing a convolution on a 2D picture, we may want to use multiple filters in the same model. Each of them produces a different output; these outputs are also <em>channels</em>. If we produce multiple 2D outputs with multiple filters, we can stack them into a 3D cube.</p>
<p>As we get deeper and deeper into a CNN, we tend to add more and more channels, but the 2D size of the picture typically gets smaller and smaller, either due to valid padding or subsampling. This leads to a pyramid shaped structure, as below.</p>
<p><img src="/images/ml/cnn.svg" alt="Example of a CNN getting deeper and deeper" /></p>
<h4 id="training-1">Training</h4>
<p>CNNs are different from fully connected neural nets in that only some of the edges are present, and in that they use weight sharing. The former makes our weight matrices sparser, but doesn’t require any changes in SGD or backpropagation; the latter requires a small modification in the backpropagation algorithm.</p>
<p>With CNNs, we run backpropagation ignoring that some weights are shared, considering each weight on each edge to be an independent variable. We then sum up the gradients of all edges that share the same weight, which gives us the gradient for the network with weight sharing.</p>
<p>Why we do this may seem a little counterintuitive at first, but we’ll attempt to give the mathematical intuition for it. Let’s consider a simple example, in which we let $f(x, y, z)$ be a function from $\mathbb{R}^3 \rightarrow \mathbb{R}$. If we let $g(x, y) = f(x, y, x)$, then $z$ is no longer an independent variable, but is instead fixed to $z = x$. The gradients of $g$ and $f$ are given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla g(x, y) & = \left(
\diff{g(x, y)}{x}, \quad
\diff{g(x, y)}{y}
\right) \\
\nabla f(x, y, z) & = \left(
\diff{f(x, y, z)}{x}, \quad
\diff{f(x, y, z)}{y}, \quad
\diff{f(x, y, z)}{z}
\right) \\
\end{align} %]]></script>
<p>To compute the gradient of $g$, we can first compute that of $f$, and then realize that:</p>
<script type="math/tex; mode=display">\left(
\diff{g(x, y)}{x}, \;
\diff{g(x, y)}{y}
\right)
=
\left(
\diff{f(x, y, z)}{x} + \diff{f(x, y, z)}{z}, \;
\diff{f(x, y, z)}{y}
\right) \\</script>
<p>This is a general property: we can add up the derivatives of the shared weights to compute the value of a single derivative.</p>
<h2 id="bayes-nets">Bayes Nets</h2>
<p>We’ve often seen in this course that there are multiple ways of thinking of the same things; for instance, we’ve often seen different models as variations of least squares, and seen different ways of getting back to least squares (e.g. the probabilistic approach assuming linear model with Gaussian noise, in which we maximize likelihood, or the approach in which we try to minimize MSE, etc).</p>
<p>But these have often been based on very simple assumptions. To model more complex models of causality, we turn to <em>graphical models</em>. They allow to use a graphical depiction of the relationships between random variables. The most prominent ones are <em>Bayes Nets</em>, <em>Markov Random Fields</em> and <em>Factor Graphs</em>.</p>
<h3 id="from-distribution-to-graphs">From distribution to graphs</h3>
<p>Assume that we’re given a large set of random variables $X_1, \dots, X_D$ and that we’re interested in their relationships (e.g. whether $X_1$ and $X_2$ are independent given $X_3$). It doesn’t matter if these are discrete or continuous; we’ll just think of them as being discrete, and consider $p(\cdot)$ to be the density.</p>
<p>The most generic way to write down this model is to write it as a generic distribution over a vector of random variables. The chain rule tells us:</p>
<script type="math/tex; mode=display">p(X_1, \dots, X_D) = p(X_1)p(X_2 \mid X_1) \cdots p(X_D \mid X_1, \dots, X_{D-1})</script>
<p>In the above, we used the natural ordering $X_1, X_2, \dots, X_D$, but we could just as well have used any of the $D!$ orders: this degree of freedom will be important later. Each variable in this chain rule formulation is conditioned on other variables. For instance, for $D=4$, we have:</p>
<script type="math/tex; mode=display">p(X_1, X_2, X_3, X_4) = p(X_1)p(X_2 \mid X_1)p(X_3 \mid X_1, X_2)p(X_4 \mid X_1, X_2, X_3)</script>
<p>A way to represent this expansion of the chain rule is to draw which variables are conditioned on which. In Bayes nets, we draw an arrow from each variable to the variables that are conditioned on it.</p>
<p><img src="/images/ml/bayes-net-1.svg" alt="The Bayes net corresponding to the above" /></p>
<p>It’s important not to interpret this as causality, because the ordering that we picked chain rule is arbitrary, and could lead to many kind of arrows in the Bayes nets representation. If we just have $D=2$, we could have an arrow from $X_1$ to $X_2$ just as well as the other way around. The arrows are sufficient condition to guarantee dependence, but not a necessary one: they allow for dependence, but don’t guarantee it.</p>
<p>Still, when we know that two variables are (conditionally) independent, we can remove edges from the graph. Perhaps we have $p(X_3 \mid X_1, X_2) = p(X_3 \mid X_2)$, in which case we can draw the same graph, but without the edge from $X_1$ to $X_3$.</p>
<p><img src="/images/ml/bayes-net-2.svg" alt="The Bayes net where X1 is independent from X3 conditional on X2" /></p>
<p>This is suddenly much more interesting. Allowing to remove edges between independent variables means that we can have many different graphs. If we couldn’t do that, we would always generate the same graph with the chain rule, in the sense that it would always have the same topology; the exact ordering could still change depending on how we apply the chain rule. This is what will allow us to get information on independence from a graph.</p>
<h3 id="cyclic-graphs">Cyclic graphs</h3>
<p><img src="/images/ml/bayes-net-3.svg" alt="Bayes net with a cycle" /></p>
<p>The above net would correspond to the factorization:</p>
<script type="math/tex; mode=display">p(X_1 \mid X_2) p(X_2 \mid X_3) p(X_3 \mid X_1)</script>
<p>This is clearly not something that could stem from the chain rule, and therefore, the graph is not valid. In fact, we can state a stronger assertion:</p>
<p>Valid Bayes nets are always DAGs (directed acyclic graphs). There exists a valid distribution (a valid chain rule factorization) <strong>iff</strong> there are no cycles in the graph.</p>
<h3 id="conditional-independence">Conditional independence</h3>
<p>Now, assume that we are given an acyclic graph. We’d like to find an appropriate ordering in the chain rule in order to find the distribution. A few things to note before we start:</p>
<ul>
<li>Every acyclic graph has at least one <em>source</em>, that is, a node that has no incoming edges</li>
<li>Two random variables $X$ and $Y$ are independent if $p(X, Y) = p(X)p(Y)$</li>
<li>$X$ is independent of $Y$ given $Z$ (which we denote by $X \bot Y \mid Z$) if $p(X, Y \mid Z) = p(X \mid Z) p(Y \mid Z)$</li>
<li>When we talk about <em>path</em> in the following, we mean an undirected path</li>
</ul>
<p>Let’s look at some simple graphs involving three variables, which will help us clarify the concept of <strong>D-separation</strong>. We’ll always ask the two same questions:</p>
<ul>
<li>Is $X_1 \bot X_2$ ?</li>
<li>Is $X_1 \bot X_2 \mid X_3$ ?</li>
</ul>
<p>These examples have names describing whether we’re comparing the head (source) or tail (sink) of the graph when asking about (conditional) independence of $X_1$ and $X_2$.</p>
<h4 id="tail-to-tail">Tail-to-tail</h4>
<figure>
<img alt="Tail-to-tail Bayes net" src="/images/ml/bayes-net-4.svg" />
<figcaption>$X_3$ is tail-to-tail with respect to the path from $X_1$ to $X_2$</figcaption>
</figure>
<p>$X_3$ is the source of this graph, so the factorization is:</p>
<script type="math/tex; mode=display">p(X_1, X_2, X_3) = p(X_3)p(X_1 \mid X_3)p(X_2 \mid X_3)</script>
<p>Intuitively, $X_1$ and $X_2$ are not independent here, as $X_3$ influences them both; it would be easy to construct something where they are both correlated (e.g. if we let them be fully dictated by $X_3$).</p>
<p>To know if they are conditionally independent, let’s look at the conditioned quantity $p(X_1, X_2 \mid X_3)$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(X_1, X_2 \mid X_3)
& = \frac{p(X_1, X_2, X_3)}{p(X_3)} \\
& = \frac{p(X_3)p(X_1 \mid X_3)p(X_2 \mid X_3)}{p(X_3)} \\
& = p(X_1 \mid X_3) p(X_2 \mid X_3)
\end{align} %]]></script>
<p>This proves $X_1 \bot X_2 \mid X_3$.</p>
<p>Let’s try to look at it in more general terms. We have a path between $X_1$ and $X_2$, which in general is worrisome as it may indicate some kind of relationship. But if we know what the value of $X_3$ is, then the knowledge of $X_3$ “blocks” that dependence.</p>
<h4 id="head-to-tail">Head-to-tail</h4>
<figure>
<img alt="Head-to-tail Bayes net" src="/images/ml/bayes-net-5.svg" />
<figcaption>$X_3$ is head-to-tail with respect to the path from $X_1$ to $X_2$</figcaption>
</figure>
<p>$X_1$ is the source of the graph, so the factorization is:</p>
<script type="math/tex; mode=display">p(X_1, X_2, X_3) = p(X_1) p(X_3 \mid X_1) p(X_2 \mid X_3)</script>
<p>We can clearly construct a case where $X_1$ and $X_2$ are dependent (e.g. if we pick $X_1 = X_3 = X_2$). So again, $X_1$ and $X_2$ are not independent.</p>
<p>To know if they are conditionally independent, let’s look at the conditioned quantity $p(X_1, X_2 \mid X_3)$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(X_1, X_2 \mid X_3)
& = \frac{p(X_1, X_2, X_3)}{p(X_3)} \\
& = \frac{p(X_1) p(X_3 \mid X_1) p(X_2 \mid X_3)}{p(X_3)} \\
& = \frac{p(X_1) p(X_3) p(X_1 \mid X_3) p(X_2 \mid X_3)}{p(X_1) p(X_3)} \\
& = p(X_1 \mid X_3) p(X_2 \mid X_3)
\end{align} %]]></script>
<p>This proves $X_1 \bot X_2 \mid X_3$. Again, conditioned on $X_3$ we block the path from $X_1$ to $X_2$.</p>
<h4 id="head-to-head">Head-to-head</h4>
<figure>
<img alt="Head-to-head Bayes net" src="/images/ml/bayes-net-6.svg" />
<figcaption>$X_3$ is head-to-head with respect to the path from $X_1$ to $X_2$</figcaption>
</figure>
<p>Here, $X_3$ is the source of the graph, and the factorization is:</p>
<script type="math/tex; mode=display">p(X_1, X_2, X_3) = p(X_1) p(X_2) p(X_3 \mid X_1, X_2)</script>
<p>In this example, $X_1$ and $X_2$ are independent. But if we condition on $X_3$, they become dependent. So contrary, to the two previous cases, conditioning on $X_3$ creates a dependence. This phenomenon is called <a href="https://www.eecs.qmul.ac.uk/~norman/BBNs/The_notion_of__explaining_away__evidence.htm"><em>explaining away</em></a>.</p>
<h4 id="d-separation">D-separation</h4>
<p>Instead of determining independence manually as we did above, we can use the two following criteria to decide on (conditional) independence graphically. We’ll give a series of nested definitions that will eventually lead to the criteria. Note that these definitions talk about sets of random variables, but this also applies to single random variables (which we can consider as a set of one).</p>
<ul>
<li>Let $X$, $Y$ and $Z$ be sets of random variables. $X \bot Y \mid Z$ if $X$ and $Y$ are <em>D-separated</em> by $Z$.</li>
<li>We say that $X$ and $Y$ are <strong>D-separated</strong> by $Z$ <strong>iff</strong> every path from any element of $X$ to any element of $Y$ is <em>blocked by</em> $Z$.</li>
<li>We say that a path from node $X$ to node $Y$ is <strong>blocked</strong> by $Z$ <strong>iff</strong> it contains a variable $U$ such that either:
<ul>
<li>$U$ is in $Z$ and is <a href="#head-to-tail">head-to-tail</a></li>
<li>$U$ is in $Z$ and is <a href="#tail-to-tail">tail-to-tail</a></li>
<li>The node is <a href="#head-to-head">head-to-head</a> and <em>neither</em> this node nor any of its <em>descendants</em> are in $Z$</li>
</ul>
</li>
</ul>
<p><strong>Descendant</strong> means that there exist a <em>directed</em> path from parent to descendant.</p>
<h4 id="examples">Examples</h4>
<p>Let’s do lots of examples to make sure that we understand this. We’ll be working on the following graph, and ask about different combinations of random variables.</p>
<p><img src="/images/ml/bayes-net-7.svg" alt="Example of a Bayes net containing all 3 kinds of relationship" /></p>
<ul>
<li>
<p>Is $X_1 \bot X_3 \mid X_2$?</p>
<p>First, let’s try to understand the idea of <em>paths</em>. There is only one path between $X_1$ and $X_3$: from $X_1$ to $X_2$ to $X_3$. In general, it doesn’t have to be a directed path, although this one happens to be so.</p>
<p>For every such path—and in this case, there is just one, so it’s easy—, we’ll check if it contains is a variable that is head-tail in $Z = \set{X_2}$. This is the case, and $X_2$ is head-to-tails with respect to this path. This means that the only path is <em>blocked</em> by $X_2$, and therefore that $X_1 \bot X_3 \mid X_2$.</p>
</li>
<li>
<p>Is $X_3 \bot X_1 \mid X_2$?</p>
<p>This is the same as above, except that the independence is stated in reverse. We know that independence is commutative, and it also follows from the D-separation lemma, since paths are not directed.</p>
</li>
<li>
<p>Is $X_4 \bot X_1 \mid X_2$?</p>
<p>There’s only one path from $X_4$ to $X_1$. We’ll check if it contains a variable $U\in Z = \set{X_2}$: the only node that fits this is quite trivially $U = X_2$, which is head-to-tail with respect to the path. It therefore blocks the path, and we have $X_4 \bot X_1 \mid X_2$.</p>
</li>
<li>
<p>Is $X_4 \bot X_1 \mid X_3$?</p>
<p>There’s only one path from $X_4$ to $X_1$, and it doesn’t contain any head-to-tail or tail-to-tail nodes in $Z$. It does however contain a head-to-head node, $X_3$. While $X_3$ has no descendants, we still have $X_3 \in Z = \set{X_3}$, and therefore, the lemma does not apply. The answer is therefore no.</p>
</li>
<li>
<p>Is $X_4 \bot X_1 \mid X_3, X_2$?</p>
<p>In this case, we have $Z = \set{X_2, X_3}$. There’s still only one path from $X_4$ to $X_1$. We saw previously that we cannot apply the lemma with $X_3$, so let’s try with $X_2$: this node is head-to-tail with respect to the path, and belongs to $Z$. Therefore, $X_2$ blocks the path, and we have a D-separation, which means that the answer is yes.</p>
</li>
<li>
<p>Is $X_4 \bot X_1$?</p>
<p>There’s only one path between them, which is blocked by $X_3$ which is head-to-head, and $X_3 \notin Z = \emptyset$, and it has no descendants (so none of them are in $Z$). Therefore, the answer is yes.</p>
</li>
</ul>
<h3 id="markov-blankets">Markov blankets</h3>
<p>Given a node $X_i$, we can ask if there is a minimal set so that every random variable outside this set is conditionally independent of $X_i$. The answer to this is the Markov blanket.</p>
<p>The <strong>Markov blanket</strong> of $X_1$ is the set of parents, children, and co-parents of $X_i$. By co-parent, we mean other parents of the children of $X_i$.</p>
<figure>
<img src="/images/ml/markov-blanket.svg" alt="Example of a Markov blanket" />
<figcaption>The Markov blanket of $X_1$ is colored in gray</figcaption>
</figure>
<h3 id="sampling-and-marginalizing">Sampling and marginalizing</h3>
<p>So far we’ve seen how to recognize independence relationships from a Bayes net. Another possible task is to sample given a Bayes net, or to compute marginals from a Bayes net. As it turns out, these two tasks are related.</p>
<p>First, let’s assume we know how to sample from a Bayes net. Let’s assume that we have a set of $D$ binary random variables, $X_i \in \set{0, 1}$. We can then generate $N$ independent samples $\set{\vec{x}_n}_{n=1}^N = \set{(X_{1n}, \dots, X_{Dn})}_{n=1}^N$. To get the marginal for $X_i$, we estimate $\expect{X_i}$ by computing the empirical quantity $\frac{1}{N}\sum_{n=1}^N x_{in}$. As $N\rightarrow\infty$, we know that this converges to the true mean.</p>
<p>Conversely, assume we know how to efficiently compute marginals from any Bayes net, and that we’d like to sample from the joint distribution. We can then compute the marginal of the net with respect to a certain variable $X_i$, and then flip a coin according to the marginal probability we’ve computed.</p>
<p>The problem is that neither of these can be done efficiently, except for some special cases. The chain rule tells us that $X_i$ is conditioned on $X_1, \dots, X_{i-1}$, which means we’d need to have a table of $2^{i-1}$ conditional probabilities. In general, the storage requirement is exponential in the largest number of parents any node in the Bayes net has.</p>
<h3 id="factor-graphs">Factor graphs</h3>
<p>Assume we have a function $f$ that can be factorized as follows:</p>
<script type="math/tex; mode=display">f(X_1, X_2, X_3, X_4) = f_a(X_1) f_b(X_2, X_3) f_c(X_3, X_4)</script>
<p>A very natural representation is another graphical representation. Each variable $X_i$ gets a node, and each factor $f_j$ gets a factor node.</p>
<p><img src="/images/ml/factor-graph.svg" alt="Factor graph of the above function" /></p>
<p>If the factor graph is a bipartite tree (i.e. no cycles), then we can marginalize very efficiently with a <a href="https://en.wikipedia.org/wiki/Factor_graph#Message_passing_on_factor_graphs">message-passing algorithm</a>, which runs in linear time in the number of edges, instead of exponential complexity in the size of the network.</p>
<p>Sadly, very few probability distributions do us the favor of producing a tree in the factor graph. But it turns out that there many probability distributions where the factorization’s terms are fairly small, and despite cycles in the graph, we can still run the algorithm and it works approximately.</p>
<div class="footnotes">
<ol>
<li id="fn:here-be-dragons">
<p>I’ve done my best to respect this notational convention everywhere in these notes, but a few mistakes may have slipped through. If you see any, please correct me in the comments below! <a href="#fnref:here-be-dragons" class="reversefootnote">↩</a></p>
</li>
<li id="fn:optimality-linear-mse">
<p>To understand why, see the sections on <a href="#optimality">optimality conditions</a> and on <a href="#single-parameter-linear-regression">single parameter linear regressions</a> <a href="#fnref:optimality-linear-mse" class="reversefootnote">↩</a></p>
</li>
<li id="fn:mse-is-convex">
<p>We accept this without a formal proof for now, but it should be clear from the <a href="#convexity">section on convexity</a> that MSE is convex. Otherwise, the section on <a href="#multiple-parameter-linear-regression">normal equations for multi-parameter linear regression</a> has more complete proofs. <a href="#fnref:mse-is-convex" class="reversefootnote">↩</a></p>
</li>
<li id="fn:convergence-prob-distrib">
<p>Convergence in probability means that the actual realizations of $X$ converges to that of $Y$ (i.e. $\mathbb{P}(X=Y)\rightarrow 1$), while convergence in distribution means that the distribution function of $X$ converges to that of $Y$ (but without any guarantee that the actual realizations will be the same). Convergence in probability implies convergence in distribution, and is therefore a stronger assertion. <a href="#fnref:convergence-prob-distrib" class="reversefootnote">↩</a> <a href="#fnref:convergence-prob-distrib:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
<li id="fn:fisher-information">
<p>Fisher information is a way of measuring the information that a random variable carries about an unknown parameter. See <a href="https://en.wikipedia.org/wiki/Fisher_information">the Wikipedia article for Fisher information</a>. <a href="#fnref:fisher-information" class="reversefootnote">↩</a></p>
</li>
<li id="fn:data-subset-training-data">
<p>We say “data subset” here, because, as <a href="#splitting-the-data">we’ll see later</a>, the data available to the learning algorithm $\mathcal{A}$ is often a subset of the whole dataset, called the training set. In this subsection, $S$ actually corresponds to $\Strain$. <a href="#fnref:data-subset-training-data" class="reversefootnote">↩</a></p>
</li>
<li id="fn:squishification-function">
<p>Because this function squeezes inputs in $(-\infty, \infty)$ into a true probability in $[0, 1]$, I like the name “squishification function” that <a href="https://www.youtube.com/watch?v=aircAruvnKk">3Blue1Brown uses</a>, but other people also call it a “squashing” function. <a href="#fnref:squishification-function" class="reversefootnote">↩</a></p>
</li>
<li id="fn:logistic-implementation">
<p>Note that this function applies the exponential function to rather large values, so we should be careful when implementing this. <a href="#fnref:logistic-implementation" class="reversefootnote">↩</a></p>
</li>
<li id="fn:binary-logistic-regression">
<p>We have only studied binary logistic regression, which is the basic form of logistic regression. Generalized linear models will lead us to more complex extensions, such as <a href="https://en.wikipedia.org/wiki/Multinomial_logistic_regression">multinomial logistic regression</a>. <a href="#fnref:binary-logistic-regression" class="reversefootnote">↩</a></p>
</li>
<li id="fn:isotropic">
<p>The word that expresses this idea is <em>isotropic</em>, meaning “uniform in all directions”. <a href="#fnref:isotropic" class="reversefootnote">↩</a></p>
</li>
<li id="fn:inverted-matrix-notation">
<p>Usually, the data matrix is $N \times D$, but here, we define it as the transpose, a $D \times N$ matrix. Don’t ask me why, because I have no clue 🤷♂️ <a href="#fnref:inverted-matrix-notation" class="reversefootnote">↩</a></p>
</li>
<li id="fn:orthonormal">
<p>The columns of an orthonormal matrix are orthogonal and unitary (they have have norm 1). The transpose is equal to the inverse, meaning that if $\vec{U}$ is orthogonal, then $\vec{U}^T\vec{U} = \vec{UU}^T = \vec{I}$ <a href="#fnref:orthonormal" class="reversefootnote">↩</a></p>
</li>
<li id="fn:convexity-nn">
<p>The cost function is no longer convex as $f$ is now a forward pass through a neural net, including multiple applications of the non-linear activation function <a href="#fnref:convexity-nn" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
The course follows a few books: Christopher Bishop, Pattern Recognition and Machine Learning Kevin Patrick Murphy, Machine Learning: a Probabilistic Perspective Michael Nielsen, Neural Networks and Deep LearningThe repository for code labs and lecture notes is on GitHub. A useful website for this course is matrixcalculus.org.CS-452 Foundations of Software2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/fos
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#writing-a-parser-with-parser-combinators" id="markdown-toc-writing-a-parser-with-parser-combinators">Writing a parser with parser combinators</a> <ul>
<li><a href="#boilerplate" id="markdown-toc-boilerplate">Boilerplate</a></li>
<li><a href="#the-basic-idea" id="markdown-toc-the-basic-idea">The basic idea</a></li>
<li><a href="#simple-parser-primitives" id="markdown-toc-simple-parser-primitives">Simple parser primitives</a></li>
<li><a href="#parser-combinators" id="markdown-toc-parser-combinators">Parser combinators</a></li>
<li><a href="#shorthands" id="markdown-toc-shorthands">Shorthands</a></li>
<li><a href="#example-json-parser" id="markdown-toc-example-json-parser">Example: JSON parser</a></li>
<li><a href="#the-trouble-with-left-recursion" id="markdown-toc-the-trouble-with-left-recursion">The trouble with left-recursion</a></li>
</ul>
</li>
<li><a href="#arithmetic-expressions--abstract-syntax-and-proof-principles" id="markdown-toc-arithmetic-expressions--abstract-syntax-and-proof-principles">Arithmetic expressions — abstract syntax and proof principles</a> <ul>
<li><a href="#basics-of-induction" id="markdown-toc-basics-of-induction">Basics of induction</a></li>
<li><a href="#mathematical-representation-of-syntax" id="markdown-toc-mathematical-representation-of-syntax">Mathematical representation of syntax</a> <ul>
<li><a href="#mathematical-representation-1" id="markdown-toc-mathematical-representation-1">Mathematical representation 1</a></li>
<li><a href="#mathematical-representation-2" id="markdown-toc-mathematical-representation-2">Mathematical representation 2</a></li>
<li><a href="#mathematical-representation-3" id="markdown-toc-mathematical-representation-3">Mathematical representation 3</a></li>
<li><a href="#comparison-of-the-representations" id="markdown-toc-comparison-of-the-representations">Comparison of the representations</a></li>
</ul>
</li>
<li><a href="#induction-on-terms" id="markdown-toc-induction-on-terms">Induction on terms</a></li>
<li><a href="#inductive-function-definitions" id="markdown-toc-inductive-function-definitions">Inductive function definitions</a> <ul>
<li><a href="#what-is-a-function" id="markdown-toc-what-is-a-function">What is a function?</a></li>
<li><a href="#induction-example-1" id="markdown-toc-induction-example-1">Induction example 1</a></li>
<li><a href="#induction-example-2" id="markdown-toc-induction-example-2">Induction example 2</a></li>
</ul>
</li>
<li><a href="#operational-semantics-and-reasoning" id="markdown-toc-operational-semantics-and-reasoning">Operational semantics and reasoning</a> <ul>
<li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
<li><a href="#derivations" id="markdown-toc-derivations">Derivations</a></li>
<li><a href="#inversion-lemma" id="markdown-toc-inversion-lemma">Inversion lemma</a></li>
</ul>
</li>
<li><a href="#abstract-machines" id="markdown-toc-abstract-machines">Abstract machines</a></li>
<li><a href="#normal-forms" id="markdown-toc-normal-forms">Normal forms</a> <ul>
<li><a href="#values-that-are-normal-form" id="markdown-toc-values-that-are-normal-form">Values that are normal form</a></li>
<li><a href="#values-that-are-not-normal-form" id="markdown-toc-values-that-are-not-normal-form">Values that are not normal form</a></li>
</ul>
</li>
<li><a href="#multi-step-evaluation" id="markdown-toc-multi-step-evaluation">Multi-step evaluation</a></li>
<li><a href="#termination-of-evaluation" id="markdown-toc-termination-of-evaluation">Termination of evaluation</a></li>
</ul>
</li>
<li><a href="#lambda-calculus" id="markdown-toc-lambda-calculus">Lambda calculus</a> <ul>
<li><a href="#pure-lambda-calculus" id="markdown-toc-pure-lambda-calculus">Pure lambda calculus</a> <ul>
<li><a href="#scope" id="markdown-toc-scope">Scope</a></li>
<li><a href="#operational-semantics" id="markdown-toc-operational-semantics">Operational semantics</a></li>
<li><a href="#evaluation-strategies" id="markdown-toc-evaluation-strategies">Evaluation strategies</a></li>
</ul>
</li>
<li><a href="#classical-lambda-calculus" id="markdown-toc-classical-lambda-calculus">Classical lambda calculus</a> <ul>
<li><a href="#confluence-in-full-beta-reduction" id="markdown-toc-confluence-in-full-beta-reduction">Confluence in full beta reduction</a></li>
<li><a href="#alpha-conversion" id="markdown-toc-alpha-conversion">Alpha conversion</a></li>
</ul>
</li>
<li><a href="#programming-in-lambda-calculus" id="markdown-toc-programming-in-lambda-calculus">Programming in lambda-calculus</a> <ul>
<li><a href="#multiple-arguments" id="markdown-toc-multiple-arguments">Multiple arguments</a></li>
<li><a href="#booleans" id="markdown-toc-booleans">Booleans</a></li>
<li><a href="#pairs" id="markdown-toc-pairs">Pairs</a></li>
<li><a href="#numbers" id="markdown-toc-numbers">Numbers</a></li>
<li><a href="#lists" id="markdown-toc-lists">Lists</a></li>
</ul>
</li>
<li><a href="#recursion-in-lambda-calculus" id="markdown-toc-recursion-in-lambda-calculus">Recursion in lambda-calculus</a></li>
<li><a href="#equivalence-of-lambda-terms" id="markdown-toc-equivalence-of-lambda-terms">Equivalence of lambda terms</a></li>
</ul>
</li>
<li><a href="#types" id="markdown-toc-types">Types</a> <ul>
<li><a href="#properties-of-the-typing-relation" id="markdown-toc-properties-of-the-typing-relation">Properties of the Typing Relation</a> <ul>
<li><a href="#inversion-lemma-1" id="markdown-toc-inversion-lemma-1">Inversion lemma</a></li>
<li><a href="#canonical-form" id="markdown-toc-canonical-form">Canonical form</a></li>
<li><a href="#progress-theorem" id="markdown-toc-progress-theorem">Progress Theorem</a></li>
<li><a href="#preservation-theorem" id="markdown-toc-preservation-theorem">Preservation Theorem</a></li>
</ul>
</li>
<li><a href="#messing-with-it" id="markdown-toc-messing-with-it">Messing with it</a> <ul>
<li><a href="#removing-a-rule" id="markdown-toc-removing-a-rule">Removing a rule</a></li>
<li><a href="#changing-type-checking-rule" id="markdown-toc-changing-type-checking-rule">Changing type-checking rule</a></li>
<li><a href="#adding-bit" id="markdown-toc-adding-bit">Adding bit</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#simply-typed-lambda-calculus" id="markdown-toc-simply-typed-lambda-calculus">Simply typed lambda calculus</a> <ul>
<li><a href="#type-annotations" id="markdown-toc-type-annotations">Type annotations</a></li>
<li><a href="#typing-rules" id="markdown-toc-typing-rules">Typing rules</a></li>
<li><a href="#inversion-lemma-2" id="markdown-toc-inversion-lemma-2">Inversion lemma</a></li>
<li><a href="#canonical-form-1" id="markdown-toc-canonical-form-1">Canonical form</a></li>
<li><a href="#progress" id="markdown-toc-progress">Progress</a></li>
<li><a href="#preservation" id="markdown-toc-preservation">Preservation</a> <ul>
<li><a href="#weakening-lemma" id="markdown-toc-weakening-lemma">Weakening lemma</a></li>
<li><a href="#permutation-lemma" id="markdown-toc-permutation-lemma">Permutation lemma</a></li>
<li><a href="#substitution-lemma" id="markdown-toc-substitution-lemma">Substitution lemma</a></li>
<li><a href="#proof" id="markdown-toc-proof">Proof</a></li>
</ul>
</li>
<li><a href="#erasure" id="markdown-toc-erasure">Erasure</a></li>
<li><a href="#curry-howard-correspondence" id="markdown-toc-curry-howard-correspondence">Curry-Howard Correspondence</a></li>
<li><a href="#extensions-to-stlc" id="markdown-toc-extensions-to-stlc">Extensions to STLC</a> <ul>
<li><a href="#base-types" id="markdown-toc-base-types">Base types</a></li>
<li><a href="#unit-type" id="markdown-toc-unit-type">Unit type</a></li>
<li><a href="#sequencing" id="markdown-toc-sequencing">Sequencing</a></li>
<li><a href="#ascription" id="markdown-toc-ascription">Ascription</a></li>
<li><a href="#pairs-1" id="markdown-toc-pairs-1">Pairs</a></li>
<li><a href="#tuples" id="markdown-toc-tuples">Tuples</a></li>
<li><a href="#records" id="markdown-toc-records">Records</a></li>
</ul>
</li>
<li><a href="#sums-and-variants" id="markdown-toc-sums-and-variants">Sums and variants</a> <ul>
<li><a href="#sum-type" id="markdown-toc-sum-type">Sum type</a></li>
<li><a href="#sums-and-uniqueness-of-type" id="markdown-toc-sums-and-uniqueness-of-type">Sums and uniqueness of type</a></li>
<li><a href="#variants" id="markdown-toc-variants">Variants</a></li>
</ul>
</li>
<li><a href="#recursion" id="markdown-toc-recursion">Recursion</a></li>
<li><a href="#references" id="markdown-toc-references">References</a> <ul>
<li><a href="#mutability" id="markdown-toc-mutability">Mutability</a></li>
<li><a href="#aliasing" id="markdown-toc-aliasing">Aliasing</a></li>
<li><a href="#typing-rules-1" id="markdown-toc-typing-rules-1">Typing rules</a></li>
<li><a href="#evaluation-1" id="markdown-toc-evaluation-1">Evaluation</a></li>
<li><a href="#store-typing" id="markdown-toc-store-typing">Store typing</a></li>
<li><a href="#safety" id="markdown-toc-safety">Safety</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#type-reconstruction-and-polymorphism" id="markdown-toc-type-reconstruction-and-polymorphism">Type reconstruction and polymorphism</a> <ul>
<li><a href="#constraint-based-typing-algorithm" id="markdown-toc-constraint-based-typing-algorithm">Constraint-based Typing Algorithm</a> <ul>
<li><a href="#constraint-generation" id="markdown-toc-constraint-generation">Constraint generation</a></li>
<li><a href="#soundness-and-completeness" id="markdown-toc-soundness-and-completeness">Soundness and completeness</a></li>
<li><a href="#substitutions" id="markdown-toc-substitutions">Substitutions</a></li>
<li><a href="#unification" id="markdown-toc-unification">Unification</a></li>
<li><a href="#strong-normalization" id="markdown-toc-strong-normalization">Strong normalization</a></li>
</ul>
</li>
<li><a href="#polymorphism" id="markdown-toc-polymorphism">Polymorphism</a> <ul>
<li><a href="#explicit-polymorphism" id="markdown-toc-explicit-polymorphism">Explicit polymorphism</a></li>
<li><a href="#implicit-polymorphism" id="markdown-toc-implicit-polymorphism">Implicit polymorphism</a></li>
<li><a href="#alternative-hindley-milner" id="markdown-toc-alternative-hindley-milner">Alternative Hindley Milner</a></li>
</ul>
</li>
<li><a href="#principal-types" id="markdown-toc-principal-types">Principal types</a></li>
</ul>
</li>
<li><a href="#subtyping" id="markdown-toc-subtyping">Subtyping</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#rules" id="markdown-toc-rules">Rules</a> <ul>
<li><a href="#general-rules" id="markdown-toc-general-rules">General rules</a></li>
<li><a href="#records-1" id="markdown-toc-records-1">Records</a></li>
<li><a href="#arrow-types" id="markdown-toc-arrow-types">Arrow types</a></li>
<li><a href="#top-type" id="markdown-toc-top-type">Top type</a></li>
<li><a href="#aside-structural-vs-declared-subtyping" id="markdown-toc-aside-structural-vs-declared-subtyping">Aside: structural vs. declared subtyping</a></li>
</ul>
</li>
<li><a href="#properties-of-subtyping" id="markdown-toc-properties-of-subtyping">Properties of subtyping</a> <ul>
<li><a href="#safety-1" id="markdown-toc-safety-1">Safety</a></li>
<li><a href="#inversion-lemma-for-subtyping" id="markdown-toc-inversion-lemma-for-subtyping">Inversion lemma for subtyping</a></li>
<li><a href="#inversion-lemma-for-typing" id="markdown-toc-inversion-lemma-for-typing">Inversion lemma for typing</a></li>
<li><a href="#preservation-1" id="markdown-toc-preservation-1">Preservation</a></li>
</ul>
</li>
<li><a href="#subtyping-features" id="markdown-toc-subtyping-features">Subtyping features</a> <ul>
<li><a href="#casting" id="markdown-toc-casting">Casting</a></li>
<li><a href="#variants-1" id="markdown-toc-variants-1">Variants</a></li>
<li><a href="#covariance" id="markdown-toc-covariance">Covariance</a></li>
<li><a href="#invariance" id="markdown-toc-invariance">Invariance</a></li>
</ul>
</li>
<li><a href="#algorithmic-subtyping" id="markdown-toc-algorithmic-subtyping">Algorithmic subtyping</a></li>
</ul>
</li>
<li><a href="#objects" id="markdown-toc-objects">Objects</a> <ul>
<li><a href="#dynamic-dispatch" id="markdown-toc-dynamic-dispatch">Dynamic dispatch</a></li>
<li><a href="#encapsulation" id="markdown-toc-encapsulation">Encapsulation</a></li>
<li><a href="#inheritance" id="markdown-toc-inheritance">Inheritance</a></li>
<li><a href="#this" id="markdown-toc-this">This</a></li>
<li><a href="#using-this" id="markdown-toc-using-this">Using <code class="highlighter-rouge">this</code></a></li>
</ul>
</li>
<li><a href="#featherweight-java" id="markdown-toc-featherweight-java">Featherweight Java</a> <ul>
<li><a href="#structural-vs-nominal-type-systems" id="markdown-toc-structural-vs-nominal-type-systems">Structural vs. Nominal type systems</a></li>
<li><a href="#representing-objects" id="markdown-toc-representing-objects">Representing objects</a></li>
<li><a href="#syntax" id="markdown-toc-syntax">Syntax</a></li>
<li><a href="#evaluation-2" id="markdown-toc-evaluation-2">Evaluation</a></li>
<li><a href="#typing" id="markdown-toc-typing">Typing</a></li>
<li><a href="#properties" id="markdown-toc-properties">Properties</a> <ul>
<li><a href="#progress-1" id="markdown-toc-progress-1">Progress</a></li>
<li><a href="#preservation-2" id="markdown-toc-preservation-2">Preservation</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#foundations-of-scala" id="markdown-toc-foundations-of-scala">Foundations of Scala</a> <ul>
<li><a href="#modeling-lists" id="markdown-toc-modeling-lists">Modeling Lists</a></li>
<li><a href="#abstract-class" id="markdown-toc-abstract-class">Abstract class</a></li>
<li><a href="#dot" id="markdown-toc-dot">DOT</a> <ul>
<li><a href="#evaluation-3" id="markdown-toc-evaluation-3">Evaluation</a></li>
</ul>
</li>
<li><a href="#abstract-types" id="markdown-toc-abstract-types">Abstract types</a></li>
<li><a href="#progress-and-preservation" id="markdown-toc-progress-and-preservation">Progress and preservation</a></li>
</ul>
</li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<h2 id="writing-a-parser-with-parser-combinators">Writing a parser with parser combinators</h2>
<p>In Scala, you can (ab)use the operator overload to create an embedded DSL (EDSL) for grammars. While a grammar may look as follows in a grammar description language (Bison, Yak, ANTLR, …):</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>Expr ::= Term {'+' Term | '−' Term}
Term ::= Factor {'∗' Factor | '/' Factor}
Factor ::= Number | '(' Expr ')'</pre></td></tr></tbody></table></code></pre></figure>
<p>In Scala, we can model it as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is perhaps a little less elegant, but allows us to encode it directly into our language, which is often useful for interop.</p>
<p>The <code class="highlighter-rouge">~</code>, <code class="highlighter-rouge">|</code>, <code class="highlighter-rouge">rep</code> and <code class="highlighter-rouge">opt</code> are <strong>parser combinators</strong>. These are primitives with which we can construct a full parser for the grammar of our choice.</p>
<h3 id="boilerplate">Boilerplate</h3>
<p>First, let’s define a class <code class="highlighter-rouge">ParseResult[T]</code> as an ad-hoc monad; parsing can either succeed or fail:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="nc">case</span> <span class="k">class</span> <span class="nc">Success</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">result</span><span class="k">:</span> <span class="kt">T</span><span class="o">,</span> <span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">msg</span> <span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 <code class="highlighter-rouge">Nothing</code> is the bottom type in Scala; it contains no members, and nothing can extend it</p>
</blockquote>
<p>Let’s also define the tokens produced by the lexer (which we won’t define) as case classes extending <code class="highlighter-rouge">Token</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Keyword</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">NumericLit</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">StringLit</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Identifier</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Input into the parser is then a lazy stream of tokens (with positions for error diagnostics, which we’ll omit here):</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">type</span> <span class="kt">Input</span> <span class="o">=</span> <span class="nc">Reader</span><span class="o">[</span><span class="kt">Token</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can then define a standard, sample parser which looks as follows on the type-level:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">class</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Parser</span> <span class="o">=</span> <span class="nc">Input</span> <span class="k">=></span> <span class="nc">ParseResult</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="the-basic-idea">The basic idea</h3>
<p>For each language (defined by a grammar symbol <code class="highlighter-rouge">S</code>), define a function <code class="highlighter-rouge">f</code> that, given an input stream <code class="highlighter-rouge">i</code> (with tail <code class="highlighter-rouge">i'</code>):</p>
<ul>
<li>if a prefix of <code class="highlighter-rouge">i</code> is in <code class="highlighter-rouge">S</code>, return <code class="highlighter-rouge">Success(Pair(x, i'))</code>, where <code class="highlighter-rouge">x</code> is a result for <code class="highlighter-rouge">S</code></li>
<li>otherwise, return <code class="highlighter-rouge">Failure(msg, i)</code>, where <code class="highlighter-rouge">msg</code> is an error message string</li>
</ul>
<p>The first is called <em>success</em>, the second is <em>failure</em>. We can compose operations on this somewhat conveniently, like we would on a monad (like <code class="highlighter-rouge">Option</code>).</p>
<h3 id="simple-parser-primitives">Simple parser primitives</h3>
<p>All of the above boilerplate allows us to define a parser, which succeeds if the first token in the input satisfies some given predicate <code class="highlighter-rouge">pred</code>. When it succeeds, it reads the token string, and splits the input there.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">token</span><span class="o">(</span><span class="n">kind</span><span class="k">:</span> <span class="kt">String</span><span class="o">)(</span><span class="n">pred</span><span class="k">:</span> <span class="kt">Token</span> <span class="o">=></span> <span class="n">boolean</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span>
<span class="k">if</span> <span class="o">(</span><span class="n">pred</span><span class="o">(</span><span class="n">in</span><span class="o">.</span><span class="n">head</span><span class="o">))</span> <span class="nc">Success</span><span class="o">(</span><span class="n">in</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">chars</span><span class="o">,</span> <span class="n">in</span><span class="o">.</span><span class="n">tail</span><span class="o">)</span>
<span class="k">else</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">kind</span> <span class="o">+</span> <span class="s">" expected "</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can use this to define a keyword parser:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="k">implicit</span> <span class="k">def</span> <span class="n">keyword</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"'"</span> <span class="o">+</span> <span class="n">chars</span> <span class="o">+</span> <span class="s">"'"</span><span class="o">)</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Keyword</span><span class="o">(</span><span class="n">chars1</span><span class="o">)</span> <span class="k">=></span> <span class="n">chars</span> <span class="o">==</span> <span class="n">chars1</span>
<span class="k">case</span> <span class="k">_</span> <span class="k">=></span> <span class="kc">false</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Marking it as <code class="highlighter-rouge">implicit</code> allows us to write keywords as normal strings, where we can omit the <code class="highlighter-rouge">keyword</code> call (this helps us simplify the notation in our DSL; we can write <code class="highlighter-rouge">"if"</code> instead of <code class="highlighter-rouge">keyword("if")</code>).</p>
<p>We can make other parsers for our other case classes quite simply:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">numericLit</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"number"</span><span class="o">)(</span><span class="k">_</span><span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">NumericLit</span><span class="o">])</span>
<span class="k">def</span> <span class="n">stringLit</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"string literal"</span><span class="o">)(</span><span class="k">_</span><span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">StringLit</span><span class="o">])</span>
<span class="k">def</span> <span class="n">ident</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"identifier"</span><span class="o">)(</span><span class="k">_</span><span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">Identifier</span><span class="o">])</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="parser-combinators">Parser combinators</h3>
<p>We are going to define the following parser combinators:</p>
<ul>
<li><code class="highlighter-rouge">~</code>: sequential composition</li>
<li><code class="highlighter-rouge"><~</code>, <code class="highlighter-rouge">>~</code>: sequential composition, keeping left / right only</li>
<li><code class="highlighter-rouge">|</code>: alternative</li>
<li><code class="highlighter-rouge">opt(X)</code>: option (like a <code class="highlighter-rouge">?</code> quantifier in a regex)</li>
<li><code class="highlighter-rouge">rep(X)</code>: repetition (like a <code class="highlighter-rouge">*</code> quantifier in a regex)</li>
<li><code class="highlighter-rouge">repsep(P, Q)</code>: interleaved repetition</li>
<li><code class="highlighter-rouge">^^</code>: result conversion (like a <code class="highlighter-rouge">map</code> on an <code class="highlighter-rouge">Option</code>)</li>
<li><code class="highlighter-rouge">^^^</code>: constant result (like a <code class="highlighter-rouge">map</code> on an <code class="highlighter-rouge">Option</code>, but returning a constant value regardless of result)</li>
</ul>
<p>But first, we’ll write some very basic parser combinators: <code class="highlighter-rouge">success</code> and <code class="highlighter-rouge">failure</code>, that respectively always succeed and always fail:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">success</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">result</span><span class="k">:</span> <span class="kt">T</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Success</span><span class="o">(</span><span class="n">result</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">failure</span><span class="o">(</span><span class="n">msg</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">msg</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>All of the above are methods on a <code class="highlighter-rouge">Parser[T]</code> class. Thanks to infix space notation in Scala, we can denote <code class="highlighter-rouge">x.y(z)</code> as <code class="highlighter-rouge">x y z</code>, which allows us to simplify our DSL notation; for instance <code class="highlighter-rouge">A ~ B</code> corresponds to <code class="highlighter-rouge">A.~(B)</code>.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code"><pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="c1">// An abstract method that defines the parser function
</span> <span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span><span class="k">:</span> <span class="kt">ParseResult</span>
<span class="k">def</span> <span class="o">~[</span><span class="kt">U</span><span class="o">](</span><span class="n">rhs</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span> <span class="kt">~</span> <span class="kt">U</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">tail</span><span class="o">)</span> <span class="k">=></span> <span class="n">rhs</span><span class="o">(</span><span class="n">tail</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">y</span><span class="o">,</span> <span class="n">rest</span><span class="o">)</span> <span class="k">=></span> <span class="nc">Success</span><span class="o">(</span><span class="k">new</span> <span class="o">~(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">),</span> <span class="n">rest</span><span class="o">)</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">failure</span>
<span class="o">}</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">failure</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">|(</span><span class="n">rhs</span><span class="k">:</span> <span class="o">=></span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">s1</span> <span class="k">@</span> <span class="nc">Success</span><span class="o">(</span><span class="k">_</span><span class="o">,</span> <span class="k">_</span><span class="o">)</span> <span class="k">=></span> <span class="n">s1</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">rhs</span><span class="o">(</span><span class="n">in</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">^^[</span><span class="kt">U</span><span class="o">](</span><span class="n">f</span><span class="k">:</span> <span class="kt">T</span> <span class="o">=></span> <span class="n">U</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">tail</span><span class="o">)</span> <span class="k">=></span> <span class="nc">Success</span><span class="o">(</span><span class="n">f</span><span class="o">(</span><span class="n">x</span><span class="o">),</span> <span class="n">tail</span><span class="o">)</span>
<span class="k">case</span> <span class="n">x</span> <span class="k">=></span> <span class="n">x</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">^^^[</span><span class="kt">U</span><span class="o">](</span><span class="n">r</span><span class="k">:</span> <span class="kt">U</span><span class="o">)</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">]</span> <span class="k">=</span> <span class="o">^^(</span><span class="n">x</span> <span class="k">=></span> <span class="n">r</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 In Scala, <code class="highlighter-rouge">T ~ U</code> is syntactic sugar for <code class="highlighter-rouge">~[T, U]</code>, which is the type of the case class we’ll define below</p>
</blockquote>
<p>For the <code class="highlighter-rouge">~</code> combinator, when everything works, we’re using <code class="highlighter-rouge">~</code>, a case class that is equivalent to <code class="highlighter-rouge">Pair</code>, but prints the way we want to and allows for the concise type-level notation above.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">case</span> <span class="k">class</span> <span class="nc">~</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">](</span><span class="n">_1</span> <span class="k">:</span> <span class="kt">T</span><span class="o">,</span> <span class="n">_2</span> <span class="k">:</span> <span class="kt">U</span><span class="o">)</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="n">toString</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">+</span> <span class="n">_1</span> <span class="o">+</span> <span class="s">" ~ "</span> <span class="o">+</span> <span class="n">_2</span> <span class="o">+</span><span class="s">")"</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>At this point, we thus have <strong>two</strong> different meanings for <code class="highlighter-rouge">~</code>: a <em>function</em> <code class="highlighter-rouge">~</code> that produces a <code class="highlighter-rouge">Parser</code>, and the <code class="highlighter-rouge">~(a, b)</code> <em>case class</em> pair that this parser returns (all of this is encoded in the function signature of the <code class="highlighter-rouge">~</code> function).</p>
<p>Note that the <code class="highlighter-rouge">|</code> combinator takes the right-hand side parser as a call-by-name argument. This is because we don’t want to evaluate it unless it is strictly needed—that is, if the left-hand side fails.</p>
<p><code class="highlighter-rouge">^^</code> is like a <code class="highlighter-rouge">map</code> operation on <code class="highlighter-rouge">Option</code>; <code class="highlighter-rouge">P ^^ f</code> succeeds iff <code class="highlighter-rouge">P</code> succeeds, in which case it applies the transformation <code class="highlighter-rouge">f</code> on the result of P. Otherwise, it fails.</p>
<h3 id="shorthands">Shorthands</h3>
<p>We can now define shorthands for common combinations of parser combinators:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">opt</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Option</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span> <span class="n">p</span> <span class="o">^^</span> <span class="nc">Some</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">None</span><span class="o">)</span>
<span class="k">def</span> <span class="n">rep</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span>
<span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">p</span><span class="o">)</span> <span class="o">^^</span> <span class="o">{</span> <span class="k">case</span> <span class="n">x</span> <span class="o">~</span> <span class="n">xs</span> <span class="k">=></span> <span class="n">x</span> <span class="o">::</span> <span class="n">xs</span> <span class="o">}</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">Nil</span><span class="o">)</span>
<span class="k">def</span> <span class="n">repsep</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">],</span> <span class="n">q</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span>
<span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">q</span> <span class="o">~></span> <span class="n">p</span><span class="o">)</span> <span class="o">^^</span> <span class="o">{</span> <span class="k">case</span> <span class="n">r</span> <span class="o">~</span> <span class="n">rs</span> <span class="k">=></span> <span class="n">r</span> <span class="o">::</span> <span class="n">rs</span> <span class="o">}</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">Nil</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Note that none of the above can fail. They may, however, return <code class="highlighter-rouge">None</code> or <code class="highlighter-rouge">Nil</code> wrapped in <code class="highlighter-rouge">success</code>.</p>
<p>As an exercise, we can implement the <code class="highlighter-rouge">rep1(P)</code> parser combinator, which corresponds to the <code class="highlighter-rouge">+</code> regex quantifier:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">rep1</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">p</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="example-json-parser">Example: JSON parser</h3>
<p>Let’s define a JSON parser. Scala’s parser combinator library has a <code class="highlighter-rouge">StandardTokenParsers</code> that give us a variety of utility methods for lexing, like <code class="highlighter-rouge">lexical.delimiters</code>, <code class="highlighter-rouge">lexical.reserved</code>, <code class="highlighter-rouge">stringLit</code> and <code class="highlighter-rouge">numericLit</code>.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">JSON</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">+=</span> <span class="o">(</span><span class="s">"{"</span><span class="o">,</span> <span class="s">"}"</span><span class="o">,</span> <span class="s">"["</span><span class="o">,</span> <span class="s">"]"</span><span class="o">,</span> <span class="s">":"</span><span class="o">)</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">reserved</span> <span class="o">+=</span> <span class="o">(</span><span class="s">"null"</span><span class="o">,</span> <span class="s">"true"</span><span class="o">,</span> <span class="s">"false"</span><span class="o">)</span>
<span class="c1">// Return Map
</span> <span class="k">def</span> <span class="n">obj</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"{"</span> <span class="o">~</span> <span class="n">repsep</span><span class="o">(</span><span class="n">member</span><span class="o">,</span> <span class="s">","</span><span class="o">)</span> <span class="o">~</span> <span class="s">"}"</span> <span class="o">^^</span> <span class="o">(</span><span class="n">ms</span> <span class="k">=></span> <span class="nc">Map</span><span class="o">()</span> <span class="o">++</span> <span class="n">ms</span><span class="o">)</span>
<span class="c1">// Return List
</span> <span class="k">def</span> <span class="n">arr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"["</span> <span class="o">~></span> <span class="n">repsep</span><span class="o">(</span><span class="n">value</span><span class="o">,</span> <span class="s">","</span><span class="o">)</span> <span class="o"><~</span> <span class="s">"]"</span>
<span class="c1">// Return name/value pair:
</span> <span class="k">def</span> <span class="n">member</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">stringLit</span> <span class="o">~</span> <span class="s">":"</span> <span class="o">~</span> <span class="n">value</span> <span class="o">^^</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">name</span> <span class="o">~</span> <span class="s">":"</span> <span class="o">~</span> <span class="n">value</span> <span class="k">=></span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span>
<span class="o">}</span>
<span class="c1">// Return correct Scala type
</span> <span class="k">def</span> <span class="n">value</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span>
<span class="n">obj</span>
<span class="o">|</span> <span class="n">arr</span>
<span class="o">|</span> <span class="n">stringLit</span>
<span class="o">|</span> <span class="n">numericLit</span> <span class="o">^^</span> <span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toInt</span><span class="o">)</span>
<span class="o">|</span> <span class="s">"null"</span> <span class="o">^^^</span> <span class="kc">null</span>
<span class="o">|</span> <span class="s">"true"</span> <span class="o">^^^</span> <span class="kc">true</span>
<span class="o">|</span> <span class="s">"false"</span> <span class="o">^^^</span> <span class="kc">false</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="the-trouble-with-left-recursion">The trouble with left-recursion</h3>
<p>Parser combinators work top-down and therefore do not allow for left-recursion. For example, the following would go into an infinite loop, where the parser keeps recursively matching the same token unto <code class="highlighter-rouge">expr</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">expr</span> <span class="k">=</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">"-"</span> <span class="o">~</span> <span class="n">term</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Let’s take a look at an arithmetic expression parser:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">Arithmetic</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">++=</span> <span class="nc">List</span><span class="o">(</span><span class="s">"("</span><span class="o">,</span> <span class="s">")"</span><span class="o">,</span> <span class="s">"+"</span><span class="o">,</span> <span class="s">"−"</span><span class="o">,</span> <span class="s">"∗"</span><span class="o">,</span> <span class="s">"/"</span><span class="o">)</span>
<span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This definition of <code class="highlighter-rouge">expr</code>, namely <code class="highlighter-rouge">term ~ rep("-" ~ term)</code> produces a right-leaning tree. For instance, <code class="highlighter-rouge">1 - 2 - 3</code> produces <code class="highlighter-rouge">1 ~ List("-" ~ 2, ~ "-" ~ 3)</code>.</p>
<p>The solution is to combine calls to <code class="highlighter-rouge">rep</code> with a final foldLeft on the list:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">Arithmetic</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">++=</span> <span class="nc">List</span><span class="o">(</span><span class="s">"("</span><span class="o">,</span> <span class="s">")"</span><span class="o">,</span> <span class="s">"+"</span><span class="o">,</span> <span class="s">"−"</span><span class="o">,</span> <span class="s">"∗"</span><span class="o">,</span> <span class="s">"/"</span><span class="o">)</span>
<span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span> <span class="o">^^</span> <span class="n">reduceList</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span> <span class="o">^^</span> <span class="n">reduceList</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span>
<span class="k">private</span> <span class="k">def</span> <span class="n">reduceList</span><span class="o">(</span><span class="n">list</span><span class="k">:</span> <span class="kt">Expr</span> <span class="kt">~</span> <span class="kt">List</span><span class="o">[</span><span class="kt">String</span> <span class="kt">~</span> <span class="kt">Expr</span><span class="o">])</span><span class="k">:</span> <span class="kt">Expr</span> <span class="o">=</span> <span class="n">list</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">x</span> <span class="o">~</span> <span class="n">xs</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span> <span class="n">foldLeft</span> <span class="n">ps</span><span class="o">)(</span><span class="n">reduce</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">private</span> <span class="k">def</span> <span class="n">reduce</span><span class="o">(</span><span class="n">x</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">r</span><span class="k">:</span> <span class="kt">String</span> <span class="kt">~</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">r</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="s">"+"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">−</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"∗"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">∗</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">/</span> <span class="n">y</span>
<span class="k">case</span> <span class="k">=></span> <span class="k">throw</span> <span class="k">new</span> <span class="nc">MatchError</span><span class="o">(</span><span class="s">"illegal case: "</span> <span class="o">+</span> <span class="n">r</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 It used to be that the standard library contained parser combinators, but those are now a <a href="https://github.com/scala/scala-parser-combinators">separate module</a>. This module contains a <code class="highlighter-rouge">chainl</code> (chain-left) method that reduces after a <code class="highlighter-rouge">rep</code> for you.</p>
</blockquote>
<h2 id="arithmetic-expressions--abstract-syntax-and-proof-principles">Arithmetic expressions — abstract syntax and proof principles</h2>
<p>This section follows Chapter 3 in TAPL.</p>
<h3 id="basics-of-induction">Basics of induction</h3>
<p>Ordinary induction is simply:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose P is a predicate on natural numbers.
Then:
If P(0)
and, for all i, P(i) implies P(i + 1)
then P(n) holds for all n
</code></pre></div></div>
<p>We can also do complete induction:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose P is a predicate on natural numbers.
Then:
If for each natural number n,
given P(i) for all i < n we can show P(n)
then P(n) holds for all n
</code></pre></div></div>
<p>It proves exactly the same thing as ordinary induction, it is simply a restated version. They’re <em>interderivable</em>; assuming one, we can prove the other. Which one to use is simply a matter of style or convenience. We’ll see some more equivalent styles as we go along.</p>
<h3 id="mathematical-representation-of-syntax">Mathematical representation of syntax</h3>
<p>Let’s assume the following grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>t ::=
true
false
if t then t else t
0
succ t
pred t
iszero t</pre></td></tr></tbody></table></code></pre></figure>
<p>What does this really define? A few suggestions:</p>
<ul>
<li>A set of character strings</li>
<li>A set of token lists</li>
<li>A set of abstract syntax trees</li>
</ul>
<p>It depends on how you read it; a grammar like the one above contains information about all three.</p>
<p>However, we are mostly interested in the ASTs. The above grammar is therefore called an <strong>abstract grammar</strong>. Its main purpose is to suggest a mapping from character strings to trees.</p>
<p>For our use of these, we won’t be too strict with these. For instance, we’ll freely use parentheses to disambiguate what tree we mean to describe, even though they’re not strictly supported by the grammar. What matters to us here aren’t strict implementation semantics, but rather that we have a framework to talk about ASTs. For our purposes, we’ll consider that two terms producing the same AST are basically the same; still, we’ll distinguish terms that only have the same evaluation result, as they don’t necessarily have the same AST.</p>
<p>How can we express our grammar as mathematical expressions? A grammar describes the legal <em>set</em> of terms in a program by offering a recursive definition. While recursive definitions may seem obvious and simple to a programmer, we have to go through a few hoops to make sense of them mathematically.</p>
<h4 id="mathematical-representation-1">Mathematical representation 1</h4>
<p>We can use a set $\mathcal{T}$ of terms. The grammar is then the smallest set such that:</p>
<ol>
<li>$\left\{ \text{true}, \text{false}, 0 \right\} \subseteq \mathcal{T}$,</li>
<li>If $t_1 \in \mathcal{T}$ then $\left\{ \text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \right\} \subseteq \mathcal{T}$,</li>
<li>If $t_1, t_2, t_3 \in \mathcal{T}$ then we also have $\text{if } t_1 \text{ then } t_2 \text{ else } t_3 \in \mathcal{T}$.</li>
</ol>
<h4 id="mathematical-representation-2">Mathematical representation 2</h4>
<p>We can also write this somewhat more graphically:</p>
<script type="math/tex; mode=display">\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\if}{\text{if }}
\newcommand{\then}{\text{ then }}
\newcommand{\else}{\text{ else }}
\newcommand{\ifelse}{\if t_1 \then t_2 \else t_3}
\newcommand{\defeq}{\overset{\text{def}}{=}}
\newenvironment{rcases}
{\left.\begin{aligned}}
{\end{aligned}\right\rbrace}
\text{true } \in \mathcal{T}, \quad
\text{false } \in \mathcal{T}, \quad
0 \in \mathcal{T} \\ \\
\frac{t_1 \in \mathcal{T}}{\text{succ } t_1 \in \mathcal{T}}, \quad
\frac{t_1 \in \mathcal{T}}{\text{pred } t_1 \in \mathcal{T}}, \quad
\frac{t_1 \in \mathcal{T}}{\text{iszero } t_1 \in \mathcal{T}} \\ \\
\frac{t_1 \in \mathcal{T}, \quad t_2 \in \mathcal{T}, \quad t_3 \in \mathcal{T}}{\ifelse \in \mathcal{T}}</script>
<p>This is exactly equivalent to representation 1, but we have just introduced a different notation. Note that “the smallest set closed under…” is often not stated explicitly, but implied.</p>
<h4 id="mathematical-representation-3">Mathematical representation 3</h4>
<p>Alternatively, we can build up our set of terms as an infinite union:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{S}_0 = & & \emptyset \\
\mathcal{S}_{i+1} =
& & \set{\text{true}, \text{ false}, 0} \\
& \cup & \set{\text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \mid t_1 \in \mathcal{S}_i} \\
& \cup & \set{\ifelse \mid t_1, t_2, t_3 \in \mathcal{S}_i}
\end{align} %]]></script>
<p>We can thus build our final set as follows:</p>
<script type="math/tex; mode=display">\mathcal{S} = \bigcup_i{\mathcal{S}_i}</script>
<p>Note that we can “pull out” the definition into a generating function $F$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{S}_0 & = \emptyset \\
\mathcal{S}_{i+1} & = F(\mathcal{S}_i) \\
\mathcal{S} & = \bigcup_i{\mathcal{S}_i} \\
\end{align} %]]></script>
<p>The generating function is thus defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
F_1(U) & = \set{\text{true}} \\
F_2(U) & = \set{\text{false}} \\
F_3(U) & = \set{0} \\
F_4(U) & = \set{\text{succ } t_1 \mid t_1 \in U} \\
F_5(U) & = \set{\text{pred } t_1 \mid t_1 \in U} \\
F_6(U) & = \set{\text{iszero } t_1 \mid t_1 \in U} \\
F_7(U) & = \set{\ifelse \mid t_1, t_2, t_3 \in U} \\
\end{align} \\
F(U) = \bigcup_{i=1}^7{F_i(U)} %]]></script>
<p>Each function takes a set of terms $U$ as input and produces “terms justified by $U$” as output; that is, all terms that have the items of $U$ as subterms.</p>
<p>The set $U$ is said to be <strong>closed under F</strong> or <strong>F-closed</strong> if $F(U) \subseteq U$.</p>
<p>The set of terms $T$ as defined above is the smallest F-closed set. If $O$ is another F-closed set, then $T \subseteq O$.</p>
<h4 id="comparison-of-the-representations">Comparison of the representations</h4>
<p>We’ve seen essentially two ways of defining the set (as representation 1 and 2 are equivalent, but with different notation):</p>
<ol>
<li>The smallest set that is closed under certain rules. This is compact and easy to read.</li>
<li>The limit of a series of sets. This gives us an <em>induction principle</em> on which we can prove things on terms by induction.</li>
</ol>
<p>The first one defines the set “from above”, by intersecting F-closed sets.</p>
<p>The second one defines it “from below”, by starting with $\emptyset$ and getting closer and closer to being F-closed.</p>
<p>These are equivalent (we won’t prove it, but Proposition 3.2.6 in TAPL does so), but can serve different uses in practice.</p>
<h3 id="induction-on-terms">Induction on terms</h3>
<p>First, let’s define depth: the <strong>depth</strong> of a term $t$ is the smallest $i$ such that $t\in\mathcal{S_i}$.</p>
<p>The way we defined $\mathcal{S}_i$, it gets larger and larger for increasing $i$; the depth of a term $t$ gives us the step at which $t$ is introduced into the set.</p>
<p>We see that if a term $t$ is in <script type="math/tex">\mathcal{S}_i</script>, then all of its immediate subterms must be in $\mathcal{S}_{i-1}$, meaning that they must have smaller depth.</p>
<p>This justifies the principle of <strong>induction on terms</strong>, or <strong>structural induction</strong>. Let P be a predicate on a term:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>If, for each term s,
given P(r) for all immediate subterms r of s we can show P(s)
then P(t) holds for all t
</code></pre></div></div>
<p>All this says is that if we can prove the induction step from subterms to terms (under the induction hypothesis), then we have proven the induction.</p>
<p>We can also express this structural induction using generating functions, which we <a href="#mathematical-representation-3">introduced previously</a>.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose T is the smallest F-closed set.
If, for each set U,
from the assumption "P(u) holds for every u ∈ U",
we can show that "P(v) holds for every v ∈ F(U)"
then
P(t) holds for all t ∈ T
</code></pre></div></div>
<p>Why can we use this?</p>
<ul>
<li>We assumed that $T$ was the smallest F-closed set, which means that $T\subseteq O$ for any other F-closed set $O$.</li>
<li>Showing the pre-condition (“for each set $U$, from the assumption…”) amounts to showing that the set of all terms satisfying $P$ (call it $O$) is itself an F-closed set.</li>
<li>Since $T\subseteq O$, every element of $T$ satisfies $P$.</li>
</ul>
<h3 id="inductive-function-definitions">Inductive function definitions</h3>
<p>An <a href="https://en.wikipedia.org/wiki/Recursive_definition">inductive definition</a> is used to define the elements in a set recursively, as we have done above. The <a href="https://en.wikipedia.org/wiki/Recursion#The_recursion_theorem">recursion theorem</a> states that a well-formed inductive definition defines a function. To understand what being well-formed means, let’s take a look at some examples.</p>
<p>Let’s define our grammar function a little more formally. Constants are the basic values that can’t be expanded further; in our example, they are <code class="highlighter-rouge">true</code>, <code class="highlighter-rouge">false</code>, <code class="highlighter-rouge">0</code>. As such, the set of constants appearing in a term $t$, written $\text{Consts}(t)$, is defined recursively as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{Consts}(\text{true}) & = \set{\text{true}} \\
\text{Consts}(\text{false}) & = \set{\text{false}} \\
\text{Consts}(0) & = \set{0} \\
\text{Consts}(\text{succ } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\text{pred } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\text{iszero } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\ifelse & = \text{Consts}(t_1) \cup \text{Consts}(t_2) \cup \text{Consts}(t_3) \\
\end{align} %]]></script>
<p>This seems simple, but these semantics aren’t perfect. First off, a mathematical definition simply assigns a convenient name to some previously known thing. But here, we’re defining the thing in terms of itself, recursively. And the semantics above also allow us to define ill-formed inductive definitions:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{BadConsts}(\text{true}) & = \set{\text{true}} \\
\text{BadConsts}(\text{false}) & = \set{\text{false}} \\
\text{BadConsts}(0) & = \set{0} \\
\text{BadConsts}(0) & = \set{} = \emptyset \\
\text{BadConsts}(\text{succ } t_1) & = \text{BadConsts}(t_1) \\
\text{BadConsts}(\text{pred } t_1) & = \text{BadConsts}(t_1) \\
\text{BadConsts}(\text{iszero } t_1) & = \text{BadConsts}(\text{iszero iszero }t_1) \\
\end{align} %]]></script>
<p>The last rule produces infinitely large rules (if we implemented it, we’d expect some kind of stack overflow). We’re missing the rules for if-statements, and we have a useless rule for <code class="highlighter-rouge">0</code>, producing empty sets.</p>
<p>How do we tell the difference between a well-formed inductive definition, and an ill-formed one as above? What is well-formedness anyway?</p>
<h4 id="what-is-a-function">What is a function?</h4>
<p>A relation over $T, U$ is a subset of $T \times U$, where the Cartesian product is defined as:</p>
<script type="math/tex; mode=display">T\times U = \set{(t, u) : t\in T, u\in U}</script>
<p>A function $f$ from $A$ (domain) to $B$ (co-domain) can be viewed as a two-place relation, albeit with two additional properties:</p>
<ul>
<li>It is <strong>total</strong>: $\forall a \in A, \exists b \in B : (a, b) \in f$</li>
<li>It is <strong>deterministic</strong>: $(a, b_1) \in f, (a, b_2) \in f \implies b_1 = b_2$</li>
</ul>
<p>Totality ensures that the A domain is covered, while being deterministic just means that the function always produces the same result for a given input.</p>
<h4 id="induction-example-1">Induction example 1</h4>
<p>As previously stated, $\text{Consts}$ is a <em>relation</em>. It maps terms (A) into the set of constants that they contain (B). The induction theorem states that it is also a <em>function</em>. The proof is as follows.</p>
<p>$\text{Consts}$ is total and deterministic: for each term $t$ there is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$<sup id="fnref:in-relation-notation"><a href="#fn:in-relation-notation" class="footnote">1</a></sup> . The proof is done by induction on $t$.</p>
<p>To be able to apply the induction principle for terms, we must first show that for an arbitrary term $t$, under the following induction hypothesis:</p>
<blockquote>
<p>For each immediate subterm $s$ of $t$, there is exactly one set of terms $C_s$ such that $(s, C_s) \in \text{Consts}$</p>
</blockquote>
<p>Then the following needs to be proven as an induction step:</p>
<blockquote>
<p>There is <strong>exactly one</strong> set of terms $C$ such that $(t, C) \in \text{Consts}$</p>
</blockquote>
<p>We proceed by cases on $t$:</p>
<ul>
<li>
<p>If $t$ is $0$, $\text{true}$ or $\text{false}$</p>
<p>We can immediately see from the definition that of $\text{Consts}$ that there is exactly one set of terms $C = \set{t}$) such that $(t, C) \in \text{Consts}$.</p>
<p>This constitutes our base case.</p>
</li>
<li>
<p>If $t$ is $\text{succ } t_1$, $\text{pred } t_1$ or $\text{iszero } t_1$</p>
<p>The immediate subterm of $t$ is $t_1$, and the induction hypothesis tells us that there is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$. But then it is clear from the definition that there is exactly one set of terms $C = C_1$ such that $(t, C) \in \text{Consts}$.</p>
</li>
<li>
<p>If $t$ is $\ifelse$</p>
<p>The induction hypothesis tells us:</p>
<ul>
<li>There is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$</li>
<li>There is exactly one set of terms $C_2$ such that $(t_2, C_2) \in \text{Consts}$</li>
<li>There is exactly one set of terms $C_3$ such that $(t_3, C_3) \in \text{Consts}$</li>
</ul>
<p>It is clear from the definition of $\text{Consts}$ that there is exactly one set $C = C_1 \cup C_2 \cup C_3$ such that $(t, C) \in \text{Consts}$.</p>
</li>
</ul>
<p>This proves that $\text{Consts}$ is indeed a function.</p>
<p>But what about $\text{BadConsts}$? It is also a relation, but it isn’t a function. For instance, we have $\text{BadConsts}(0) = \set{0}$ and $\text{BadConsts}(0) = \emptyset$, which violates determinism. To reformulate this in terms of the above, there are two sets $C$ such that $(0, C) \in \text{BadConsts}$, namely $C = \set{0}$ and $C = \emptyset$.</p>
<p>Note that there are many other problems with $\text{BadConsts}$, but this is sufficient to prove that it isn’t a function.</p>
<h4 id="induction-example-2">Induction example 2</h4>
<p>Let’s introduce another inductive definition:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{size}(\text{true}) & = 1 \\
\text{size}(\text{false}) & = 1 \\
\text{size}(0) & = 1 \\
\text{size}(\text{succ}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\text{pred}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\text{iszero}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\ifelse) & = \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3)\\
\end{align} %]]></script>
<p>We’d like to prove that the number of distinct constants in a term is at most the size of the term. In other words, that $\abs{\text{Consts}(t)} \le \text{size}(t)$</p>
<p>The proof is by induction on $t$:</p>
<ul>
<li>
<p>$t$ is a constant; $t=\text{true}$, $t=\text{false}$ or $t=0$</p>
<p>The proof is immediate. For constants, the number of constants and the size are both one: $\abs{\text{Consts(t)}} = \abs{\set{t}} = 1 = \text{size}(t)$</p>
</li>
<li>
<p>$t$ is a function; $t = \text{succ}\ t_1$, $t = \text{pred}\ t_1$ or $t = \text{iszero}\ t_1$</p>
<p>By the induction hypothesis, $\abs{\text{Consts}(t1)} \le \text{size}(t_1)$.</p>
<p>We can then prove the proposition as follows: $\abs{\text{Consts}(t)} = \abs{\text{Consts}(t_1)} \overset{\text{IH}}{\le} \text{size}(t_1) = \text{size}(t) + 1 < \text{size}(t)$</p>
</li>
<li>
<p>$t$ is an if-statement: $t = \ifelse$</p>
<p>By the induction hypothesis, $\abs{\text{Consts}(t_1)} \le \text{size}(t_1)$, $\abs{\text{Consts}(t_2)} \le \text{size}(t_2)$ and $\abs{\text{Consts}(t_3)} \le \text{size}(t_3)$.</p>
<p>We can then prove the proposition as follows:</p>
</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\abs{\text{Consts}}
& = \abs{\text{Consts}(t_1)\cup\text{Consts}(t_2)\cup\text{Consts}(t_3)} \\
& \le \abs{\text{Consts}(t_1)}+\abs{\text{Consts}(t_2)}+\abs{\text{Consts}(t_3)} \\
& \overset{\text{IH}}{\le} \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3) \\
& < \text{size}(t)
\end{align} %]]></script>
<h3 id="operational-semantics-and-reasoning">Operational semantics and reasoning</h3>
<h4 id="evaluation">Evaluation</h4>
<p>Suppose we have the following syntax</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>t ::= // terms
true // constant true
false // constant false
if t then t else t // conditional</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation relation $t \longrightarrow t’$ is the smallest relation closed under the following rules.</p>
<p>The following are <em>computation rules</em>, defining the “real” computation steps:</p>
<script type="math/tex; mode=display">\begin{align}
\text{if true then } t_2 \else t_3 \longrightarrow t_2
\tag{E-IfTrue}
\label{eq:e-iftrue} \\
\text{if false then } t_2 \else t_3 \longrightarrow t_3
\tag{E-IfFalse}
\label{eq:e-iffalse} \\
\end{align}</script>
<p>The following is a <em>congruence rule</em>, defining where the computation rule is applied next:</p>
<script type="math/tex; mode=display">\frac{t_1 \longrightarrow t_1'}
{\ifelse \longrightarrow \if t_1' \then t_2 \else t_3}
\tag{E-If}
\label{eq:e-if}</script>
<p>We want to evaluate the condition before the conditional clauses in order to save on evaluation; we’re not sure which one should be evaluated, so we need to know the condition first.</p>
<h4 id="derivations">Derivations</h4>
<p>We can describe the evaluation logically from the above rules using derivation trees. Suppose we want to evaluate the following (with parentheses added for clarity): <code class="highlighter-rouge">if (if true then true else false) then false else true</code>.</p>
<p>In an attempt to make all this fit onto the screen, <code class="highlighter-rouge">true</code> and <code class="highlighter-rouge">false</code> have been abbreviated <code class="highlighter-rouge">T</code> and <code class="highlighter-rouge">F</code> in the derivation below, and the <code class="highlighter-rouge">then</code> keyword has been replaced with a parenthesis notation for the condition.</p>
<script type="math/tex; mode=display">\frac{
\frac{
\if (T)\ T \else F
\longrightarrow
T
\quad (\ref{eq:e-iftrue})
}{
\if (\if (T)\ T \else F) \ F \else T
\longrightarrow
\if (T) \ F \else T
\quad (\ref{eq:e-if})
}
\qquad
\small{
\if (T) \ F \else T
\longrightarrow
F
\quad (\ref{eq:e-iftrue})
}
}{
\if (\if (T) \ T \else F) \ F \else T
\longrightarrow
T
}</script>
<p>The final statement is a <strong>conclusion</strong>. We say that the derivation is a <strong>witness</strong> for its conclusion (or a <strong>proof</strong> for its conclusion). The derivation records all reasoning steps that lead us to the conclusion.</p>
<h4 id="inversion-lemma">Inversion lemma</h4>
<p>We can introduce the <strong>inversion lemma</strong>, which tells us how we got to a term.</p>
<p>Suppose we are given a derivation $\mathcal{D}$ witnessing the pair $(t, t’)$ in the evaluation relation. Then either:</p>
<ol>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $\if true \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $\if false \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \if t_1 \then t_2 \else t_3$ and $t’ = t = \if t_1’ \then t_2 \else t_3$, for some $t_1, t_1’, t_2, t_3$. Moreover, the immediate subderivation of $\mathcal{D}$ witnesses $(t_1, t_1’) \in \longrightarrow$.</li>
</ol>
<p>This is super boring, but we do need to acknowledge the inversion lemma before we can do induction proofs on derivations. Thanks to the inversion lemma, given an arbitrary derivation $\mathcal{D}$ with conclusion $t \longrightarrow t’$, we can proceed with a case-by-case analysis on the final rule used in the derivation tree.</p>
<p>Let’s recall our <a href="#induction-example-2">definition of the size function</a>. In particular, we’ll need the rule for if-statements:</p>
<script type="math/tex; mode=display">\text{size}(\ifelse) = \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3)</script>
<p>We want to prove that if $t \longrightarrow t’$, then $\text{size}(t) > \text{size}(t’)$.</p>
<ol>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $t = \if true \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $t = \if false \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \ifelse$ and $t’ = \if t_1’ \then t_2 \else t_3$. In this case, $t_1 \longrightarrow t_1’$ is witnessed by a derivation $\mathcal{D}_1$. By the induction hypothesis, $\text{size}(t_1) > \text{size}(t_1’)$, and the result is then immediate from the definition of $\text{size}$</li>
</ol>
<h3 id="abstract-machines">Abstract machines</h3>
<p>An abstract machine consists of:</p>
<ul>
<li>A set of <strong>states</strong></li>
<li>A <strong>transition</strong> relation of states, written $\longrightarrow$</li>
</ul>
<p>$t \longrightarrow t’$ means that $t$ evaluates to $t’$ in one step. Note that $\longrightarrow$ is a relation, and that $t \longrightarrow t’$ is shorthand for $(t, t’) \in \longrightarrow$. Often, this relation is a partial function (not necessarily covering the domain A; there is at most one possible next state). But without loss of generality, there may be many possible next states, determinism isn’t a criterion here.</p>
<h3 id="normal-forms">Normal forms</h3>
<p>A normal form is a term that cannot be evaluated any further. More formally, a term $t$ is a normal form if there is no $t’$ such that $t \longrightarrow t’$. A normal form is a state where the abstract machine is halted; we can regard it as the result of a computation.</p>
<h4 id="values-that-are-normal-form">Values that are normal form</h4>
<p>Previously, we intended for our values (true and false) to be exactly that, the result of a computation. Did we get that right?</p>
<p>Let’s prove that a term $t$ is a value $\iff$ it is in normal form.</p>
<ul>
<li>The $\implies$ direction is immediate from the definition of the evaluation relation $\longrightarrow$.</li>
<li>
<p>The $\impliedby$ direction is more conveniently proven as its contrapositive: if $t$ is not a value, then it is not a normal form, which we can prove by induction on the term $t$.</p>
<p>Since $t$ is not a value, it must be of the form $\ifelse$. If $t_1$ is directly <code class="highlighter-rouge">true</code> or <code class="highlighter-rouge">false</code>, then $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ apply, and we are done.</p>
<p>Otherwise, if $t = \ifelse$ where $t_1$ isn’t a value, by the induction hypothesis, there is a $t_1’$ such that $t_1 \longrightarrow t_1’$. Then rule $\ref{eq:e-if}$ yields $\if t_1’ \then t_2 \else t_3$, which proves that $t$ is not in normal form.</p>
</li>
</ul>
<h4 id="values-that-are-not-normal-form">Values that are not normal form</h4>
<p>Let’s introduce new syntactic forms, with new evaluation rules.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>t ::= // terms
0 // constant 0
succ t // successor
pred t // predecessor
iszero t // zero test
v ::= nv // values
nv ::= // numeric values
0 // zero value
succ nv // successor value</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation rules are given as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 \longrightarrow t_1'}{\text{succ } t_1 \longrightarrow \text{succ } t_1'}
\tag{E-Succ} \label{eq:e-succ}
\\ \\
& \text{pred } 0 \longrightarrow 0
\tag{E-PredZero} \label{eq:e-predzero}
\\ \\
& \text{pred succ } nv_1 \longrightarrow nv_1
\tag{E-PredSucc} \label{eq:e-predsucc}
\\ \\
& \frac{t_1 \longrightarrow t_1'}{\text{pred } t_1 \longrightarrow \text{pred } t_1'}
\tag{E-Pred} \label{eq:e-pred}
\\ \\
& \text{iszero } 0 \longrightarrow true
\tag{E-IszeroZero} \label{eq:e-iszerozero}
\\ \\
& \text{iszero succ } nv_1 \longrightarrow false
\tag{E-IszeroSucc} \label{eq:e-iszerosucc}
\\ \\
& \frac{t_1 \longrightarrow t_1'}{\text{iszero } t_1 \longrightarrow \text{iszero } t_1'}
\tag{E-Iszero} \label{eq:e-iszero} \\
\end{align} %]]></script>
<p>All values are still normal forms. But are all normal forms values? Not in this case. For instance, <code class="highlighter-rouge">succ true</code>, <code class="highlighter-rouge">iszero true</code>, etc, are normal forms. These are <strong>stuck terms</strong>: they are in normal form, but are not values. In general, these correspond to some kind of type error, and one of the main purposes of a type system is to rule these kinds of situations out.</p>
<h3 id="multi-step-evaluation">Multi-step evaluation</h3>
<p>Let’s introduce the <em>multi-step evaluation</em> relation, $\longrightarrow^*$. It is the reflexive, transitive closure of single-step evaluation, i.e. the smallest relation closed under these rules:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t\longrightarrow t'}{t \longrightarrow^* t'} \\ \\
t \longrightarrow^* t \\ \\
\frac{t \longrightarrow^* t' \qquad t' \longrightarrow^* t''}{t \longrightarrow^* t''}
\end{align}</script>
<p>In other words, it corresponds to any number of single consecutive evaluations.</p>
<h3 id="termination-of-evaluation">Termination of evaluation</h3>
<p>We’ll prove that evaluation terminates, i.e. that for every term $t$ there is some normal form $t’$ such that $t\longrightarrow^* t’$.</p>
<p>First, let’s <a href="#induction-example-2">recall our proof</a> that $t\longrightarrow t’ \implies \text{size}(t) > \text{size}(t’)$. Now, for our proof by contradiction, assume that we have an infinite-length sequence $t_0, t_1, t_2, \dots$ such that:</p>
<script type="math/tex; mode=display">t_0 \longrightarrow t_1 \longrightarrow t_2 \longrightarrow \dots
\quad \implies \quad
\text{size}(t_0) > \text{size}(t_1) > \text{size}(t_2) > \dots</script>
<p>But this sequence cannot exist: since $\text{size}(t_0)$ is a finite, natural number, we cannot construct this infinite descending chain from it. This is a contradiction.</p>
<p>Most termination proofs have the same basic form. We want to prove that the relation $R\subseteq X \times X$ is terminating — that is, there are no infinite sequences $x_0, x_1, x_2, \dots$ such that $(x_i, x_{i+1}) \in R$ for each $i$. We proceed as follows:</p>
<ol>
<li>Choose a well-suited set $W$ with partial order $<$ such that there are no infinite descending chains $w_0 > w_1 > w_2 > \dots$ in $W$. Also choose a function $f: X \rightarrow W$.</li>
<li>Show $f(x) > f(y) \quad \forall (x, y) \in R$</li>
<li>Conclude that are no infinite sequences $(x_0, x_1, x_2, \dots)$ such that $(x_i, x_{i+1}) \in R$ for each $i$. If there were, we could construct an infinite descending chain in $W$.</li>
</ol>
<p>As a side-note, <strong>partial order</strong> is defined as the following properties:</p>
<ol>
<li><strong>Anti-symmetry</strong>: $\neg(x < y \land y < x)$</li>
<li><strong>Transitivity</strong>: $x<y \land y<z \implies x < z$</li>
</ol>
<p>We can add a third property to achieve <strong>total order</strong>, namely $x \ne y \implies x <y \lor y<x$.</p>
<h2 id="lambda-calculus">Lambda calculus</h2>
<p>Lambda calculus is Turing complete, and is higher-order (functions are data). In lambda calculus, all computation happens by means of function abstraction and application.</p>
<p>Lambda calculus is isomorphic to Turing machines.</p>
<p>Suppose we wanted to write a function <code class="highlighter-rouge">plus3</code> in our previous language:</p>
<figure class="highlight"><pre><code class="language-linenos" data-lang="linenos">plus3 x = succ succ succ x</code></pre></figure>
<p>The way we write this in lambda calculus is:</p>
<script type="math/tex; mode=display">\text{plus3 } = \lambda x. \text{ succ}(\text{succ}(\text{succ}(x)))</script>
<p>$\lambda x. t$ is written <code class="highlighter-rouge">x => t</code> in Scala, or <code class="highlighter-rouge">fun x -> t</code> in OCaml. Application of our function, say <code class="highlighter-rouge">plus3(succ 0)</code>, can be written as:</p>
<script type="math/tex; mode=display">(\lambda x. \text{succ succ succ } x)(\text{succ } 0)</script>
<p>Abstraction over functions is possible using higher-order functions, which we call $\lambda$-abstractions. An example of such an abstraction is the function $g$ below, which takes an argument $f$ and uses it in the function position.</p>
<script type="math/tex; mode=display">g = \lambda f. f(f(\text{succ } 0))</script>
<p>If we apply $g$ to an argument like $\text{plus3}$, we can just use the substitution rule to see how that defines a new function.</p>
<p>Another example: the double function below takes two arguments, as a curried function would. First, it takes the function to apply twice, then the argument on which to apply it, and then returns $f(f(y))$.</p>
<script type="math/tex; mode=display">\text{double} = \lambda f. \lambda y. f(f(y))</script>
<h3 id="pure-lambda-calculus">Pure lambda calculus</h3>
<p>Once we have $\lambda$-abstractions, we can actually throw out all other language primitives like booleans and other values; all of these can be expressed as functions, as we’ll see below. In pure lambda-calculus, <em>everything</em> is a function.</p>
<p>Variables will always denote a function, functions always take other functions as parameters, and the result of an evaluation is always a function.</p>
<p>The syntax of lambda-calculus is very simple:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>t ::= // terms, also called λ-terms
x // variable
λx. t // abstraction, also called λ-abstractions
t t // application</pre></td></tr></tbody></table></code></pre></figure>
<p>A few rules and syntactic conventions:</p>
<ul>
<li>Application associates to the left, so $t\ u\ v$ means $(t\ u)\ v$, not $t\ (u\ v)$.</li>
<li>Bodies of lambda abstractions extend as far to the right as possible, so $\lambda x. \lambda y.\ x\ y$ means $\lambda x.\ (\lambda y. x\ y)$, not $\lambda x.\ (\lambda y.\ x)\ y$</li>
</ul>
<h4 id="scope">Scope</h4>
<p>The lambda expression $\lambda x.\ t$ <strong>binds</strong> the variable $x$, with a <strong>scope</strong> limited to $t$. Occurrences of $x$ inside of $t$ are said to be <em>bound</em>, while occurrences outside are said to be <em>free</em>.</p>
<p>Let $\text{fv}(t)$ be the set of free variables in a term $t$. It’s defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{fv}(x) & = \set{x} \\
\text{fv}(\lambda x.\ t_1) & = \text{fv}(t_1) \setminus \set{x} \\
\text{fv}(t_1 \ t_2) & = \text{fv}(t_1)\cup\text{fv}(t_2) \\
\end{align} %]]></script>
<h4 id="operational-semantics">Operational semantics</h4>
<p>As we saw with our previous language, the rules could be distinguished into <em>computation</em> and <em>congruence</em> rules. For lambda calculus, the only computation rule is:</p>
<script type="math/tex; mode=display">(\lambda x. t_{12})\ v_2 \longrightarrow \left[ x \mapsto v_2 \right] t_{12}
\tag{E-AppAbs}\label{eq:e-appabs}</script>
<p>The notation $\left[ x \mapsto v_2 \right] t_{12}$ means “the term that results from substituting free occurrences of $x$ in $t_{12}$ with $v_2$”.</p>
<p>The congruence rules are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 \longrightarrow t_1'}{t_1\ t_2 \longrightarrow t_1'\ t_2} \tag{E-App1}\label{eq:e-app1} \\ \\
& \frac{t_2 \longrightarrow t_2'}{t_1\ t_2 \longrightarrow t_1\ t_2'} \tag{E-App2}\label{eq:e-app2} \\
\end{align} %]]></script>
<p>A lambda-expression applied to a value, $(\lambda x.\ t)\ v$, is called a <strong>reducible expression</strong>, or <strong>redex</strong>.</p>
<h4 id="evaluation-strategies">Evaluation strategies</h4>
<p>There are alternative evaluation strategies. In the above, we have chosen call by value (which is the standard in most mainstream languages), but we could also choose:</p>
<ul>
<li><strong>Full beta-reduction</strong>: any redex may be reduced at any time. This offers no restrictions, but in practice, we go with a set of restrictions like the ones below (because coding a fixed way is easier than coding probabilistic behavior).</li>
<li><strong>Normal order</strong>: the leftmost, outermost redex is always reduced first. This strategy allows to reduce inside unapplied lambda terms</li>
<li><strong>Call-by-name</strong>: allows no reductions inside lambda abstractions. Arguments are not reduced before being substituted in the body of lambda terms when applied. Haskell uses an optimized version of this, call-by-need (aka lazy evaluation).</li>
</ul>
<h3 id="classical-lambda-calculus">Classical lambda calculus</h3>
<p>Classical lambda calculus allows for full beta reduction.</p>
<h4 id="confluence-in-full-beta-reduction">Confluence in full beta reduction</h4>
<p>The congruence rules allow us to apply in different ways; we can choose between $\ref{eq:e-app1}$ and $\ref{eq:e-app2}$ every time we reduce an application, and this offers many possible reduction paths.</p>
<p>While the path is non-deterministic, is the result also non-deterministic? This question took a very long time to answer, but after 25 years or so, it was proven that the result is always the same. This is known the <strong>Church-Rosser confluence theorem</strong>:</p>
<p>Let $t, t_1, t_2$ be terms such that $t \longrightarrow^* t_1$ and $t \longrightarrow^* t_2$. Then there exists a term $t_3$ such that $t_1 \longrightarrow^* t_3$ and $t_2 \longrightarrow^* t_3$</p>
<h4 id="alpha-conversion">Alpha conversion</h4>
<p>Substitution is actually trickier than it looks! For instance, in the expression $\lambda x.\ (\lambda y.\ x)\ y$, the first occurrence of $y$ is bound (it refers to a parameter), while the second is free (it does not refer to a parameter). This is comparable to scope in most programming languages, where we should understand that these are two different variables in different scopes, $y_1$ and $y_2$.</p>
<p>The above example had a variable that is both bound and free, which is something that we’ll try to avoid. This is called a hygiene condition.</p>
<p>We can transform a unhygienic expression to a hygienic one by renaming bound variables before performing the substitution. This is known as <strong>alpha conversion</strong>. Alpha conversion is given by the following conversion rule:</p>
<script type="math/tex; mode=display">\frac{y \notin \text{fv}(t)}{(\lambda x.\ t) =_\alpha (\lambda y.\ \left[ x\mapsto y\right]\ t)}
\tag{$\alpha$}
\label{eq:alpha-conv}</script>
<p>And these equivalence rules (in mathematics, equivalence is defined as symmetry and transitivity):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 =_\alpha t_2}{t_2 =_\alpha t_1}
\tag{$\alpha \text{-Symm}$}
\label{eq:alpha-sym}
\\ \\
& \frac{t_1 =_\alpha t_2 \quad t_2 =_\alpha t_3}{t_1 =_\alpha t_3}
\tag{$\alpha \text{-Trans}$}
\label{eq:alpha-trans}
\\
\end{align} %]]></script>
<p>The congruence rules are as usual.</p>
<h3 id="programming-in-lambda-calculus">Programming in lambda-calculus</h3>
<h4 id="multiple-arguments">Multiple arguments</h4>
<p>The way to handle multiple arguments is by currying: $\lambda x.\ \lambda y.\ t$</p>
<h4 id="booleans">Booleans</h4>
<p>The fundamental, universal operator on booleans is if-then-else, which is what we’ll replicate to model booleans. We’ll denote our booleans as $\text{tru}$ and $\text{fls}$ to be able to distinguish these pure lambda-calculus abstractions from the true and false values of our previous toy language.</p>
<p>We want <code class="highlighter-rouge">true</code> to be equivalent to <code class="highlighter-rouge">if (true)</code>, and <code class="highlighter-rouge">false</code> to <code class="highlighter-rouge">if (false)</code>. The terms $\text{tru}$ and $\text{fls}$ <em>represent</em> boolean values, in that we can use them to test the truth of a boolean value:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{tru } & = \lambda t.\ \lambda f.\ t \\
\text{fls } & = \lambda t.\ \lambda f.\ f \\
\end{align} %]]></script>
<p>We can consider these as booleans. Equivalently <code class="highlighter-rouge">tru</code> can be considered as a function performing <code class="highlighter-rouge">(t1, t2) => if (true) t1 else t2</code>. To understand this, let’s try to apply $\text{tru}$ to two arguments:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& && \text{tru } v\ w \\
& = && (\lambda t.\ (\lambda f.\ t))\ v\ w \\
& \longrightarrow && (\lambda f.\ v)\ w \\
& \longrightarrow && v \\
\end{align} %]]></script>
<p>This works equivalently for <code class="highlighter-rouge">fls</code>.</p>
<p>We can also do inversion, conjunction and disjunction with lambda calculus, which can be read as particular if-else statements:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{not } & = \lambda b.\ b\ \text{fls}\ \text{true} \\
\text{and } & = \lambda b.\ \lambda c.\ b\ c\ \text{fls} \\
\text{or } & = \lambda b.\ \lambda c.\ b\ \text{tru}\ c \\
\end{align} %]]></script>
<ul>
<li><code class="highlighter-rouge">not</code> is a function that is equivalent to <code class="highlighter-rouge">not(b) = if (b) false else true</code>.</li>
<li><code class="highlighter-rouge">and</code> is equivalent to <code class="highlighter-rouge">and(b, c) = if (b) c else false</code></li>
<li><code class="highlighter-rouge">or</code> is equivalent to <code class="highlighter-rouge">or(b, c) = if (b) true else c</code></li>
</ul>
<h4 id="pairs">Pairs</h4>
<p>The fundamental operations are construction <code class="highlighter-rouge">pair(a, b)</code>, and selection <code class="highlighter-rouge">pair._1</code> and <code class="highlighter-rouge">pair._2</code>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{pair } & = \lambda f.\ \lambda s.\ \lambda b.\ b\ f\ s\\
\text{fst } & = \lambda p.\ p\ \text{tru} \\
\text{snd } & = \lambda p.\ p\ \text{fls} \\
\end{align} %]]></script>
<ul>
<li><code class="highlighter-rouge">pair</code> is equivalent to <code class="highlighter-rouge">pair(f, s) = (b => b f s)</code></li>
<li>When <code class="highlighter-rouge">tru</code> is applied to <code class="highlighter-rouge">pair</code>, it selects the first element, by definition of the boolean, and that is therefore the definition of <code class="highlighter-rouge">fst</code></li>
<li>Equivalently for <code class="highlighter-rouge">fls</code> applied to <code class="highlighter-rouge">pair</code>, it selects the second element</li>
</ul>
<h4 id="numbers">Numbers</h4>
<p>We’ve actually been representing numbers as lambda-calculus numbers all along! Our <code class="highlighter-rouge">succ</code> function represents what’s more formally called <strong>Church numerals</strong>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
c_0 & = \lambda s.\ \lambda z.\ z \\
c_1 & = \lambda s.\ \lambda z.\ s\ z \\
c_2 & = \lambda s.\ \lambda z.\ s\ s\ z \\
c_3 & = \lambda s.\ \lambda z.\ s\ s\ s\ z \\
\end{align} %]]></script>
<p>Note that $c_0$’s implementation is the same as that of $\text{fls}$ (just with renamed variables).</p>
<p>Every number $n$ is represented by a term $c_n$ taking two arguments, which are $s$ and $z$ (for “successor” and “zero”), and applies $s$ to $z$, $n$ times. Fundamentally, a number is equivalent to the following:</p>
<script type="math/tex; mode=display">c_n = \lambda f.\ \lambda x.\ \underbrace{f\ \dots\ f}_{n \text{ times}}\ x</script>
<p>With this in mind, let us implement some functions on numbers.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{scc } & = \lambda n.\ \lambda s.\ \lambda z.\ s\ (n\ s\ z) \\
\text{add } & = \lambda s.\ \lambda z.\ m\ s\ (n\ s\ z) \\
\text{mul } & = \lambda m.\ \lambda n.\ m\ (\text{add } n)\ c_0 \\
\text{sub } & = \lambda m.\ \lambda n.\ n\ \text{pred}\ m \\
\text{iszero } & = \lambda m.\ m\ (\lambda x.\ \text{fls})\ \text{tru}
\end{align} %]]></script>
<ul>
<li><strong>Successor</strong> $\text{scc}$: we apply the successor function to $n$ (which has been correctly instantiated with $s$ and $z$)</li>
<li><strong>Addition</strong> $\text{add}$: we pass the instantiated $n$ as the zero of $m$</li>
<li><strong>Subtraction</strong> $\text{sub}$: we apply $\text{pred}$ $n$ times to $m$</li>
<li><strong>Multiplication</strong> $\text{mul}$: instead of the successor function, we pass the addition by $n$ function.</li>
<li><strong>Zero test</strong> $\text{iszero}$: zero has the same implementation as false, so we can lean on that to build an iszero function. An alternative understanding is that we’re building a number, in which we use true for the zero value $z$. If we have to apply the successor function $s$ once or more, we want to get false, so for the successor function we use a function ignoring its input and returning false if applied.</li>
</ul>
<p>What about predecessor? This is a little harder, and it’ll take a few steps to get there. The main idea is that we find the predecessor by rebuilding the whole succession up until our number. At every step, we must generate the number and its predecessor: zero is $(c_0, c_0)$, and all other numbers are $(c_{n-1}, c_n)$. Once we’ve reconstructed this pair, we can get the predecessor by taking the first element of the pair.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{zz} & = \text{pair } c_0 \ c_0 \\
\text{ss} & = \lambda p.\ \text{pair } (\text{snd } p)\ (\text{scc } (\text{snd } p)) \\
\text{prd} & = \lambda m.\ \text{fst } (m\ \text{ss zz}) \\
\end{align} %]]></script>
<details><summary><p>Sidenote</p>
</summary><div class="details-content">
<p>The story goes that Church was stumped by predecessors for a long time. This solution finally came to him while he was at the barber, and he jumped out half shaven to write it down.</p>
</div></details>
<h4 id="lists">Lists</h4>
<p>Now what about lists?</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{nil} & = \lambda f.\ \lambda g.\ g \\
\text{cons} & = \lambda x.\ \lambda xs.\ (\lambda f.\ \lambda g.\ f\ x\ xs) \\
\text{head} & = \lambda xs.\ (\lambda y.\ \lambda ys.\ y) \\
\text{isEmpty} & = \lambda xs.\ xs\ (\lambda y.\ \lambda ys.\ \text{fls}) \\
\end{align} %]]></script>
<h3 id="recursion-in-lambda-calculus">Recursion in lambda-calculus</h3>
<p>Let’s start by taking a step back. We talked about normal forms and terms for which we terminate; does lambda calculus always terminate? It’s Turing complete, so it must be able to loop infinitely (otherwise, we’d have solved the halting problem!).</p>
<p>The trick to recursion is self-application:</p>
<script type="math/tex; mode=display">\lambda x.\ x\ x</script>
<p>From a type-level perspective, we would cringe at this. This should not be possible in the typed world, but in the untyped world we can do it. We can construct a simple infinite loop in lambda calculus as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\Omega
& = & (\lambda x.\ x\ x)\ (\lambda x.\ x\ x) \\
& \longrightarrow & \ (\lambda x.\ x\ x)\ (\lambda x.\ x\ x)
\end{align} %]]></script>
<p>The expression evaluates to itself in one step; it never reaches a normal form, it loops infinitely, diverges. This is not a stuck term though; evaluation is always possible.</p>
<p>In fact, there are no stuck terms in pure lambda calculus. Every term is either a value or reduces further.</p>
<p>So it turns out that $\text{omega}$ isn’t so terribly useful. Let’s try to construct something more practical:</p>
<script type="math/tex; mode=display">Y_f = (\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x))</script>
<p>Now, the divergence is a little more interesting:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
Y_f & = & (\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x)) \\
& \longrightarrow & f\ ((\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x))) \\
& = & f\ (Y_f) \\
& \longrightarrow & \dots \\
& = & f\ (f\ (Y_f)) \\
\end{align} %]]></script>
<p>This $Y_f$ function is known as a <strong>Y combinator</strong>. It still loops infinitely (though note that while it works in classical lambda calculus, it blows up in call-by-name), so let’s try to build something more useful.</p>
<p>To delay the infinite recursion, we could build something like a poison pill:</p>
<script type="math/tex; mode=display">\text{poisonpill} = \lambda y.\ \text{omega}</script>
<p>It can be passed around (after all, it’s just a value), but evaluating it will cause our program to loop infinitely. This is the core idea we’ll use for defining the <strong>fixed-point combinator</strong> $\text{fix}$ (also known as the call-by-value Y combinator), which allows us to do recursion. It’s defined as follows:</p>
<script type="math/tex; mode=display">\text{fix} = \lambda f.\ (\lambda x.\ f\ (\lambda y.\ x\ x\ y))\ (\lambda x.\ f\ (\lambda y.\ x\ x\ y))</script>
<p>This looks a little intricate, and we won’t need to fully understand the definition. What’s important is mostly how it is used to define a recursive function. For instance, if we wanted to define a modulo function in our toy language, we’d do it as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">mod</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span> <span class="k">=</span>
<span class="k">if</span> <span class="o">(</span><span class="n">y</span> <span class="o">></span> <span class="n">x</span><span class="o">)</span> <span class="n">x</span>
<span class="k">else</span> <span class="n">mod</span><span class="o">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">y</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>In lambda calculus, we’d define this as:</p>
<script type="math/tex; mode=display">\text{mod} = \text{fix } (\lambda f.\ \lambda x.\ \lambda y.\
(\text{gt } y\ x)\ x\ (f (\text{sub } a\ b)\ b)
)</script>
<p>We’ve assumed that a greater-than $\text{gt}$ function was available here.</p>
<p>More generally, we can define a recursive function as:</p>
<script type="math/tex; mode=display">\text{fix } \bigl(\lambda f.\ (\textit{recursion on } f)\bigr)</script>
<h3 id="equivalence-of-lambda-terms">Equivalence of lambda terms</h3>
<p>We’ve seen how to define Church numerals and successor. How can we prove that $\text{succ } c_n$ is equal to $c_{n+1}$?</p>
<p>The naive approach unfortunately doesn’t work; they do not evaluate to the same value.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{scc } c_2
& = (\lambda n.\ \lambda s.\ \lambda z.\ s\ (n\ s\ z))\ (\lambda s.\ \lambda z.\ s\ (s\ z)) \\
& \longrightarrow \lambda s.\ \lambda z.\ s\ ((\lambda s.\ \lambda z.\ s\ (s\ z))\ s\ z) \\
& \neq \lambda s.\ \lambda z.\ s\ (s\ (s\ z)) \\
& = c_3 \\
\end{align*} %]]></script>
<p>This still seems very close. If we could simplify a little further, we do see how they would be the same.</p>
<p>The intuition behind the Church numeral representation was that a number $n$ is represented as a term that “does something $n$ times to something else”. $\text{scc}$ takes a term that “does something $n$ times to something else”, and returns a term that “does something $n+1$ times to something else”.</p>
<p>What we really care about is that $\text{scc } c_2$ <em>behaves</em> the same as $c_3$ when applied to two arguments. We want <em>behavioral equivalence</em>. But what does that mean? Roughly, two terms $s$ and $t$ are behaviorally equivalent if there is no “test” that distinguishes $s$ and $t$.</p>
<p>Let’s define this notion of “test” this a little more precisely, and specify how we’re going to observe the results of a test. We can use the notion of <strong>normalizability</strong> to define a simple notion of a test:</p>
<blockquote>
<p>Two terms $s$ and $t$ are said to be <strong>observationally equivalent</strong> if they are either both normalizable (i.e. they reach a normal form after a finite number of evaluation steps), or both diverge.</p>
</blockquote>
<p>In other words, we observe a term’s behavior by running it and seeing if it halts. Note that this is not decidable (by the halting problem).</p>
<p>For instance, $\text{omega}$ and $\text{tru}$ are not observationally equivalent (one diverges, one halts), while $\text{tru}$ and $\text{fls}$ are (they both halt).</p>
<p>Observational equivalence isn’t strong enough of a test for what we need; we need behavioral equivalence.</p>
<blockquote>
<p>Two terms $s$ and $t$ are said to be <strong>behaviorally equivalent</strong> if, for every finite sequence of values $v_1, v_2, \dots, v_n$ the applications $s\ v_1\ v_2\ \dots\ v_n$ and $t\ v_1\ v_2\ \dots\ v_n$ are observationally equivalent.</p>
</blockquote>
<p>This allows us to assert that true and false are indeed different:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{tru}\ x\ \Omega & \longrightarrow x \\
\text{fls}\ x\ \Omega & \longrightarrow \Omega \\
\end{align} %]]></script>
<p>The former returns a normal form, while the latter diverges.</p>
<h2 id="types">Types</h2>
<p>As previously, to define a language, we start with a <em>set of terms</em> and <em>values</em>, as well as an <em>evaluation relation</em>. But now, we’ll also define a set of <strong>types</strong> (denoted with a first capital letter) classifying values according to their “shape”. We can define a <em>typing relation</em> $t:\ T$. We must check that the typing relation is <em>sound</em> in the sense that:</p>
<script type="math/tex; mode=display">\frac{t: T \qquad t\longrightarrow^* v}{v: T}
\qquad\text{and}\qquad
\frac{t: T}{\exists t' \text{ such that } t\longrightarrow t'}</script>
<p>These rules represent some kind of safety and liveness, but are more commonly referred to as <a href="#properties-of-the-typing-relation">progress and preservation</a>, which we’ll talk about later. The first one states that types are preserved throughout evaluation, while the second says that if we can type-check, then evaluation of $t$ will not get stuck.</p>
<p>In our previous toy language, we can introduce two types, booleans and numbers:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>T ::= // types
Bool // type of booleans
Nat // type of numbers</pre></td></tr></tbody></table></code></pre></figure>
<p>Our typing rules are then given by:</p>
<script type="math/tex; mode=display">\begin{align}
\text{true } : \text{ Bool}
\tag{T-True} \label{eq:t-true} \\ \\
\text{false } : \text{ Bool}
\tag{T-False} \label{eq:t-false} \\ \\
0: \text{ Nat}
\tag{T-Zero} \label{eq:t-zero} \\ \\
\frac{t_1: \text{Bool} \quad t_2 : T \quad t_3: T}{\ifelse}
\tag{T-If} \label{eq:t-if} \\ \\
\frac{t_1: \text{Nat}}{\text{succ } t_1: \text{Nat}}
\tag{T-Succ} \label{eq:t-succ} \\ \\
\frac{t_1: \text{Nat}}{\text{pred } t_1: \text{Nat}}
\tag{T-Pred} \label{eq:t-pred} \\ \\
\frac{t_1: \text{Nat}}{\text{iszero } t_1: \text{Nat}}
\tag{T-IsZero} \label{eq:t-iszero} \\ \\
\end{align}</script>
<p>With these typing rules in place, we can construct typing derivations to justify every pair $t: T$ (which we can also denote as a $(t, T)$ pair) in the typing relation, as we have done previously with evaluation. Proofs of properties about the typing relation often proceed by induction on these typing derivations.</p>
<p>Like other static program analyses, type systems are generally imprecise. They do not always predict exactly what kind of value will be returned, but simply a conservative approximation. For instance, <code class="highlighter-rouge">if true then 0 else false</code> cannot be typed with the above rules, even though it will certainly evaluate to a number. We could of course add a typing rule for <code class="highlighter-rouge">if true</code> statements, but there is still a question of how useful this is, and how much complexity it adds to the type system, and especially for proofs. Indeed, the inversion lemma below becomes much more tedious when we have more rules.</p>
<h3 id="properties-of-the-typing-relation">Properties of the Typing Relation</h3>
<p>The safety (or soundness) of this type system can be expressed by the following two properties:</p>
<ul>
<li>
<p><strong>Progress</strong>: A well-typed term is not stuck.</p>
<p>If $t\ :\ T$ then either $t$ is a value, or else $t\longrightarrow t’$ for some $t’$.</p>
</li>
<li>
<p><strong>Preservation</strong>: Types are preserved by one-step evaluation.</p>
<p>If $t\ :\ T$ and $t\longrightarrow t’$, then $t’\ :\ T$.</p>
</li>
</ul>
<p>We will prove these later, but first we must state a few lemmas.</p>
<h4 id="inversion-lemma-1">Inversion lemma</h4>
<p>Again, for types we need to state the same (boring) inversion lemma:</p>
<ol>
<li>If $\text{true}: R$, then $R = \text{Bool}$.</li>
<li>If $\text{false}: R$, then $R = \text{Bool}$.</li>
<li>If $\ifelse: R$, then $t_1: \text{ Bool}$, $t_2: R$ and $t_3: R$</li>
<li>If $0: R$ then $R = \text{Nat}$</li>
<li>If $\text{succ } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$</li>
<li>If $\text{pred } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$</li>
<li>If $\text{iszero } t_1: R$ then $R = \text{Bool}$ and $t_1: \text{Nat}$</li>
</ol>
<p>From the inversion lemma, we can directly derive a typechecking algorithm:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">Expr</span><span class="o">)</span><span class="k">:</span> <span class="kt">T</span> <span class="o">=</span> <span class="n">t</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">True</span> <span class="o">|</span> <span class="nc">False</span> <span class="k">=></span> <span class="nc">Bool</span>
<span class="k">case</span> <span class="nc">If</span><span class="o">(</span><span class="n">t1</span><span class="o">,</span> <span class="n">t2</span><span class="o">,</span> <span class="n">t3</span><span class="o">)</span> <span class="k">=></span>
<span class="k">val</span> <span class="n">type1</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span>
<span class="k">val</span> <span class="n">type2</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t2</span><span class="o">)</span>
<span class="k">val</span> <span class="n">type3</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t3</span><span class="o">)</span>
<span class="k">if</span> <span class="o">(</span><span class="n">type1</span> <span class="o">==</span> <span class="nc">Bool</span> <span class="o">&&</span> <span class="n">type2</span> <span class="o">==</span> <span class="n">type3</span><span class="o">)</span> <span class="n">type2</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">Zero</span> <span class="k">=></span> <span class="nc">Nat</span>
<span class="k">case</span> <span class="nc">Succ</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Nat</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">Pred</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Nat</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">IsZero</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Bool</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="canonical-form">Canonical form</h4>
<p>A simple lemma that will be useful for lemma is that of canonical forms. Given a type, it tells us what kind of values we can expect:</p>
<ol>
<li>If $v$ is a value of type Bool, then $v$ is either $\text{true}$ or $\text{false}$</li>
<li>If $v$ is a value of type Nat, then $v$ is a numeric value</li>
</ol>
<p>The proof is somewhat immediate from the syntax of values.</p>
<h4 id="progress-theorem">Progress Theorem</h4>
<p><strong>Theorem</strong>: suppose that $t$ is a well-typed term of type $T$. Then either $t$ is a value, or else there exists some $t’$ such that $t\longrightarrow t’$.</p>
<p><strong>Proof</strong>: by induction on a derivation of $t: T$.</p>
<ul>
<li>The $\ref{eq:t-true}$, $\ref{eq:t-false}$ and $\ref{eq:t-zero}$ are immediate, since $t$ is a value in these cases.</li>
<li>
<p>For $\ref{eq:t-if}$, we have $t=\ifelse$, with $t_1: \text{Bool}$, $t_2: T$ and $t_3: T$. By the induction hypothesis, there is some $t_1’$ such that $t_1 \longrightarrow t_1’$.</p>
<p>If $t_1$ is a value, then rule 1 of the <a href="#canonical-form">canonical form lemma</a> tells us that $t_1$ must be either $\text{true}$ or $\text{false}$, in which case $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ applies to $t$.</p>
<p>Otherwise, if $t_1 \longrightarrow t_1’$, then by $\ref{eq:e-if}$, $t\longrightarrow \if t_1’ \then t_2 \text{ else } t_3$</p>
</li>
<li>
<p>For $\ref{eq:t-succ}$, we have $t = \text{succ } t_1$.</p>
<p>$t_1$ is a value, by rule 5 of the <a href="#inversion-lemma">inversion lemma</a> and by rule 2 of the <a href="#canonical-form">canonical form</a>, $t_1 = nv$ for some numeric value $nv$. Therefore, $\text{succ }(t_1)$ is a value. If $t_1 \longrightarrow t_1’$, then $t\longrightarrow \text{succ }t_1$.</p>
</li>
<li>The cases for $\ref{eq:t-zero}$, $\ref{eq:t-pred}$ and $\ref{eq:t-iszero}$ are similar.</li>
</ul>
<h4 id="preservation-theorem">Preservation Theorem</h4>
<p><strong>Theorem</strong>: Types are preserved by one-step evaluation. If $t: T$ and $t\longrightarrow t’$, then $t’: T$.</p>
<p><strong>Proof</strong>: by induction on the given typing derivation</p>
<ul>
<li>For $\ref{eq:t-true}$ and $\ref{eq:t-false}$, the precondition doesn’t hold (no reduction is possible), so it’s trivially true. Indeed, $t$ is already a value, either $t=\text{ true}$ or $t=\text{ false}$.</li>
<li>For $\ref{eq:t-if}$, there are three evaluation rules by which $t\longrightarrow t’$ can be derived, depending on $t_1$
<ul>
<li>If $t_1 = \text{true}$, then by $\ref{eq:e-iftrue}$ we have $t’=t_2$, and from rule 3 of the <a href="#inversion-lemma-1">inversion lemma</a> and the assumption that $t: T$, we have $t_2: T$, that is $t’: T$</li>
<li>If $t_1 = \text{false}$, then by $\ref{eq:e-iffalse}$ we have $t’=t_3$, and from rule 3 of the <a href="#inversion-lemma-1">inversion lemma</a> and the assumption that $t: T$, we have $t_3: T$, that is $t’: T$</li>
<li>If $t_1 \longrightarrow t_1’$, then by the induction hypothesis, $t_1’: \text{Bool}$. Combining this with the assumption that $t_2: T$ and $t_3: T$, we can apply $\ref{eq:t-if}$ to conclude $\if t_1’ \then t_2 \else t_3: T$, that is $t’: T$</li>
</ul>
</li>
</ul>
<h3 id="messing-with-it">Messing with it</h3>
<h4 id="removing-a-rule">Removing a rule</h4>
<p>What if we remove $\ref{eq:e-predzero}$? Then <code class="highlighter-rouge">pred 0</code> type checks, but it is stuck and is not a value; the <a href="#progress-theorem">progress theorem</a> fails.</p>
<h4 id="changing-type-checking-rule">Changing type-checking rule</h4>
<p>What if we change the $\ref{eq:t-if}$ to the following?</p>
<script type="math/tex; mode=display">\frac{
t_1 : \text{Bool} \quad
t_2 : \text{Nat} \quad
t_3 : \text{Nat}
}{
(\ifelse) : \text{Nat}
}
\tag{T-If 2}
\label{eq:t-if2}</script>
<p>This doesn’t break our type system. It’s still sound, but it rejects if-else expressions that return other things than numbers (e.g. booleans). But that is an expressiveness problem, not a soundness problem; our type system disallows things that would otherwise be fine by the evaluation rules.</p>
<h4 id="adding-bit">Adding bit</h4>
<p>We could add a boolean to natural function <code class="highlighter-rouge">bit(t)</code>. We’d have to add it to the grammar, add some evaluation and typing rules, and prove progress and preservation.</p>
<script type="math/tex; mode=display">\begin{align}
\text{bit true} \longrightarrow 0 \\ \\
\text{bit false} \longrightarrow 1 \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{bit }t_1 \longrightarrow \text{bit }t_1'}
\\ \\
\frac{t : \text{Bool}}{\text{bit } t : \text{Nat}}
\end{align}</script>
<p>We’ll do something similar this below, so the full proof is omitted.</p>
<h2 id="simply-typed-lambda-calculus">Simply typed lambda calculus</h2>
<p>Simply Typed Lambda Calculus (STLC) is also denoted $\lambda_\rightarrow$. The “pure” form of STLC is not very interesting on the type-level (unlike for the term-level of pure lambda calculus), so we’ll allow base values that are not functions, like booleans and integers. To talk about STLC, we always begin with some set of “base types”:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>T ::= // types
Bool // type of booleans
T -> T // type of functions</pre></td></tr></tbody></table></code></pre></figure>
<p>In the following examples, we’ll work with a mix of our previously defined toy language, and lambda calculus. This will give us a little syntactic sugar.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre>t ::= // terms
x // variable
λx. t // abstraction
t t // application
true // constant true
false // constant false
if t then t else t // conditional
v ::= // values
λx. t // abstraction value
true // true value
false // false value</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="type-annotations">Type annotations</h3>
<p>We will annotate lambda-abstractions with the expected type of the argument, as follows:</p>
<script type="math/tex; mode=display">\lambda x: T_1 .\ t_1</script>
<p>We could also omit it, and let type inference do the job (as in OCaml), but for now, we’ll do the above. This will make it simpler, as we won’t have to discuss inference just yet.</p>
<h3 id="typing-rules">Typing rules</h3>
<p>In STLC, we’ve introduced abstraction. To add a typing rule for that, we need to encode the concept of an environment $\Gamma$, which is a set of variable assignments. We also introduce the “turnstile” symbol $\vdash$, meaning that the environment can verify the right hand-side typing, or that $\Gamma$ must imply the right-hand side.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{
\bigl( \Gamma \cup (x_1 : T_1) \bigr) \vdash t_2 : T_2
}{ \Gamma\vdash(\lambda x: T_1.\ t_2): T_1 \rightarrow T_2 }
\tag{T-Abs} \label{eq:t-abs} \\ \\
\frac{x: T \in \Gamma}{\Gamma\vdash x: T}
\tag{T-Var} \label{eq:t-var} \\ \\
\frac{
\Gamma\vdash t_1 : T_{11}\rightarrow T_{12}
\quad
\Gamma\vdash t_2 : T_{11}
}{\Gamma\vdash t_1\ t_2 : T_{12}}
\tag{T-App} \label{eq:t-app}
\end{align}</script>
<p>This additional concept must be taken into account in our definition of progress and preservation:</p>
<ul>
<li><strong>Progress</strong>: If $\Gamma\vdash t : T$, then either $t$ is a value or else $t\longrightarrow t’$ for some $t’$</li>
<li><strong>Preservation</strong>: If $\Gamma\vdash t : T$ and $t\longrightarrow t’$, then $\Gamma\vdash t’ : T$</li>
</ul>
<p>To prove these, we must take the same steps as above. We’ll introduce the inversion lemma for typing relations, and restate the canonical forms lemma in order to prove the progress theorem.</p>
<h3 id="inversion-lemma-2">Inversion lemma</h3>
<p>Let’s start with the inversion lemma.</p>
<ol>
<li>If $\Gamma\vdash\text{true} : R$ then $R = \text{Bool}$</li>
<li>If $\Gamma\vdash\text{false} : R$ then $R = \text{Bool}$</li>
<li>If $\Gamma\vdash\ifelse : R$ then $\Gamma\vdash t_1 : \text{Bool}$ and $\Gamma\vdash t_2, t_3: R$.</li>
<li>If $\Gamma\vdash x: R$ then $x: R \in\Gamma$</li>
<li>If $\Gamma\vdash\lambda x: T_1 .\ t_2 : R$ then $R = T_1 \rightarrow T_2$ for some $R_2$ with $\Gamma\cup(x: T_1)\vdash t_2: R_2$</li>
<li>If $\Gamma\vdash t_1\ t_2 : R$ then there is some type $T_{11}$ such that $\Gamma\vdash t_1 : T_{11} \rightarrow R$ and $\Gamma\vdash t_2 : T_{11}$.</li>
</ol>
<h3 id="canonical-form-1">Canonical form</h3>
<p>The canonical forms are given as follows:</p>
<ol>
<li>If $v$ is a value of type Bool, then it is either $\text{true}$ or $\text{false}$</li>
<li>If $v$ is a value of type $T_1 \rightarrow T_2$ then $v$ has the form $\lambda x: T_1 .\ t_2$</li>
</ol>
<h3 id="progress">Progress</h3>
<p>Finally, we get to prove the progress by induction on typing derivations.</p>
<p><strong>Theorem</strong>: Suppose that $t$ is a closed, well typed term (that is, $\Gamma\vdash t: T$ for some type $T$). Then either $t$ is a value, or there is some $t’$ such that $t\longrightarrow t’$.</p>
<ul>
<li>For boolean constants, the proof is immediate as $t$ is a value</li>
<li>For variables, the proof is immediate as $t$ is closed, and the precondition therefore doesn’t hold</li>
<li>For abstraction, the proof is immediate as $t$ is a value</li>
<li>
<p>Application is the only case we must treat.</p>
<p>Consider $t = t_1\ t_2$, with $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2: T_{11}$.</p>
<p>By the induction hypothesis, $t_1$ is either a value, or it can make a step of evaluation. The same goes for $t_2$.</p>
<p>If $t_1$ can reduce, then rule $\ref{eq:e-app1}$ applies to $t$. Otherwise, if it is a value, and $t_2$ can take a step, then $\ref{eq:e-app2}$ applies. Otherwise, if they are both values (and we cannot apply $\beta$-reduction), then the canonical forms lemma above tells us that $t_1$ has the form $\lambda x: T_11.\ t_{12}$, and so rule $\ref{eq:e-appabs}$ applies to $t$.</p>
</li>
</ul>
<h3 id="preservation">Preservation</h3>
<p><strong>Theorem</strong>: If $\Gamma\vdash t: T$ and $t \longrightarrow t’$ then $\Gamma\vdash t’: T$.</p>
<p><strong>Proof</strong>: by induction on typing derivations. We proceed on a case-by-case basis, as we have done so many times before. But one case is hard: application.</p>
<p>For $t = t_1\ t_2$, such that $\Gamma\vdash t_1 : T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2 : T_{11}$, and where $T=T_{12}$, we want to show $\Gamma\vdash t’ : T_{12}$.</p>
<p>To do this, we must use the <a href="#inversion-lemma">inversion lemma for evaluation</a> (note that we haven’t written it down for STLC, but the idea is the same). There are three subcases for it, starting with the following:</p>
<p>The left-hand side is $t_1 = \lambda x: T_{11}.\ t_{12}$, and the right-hand side of application $t_2$ is a value $v_2$. In this case, we know that the result of the evaluation is given by $t’ = \left[ x\mapsto v_2 \right] t_{12}$.</p>
<p>And here, we already run into trouble, because we do not know about how types act under substitution. We will therefore need to introduce some lemmas.</p>
<h4 id="weakening-lemma">Weakening lemma</h4>
<p>Weakening tells us that we can <em>add</em> assumptions to the context without losing any true typing statements:</p>
<p>If $\Gamma\vdash t: T$, and the environment $\Gamma$ has no information about $x$—that is, $x\notin \text{dom}(\Gamma)$—then the initial assumption still holds if we add information about $x$ to the environment:</p>
<script type="math/tex; mode=display">\bigl(\Gamma \cup (x: S)\bigr)\vdash t: T</script>
<p>Moreover, the latter $\vdash$ derivation has the same depth as the former.</p>
<h4 id="permutation-lemma">Permutation lemma</h4>
<p>Permutation tells us that the order of assumptions in $\Gamma$ does not matter.</p>
<p>If $\Gamma \vdash t: T$ and $\Delta$ is a permutation of $\Gamma$, then $\Delta\vdash t: T$.</p>
<p>Moreover, the latter $\vdash$ derivation has the same depth as the former.</p>
<h4 id="substitution-lemma">Substitution lemma</h4>
<p>Substitution tells us that types are preserved under substitution.</p>
<p>That is, if $\Gamma\cup(x: S) \vdash t: T$ and $\Gamma\vdash s: S$, then $\Gamma\vdash \left[x\mapsto s\right] t: T$.</p>
<p>The proof goes by induction on the derivation of $\Gamma\cup(x: S) \vdash t: T$, that is, by cases on the final typing rule used in the derivation.</p>
<ul>
<li>
<p>Case $\ref{eq:t-app}$: in this case, $t = t_1\ t_2$.</p>
<p>Thanks to typechecking, we know that the environment validates $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_1: T_2 \rightarrow T_1$ and $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_2: T_2$. In this case, the resulting type of the application is $T=T_1$.</p>
<p>By the induction hypothesis, $\Gamma\vdash[x\mapsto s]t_1 : T_2 \rightarrow T_1$, and $\Gamma\vdash[x\mapsto s]t_2 : T_2$.</p>
<p>By $\ref{eq:t-app}$, the environment then also verifies the application of these two substitutions as $T$: $\Gamma\vdash[x\mapsto s]t_1\ [x\mapsto s]t_2: T$. We can factorize the substitution to obtain the conclusion, i.e. $\Gamma\vdash \left[x\mapsto s\right](t_1\ t_2): T$</p>
</li>
<li>Case $\ref{eq:t-var}$: if $t=z$ ($t$ is a simple variable $z$) where $z: T \in \bigl(\Gamma\cup (x: S)\bigr)$. There are two subcases to consider here, depending on whether $z$ is $x$ or another variable:
<ul>
<li>If $z=x$, then $\left[x\mapsto s\right] z = s$. The result is then $\Gamma\vdash s: S$, which is among the assumptions of the lemma</li>
<li>If $z\ne x$, then $\left[x\mapsto s\right] z = z$, and the desired result is immediate</li>
</ul>
</li>
<li>
<p>Case $\ref{eq:t-abs}$: if $t=\lambda y: T_2.\ t_1$, with $T=T_2\rightarrow T_1$, and $\bigl(\Gamma\cup (x: S)\cup (y: T_2)\bigr)\vdash t_1 : T_1$.</p>
<p>Based on our <a href="#alpha-conversion">hygiene convention</a>, we may assume $x\ne y$ and $y \notin \text{fv}(s)$.</p>
<p>Using <a href="#permutation-lemma">permutation</a> on the first given subderivation in the lemma ($\Gamma\cup(x: S) \vdash t: T$), we obtain $\bigl(\Gamma\cup (y: T_2)\cup (x: S)\bigr)\vdash t_1 : T_1$ (we have simply changed the order of $x$ and $y$).</p>
<p>Using <a href="#weakening-lemma">weakening</a> on the other given derivation in the lemma ($\Gamma\vdash s: S$), we obtain $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash s: S$.</p>
<p>By the induction hypothesis, $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash\left[x\mapsto s\right] t_1: T_1$.</p>
<p>By $\ref{eq:t-abs}$, we have $\Gamma\vdash(\lambda y: T_2.\ [x\mapsto s]t_1): T_1$</p>
<p>By the definition of substitution, this is $\Gamma\vdash([x\mapsto s]\lambda y: T_2.\ t_1): T_2 \rightarrow T_1$.</p>
</li>
</ul>
<h4 id="proof">Proof</h4>
<p>We’ve now proven the following lemmas:</p>
<ul>
<li>Weakening</li>
<li>Permutation</li>
<li>Type preservation under substitution</li>
<li>Type preservation under reduction (i.e. preservation)</li>
</ul>
<p>We won’t actually do the proof, we’ve just set up the pieces we need for it.</p>
<h3 id="erasure">Erasure</h3>
<p>Type annotations do not play any role in evaluation. In STLC, we don’t do any run-time checks, we only run compile-time type checks. Therefore, types can be removed before evaluation. This often happens in practice, where types do not appear in the compiled form of a program; they’re typically encoded in an untyped fashion. The semantics of this conversion can be formalized by an erasure function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{erase}(x) & = x \\
\text{erase}(\lambda x: T_1. t_2) & = \lambda x. \text{erase}(t_2) \\
\text{erase}(t_1\ t_2) & = \text{erase}(t_1)\ \text{erase}(t_2)
\end{align} %]]></script>
<h3 id="curry-howard-correspondence">Curry-Howard Correspondence</h3>
<p>The Curry-Howard correspondence tells us that there is a correspondence between propositional logic and types.</p>
<p>An implication $P\supset Q$ (which could also be written $P\implies Q$) can be proven by transforming evidence for $P$ into evidence for $Q$. A conjunction $P\land Q$ is a <a href="#pairs-1">pair</a> of evidence for $P$ and evidence for $Q$. For more examples of these correspondences, see the <a href="https://en.wikipedia.org/wiki/Brouwer–Heyting–Kolmogorov_interpretation">Brouwer–Heyting–Kolmogorov (BHK) interpretation</a>.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Logic</th>
<th style="text-align: left">Programming languages</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Propositions</td>
<td style="text-align: left">Types</td>
</tr>
<tr>
<td style="text-align: left">$P \supset Q$</td>
<td style="text-align: left">Type $P\rightarrow Q$</td>
</tr>
<tr>
<td style="text-align: left">$P \land Q$</td>
<td style="text-align: left"><a href="#pairs-1">Pair type</a> $P\times Q$</td>
</tr>
<tr>
<td style="text-align: left">$P \lor Q$</td>
<td style="text-align: left"><a href="#sum-type">Sum type</a> $P+Q$</td>
</tr>
<tr>
<td style="text-align: left">$\exists x\in S: \phi(x)$</td>
<td style="text-align: left">Dependent type $\sum{x: S, \phi(x)}$</td>
</tr>
<tr>
<td style="text-align: left">$\forall x\in S: \phi(x)$</td>
<td style="text-align: left">$\forall (x:S): \phi(x)$</td>
</tr>
<tr>
<td style="text-align: left">Proof of $P$</td>
<td style="text-align: left">Term $t$ of type $P$</td>
</tr>
<tr>
<td style="text-align: left">$P$ is provable</td>
<td style="text-align: left">Type $P$ is inhabited</td>
</tr>
<tr>
<td style="text-align: left">Proof simplification</td>
<td style="text-align: left">Evaluation</td>
</tr>
</tbody>
</table>
<p>In Scala, all types are inhabited except for the bottom type <code class="highlighter-rouge">Nothing</code>. Singleton types are only inhabited by a single term.</p>
<p>As an example of the equivalence, we’ll see that application is equivalent to <a href="https://en.wikipedia.org/wiki/Modus_ponens">modus ponens</a>:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash t_1 : P \supset Q \quad \Gamma\vdash t_2 : P}{\Gamma\vdash t_1\ t_2 : Q}</script>
<p>This also tells us that if we can prove something, we can evaluate it.</p>
<p>How can we prove the following? Remember that $\rightarrow$ is right-associative.</p>
<script type="math/tex; mode=display">(A \land B) \rightarrow C \rightarrow ((C\land A)\land B)</script>
<p>The proof is actually a somewhat straightforward conversion to lambda calculus:</p>
<script type="math/tex; mode=display">\lambda p: A\times B.\ \lambda c: C.\ \text{pair} (\text{pair} (c\ \text{fst}(p)) \text{snd}(p))</script>
<h3 id="extensions-to-stlc">Extensions to STLC</h3>
<h4 id="base-types">Base types</h4>
<p>Up until now, we’ve defined our base types (such as $\text{Nat}$ and $\text{Bool}$) manually: we’ve added them to the syntax of types, with associated constants ($\text{zero}, \text{true}, \text{false}$) and operators ($\text{succ}, \text{pred}$), as well as associated typing and evaluation rules.</p>
<p>This is a lot of minutiae though, especially for theoretical discussions. For those, we can often ignore the term-level inhabitants of the base types, and just treat them as uninterpreted constants: we don’t really need the distinction between constants and values. For theory, we can just assume that some generic base types (e.g. $B$ and $C$) exist, without defining them further.</p>
<h4 id="unit-type">Unit type</h4>
<p>In C-like languages, this type is usually called <code class="highlighter-rouge">void</code>. To introduce it, we do not add any computation rules. We must only add it to the grammar, values and types, and then add a single typing rule that trivially verifies units.</p>
<script type="math/tex; mode=display">\Gamma\vdash\text{unit}:\text{Unit}
\label{eq:t-unit} \tag{T-Unit}</script>
<p>Units are not too interesting, but <em>are</em> quite useful in practice, in part because they allow for other extensions.</p>
<h4 id="sequencing">Sequencing</h4>
<p>We can define sequencing as two statements following each other:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
t1; t2</pre></td></tr></tbody></table></code></pre></figure>
<p>This implies adding some evaluation and typing rules, defined below:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \longrightarrow t_1'}{t_1;\ t_2 \longrightarrow t_1';\ t_2}
\label{eq:e-seq}\tag{E-Seq} \\ \\
(\text{unit};\ t_2) \longrightarrow t_2
\label{eq:e-seqnext}\tag{E-SeqNext} \\ \\
\frac{\Gamma\vdash t_1 : \text{Unit} \quad \Gamma\vdash t_2: T_2}{\Gamma\vdash t_1;\ t_2 : T_2}
\label{eq:t-seq}\tag{T-Seq} \\
\end{align}</script>
<p>But there’s another way that we could define sequencing: simply as syntactic sugar, a derived form for something else. In this way, we define an external language, that is transformed to an internal language by the compiler in the desugaring step.</p>
<script type="math/tex; mode=display">t_1;\ t_2 \defeq (\lambda x: \text{Unit}.\ t_2)\ t_1
\qquad \text{where } x\notin\text{ FV}(t_2)</script>
<p>This is useful to know, because it makes proving soundness much easier. We do not need to re-state the inversion lemma, re-prove preservation and progress. We can simple rely on the proof for the underlying internal language.</p>
<h4 id="ascription">Ascription</h4>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
t as T</pre></td></tr></tbody></table></code></pre></figure>
<p>Ascription allows us to have a compiler type-check a term as really being of the correct type:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash t_1 : T}{\Gamma\vdash t_1 \text{ as } T: T}
\label{eq:t-ascribe}\tag{T-Ascribe}</script>
<p>This seems like it preserves soundness, but instead of doing the whole proof over again, we’ll just propose a simple desugaring, in which an ascription is equivalent to the term $t$ applied the identity function, typed to return $T$:</p>
<script type="math/tex; mode=display">t \text{ as } T \defeq (\lambda x: T.\ x)\ t</script>
<p>Alternatively, we could do the whole proof over again, and institute a simple evaluation rule that ignores the ascription.</p>
<script type="math/tex; mode=display">v_1 \text{ as } T \longrightarrow v_1
\label{eq:e-ascribe}\tag{E-Ascribe} \\</script>
<h4 id="pairs-1">Pairs</h4>
<p>We can introduce pairs into our grammar.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre>t ::=
...
{t, t} // pair
t.1 // first projection
t.2 // second projection
v ::=
...
{v, v} // pair value
T ::=
...
T1 x T2 // product types</pre></td></tr></tbody></table></code></pre></figure>
<p>We can also introduce evaluation rules for pairs:</p>
<script type="math/tex; mode=display">\begin{align}
\set{v_1, v_2}.1 \longrightarrow v_1
\tag{E-PairBeta1}\label{eq:e-pairbeta1} \\ \\
\set{v_1, v_2}.2 \longrightarrow v_2
\tag{E-PairBeta2}\label{eq:e-pairbeta2} \\ \\
\frac{t_1 \longrightarrow t_1'}{t_1.1\longrightarrow t_1'.1}
\tag{E-Proj1}\label{eq:e-proj1} \\ \\
\frac{t_1 \longrightarrow t_1'}{t_1.2\longrightarrow t_1'.2}
\tag{E-Proj2}\label{eq:e-proj2} \\ \\
\frac{t_1 \longrightarrow t_1'}{\set{t_1, t_2} \longrightarrow \set{t_1', t_2}}
\tag{E-Pair1}\label{eq:e-pair1} \\ \\
\frac{t_2 \longrightarrow t_2'}{\set{t_1, t_2} \longrightarrow \set{t_1, t_2'}}
\tag{E-Pair2}\label{eq:e-pair2} \\ \\
\end{align}</script>
<p>The typing rules are then:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{
\Gamma\vdash t_1: T_1 \quad \Gamma\vdash t_2: T_2
}{
\Gamma\vdash \set{t_1, t_2} : T_1 \times T_2
} \label{eq:t-pair} \tag{T-Pair} \\ \\
\frac{\Gamma\vdash t_1 : T_{11}\times T_{12}}{\Gamma\vdash t_1.1:T_{11}}
\label{eq:t-proj1}\tag{T-Proj1} \\ \\
\frac{\Gamma\vdash t_1 : T_{11}\times T_{12}}{\Gamma\vdash t_1.2:T_{12}}
\label{eq:t-proj2}\tag{T-Proj2} \\ \\
\end{align}</script>
<p>Pairs have to be added “the hard way”: we do not really have a way to define them in a derived form, as we have no existing language features to piggyback onto.</p>
<h4 id="tuples">Tuples</h4>
<p>Tuples are like pairs, except that we do not restrict it to 2 elements; we allow an arbitrary number from 1 to n. We can use pairs to encode tuples: <code class="highlighter-rouge">(a, b, c)</code> can be encoded as <code class="highlighter-rouge">(a, (b, c))</code>. Though for performance and convenience, most languages implement them natively.</p>
<h4 id="records">Records</h4>
<p>We can easily generalize tuples to records by annotating each field with a label. A record is a bundle of values with labels; it’s a map of labels to values and types. Order of records doesn’t matter, the only index is the label.</p>
<p>If we allow numeric labels, then we can encode a tuple as a record, where the index implicitly encodes the numeric label of the record representation.</p>
<p>No mainstream language has language-level support for records (two case classes in Scala may have the same arguments but a different constructor, so it’s not quite the same; records are more like anonymous objects). This is because they’re often quite inefficient in practice, but we’ll still use them as a theoretical abstraction.</p>
<h3 id="sums-and-variants">Sums and variants</h3>
<h4 id="sum-type">Sum type</h4>
<p>A sum type $T = T_1 + T_2$ is a <em>disjoint</em> union of $T_1$ and $T_2$. Pragmatically, we can have sum types in Scala with case classes extending an abstract object:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">+T</span><span class="o">]</span>
<span class="nc">case</span> <span class="k">class</span> <span class="nc">Some</span><span class="o">[</span><span class="kt">+T</span><span class="o">]</span> <span class="nc">extends</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="k">case</span> <span class="k">object</span> <span class="nc">None</span> <span class="k">extends</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>In this example, <code class="highlighter-rouge">Option = Some + None</code>. We say that $T_1$ is on the left, and $T_2$ on the right. Disjointness is ensured by the tags $\text{inl}$ and $\text{inr}$. We can <em>think</em> of these as functions that inject into the left or right of the sum type $T$:</p>
<script type="math/tex; mode=display">\text{inl}: T_1 \rightarrow T_1 + T_2 \\
\text{inr}: T_2 \rightarrow T_1 + T_2</script>
<p>Still, these aren’t really functions, they don’t actually have function type. Instead, we use them them to tag the left and right side of a sum type, respectively.</p>
<p>Another way to think of these stems from <a href="/#curry-howard-correspondence">Curry-Howard correspondence</a>. Recall that in the <a href="https://en.wikipedia.org/wiki/Brouwer%E2%80%93Heyting%E2%80%93Kolmogorov_interpretation">BHK interpretation</a>, a proof of $P \lor Q$ is a pair <code class="highlighter-rouge"><a, b></code> where <code class="highlighter-rouge">a</code> is 0 (also denoted $\text{inl}$) and <code class="highlighter-rouge">b</code> a proof of $P$, <em>or</em> <code class="highlighter-rouge">a</code> is 1 (also denoted $\text{inr}$) and <code class="highlighter-rouge">b</code> is a proof of $Q$.</p>
<p>To use elements of a sum type, we can introduce a <code class="highlighter-rouge">case</code> construct that allows us to pattern-match on a sum type, allowing us to distinguishing the left type from the right one.</p>
<p>We need to introduce these three special forms in our syntax:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>t ::= ... // terms
inl t // tagging (left)
inr t // tagging (right)
case t of inl x => t | inr x => t // case
v ::= ... // values
inl v // tagged value (left)
inr v // tagged value (right)
T ::= ... // types
T + T // sum type</pre></td></tr></tbody></table></code></pre></figure>
<p>This also leads us to introduce some new evaluation rules:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\begin{rcases}
\text{case } (& \text{inl } v_0) \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2 \\
\end{rcases} \longrightarrow [x_1 \mapsto v_0] t_1
\label{eq:e-caseinl}\tag{E-CaseInl} \\ \\
\begin{rcases}
\text{case } (& \text{inr } v_0) \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inl } x_2 \Rightarrow t_2 \\
\end{rcases} \longrightarrow [x_2 \mapsto v_0] t_2
\label{eq:e-caseinr}\tag{E-CaseInr} \\ \\
\frac{t_0 \longrightarrow t_0'}{
\begin{rcases}
\text{case } & t_0 \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2
\end{rcases} \longrightarrow \begin{cases}
\text{case } & t_0' \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2
\end{cases}
} \label{eq:e-case}\tag{E-Case} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{inl }t_1 \longrightarrow \text{inl }t_1'}
\label{eq:e-inl}\tag{E-Inl} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{inr }t_1 \longrightarrow \text{inr }t_1'}
\label{eq:e-inr}\tag{E-Inr} \\ \\
\end{align} %]]></script>
<p>And we’ll also introduce three typing rules:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t_1 : T_1}{\Gamma\vdash\text{inl } t_1 : T_1 + T_2}
\label{eq:t-inl}\tag{T-Inl} \\ \\
\frac{\Gamma\vdash t_1 : T_2}{\Gamma\vdash\text{inr } t_1 : T_1 + T_2}
\label{eq:t-inr}\tag{T-Inr} \\ \\
\frac{
\Gamma\vdash t_0 : T_1 + T_2 \quad
\Gamma\cup(x_1: T_1) \vdash t_1 : T \quad
\Gamma\cup(x_2: T_2) \vdash t_2 : T
}{
\Gamma\vdash\text{case } t_0 \text{ of inl } x_1 \Rightarrow t_1 \mid \text{inr } x_2 \Rightarrow t_2 : T
}
\label{eq:t-case}\tag{T-Case} \\
\end{align}</script>
<h4 id="sums-and-uniqueness-of-type">Sums and uniqueness of type</h4>
<p>The rules $\ref{eq:t-inr}$ and $\ref{eq:t-inl}$ may seem confusing at first. We only have one type to deduce from, so what do we assign to $T_2$ and $T_1$, respectively? These rules mean that we have lost uniqueness of types: if $t$ has type $T$, then $\text{inl } t$ has type $T+U$ <strong>for every</strong> $U$.</p>
<p>There are a couple of solutions to this:</p>
<ol>
<li>We can infer $U$ as needed during typechecking</li>
<li>Give constructors different names and only allow each name to appear in one sum type. This requires generalization to <a href="#variants">variants</a>, which we’ll see next. OCaml adopts this solution.</li>
<li>Annotate each inl and inr with the intended sum type.</li>
</ol>
<p>For now, we don’t want to look at type inference and variance, so we’ll choose the third approach for simplicity. We’ll introduce these annotation as ascriptions on the injection operators in our grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre>t ::=
...
inl t as T
inr t as T
v ::=
...
inl v as T
inr v as T</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation rules would be exactly the same as previously, but with ascriptions in the syntax. The injection operators just now also specify <em>which</em> sum type we’re injecting into, for the sake of uniqueness of type.</p>
<h4 id="variants">Variants</h4>
<p>Just as we generalized binary products to labeled records, we can generalize binary sums to labeled variants. We can label the members of the sum type, so that we write $\langle l_1: T_1, l_2: T_2 \rangle$ instead of $T_1 + T_2$ ($l_1$ and $l_2$ are the labels).</p>
<p>As a motivating example, we’ll show a useful idiom that is possible with variants, the optional value. We’ll use this to create a table. The example below is just like in OCaml.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="nc">OptionalNat</span> <span class="k">=</span> <span class="o"><</span><span class="n">none</span><span class="k">:</span> <span class="kt">Unit</span><span class="o">,</span> <span class="n">some</span><span class="k">:</span> <span class="kt">Nat></span><span class="o">;</span>
<span class="nc">Table</span> <span class="k">=</span> <span class="nc">Nat</span> <span class="o">-></span> <span class="nc">OptionalNat</span><span class="o">;</span>
<span class="n">emptyTable</span> <span class="k">=</span> <span class="n">λt</span><span class="k">:</span> <span class="kt">Nat.</span> <span class="kt"><none</span><span class="o">=</span><span class="n">unit</span><span class="o">></span> <span class="n">as</span> <span class="nc">OptionalNat</span><span class="o">;</span>
<span class="n">extendTable</span> <span class="k">=</span>
<span class="n">λt</span><span class="k">:</span> <span class="kt">Table.</span> <span class="kt">λkey:</span> <span class="kt">Nat.</span> <span class="kt">λval:</span> <span class="kt">Nat.</span>
<span class="kt">λsearch:</span> <span class="kt">Nat.</span>
<span class="kt">if</span> <span class="o">(</span><span class="kt">equal</span> <span class="kt">search</span> <span class="kt">key</span><span class="o">)</span> <span class="kt">then</span> <span class="kt"><some</span><span class="o">=</span><span class="k">val</span><span class="o">></span> <span class="n">as</span> <span class="nc">OptionalNat</span>
<span class="k">else</span> <span class="o">(</span><span class="n">t</span> <span class="n">search</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The implementation works a bit like a linked list, with linear look-up. We can use the result from the table by distinguishing the outcome with a <code class="highlighter-rouge">case</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="n">x</span> <span class="k">=</span> <span class="k">case</span> <span class="n">t</span><span class="o">(</span><span class="mi">5</span><span class="o">)</span> <span class="n">of</span>
<span class="o"><</span><span class="n">none</span><span class="k">=</span><span class="n">u</span><span class="o">></span> <span class="k">=></span> <span class="mi">999</span>
<span class="o">|</span> <span class="o"><</span><span class="n">some</span><span class="k">=</span><span class="n">v</span><span class="o">></span> <span class="k">=></span> <span class="n">v</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="recursion">Recursion</h3>
<p>In STLC, all programs terminate. We’ll <a href="#strong-normalization">go into a little more detail later</a>, but the main idea is that evaluation of a well-typed program is guaranteed to halt; we say that the well-typed terms are <em>normalizable</em>.</p>
<p>Indeed, the infinite recursions from untyped lambda calculus (terms like $\text{omega}$ and $\text{fix}$) are not typable, and thus cannot appear in STLC. Since we can’t express $\text{fix}$ in STLC, instead of defining it as a term in the language, we can add it as a primitive instead to get recursion.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
fix t</pre></td></tr></tbody></table></code></pre></figure>
<p>We’ll need to add evaluation rules recreating its behavior, and a typing rule that restricts its use to the intended use-case.</p>
<script type="math/tex; mode=display">\begin{align}
\text{fix } (\lambda x: T_1.\ t_2) \longrightarrow \left[
x\mapsto (\text{fix }(\lambda x: T_1.\ t_2))
\right] t_2
\label{eq:e-fixbeta}\tag{E-FixBeta} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{fix }t_1 \longrightarrow \text{fix }t_1'}
\label{eq:e-fix}\tag{E-Fix} \\ \\
\frac{\Gamma\vdash t_1 : T_1 \rightarrow T_1}{\Gamma\vdash\text{fix }t_1:T_1}
\label{eq:t-fix}\tag{T-Fix}
\end{align}</script>
<p>In order for a function to be recursive, the function needs to map a type to the same type, hence the restriction of $T_1 \rightarrow T_1$. The type $T_1$ will itself be a function type if we’re doing a recursion. Still, note that the type system doesn’t enforce this. There will actually be situations in which it will be handy to use something else than a function type inside a fix operator.</p>
<p>Seeing that this fixed-point notation can be a little involved, we can introduce some nice syntactic sugar to work with it:</p>
<script type="math/tex; mode=display">\text{letrec } x: T_1 = t_1 \text{ in } t_2
\quad \defeq \quad
\text{let } x = \text{fix } (\lambda x: T_1.\ t_1) \text{ in } t_2</script>
<p>This $t_1$ can now refer to the $x$; that’s the convenience offered by the construct. Although we don’t strictly need to introduce typing rules (it’s syntactic sugar, we’re relying on existing constructs), a typing rule for this could be:</p>
<script type="math/tex; mode=display">\frac{\Gamma\cup(x:T_1)\vdash t_1:T_1 \quad \Gamma\cup(x: T_1)\vdash t_2:T_2}{\Gamma\vdash\text{letrec } x: T_1 = t_1 \text{ in } t_2:T_2}</script>
<p>In Scala, a common error message is that a recursive function needs an explicit return type, for the same reasons as the typing rule above.</p>
<h3 id="references">References</h3>
<h4 id="mutability">Mutability</h4>
<p>In most programming languages, variables are (or can be) mutable. That is, variables can provide a name referring to a previously calculated value, as well as a way of overwriting this value with another (under the same name). How can we model this in STLC?</p>
<p>Some languages (e.g. OCaml) actually formally separate variables from mutation. In OCaml, variables are only for naming, the binding between a variable and a value is immutable. However, there is the concept of <em>mutable values</em>, also called <em>reference cells</em> or <em>references</em>. This is the style we’ll study, as it is easier to work with formally. A mutable value is represented in the type-level as a <code class="highlighter-rouge">Ref T</code> (or perhaps even a <code class="highlighter-rouge">Ref(Option T)</code>, since the null pointer cannot produce a value).</p>
<p>The basic operations are allocation with the <code class="highlighter-rouge">ref</code> operator, dereferencing with <code class="highlighter-rouge">!</code> (in C, we use the <code class="highlighter-rouge">*</code> prefix), and assignment with <code class="highlighter-rouge">:=</code>, which updates the content of the reference cell. Assignment returns a <code class="highlighter-rouge">unit</code> value.</p>
<h4 id="aliasing">Aliasing</h4>
<p>Two variables can reference the same cell: we say that they are <em>aliases</em> for the same cell. Aliasing is when we have different references (under different names) to the same cell. Modifying the value of the reference cell through one alias modifies the value for all other aliases.</p>
<p>The possibility of aliasing is all around us, in object references, explicit pointers (in C), arrays, communication channels, I/O devices; there’s practically no way around it. Yet, alias analysis is quite complex, costly, and often makes is hard for compilers to do optimizations they would like to do.</p>
<p>With mutability, the order of operations now matters; <code class="highlighter-rouge">r := 1; r := 2</code> isn’t the same as <code class="highlighter-rouge">r := 2; r := 1</code>. If we recall the <a href="#confluence-in-full-beta-reduction">Church-Rosser theorem</a>, we’ve lost the principle that all reduction paths lead to the same result. Therefore, some language designers disallow it (Haskell). But there are benefits to allowing it, too: efficiency, dependency-driven data flow (e.g. in GUI), shared resources for concurrency (locks), etc. Therefore, most languages provide it.</p>
<p>Still, languages without mutability have come up with a bunch of abstractions that allow us to have some of the benefits of mutability, like monads and lenses.</p>
<h4 id="typing-rules-1">Typing rules</h4>
<p>We’ll introduce references as a type <code class="highlighter-rouge">Ref T</code> to represent a variable of type <code class="highlighter-rouge">T</code>. We can construct a reference as <code class="highlighter-rouge">r = ref 5</code>, and access the contents of the reference using <code class="highlighter-rouge">!r</code> (this would return <code class="highlighter-rouge">5</code> instead of <code class="highlighter-rouge">ref 5</code>).</p>
<p>Let’s define references in our language:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>t ::= // terms
unit // unit constant
x // variable
λx: T. t // abstraction
t t // application
ref t // reference creation
!t // dereference
t := t // assignment</pre></td></tr></tbody></table></code></pre></figure>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t_1 : T_1}{\Gamma\vdash \text{ref } t_1 : \text{Ref } T_1}
\label{eq:t-ref}\tag{T-Ref} \\ \\
\frac{\Gamma\vdash t_1: \text{Ref } T_1}{\Gamma\vdash !t_1 : T_1}
\label{eq:t-deref}\tag{T-Deref} \\ \\
\frac{\Gamma\vdash t_1 : \text{Ref } T_1 \quad \Gamma\vdash t_2: T_1}{\Gamma\vdash t_1 := t_2 : \text{Unit}}
\label{eq:t-assign}\tag{T-Assign} \\ \\
\end{align}</script>
<h4 id="evaluation-1">Evaluation</h4>
<p>What is the <em>value</em> of <code class="highlighter-rouge">ref 0</code>? The crucial observation is that evaluation <code class="highlighter-rouge">ref 0</code> must <em>do</em> something. Otherwise, the two following would behave the same:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">r</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Evaluating <code class="highlighter-rouge">ref 0</code> should allocate some storage, and return a reference (or pointer) to that storage. A reference names a location in the <strong>store</strong> (also known as the <em>heap</em>, or just <em>memory</em>). Concretely, the store could be an array of 8-bit bytes, indexed by 32-bit integers. More abstractly, it’s an array of values, or even more abstractly, a partial function from locations to values.</p>
<p>We can introduce this idea of locations in our syntax. This syntax is exactly the same as the previous one, but adds the notion of locations:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre>v ::= // values
unit // unit constant
λx: T. t // abstraction value
l // store location
t ::= // terms
unit // unit constant
x // variable
λx: T. t // abstraction
t t // application
ref t // reference creation
!t // dereference
t := t // assignment
l // store location </pre></td></tr></tbody></table></code></pre></figure>
<p>This doesn’t mean that we’ll allow programmers to write explicit locations in their programs. We just use this as a modeling trick; we’re enriching the internal language to include some run-time structures.</p>
<p>With this added notion of stores and locations, the result of an evaluation now depends on the store in which it is evaluated, which we need to reflect in our evaluation rules. Evaluation must now include terms $t$ <strong>and</strong> store $\mu$:</p>
<script type="math/tex; mode=display">t \mid \mu \longrightarrow t' \mid \mu'</script>
<p>Let’s take a look for the evaluation rules for STLC with references, operator by operator.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1'\mid\mu'}{t_1 := t_2 \mid \mu \longrightarrow t_1' := t_2 \mid \mu'}
\label{eq:e-assign1}\tag{E-Assign1} \\ \\
\frac{t_2 \mid \mu \longrightarrow t_2'\mid\mu'}{t_1 := t_2 \mid \mu \longrightarrow t_1 := t_2' \mid \mu'}
\label{eq:e-assign2}\tag{E-Assign2} \\ \\
l := v_2 \mid \mu \longrightarrow \text{unit}\mid[l\mapsto v_2]\mu
\label{eq:e-assign}\tag{E-Assign} \\ \\
\end{align}</script>
<p>The assignments $\ref{eq:e-assign1}$ and $\ref{eq:e-assign2}$ evaluate terms until they become values. When they have been reduced, we can do that actual assignment: as per $\ref{eq:e-assign}$, we update the store and return return <code class="highlighter-rouge">unit</code>.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1' \mid \mu'}{\text{ref } t_1 \mid \mu \longrightarrow \text{ref } t_1' \mid \mu'}
\label{eq:e-ref}\tag{E-Ref} \\ \\
\frac{l \notin \text{dom}(\mu)}{\text{ref } v_1 \mid \mu \longrightarrow l \mid (\mu \cup (l\mapsto v_1))}
\label{eq:e-refv}\tag{E-RefV}
\end{align}</script>
<p>A reference $\text{ref }t_1$ first evaluates $t_1$ until it is a value ($\ref{eq:e-ref}$). To evaluate the reference operator, we find a fresh location $l$ in the store, to which it binds $v_1$, and it returns the location $l$.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1' \mid \mu'}{!t_1 \mid \mu \longrightarrow !t_1' \mid \mu'}
\label{eq:e-deref}\tag{E-Deref} \\ \\
\frac{\mu(l) = v}{!l\mid\mu \longrightarrow v\mid\mu}
\label{eq:e-derefloc}\tag{E-DerefLoc}
\end{align}</script>
<p>We find the same congruence rule as usual in $\ref{eq:e-deref}$, where a term $!t_1$ first evaluates $t_1$ until it is a value. Once it is a value, we can return the value in the current store using $\ref{eq:e-derefloc}$.</p>
<p>The evaluation rules for abstraction and application are augmented with stores, but otherwise unchanged.</p>
<h4 id="store-typing">Store typing</h4>
<p>What is the type of a location? The answer to this depends on what is in the store. Unless we specify it, a store could contain anything at a given location, which is problematic for typechecking. The solution is to type the locations themselves. This leads us to a typed store:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mu = (& l_1 \mapsto \text{Nat}, \\
& l_2 \mapsto \lambda x: \text{Unit}. x)
\end{align} %]]></script>
<p>As a first attempt at a typing rule, we can just say that the type of a location is given by the type of the value in the store at that location:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash\mu(l) : T_1}{\Gamma\vdash l : \text{Ref } T_1}</script>
<p>This is problematic though; in the following, the typing derivation for $!l_2$ would be infinite because we have a cyclic reference:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mu =\ (& l_1 \mapsto \lambda x: \text{Nat}.\ !l_2\ x, \\
& l_2 \mapsto \lambda x: \text{Nat}.\ !l_1\ x)
\end{align} %]]></script>
<p>The core of the problem here is that we would need to recompute the type of a location every time. But shouldn’t be necessary. Seeing that references are strongly typed as <code class="highlighter-rouge">Ref T</code>, we know exactly what type of value we can place in a given store location. Indeed, the typing rules we chose for references guarantee that a given location in the store always is used to hold values of the same type.</p>
<p>So to fix this problem, we need to introduce a <strong>store typing</strong>. This is a partial function from location to types, which we’ll denote by $\Sigma$.</p>
<p>Suppose we’re given a store typing $\Sigma$ describing the store $\mu$. We can use $\Sigma$ to look up the types of locations, without doing a lookup in $\mu$:</p>
<script type="math/tex; mode=display">\frac{\Sigma(l) = T_1}{\Gamma\mid\Sigma\vdash l : \text{Ref } T_1}
\label{eq:t-loc}\tag{T-Loc}</script>
<p>This tells us how to check the store typing, but how do we create it? We can start with an empty typing $\Sigma = \emptyset$, and add a typing relation with the type of $v_1$ when a new location is created during evaluation of $\ref{eq:e-refv}$.</p>
<p>The rest of the typing rules remain the same, but are augmented with the store typing. So in conclusion, we have updated our evaluation rules with a <em>store</em> $\mu$, and our typing rules with a <em>store typing</em> $\Sigma$.</p>
<h4 id="safety">Safety</h4>
<p>Let’s take a look at progress and preservation in this new type system. Preservation turns out to be more interesting, so let’s look at that first.</p>
<p>We’ve added a store and a store typing, so we need to add those to the statement of preservation to include these. Naively, we’d write:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma\vdash t: T \text{ and }
t\mid\mu\longrightarrow t'\mid\mu'
\quad \implies \quad
\Gamma\mid\Sigma\vdash t': T</script>
<p>But this would be wrong! In this statement, $\Sigma$ and $\mu$ would not be constrained to be correlated at all, which they need to be. This constraint can be defined as follows:</p>
<p>A store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$ (which we denote by $\Gamma\mid\Sigma\vdash\mu$) if the following is satisfied:</p>
<script type="math/tex; mode=display">\text{dom}(\mu) = \text{dom}(\Sigma)
\quad \text{and} \quad
\Gamma\mid\Sigma\vdash\mu(l) : \Sigma(l),\ \forall l\in\text{dom}(\mu)</script>
<p>This gets us closer, and we can write the following preservation statement:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma \vdash t : T \text{ and }
t\mid\mu \longrightarrow t'\mid\mu \text{ and }
\Gamma\mid\Sigma \vdash \mu
\quad \implies \quad
\Gamma\mid\Sigma\vdash t' : T</script>
<p>But this is still wrong! When we create a new cell with $\ref{eq:e-refv}$, we would break the correspondence between store typing and store.</p>
<p>The correct version of the progress theorem is the following:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma \vdash t : T \text{ and }
t\mid\mu \longrightarrow t'\mid\mu \text{ and }
\Gamma\mid\Sigma \vdash \mu
\quad \implies \quad
\text{for some } \Sigma' \supseteq \Sigma, \;\;
\Gamma\mid\Sigma'\vdash t' : T</script>
<p>This progress theorem just asserts that there is <em>some</em> store typing $\Sigma’ \supseteq \Sigma$ (agreeing with $\Sigma$ on the values of all old locations, but that may have also add new locations), such that $t’$ is well typed in $\Sigma’$.</p>
<p>The progress theorem must also be extended with stores and store typings:</p>
<p>Suppose that $t$ is a closed, well-typed term; that is, $\emptyset\mid\Sigma\vdash t: T$ for some type $T$ and some store typing $\Sigma$. Then either $t$ is a value or else, for any store $\mu$ such that $\emptyset\mid\Sigma\vdash\mu$<sup id="fnref:well-typed-store-notation"><a href="#fn:well-typed-store-notation" class="footnote">2</a></sup>, there is some term $t’$ and store $\mu’$ with $t\mid\mu \longrightarrow t’\mid\mu’$.</p>
<h2 id="type-reconstruction-and-polymorphism">Type reconstruction and polymorphism</h2>
<p>In type checking, we wanted to, given $\Gamma$, $t$ and $T$, check whether $\Gamma\vdash t: T$. So far, for type checking to take place, we required explicit type annotations.</p>
<p>In this section, we’ll look into <strong>type reconstruction</strong>, which allows us to infer types when type annotations aren’t present: given $\Gamma$ and $t$, we want to find a type $T$ such that $\Gamma\vdash t:T$.</p>
<p>Immediately, we can see potential problems with this idea:</p>
<ul>
<li>Abstractions without the parameter type annotation seem complicated to reconstruct (a parameter could almost have any type)</li>
<li>A term can have many types</li>
</ul>
<p>To solve these problems, we’ll introduce polymorphism into our type system.</p>
<h3 id="constraint-based-typing-algorithm">Constraint-based Typing Algorithm</h3>
<p>The idea is to split the work in two: first, we want to generate and record constraints, and then, unify them (that is, attempt to satisfy the constraints).</p>
<p>In the following, we’ll denote constraints as a set of equations $\set{T_i \hat{=} U_i}_{i=1, \dots, m}$, constraining type variables $T_i$ to actual types $U_i$.</p>
<h4 id="constraint-generation">Constraint generation</h4>
<p>The constraint generation algorithm can be described as the following function $TP: \text{Judgement } \rightarrow \text{Equations}$</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& TP(\Gamma\vdash t: T) & = \quad & \text{case } t \text{ of} & & \\
& & x & : \quad & \set{\Gamma(x) \hat{=} T} & \\ \\
& & \lambda x.\ t' & : \quad & \text{let } a, b \text{ fresh in} & \\
& & & & \set{(a \rightarrow b) \hat{=} T} \cup TP(\Gamma\cup(x: a)\vdash t': b) & \\ \\
& & t\ t' & : \quad & \text{let } a \text{ fresh in} & \\
& & & & TP(\Gamma\vdash t: a \rightarrow T) \cup TP(\Gamma\vdash t':a) & \\
\end{align} %]]></script>
<p>This creates a set of constraints between type variables and the expected types.</p>
<h4 id="soundness-and-completeness">Soundness and completeness</h4>
<p>In general a type reconstruction algorithm $\mathcal{A}$ assigns to an environment $\Gamma$ and a term $t$ a set of types $\mathcal{A}(\Gamma, t)$.</p>
<p>The algorithm is <strong>sound</strong> if for every type $T\in \mathcal{A}(\Gamma, t)$ we can prove the judgment $\Gamma\vdash t: T$.</p>
<p>The algorithm is <strong>complete</strong> if for every provable judgment $\Gamma\vdash t: T$ we have $T\in\mathcal{A}(\Gamma, t)$.</p>
<p>Soundness and completeness are the two directions of the following implication:</p>
<script type="math/tex; mode=display">\text{the algorithm can prove it} \iff \text{it holds}</script>
<p>Soundness and completeness are about the $\Leftarrow$ and $\Rightarrow$ directions of the above, respectively. The TP function we defined previously for STLC is sound and complete, and the relationship is thus $\iff$. We can write this mathematically as follows:</p>
<script type="math/tex; mode=display">\Gamma\vdash t: T \iff \exists \bar{b} \text{ s.t. } [T / a] EQNS</script>
<p>Where:</p>
<ul>
<li>$a$ is a new type variable</li>
<li>$EQNS = TP(\Gamma\vdash t: a)$ is the set of type constraints</li>
<li>$\bar{b} = \text{tv}(EQNS)\setminus\text{tv}(\Gamma)$, where $\text{tv}$ denotes the set of free type variables.</li>
<li>$[T / a] EQNS$ is notation for replacing $a$ with $T$ in $EQNS$</li>
</ul>
<p>What this means is still a little unclear to me, but it seems to say that we can prove the judgement $\Gamma\vdash t: T$ if and only if we have some type variables in the constraints ??? todo</p>
<h4 id="substitutions">Substitutions</h4>
<p>Now that we’ve generated constraints in the form $\set{T_i\ \hat{=}\ U_i}_{i=1, \dots, m}$, we’d like a way to substitute these constraints into real types. We must generate a set of substitutions:</p>
<script type="math/tex; mode=display">\set{a_j \mapsto T_j'}_{j=1, \dots, n}</script>
<p>These substitutions cannot be cyclical. The type variables may not appear recursively on their right-hand side (directly or indirectly). We can write this requirement as:</p>
<script type="math/tex; mode=display">a_j \notin \text{tv}(T_k') \quad \text{for } j=1,\dots, n, \ k = j, \dots n</script>
<p>This substitution is an idempotent mapping from type variables to types, mapping all but a finite number of type variables to themselves. We can think of a substitution as a set of equations:</p>
<script type="math/tex; mode=display">\set{a\ \hat{=}\ T}, \quad a \notin \text{tv}(T)</script>
<p>Alternatively, we can think of it as a function transforming types (based on the set of equations). Substitution is applied in a straightforward way:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
s(X) & = \begin{cases}
T & \text{if } (X \mapsto T) \in s \\
X & \text{otherwise}
\end{cases} \\
s(\text{Nat}) & = \text{Nat} \\
s(\text{Bool}) & = \text{Bool} \\
s(T \rightarrow U) & = sT \rightarrow sU \\
\end{align} %]]></script>
<p>Substitution has two properties:</p>
<ul>
<li><strong>Idempotence</strong>: $s(s(T)) = s(T)$</li>
<li><strong>Composition</strong>: $(f \circ g)\ x = f(g\ x)$, the composition of substitutions, is also a substitution</li>
</ul>
<h4 id="unification">Unification</h4>
<p>We present a unification algorithm based on Robinson’s 1965 unification algorithm:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \text{mgu} & :\quad & (\text{Type }\hat{=}\text{ Type})\rightarrow\text{Subst}\rightarrow\text{Subst} \\ \\
& \text{mgu}(T\ \hat{=}\ U)\ s
& =\ & \text{mgu}'(sT\ \hat{=}\ sU)\ s
\\ \\
& \text{mgu}'(a\ \hat{=}\ a)\ s
& =\ & s
\\
& \text{mgu}'(a\ \hat{=}\ T)\ s
& =\ & s \cup \set{a \mapsto T}
\quad \text{if } a\notin \text{tv}(T) \\
& \text{mgu}'(T\ \hat{=}\ a)\ s
& =\ & s \cup \set{a \mapsto T}
\quad \text{if } a\notin \text{tv}(T) \\
& \text{mgu}'(T\rightarrow T'\ \hat{=}\ U\rightarrow U')\ s
& =\ & (\text{mgu}(T'\ \hat{=}\ U')\circ\text{mgu}(T\ \hat{=}\ U))\ s
\\
& \text{mgu}'(K[T_1, \dots, T_n]\ \hat{=}\ K[U_1, \dots, U_n])\ s
& =\ & (\text{mgu}(T_n\ \hat{=}\ U_n)\circ\dots\circ\text{mgu}(T_1\ \hat{=}\ U_1))\ s
\\ \\
& \text{mgu}(T\ \hat{=}\ U)\ s
& =\ & \text{error}
\quad \text{otherwise} \\
\end{align} %]]></script>
<p>This function is called $\text{mgu}$, which stands for most general unifier.</p>
<p>A substitution $u$ is a <strong>unifier</strong> of a set of equations $\set{T_i\ \hat{=}\ U_i}$ if $uT_i = uU_i,\, \forall i$. This means that it can find an assignment to the type variables in the constraints so that all equations are trivially true.</p>
<p>The substitution is a <strong>most general unifier</strong> if for every other unifier $u’$ of the same equations, there exists a substitution $s$ such that $u’ = s\circ u$. In other words, it must be less specific (or more general) than all other unifiers.</p>
<p>We won’t prove this, but just state it as a theorem: if we get a set of constraints $\text{EQNS}$ which has a unifier, then $\text{mgu EQNS} \set{}$ computes the most general unifier of the constraints. If the constraints do not have a unifier, it fails.</p>
<p>In other words, the TP function is sound and complete.</p>
<h4 id="strong-normalization">Strong normalization</h4>
<p>With this typing inference in place, we can be tempted to try to run this on the diverging $\Omega$ that <a href="#recursion-in-lambda-calculus">we defined much earlier</a>, or perhaps on the <a href="#recursion-in-lambda-calculus">Y combinator</a>. But as we said before, self-application is not typable. In fact, we can state a stronger assertion:</p>
<p><strong>Strong Normalization Theorem</strong>: if $\vdash t: T$, then there is a value $V$ such that $t \longrightarrow^* V$.</p>
<p>In other words, if we can type it, it reduces to a value. In the case of the infinite recursion, we cannot type it, and it does not evaluate to a value (instead, it diverges). So looping infinitely isn’t possible in STLC, which leads us to the corollary of this theorem: <strong>STLC is not Turing complete</strong>.</p>
<h3 id="polymorphism">Polymorphism</h3>
<p>There are multiple forms of polymorphism:</p>
<ul>
<li><strong>Universal polymorphism</strong> (aka <em>generic types</em>): the ability to instantiate type variables</li>
<li><strong>Inclusion polymorphism</strong> (aka <em>subtying</em>): the ability to treat a value of a subtype as a value of one of its supertypes</li>
<li><strong>Ad-hoc</strong> (aka <em>overloading</em>): the ability to define several versions of the same function name with different types.</li>
</ul>
<p>We’ll concentrate on universal polymorphism, of which there are to variants: explicit and implicit.</p>
<h4 id="explicit-polymorphism">Explicit polymorphism</h4>
<p>In STLC, a term can have many types, but a variable or parameter only has one type. With polymorphism, we open this up: we allow functions to be applied to arguments of many types.</p>
<p>To do this, we introduce a new polymorphic type $\forall a.T$, which can be used as any other type. The typing rules are:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t: \forall a.T}{\Gamma\vdash t[U] : [a \mapsto U] T}
\label{eq:polymorphic-app}\tag{$\forall$E} \\ \\
\frac{\Gamma\vdash t: T}{\Gamma\vdash\Lambda a.t : \forall a.T}
\label{eq:polymorphic-abs}\tag{$\forall$I} \\ \\
\end{align}</script>
<p>The $\Lambda$ symbol represents a type abstraction. It corresponds to <code class="highlighter-rouge">[T]</code> or <code class="highlighter-rouge"><T></code> in most programming languages. For instance, the signature of <code class="highlighter-rouge">map</code> could be written as follows in Scala:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">map</span><span class="o">[</span><span class="kt">A</span><span class="o">][</span><span class="kt">B</span><span class="o">](</span><span class="n">f</span><span class="k">:</span> <span class="kt">A</span> <span class="o">=></span> <span class="n">B</span><span class="o">)(</span><span class="n">xs</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">A</span><span class="o">])</span> <span class="k">=</span> <span class="o">...</span></pre></td></tr></tbody></table></code></pre></figure>
<p>In lambda calculus we’d write:</p>
<script type="math/tex; mode=display">\Lambda X.\ \Lambda Y.\ \lambda f: X\rightarrow Y.\ \lambda xs: List[X].\ \dots</script>
<h4 id="implicit-polymorphism">Implicit polymorphism</h4>
<p>Implicit polymorphism does not require annotations for parameter types. The idea is that inference treats unannotated terms as polymorphic types. To have this feature, we must introduce the notion of <strong>type schemes</strong>. These are not fully general types, but are an internal construct used to type named values (<code class="highlighter-rouge">val</code> or <code class="highlighter-rouge">let ... in ...</code> statements). A type scheme has the following syntax:</p>
<script type="math/tex; mode=display">S ::= T \mid \forall a. S</script>
<p>This feature is called implicit polymorphism or let-polymorphism. The resulting type system is called the Hindley/Milner system. Its typing rules are:</p>
<script type="math/tex; mode=display">\begin{align}
\Gamma \cup (x: S) \cup \Gamma' \vdash x: S, \quad x\notin\text{dom}(\Gamma')
\label{eq:hm-var}\tag{Var} \\ \\
\frac{\Gamma\vdash t: \forall a. T}{\Gamma\vdash t: [a \mapsto U]T}
\label{eq:hm-forall-e}\tag{$\forall E$} \\ \\
\frac{\Gamma\vdash t: T \quad a\notin \text{tv}(\Gamma)}{\Gamma\vdash \forall a.T}
\label{eq:hm-forall-i}\tag{$\forall I$} \\ \\
\frac{\Gamma\vdash t: S \quad \Gamma\cup(x: S)\vdash t' : T}
{\Gamma \vdash \text{let } x = t \text{ in } t': T}
\label{eq:hm-let}\tag{Let} \\ \\
\frac{\Gamma\cup(x: T)\vdash t: T}{\Gamma\vdash\lambda x. t: T \rightarrow U}
\label{eq:hm-arrow-i}\tag{$\rightarrow I$} \\ \\
\frac{\Gamma\vdash t_1: T \rightarrow U \quad \Gamma\vdash t_2: T}{\Gamma\vdash t_1\ t_2: U}
\label{eq:hm-arrow-e}\tag{$\rightarrow E$} \\
\end{align}</script>
<p>$\ref{eq:hm-var}$ means that we can verify $x: S$ if $(x: S)$ is in the environment and it isn’t overwritten later (in $\Gamma’$). This allows us to have some concept of scoping of variables.</p>
<p>$\ref{eq:hm-forall-e}$ allows to verify specific instances of a polymorphic type, and $\ref{eq:hm-forall-i}$ allows to generalize to a polymorphic type (with a hygiene condition telling us that the type variable we choose isn’t already in the environment).</p>
<p>$\ref{eq:hm-let}$ is fairly straightforward. $\ref{eq:hm-arrow-i}$ and $\ref{eq:hm-arrow-e}$ are simply as in STLC.</p>
<h4 id="alternative-hindley-milner">Alternative Hindley Milner</h4>
<p>A let-in statement can be regarded as shorthand for a substitution:</p>
<script type="math/tex; mode=display">\text{let } x = t \text{ in } t'
\quad \equiv \quad
[x\mapsto t] t'</script>
<p>We can use this to get a revised Hindley/Milner system which we call HM’, where $\ref{eq:hm-let}$ is replaced by the following:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash t: T \quad \Gamma\vdash [x\mapsto t] t' : U}
{\Gamma \vdash \text{let } x = t \text{ in } t': U}
\label{eq:hm-let-prime}\tag{Let'}</script>
<p>In essence, it only changes the typing rule for <code class="highlighter-rouge">let</code> so that they perform a step of evaluation before calculating the types. This is equivalent to the previous HM system; we’ll state that as a theorem, without proof.</p>
<p><strong>Theorem</strong>: $\Gamma\vdash_{\text{HM}} t: S \iff \Gamma\vdash_{\text{HM}’} t: S$</p>
<p>The corollary to this theorem is that, if we let $t^*$ be the result of expanding all <code class="highlighter-rouge">let</code>s in $t$ using the substitution above, then:</p>
<script type="math/tex; mode=display">\Gamma\vdash_{\text{HM}} t: T \Longrightarrow \Gamma\vdash_{F_1} t^* : T</script>
<p>The converse is true if every let-bound name is used at least once:</p>
<script type="math/tex; mode=display">\Gamma\vdash_{\text{HM}} t: T \Longleftarrow \Gamma\vdash_{F_1} t^* : T</script>
<h3 id="principal-types">Principal types</h3>
<p>A type $T$ is a <strong>generic instance</strong> of a type scheme $S = \forall \alpha_1, \dots, \forall \alpha_n. T’$ if there is a substitution $s$ on $\alpha_1, \dots, \alpha_n$ such that $T = sT’$. In this case, we write $S \le T$.</p>
<p>A type scheme $S’$ is a <strong>generic instance</strong> of a type scheme $S$ iff for all types $T$:</p>
<script type="math/tex; mode=display">S' \le T \implies S \le T</script>
<p>In this case, we write $S \le S’$.</p>
<p>A type scheme $S$ is <strong>principal</strong> (or <em>most general</em>) for $\Gamma$ and $t$ iff:</p>
<ul>
<li>$\Gamma\vdash t: S$</li>
<li>$\Gamma\vdash t: S’ \implies S \le S’$</li>
</ul>
<p>A type system TS has the <strong>principal typing property</strong> iff, whenever $\Gamma\vdash_{\text{TS}} t: S$, there exists a principal type scheme for $\Gamma$ and $t$.</p>
<p>In other words, a type system with principal types is one where the type engine doesn’t make any choices; it always finds the most general solution. The type checker may fail if it cannot advance without making a choice (e.g. for $\lambda x. x+x$, where the typechecker would have to choose between $\text{Int} \rightarrow \text{Int}$, $\text{Float} \rightarrow \text{Float}$, etc).</p>
<p>The following can be stated as a theorem:</p>
<ol>
<li>HM’ without <code class="highlighter-rouge">let</code> has the principal typing property</li>
<li>HM’ with <code class="highlighter-rouge">let</code> has the principal typing property</li>
<li>HM has the principal typing property</li>
</ol>
<h2 id="subtyping">Subtyping</h2>
<h3 id="motivation">Motivation</h3>
<p>Under $\ref{eq:t-app}$, the following is not well typed:</p>
<script type="math/tex; mode=display">(\lambda x.\ \set{x: Nat}.\ r.x)\ \set{x=0, y=1}</script>
<p>We’re passing a record to a function that selects its <code class="highlighter-rouge">x</code> member. This is not well typed, but would still evaluate just fine; after all, we’re passing the function a <em>better</em> argument than it needs.</p>
<p>In general, we’d like to be able to define hierarchies of classes, with descendants having richer interfaces. These should still be usable instead of their ancestors. We solve this using subtyping.</p>
<p>We achieve this by introducing a subtyping relation $S <: T$, and a <strong>subsumption rule</strong>:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\Gamma\vdash t: S \quad S <: T}{\Gamma\vdash t: T}
\label{eq:t-sub}\tag{T-Sub} %]]></script>
<p>This rule tells us that if $S <: T$, then any value of type $S$ can also be regarded as having type $T$. With this rule in place, we just need to define the rules for when we can assert $S <: T$.</p>
<h3 id="rules">Rules</h3>
<h4 id="general-rules">General rules</h4>
<p>Subtyping is reflective and transitive:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
S <: S
\label{eq:s-refl}\tag{S-Refl} \\ \\
\frac{S <: U \quad U <: T}{S <: T}
\label{eq:s-trans}\tag{S-Trans} \\ \\
\end{align} %]]></script>
<h4 id="records-1">Records</h4>
<p>To solve our previous example, we can introduce subtyping between record types:</p>
<script type="math/tex; mode=display">% <![CDATA[
\set{x: \text{Nat}, y: \text{Nat}} <: \set{x: \text{Nat}} %]]></script>
<p>Using $\ref{eq:t-sub}$, we can see that our example is now well-typed. Of course, the subtyping rule we introduced here is too specific; we need something more general. We can do this by introducing three rules for subtyping of record types:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\set{l_i: {T_i}^{i\in 1\dots n+k}} <: \set{l_i: {T_i}^{i\in 1\dots n}}
\label{eq:s-rcdwidth}\tag{S-RcdWidth} \\ \\
\frac{
\set{k_j : {S_j}^{j\in 1 \dots n}} \text{ is a permutation of } \set{l_i : {T_i}^{i\in 1 \dots n}}
}{
\set{k_j : {S_j}^{j\in 1 \dots n}} <: \set{l_i : {T_i}^{i\in 1 \dots n}}
}
\label{eq:s-rcdperm}\tag{S-RcdPerm} \\ \\
\frac{
\forall i \ S_i <: T_i
}{
\set{l_i : {S_i}^{i\in 1\dots n}} <: \set{l_i: {T_i}^{i\in 1\dots n}}
}
\label{eq:s-rcddepth}\tag{S-RcdDepth} \\ \\
\end{align} %]]></script>
<p>$\ref{eq:s-rcdwidth}$ tells us that a record is a supertype of a record with additional fields to the right. Intuitively, the reason that the record <em>more</em> fields is a <em>subtype</em> of the record with fewer fields is because it places a stronger constraint on values, and thus describes fewer values (think of the Venn diagram of possible values).</p>
<p>Of course, adding fields to the right only is not strong enough of a rule, as order in a record shouldn’t matter. We fix this with $\ref{eq:s-rcdperm}$, which allows us to reorder the record so that all additional fields are on the right: $\ref{eq:s-rcdperm}$, $\ref{eq:s-rcdwidth}$ and $\ref{eq:s-trans}$ allows us to drop arbitrary fields within records.</p>
<p>Finally, $\ref{eq:s-rcddepth}$ allows for the types of individual fields to be subtypes of the supertype record’s fields.</p>
<p>Note that real languages often choose not to adopt these <a href="#aside-structural-vs-declared-subtyping">structural record subtyping</a> rules. For instance, Java has no depth subtyping (a subclass may not change the argument or result types of a method of its superclass), no permutation for classes (single inheritance means that each member can be assigned a single index; new members can be added as new indices “on the right”), but has permutation for interfaces (multiple inheritance of interfaces is allowed).</p>
<h4 id="arrow-types">Arrow types</h4>
<p>Function types are contravariant in the argument and covariant in the return type. The rule is therefore:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{T_1 <: S_1 \quad S_2 <: T_2}{S_1 \rightarrow S_2 <: T_1 \rightarrow T_2}
\label{eq:s-arrow}\tag{S-Arrow} %]]></script>
<h4 id="top-type">Top type</h4>
<p>For convenience, we have a top type that everything can be a subtype of. In Java, this corresponds to <code class="highlighter-rouge">Object</code>.</p>
<script type="math/tex; mode=display">% <![CDATA[
S <: \text{Top}
\label{eq:s-top}\tag{S-Top} %]]></script>
<h4 id="aside-structural-vs-declared-subtyping">Aside: structural vs. declared subtyping</h4>
<p>The <a href="#records-1">subtype relation we defined for records</a> is <em>structural</em>: we decide whether $S$ is a subtype of $T$ by examining the structure of $S$ and $T$. By contrast, most OO languages (e.g. Java) use <em>declared</em> subtyping: $S$ is only a subtype of $T$ if the programmer has stated that it should be (with <code class="highlighter-rouge">extends</code> or <code class="highlighter-rouge">implements</code>).</p>
<p>We’ll come back to this when we talk about <a href="#featherweight-java">Featherweight Java</a>.</p>
<h3 id="properties-of-subtyping">Properties of subtyping</h3>
<h4 id="safety-1">Safety</h4>
<p>The problem with subtyping is that it changes how we do proofs. They become a bit more involved, as the typing relation is no longer syntax directed; when we’re proving things, we need to start making choices, as the rule $\ref{eq:t-sub}$ could appear anywhere. Still, the proofs are possible.</p>
<h4 id="inversion-lemma-for-subtyping">Inversion lemma for subtyping</h4>
<p>Before we can prove safety and preservation, we’ll introduce the inversion lemma for subtyping.</p>
<p><strong>Inversion Lemma</strong>: If $U <: T_1 \rightarrow T_2$, then $U$ has the form $U_1 \rightarrow U_2$ with $T_1 <: U_1$ and $U_2 <: T_2$.</p>
<p>The proof is by induction on subtyping derivations:</p>
<ul>
<li>Case $\ref{eq:s-arrow}$, $U=U_1 \rightarrow U_2$: immediate, as $U$ already has the correct form, and as we can deduce $T_1 <: U_1$ and $U_2 <: T_2$ from $\ref{eq:s-arrow}$.</li>
<li>Case $\ref{eq:s-refl}$, $U=T_1 \rightarrow T_2$: by applying $\ref{eq:s-refl}$ twice, we get $T_1 <: T_1$ and $T_2 <: T_2$, as required.</li>
<li>
<p>Case $\ref{eq:s-trans}$, $U <: W$ and $W <: T_1 \rightarrow T_2$</p>
<p>By the IH on the second subderivation, we find that $W$ has the form $W_1 \rightarrow W_2$ with $T_1 <: W_1$ and $W_2 <: T_2$.</p>
<p>Applying the IH again to the first subderivation, we find that $U$ has the form $U_1 \rightarrow U_2$ with $W_1 <: U_1$ and $U_2 <: W_2$</p>
<p>By $\ref{eq:s-trans}$, we get $T_1 <: U_1$, and by $\ref{eq:s-trans}$ again, $U_2 <: T_2$ as required</p>
</li>
</ul>
<h4 id="inversion-lemma-for-typing">Inversion lemma for typing</h4>
<p>We’ll introduce another lemma, but this time for typing (not subtyping):</p>
<p><strong>Iversion lemma</strong>: if $\Gamma\vdash\lambda x: S_1. s_2 : T_1 \rightarrow T_2$, then $T_1 <: S_1$ and $\Gamma\cup(x: S_1)\vdash s_2: T_2$.</p>
<p>Again, the proof is by induction on typing derivations:</p>
<ul>
<li>Case $\ref{eq:t-abs}$, where $T_1 = S_1$, $T_2 = S_2$ and $\Gamma\cup(x: S_1)\vdash s_2 : S_2$: the result is immediate (using $\ref{eq:s-refl}$ to get $T_1 <: S_1$ from $T_1 = S_1$).</li>
<li>
<p>Case $\ref{eq:t-sub}$, $\Gamma\vdash\lambda x: X_1.\ s_2: U$ and $U <: T_1 \rightarrow T_2$</p>
<p>By the <a href="#inversion-lemma-for-subtyping">inversion lemma for subtyping</a>, we have $U = U_1 \rightarrow U_2$, with $T_1 <: U_1$ and $U_2 <: T_2$.</p>
<p>By the IH, we then have $U_1 <: S_1$ and $\Gamma\cup(x: S_1)\vdash s_2 : U_2$.</p>
<p>We can apply $\ref{eq:s-trans}$ to $U_1 <: S_1$ and $T_1 <: U_1$ to get $T_1 <: S_1$.</p>
<p>We can apply $\ref{eq:t-sub}$ to the assumptions that $\Gamma\cup(x: S_1)\vdash s_2: U_2$ and $U_2 <: T_2$ to conclude $\Gamma\cup(x: S_1)\vdash s_2: T_2$</p>
</li>
</ul>
<h4 id="preservation-1">Preservation</h4>
<p>Remember that preservation states that if $\Gamma\vdash t: T$ and $t\longrightarrow t’$ then $\Gamma\vdash t’: T$.</p>
<p>The proof is by induction on typing derivations:</p>
<ul>
<li>
<p>Case $\ref{eq:t-sub}$: $t: S$ and $S <: T$.</p>
<p>By the IH, $\Gamma\vdash t’: S$.</p>
<p>By $\ref{eq:t-sub}$, $\Gamma\vdash t: T$.</p>
</li>
<li>
<p>Case $\ref{eq:t-app}$: $t = t_1\ t_2$, $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$, $\Gamma\vdash t_2: T_{11}$ and $T = T_{12}$. By the inversion lemma for evaluation<sup id="fnref:inversion-lemma-evaluation-lambda"><a href="#fn:inversion-lemma-evaluation-lambda" class="footnote">3</a></sup>, there are three rules by which $t\longrightarrow t’$ can be derived:</p>
<ul>
<li>Subcase $\ref{eq:e-app1}$: $t_1 \longrightarrow t_1’$ and $t’ = t_1’\ t_2$. The result follows from the IH and $\ref{eq:t-app}$</li>
<li>Subcase $\ref{eq:e-app2}$: $t_1 = v_1$, $t_2 \longrightarrow t_2’$ and $t’ = v_1\ t_2’$. The result follows from the IH and $\ref{eq:t-app}$</li>
<li>
<p>Subcase $\ref{eq:e-appabs}$: $t_1 = \lambda x: S_{11}.\ t_{12}$, $t_2 = v_2$ and $t’ = [x\mapsto v_2]t_{12}$.</p>
<p>By the <a href="#inversion-lemma-for-typing">inversion lemma for typing</a>, $T_{11} <: S_{11}$ and $\Gamma\cup (x: S_{11})\vdash t_{12}: T_{12}$.</p>
<p>By $\ref{eq:t-sub}$, $\Gamma\vdash t_2: S_{11}$</p>
<p>By the <a href="#substitution-lemma">substitution lemma</a>, $\Gamma\vdash t’: T_{12}$.</p>
</li>
</ul>
</li>
</ul>
<h3 id="subtyping-features">Subtyping features</h3>
<h4 id="casting">Casting</h4>
<p>In languages like Java and C++, ascription is a little more interesting than <a href="#ascription">what we previously defined it as</a>. In these languages, ascription serves as a casting operator.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t_1 : S}{\Gamma\vdash t_1 \text{ as } T : T}
\label{eq:t-cast}\tag{T-Cast} \\ \\
\frac{\vdash_r v_1: T}{v_1 \text{ as } T \longrightarrow v_1}
\label{eq:e-cast}\tag{E-Cast} \\
\end{align}</script>
<p>Contrary to $\ref{eq:t-ascribe}$, the $\ref{eq:t-cast}$ rule allows the ascription to be of a different type than the term. This allows the programmer to have an escape hatch, and get around the type checker. However, this <em>laissez-faire</em> solution means that a run-time check is necessary, as $\ref{eq:e-cast}$ shows.</p>
<h4 id="variants-1">Variants</h4>
<p>The subtyping rules for <a href="#variants">variants</a> are almost identical to those of records, with the main difference being the width rule allows variants to be <em>added</em>, not dropped:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\langle l_i : {T_i}^{i\in 1\dots n} \rangle
<:
\langle l_i : {T_i}^{i\in 1\dots n+k} \rangle
\label{eq:s-variantwidth}\tag{S-VariantWidth} \\ \\
\frac{\forall i \ S_i <: T_i}{
\langle l_1 : {S_i}^{i\in 1 \dots n} \rangle
<:
\langle l_1 : {T_i}^{i\in 1 \dots n} \rangle
} \label{eq:s-variantdepth}\tag{S-VariantDepth} \\ \\
\frac{
\langle k_j : {S_j}^{j\in 1 \dots n} \rangle
\text{ is a permutation of }
\langle l_i : {T_i}^{i\in 1 \dots n} \rangle
}{
\langle k_j : {S_j}^{j\in 1 \dots n} \rangle
<:
\langle l_i : {T_i}^{i\in 1 \dots n} \rangle
}
\label{eq:s-variantperm}\tag{S-VariantPerm} \\ \\
\frac{
\Gamma\vdash t_1 : T_1
}{
\Gamma\vdash \langle l_1 = t_1 \rangle : \langle l_1 : T_1 \rangle
} \label{eq:t-variant}\tag{T-Variant}
\end{align} %]]></script>
<p>The intuition for $\ref{eq:s-variantwidth}$ is that a tagged expression $\langle l = t \rangle$ belongs to a variant type $\langle l_i : {T_i}^{i\in 1\dots n} \rangle$ if the label $l$ is <em>one of the possible labels</em> $\set{l_i}$. This is easy to understand if we consider the <a href="#variants"><code class="highlighter-rouge">Option</code> example that we used previously</a>: <code class="highlighter-rouge">some</code> and <code class="highlighter-rouge">none</code> are subtypes of <code class="highlighter-rouge">Option</code>.</p>
<h4 id="covariance">Covariance</h4>
<p><code class="highlighter-rouge">List</code> is an example of a covariant type constructor: we want <code class="highlighter-rouge">List[None]</code> to be a subtype of <code class="highlighter-rouge">List[Option]</code>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{S_1 <: T_1}{\text{List } S_1 <: \text{List } T_1}
\label{eq:s-list}\tag{S-List} %]]></script>
<h4 id="invariance">Invariance</h4>
<p>References are not covariant nor invariant. An example of an invariant constructor is a <a href="#references">reference</a>.</p>
<ul>
<li>When a reference is <em>read</em>, the context expects $T_1$ so giving a $S_1 <: T_1$ is fine</li>
<li>When a reference is <em>written</em>, the context provides a $T_1$. If the the actual type of the reference is $\text{Ref } S_1$, someone may later use the $T_1$ as an $S_1$, so we need $T_1 <: S_1$</li>
</ul>
<p>Similarly, arrays are invariant, for the same reason:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{S_1 <: T_1 \quad T_1 <: S_1}{\text{Array } S_1 <: \text{Array } T_1}
\label{eq:s-array}\tag{S-Array} %]]></script>
<p>Instead, Java has covariant arrays:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{S_1 <: T_1}{\text{Array } S_1 <: \text{Array } T_1}
\label{eq:s-arrayjava}\tag{S-ArrayJava} %]]></script>
<p>This is because the Java language designers felt that they needed to be able to write a sort routine for mutable arrays, and implemented this as a quick fix. Instead, it turned out to be a mistake that even the Java designers regret.</p>
<p>The solution to this invariance problem is based on the following observation: a <code class="highlighter-rouge">Ref T</code> can be used either for reading or writing. To be able to have contravariant reading and covariant writing, we can split a <code class="highlighter-rouge">Ref T</code> in three:</p>
<ul>
<li><code class="highlighter-rouge">Source T</code>: a reference with read capability</li>
<li><code class="highlighter-rouge">Sink T</code>: a reference cell with write capability</li>
<li><code class="highlighter-rouge">Ref T</code>: a reference cell with both capabilities</li>
</ul>
<p>The typing rules then limit dereference to sources, and assignment to sinks:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{
\Gamma \mid \Sigma \vdash t_1 : \text{Source } T_{11}
}{
\Gamma \mid \Sigma \vdash !t_1 : T_{11}
} \label{eq:t-derefsource}\tag{T-DerefSource} \\ \\
\frac{
\Gamma \mid \Sigma \vdash t_1 : \text{Sink } T_{11}
\quad
\Gamma \mid \Sigma \vdash t_2 : T_{11}
}{
\Gamma \mid \Sigma \vdash t_1 := t_2 : \text{Unit}
}
\label{eq:t-assignsink}\tag{T-AssignSink} \\
\end{align}</script>
<p>The subtyping rules establish sources as covariant constructors, sinks as contravariant, and a reference as a subtype of both:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\frac{S_1 <: T_1}{\text{Source } S_1 <: \text{Source } T_1}
\label{eq:s-source}\tag{S-Source} \\ \\
\frac{T_1 <: S_1}{\text{Sink } S_1 <: \text{Sink } T_1}
\label{eq:s-sink}\tag{S-Sink} \\ \\
\text{Ref } T_1 <: \text{Source } T_1
\label{eq:s-refsource}\tag{S-RefSource} \\ \\
\text{Ref } T_1 <: \text{Sink } T_1
\label{eq:s-refsink}\tag{S-RefSink} \\
\end{align} %]]></script>
<h3 id="algorithmic-subtyping">Algorithmic subtyping</h3>
<p>So far, in STLC, our typing rules were <em>syntax directed</em>. This means that for every for every form of a term, a specific rule applied; which rule to choose was always straightforward.</p>
<p>The reason the choice is so straightforward is because we can divide the positions of a typing relation like $\ref{eq:t-app}$ into input positions ($\Gamma$ and $t$), and output positions ($T_{11}$, $T_{12}$).</p>
<p>However, by introducing subtyping, we introduced rules that break this: $\ref{eq:t-sub}$ and $\ref{eq:s-trans}$ apply to <em>any</em> kind of term, and can appear at any point of a derivation. Every time our type checking algorithm encounters a term, it must decide which rule to apply. $\ref{eq:s-trans}$ also introduces the problem of having to pick an intermediary type $U$ (which is neither an input nor an output position), for which there can be multiple choices. $\ref{eq:s-refl}$ also overlaps with the conclusions of other rules, although this is a less severe problem.</p>
<p>But this excess flexibility isn’t strictly needed; we don’t need 1000 ways to prove a given typing or subtyping statement, one is enough. The solution to these problems is to replace the ordinary, <em>declarative</em> typing and subtyping relations with <em>algorithmic</em> relations, whose sets of rules are syntax directed. This implies proving that the algorithmic relations are equivalent to the original ones.</p>
<h2 id="objects">Objects</h2>
<p>For simple objects and classes, we can easily use a translational analysis, converting ideas like dynamic dispatch, state, inheritance, into derived forms from lambda calculus such as (higher-order) functions, records, references, recursion, subtyping. However, for more complex features (like <code class="highlighter-rouge">this</code>), we’ll need a more direct treatment.</p>
<p>In lambda calculus, we can represent an object as a record inside a let:</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="kd">class</span> <span class="nc">Counter</span> <span class="o">{</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">1</span><span class="o">;</span>
<span class="kt">int</span> <span class="nf">get</span><span class="o">()</span> <span class="o">{</span> <span class="k">return</span> <span class="n">x</span><span class="o">;</span> <span class="o">}</span>
<span class="kt">void</span> <span class="nf">inc</span><span class="o">()</span> <span class="o">{</span> <span class="n">x</span><span class="o">++;</span> <span class="o">}</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<script type="math/tex; mode=display">% <![CDATA[
\text{let } x = \text{ref } 1 \text{ in}
\set{\begin{align}
\text{get} & = \lambda \text{_}: \text{Unit}.\ !x \\
\text{inc} & = \lambda \text{_}: \text{Unit}.\ x := \text{succ}(!x) \\
\end{align}} %]]></script>
<p>To create an object, we can just do the following:</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{newCounter} = \lambda \text{_}: \text{Unit}.\
\text{let } x = \text{ref } 1 \text{ in}
\set{\begin{align}
\text{get} & = \lambda \text{_}: \text{Unit}.\ !x \\
\text{inc} & = \lambda \text{_}: \text{Unit}.\ x := \text{succ}(!x) \\
\end{align}} %]]></script>
<p>This returns a <code class="highlighter-rouge">newCounter</code> object of type $\text{Unit} \rightarrow \text{Counter}$, where $\text{Counter} = \set{\text{get}: \text{Unit} \rightarrow \text{Nat},\ \text{inc}: \text{Unit}\rightarrow\text{Unit}}$.</p>
<p>More generally, the state may consist of more than a single reference cell, so we can let the state be represented by a variable <code class="highlighter-rouge">r</code> corresponding to a record with (potentially) multiple fields.</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{newCounter} = \lambda \text{_}: \text{Unit}.\
\text{let } r = \set{x=\text{ref } 1} \text{ in}
\set{\begin{align}
\text{get} & = \lambda \text{_}: \text{Unit}.\ !(r.x) \\
\text{inc} & = \lambda \text{_}: \text{Unit}.\ x := \text{succ}(!(r.x)) \\
\end{align}} %]]></script>
<h3 id="dynamic-dispatch">Dynamic dispatch</h3>
<p>When an operation is invoked on an object, the ensuing behavior depends on the object itself; indeed, two object of the same type may be implemented internally in completely different ways.</p>
<p>This is late binding for function calls. The idea is to bind a call to the corresponding function at runtime. todo.</p>
<h3 id="encapsulation">Encapsulation</h3>
<p>In most OO languages, each object consists of some internal state. The state is directly accessible to the methods, but inaccessible from the outside. It’s a form of information hiding.</p>
<p>Note that this information hiding is different from what abstract data types (ADTs), which do not offer dynamic dispatch.</p>
<p>In Java, the encapsulation can be enabled with <code class="highlighter-rouge">protected</code>.</p>
<p>The type of an object is just the set of operations that can be performed on it. It doesn’t include the internal state.</p>
<h3 id="inheritance">Inheritance</h3>
<p>Subtyping is a way to talk about types. Inheritance is more focused on the idea of sharing behavior, on avoiding duplication of code. The basic mechanism of inheritance is classes. Classes can be <em>instantiated</em> to create new objects (“instances”), or <em>refined</em> to create new classes (“subclasses”). Subclasses are subtypes of their parent classes. We’ll talk about both here, but it’s important to know the distinction.</p>
<p>We saw previously that a record A with more fields than B is a subtype of B. As an example, let’s try to look at a <code class="highlighter-rouge">ResetCounter</code> inheriting from <code class="highlighter-rouge">Counter</code>, adding a <code class="highlighter-rouge">reset</code> method that sets <code class="highlighter-rouge">x</code> to 0.</p>
<p>Initially, we can just try to do this by copying the code, and adding a method. But this goes against the DRY principle from software engineering. Another thing that we could try is to take a <code class="highlighter-rouge">Counter</code> as an argument in the object generator, but this is problematic because we’re not sharing the state; we’ve got two separate counts in <code class="highlighter-rouge">Counter</code> and <code class="highlighter-rouge">ResetCounter</code>, and they can not access each other’s state.</p>
<p>To avoid these problems, we must separate the definitions of the methods, from the act of binding these methods to a particular set of instance variables, in the object generator. Here, we use the age-old computer science adage of “every problem can be solved with an additional level of indirection”.</p>
<p>We’ll first have to introduce the notion of <code class="highlighter-rouge">super</code>. We know this construct from Java, for instance. Java’s <code class="highlighter-rouge">super</code> gives us a mechanism to avoid dynamic dispatch. We can call specifically the methods of the class we’re inheriting from through <code class="highlighter-rouge">super</code>.</p>
<p>To define a subclass, the idea is then to instantiate the super, and bind the methods of the object to the super’s method. The classes both have access to the same value through the use of references.</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{resetCounterClass} =
\lambda r: \text{CounterRep}.\
\text{let super} = \text{counterClass} r \text{in} \{ \\
\begin{align}
\text{get} & = \text{super.get}, \\
\text{inc} & = \text{super.inc}, \\
\text{reset} & = \lambda _: \text{Unit}.\ r.x := 1 \\
\end{align}
\} %]]></script>
<p>This also allows us to call <code class="highlighter-rouge">super</code> in redefined methods (so <code class="highlighter-rouge">inc</code> could call <code class="highlighter-rouge">super.inc</code> if it needs to).</p>
<p>Our record $r$ can even contain more variable than the superclass needs, as records with more fields are subtypes of those with a subset of fields. This allows us to have more instance variables in the subclass.</p>
<p>Note that to be more rigorous, we’d have to define this more precisely. In most OO languages, things aren’t subtypes of each other just because they have the same methods; it’s because we declare them to be so. We’d need to be more rigorous to model this.</p>
<h3 id="this">This</h3>
<p>OO langauges provide access to <code class="highlighter-rouge">this</code>, the current method receiver. It may be an instance of a subclass, no the class we’re currently looking at. So <code class="highlighter-rouge">this</code>’s actual class (at runtime) must be able to override the definitions.</p>
<figure class="highlight"><pre><code class="language-java" data-lang="java"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="kd">class</span> <span class="nc">E</span> <span class="o">{</span>
<span class="kd">protected</span> <span class="kt">int</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span><span class="o">;</span>
<span class="kt">int</span> <span class="nf">m</span><span class="o">()</span> <span class="o">{</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">+</span><span class="mi">1</span><span class="o">;</span> <span class="k">return</span> <span class="n">x</span><span class="o">;</span> <span class="o">}</span>
<span class="kt">int</span> <span class="nf">n</span><span class="o">()</span> <span class="o">{</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">-</span><span class="mi">1</span><span class="o">;</span> <span class="k">return</span> <span class="k">this</span><span class="o">.</span><span class="na">m</span><span class="o">();</span> <span class="o">}</span>
<span class="o">}</span>
<span class="kd">class</span> <span class="nc">F</span> <span class="kd">extends</span> <span class="n">E</span> <span class="o">{</span>
<span class="kt">int</span> <span class="nf">m</span><span class="o">()</span> <span class="o">{</span> <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="o">+</span><span class="mi">100</span><span class="o">;</span> <span class="k">return</span> <span class="n">x</span><span class="o">;</span> <span class="o">}</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Above, we saw how to call the parent class through <code class="highlighter-rouge">super</code>. To call methods between each other, we need to add <code class="highlighter-rouge">this</code>.</p>
<p>In an initial attempt at implementing this in lambda calculus, we can add a fix operator to the class definition, so that we can call ourselves.</p>
<script type="math/tex; mode=display">setCounterClass = ... todo</script>
<p>But the fixed point here is “closed”. We have “tied the knot” when we built the record. So this does not model the behavior of <code class="highlighter-rouge">this</code> in OO. To solve this, we can move the application of <code class="highlighter-rouge">fix</code> from the class definition to the object creation function (essentially switching the order of $\text{fix}$ and $\lambda r: \text{CounterRep}$):</p>
<script type="math/tex; mode=display">newSetCounter = ... todo</script>
<p>Note that this changes the type signature: todo (slide 50)</p>
<h3 id="using-this">Using <code class="highlighter-rouge">this</code></h3>
<p>Let’s continue the example from above by defining a new class of counter object, keeping count of the number of times <code class="highlighter-rouge">set</code> has been called. We’ll call this an “instrumented counter”.</p>
<figure class="highlight"><pre><code class="language-plain" data-lang="plain"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre>InstrCounter = {
get: Unit -> Nat,
set: Nat -> Unit,
inc: Unit -> Unit
accesses: Unit -> Nat
}
IntrCounterRep = {
x: Ref Nat,
a: Ref Nat
}
instrCounterClass =
λr: InstrCounterRep.
λthis: InstrCounter.
let super = setCounterClass r this in
{get = super.get,
set = λi: Nat. (r.a := succ(!(r.a)); super.set i),
inc = super.inc,
accesses = λ_: Unit. !(r.a)};</pre></td></tr></tbody></table></code></pre></figure>
<p>But this implementation is not very useful, as the object creator diverges. Intuitively, the problem is that the “unprotected” use of <code class="highlighter-rouge">this</code>. A solution is to “delay” this by putting a dummy abstraction in front of it. This allows us to replace call-by-value with call-by-name. Now, <code class="highlighter-rouge">this</code> is of type $\text{Unit} \rightarrow \text{SetCounter}$.</p>
<p>This works, but very slowly. All the delaying we added has a side effect. Instead of computing the method table just once, we now re-compute it every time we invoke a method. Indeed, every time we need it, since we’re in call-by-name, we re-compute it every time. The solution here is to use a lazy value. In lambda calculus, we represent lazy values with a reference, along with a flag about whether we’ve computed it or not. Section 18.12 describes this in more detail.</p>
<h2 id="featherweight-java">Featherweight Java</h2>
<p>Having covered all the topics related to the essence of objects. But there are still certain things missing compared to Java. With objects, we’ve captured the runtime aspect of classes, but we haven’t really talked about the classes as types. We’re also missing a discussion on named types with declared subtyping (we’ve only done structural subtyping), nor recursive types (like the ones we need for list tails, for instance) or run-time type analysis. Additionally, most type systems have escape hatches known as casts, which we haven’t talked about either.</p>
<p>Seeing that we have plenty to talk about, let’s try to define a model for Java. Remember that a model always abstracts details away, so there’s no such thing as a perfect model. It’s always a question of which tradeoffs we choose for our specific use-case. Seeing that Java has a lot of different purposes, we are going to have lots of different models. For instance, some of the choices we need to make are:</p>
<ul>
<li>Source-level vs. bytecode level</li>
<li>Large (inclusive) vs small (simple) models</li>
<li>Type system vs. run-time</li>
<li>Models of specific features</li>
</ul>
<p>Featherweight Java was proposed as a tool for analyzing GJ (Java with generics), and has since been used to study proposed Java extensions. It aims to be very simple, modelling just the core OO features and their types, <em>and nothing else</em>. It models classes, objects, methods, method invocation, fields, field access, inheritance and casting, but leaves out more complex topics such as reflection, concurrency, exceptions, loops, assignment and overloading.</p>
<p>The model aims to be very explicit.</p>
<ul>
<li>Every class must declare a parent class</li>
<li>All classes must have a constructor</li>
<li>All fields must be represented 1-to-1 in the constructor</li>
<li>The constructor must call <code class="highlighter-rouge">super()</code></li>
<li>Always explicitly name receiver object in method invocation or field access (using <code class="highlighter-rouge">this.x</code> or <code class="highlighter-rouge">that.x</code>)</li>
<li>Methods are just a single <code class="highlighter-rouge">return</code> expression</li>
</ul>
<h3 id="structural-vs-nominal-type-systems">Structural vs. Nominal type systems</h3>
<p>There’s a big dichotomy in the world of programming languages.</p>
<p>On one hand, we <em>structural</em> type systems, where the names are convenient but inessential abbreviations. What really matters about a type in a structural type system is its structure. It’s somewhat cleaner and more elegant, easier to extend, but once we need to talk about recursive types, some of the elegance falls away.</p>
<p>On the other hand, what’s used in almost all mainstream programming languages is <em>nominal</em> type systems. Here, recursive types are much simpler, and using names everywhere makes type checking much simpler. Having named types is also useful at run-time for casting, type testing, reflection, etc.</p>
<h3 id="representing-objects">Representing objects</h3>
<p>How can we represent an object? What defines it? Two objects are different if their constructors are different, or if their constructors have been passed different arguments. This observation leads us to the idea that we can identify an object fully by looking at the <code class="highlighter-rouge">new</code> expression. Here, having omitted assignments makes our life much easier.</p>
<h3 id="syntax">Syntax</h3>
<p>We’ll use the notation $\bar{C}$ to mean arbitrary repetition of $C$ (a constructor) or $c$ (a variable or value). The notation $\bar{C}\ \bar{f}$ means we’ve “zipped” the two together, like $(C_1\ f_1, \dots, C_n\ f_n)$.</p>
<p>todo</p>
<h3 id="evaluation-2">Evaluation</h3>
<p>FJ uses call-by-value, like lambda calculus and Java.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{fields(C) = \bar{C}\ \bar{f}}{(new C(\bar{v}).f_i \longrightarrow v_1}
todo
\end{align}</script>
<h3 id="typing">Typing</h3>
<script type="math/tex; mode=display">todo</script>
<p>We have two rules for casting: one for subtypes, and one for supertypes. We do not allow casting to an unrelated type, because FJ complies with Java, and Java doesn’t allow it.</p>
<p>For methods and classes, we want to make sure that overrides are valid, that we pass the correct arguments to the superclass constructor.</p>
<p>Also note that the our typing rules often have subsumption built into it, instead of having a separate subsumption rule. This allows us to have algorithmic subtyping, which we need for two reasons:</p>
<ol>
<li>To perform static overloading resolution (picking between different overloaded methods at compile-time), we need to be able to speak about the type of an expression (and we need one single type, not several of them)</li>
<li>We’d run into trouble typing conditional expressions. This is not something that we have included in FJ, but regular Java has it, and we may wish to include it as an extension to FJ</li>
</ol>
<p>Let’s talk about this problem with conditionals in a little more detail. If we have a conditional (or a ternary expression) $t_1 ? t_2 : t_3$, with $t_1: T_1$, $t_2: T_2$, $t_3: T:_3$, what is the return type of the expression? The simple solution is the least common supertype (this corresponds to the lowest common ancestor), but that becomes problematic with interfaces, which allow for multiple inheritance (for instance, if $T_2$ and $T_3$ both implement $I_2$ and $I_3$, we wouldn’t know which one to pick).</p>
<p>The actual Java rule that’s used is that the return type is $\min (T_2, T_3)$. Scala solves this (in Dotty) with union types, where the result type is $T_2 \mid T_3$.</p>
<h3 id="properties">Properties</h3>
<h4 id="progress-1">Progress</h4>
<p>We can’t actually prove progress, as well-typed programs can get stuck because of casting. Casting can fail, and we’d get stuck. The solution is to weaken the statement of progress. We’ll instead try to prove that a well-typed FJ term is either value, reduces to one, or gets stuck at a cast failure.</p>
<p>To formalize this, we need a little more work. Indeed, since casts are done at runtime, we need to describe the evaluation context.</p>
<p>todo</p>
<p>We can now restate progress more formally. Suppose $t$ is a closed, well-typed normal form. Then either:</p>
<ol>
<li>$t$ is a value</li>
<li>$t \longrightarrow t’$ for some $t’$</li>
<li>For some evaluation context $E$, we can express $t$ as $t = E[(C)(new D(\bar{v}))]$, with $\neg (D <: C)$</li>
</ol>
<h4 id="preservation-2">Preservation</h4>
<p>Theorem: todo</p>
<p>But preservation doesn’t actually hold here. Because we allow casts to go up and down, we can upcast to Object before downcasting to another, unrelated type. Because FJ must model Java, we need to actually introduce a rule for this. In this new rule, we give a “stupid warning”</p>
<h2 id="foundations-of-scala">Foundations of Scala</h2>
<h3 id="modeling-lists">Modeling Lists</h3>
<p>If we’d like to apply everything we’ve learned so far to model Scala, we’ll run into problems. Say we’d like to model a <code class="highlighter-rouge">List</code>; immediately, we run into these problems</p>
<ul>
<li>It’s parameterized</li>
<li>It’s recursive</li>
<li>It can be invariant or covariant</li>
</ul>
<p>To solve this, we need a way to express type constructors:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="o">*</span> <span class="c1">// kind of normal types (Boolean, Int, ...)
</span><span class="o">*</span> <span class="o">-></span> <span class="o">*</span> <span class="c1">// unary type constructor: something that takes a type, returns one
</span><span class="o">*</span> <span class="o">-></span> <span class="o">*</span> <span class="o">-></span> <span class="o">*</span>
<span class="o">...</span> </pre></td></tr></tbody></table></code></pre></figure>
<p>We need some way to express these, so we’ll introduce $\mu$, which works like $\lambda$ but for types. This allows us to have constructors for recursive types, $\mu t.\ T(t)$. While it is possible, this introduces problems for dealing with subtyping and equality (e.g. how do <code class="highlighter-rouge">T</code> and <code class="highlighter-rouge">Int -> T</code>).</p>
<p>We can deal with variance by expressing definition site variance as use-site variance, using Java wildcards:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="c1">// definition site variance:
</span><span class="k">trait</span> <span class="nc">List</span><span class="o">[</span><span class="kt">+T</span><span class="o">]</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
<span class="k">trait</span> <span class="nc">Function1</span><span class="o">[</span><span class="kt">-T</span>, <span class="kt">+U</span><span class="o">]</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
<span class="nc">List</span><span class="o">[</span><span class="kt">C</span><span class="o">]</span>
<span class="nc">Function1</span><span class="o">[</span><span class="kt">D</span>, <span class="kt">E</span><span class="o">]</span>
<span class="c1">// use-site variance:
</span><span class="k">trait</span> <span class="nc">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span> <span class="o">...</span> <span class="o">}</span>
<span class="k">trait</span> <span class="nc">Function1</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">]</span>
<span class="nc">List</span><span class="o">[</span><span class="k">_</span> <span class="k"><:</span> <span class="kt">C</span><span class="o">]</span>
<span class="nc">Function1</span><span class="o">[</span><span class="k">_</span> <span class="k">>:</span> <span class="kt">D</span>, <span class="k">_</span> <span class="k"><:</span> <span class="kt">E</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p><code class="highlighter-rouge">Function1[_ >: D, _ <: E]</code> is the type of functions from some (unknown) supertpye of <code class="highlighter-rouge">D</code> to some (unknown) subtype of <code class="highlighter-rouge">E</code>, which corresponds to an existential type. This is one possible way of modeling it, but it gets messy rather quickly. Can we find a nicer way of expressing this?</p>
<p>Scala has type members, so we can re-formulate the list as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="k">trait</span> <span class="nc">List</span> <span class="o">{</span> <span class="n">self</span> <span class="k">=></span>
<span class="k">type</span> <span class="kt">T</span>
<span class="k">def</span> <span class="n">isEmpty</span><span class="k">:</span> <span class="kt">Boolean</span>
<span class="k">def</span> <span class="n">head</span><span class="k">:</span> <span class="kt">T</span>
<span class="k">def</span> <span class="n">tail</span><span class="k">:</span> <span class="kt">List</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">T</span> <span class="k"><:</span> <span class="kt">self.T</span> <span class="o">}</span> <span class="c1">// refinement handling co-variance
</span><span class="o">}</span>
<span class="k">def</span> <span class="nc">Cons</span><span class="o">[</span><span class="kt">X</span><span class="o">](</span><span class="n">hd</span><span class="k">:</span> <span class="kt">X</span><span class="o">,</span> <span class="n">tl</span><span class="k">:</span> <span class="kt">List</span> <span class="o">{</span><span class="k">type</span> <span class="kt">T</span> <span class="k"><:</span> <span class="kt">X</span><span class="o">})</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">List</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">T</span> <span class="o">=</span> <span class="n">X</span>
<span class="k">def</span> <span class="n">isEmpty</span> <span class="k">=</span> <span class="kc">false</span>
<span class="k">def</span> <span class="n">head</span> <span class="k">=</span> <span class="n">hd</span>
<span class="k">def</span> <span class="n">tail</span> <span class="k">=</span> <span class="n">tl</span>
<span class="o">}</span>
<span class="o">//</span> <span class="n">analogous</span> <span class="k">for</span> <span class="nc">Nil</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Using these path-dependent types <code class="highlighter-rouge">self.T</code>, we can avoid using existential types.</p>
<h3 id="abstract-class">Abstract class</h3>
<p>Abstract types are types without a concrete implementation. They may have an upper and/or lower bound (as <code class="highlighter-rouge">type L >: T <: U</code>).</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="c1">// Abstract type:
</span><span class="k">trait</span> <span class="nc">KeyGen</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Key</span>
<span class="k">def</span> <span class="n">key</span><span class="o">(</span><span class="n">s</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span><span class="k">:</span> <span class="kt">this.Key</span>
<span class="o">}</span>
<span class="c1">// Implementation
</span><span class="k">object</span> <span class="nc">HashKeyGen</span> <span class="k">extends</span> <span class="nc">KeyGen</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Key</span> <span class="o">=</span> <span class="nc">Int</span>
<span class="k">def</span> <span class="n">key</span><span class="o">(</span><span class="n">s</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="n">s</span><span class="o">.</span><span class="n">hashCode</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can reference the <code class="highlighter-rouge">Key</code> type of a term <code class="highlighter-rouge">k</code> as <code class="highlighter-rouge">k.Key</code>, which is a <em>path-dependent</em> type. For instance:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">mapKeys</span><span class="o">(</span><span class="n">k</span><span class="k">:</span> <span class="kt">KeyGen</span><span class="o">,</span> <span class="n">ss</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">String</span><span class="o">])</span><span class="k">:</span> <span class="kt">List</span><span class="o">[</span><span class="kt">k.Key</span><span class="o">]</span> <span class="k">=</span> <span class="n">ss</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">s</span> <span class="k">=></span> <span class="n">k</span><span class="o">.</span><span class="n">key</span><span class="o">(</span><span class="n">s</span><span class="o">))</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The function <code class="highlighter-rouge">mapKeys</code> has a <em>dependent function type</em>. This is an interesting type, because it has an internal dependency: <code class="highlighter-rouge">(k: KeyGen, ss: List[String]) -> List[k.Key]</code>. In Scala 2, can’t express this directly (we’d have to go through a trait with an apply method). Scala 3 (dotty) <a href="http://dotty.epfl.ch/docs/reference/new-types/dependent-function-types.html">introduces these dependent function types</a> at the language level; it’s done with a similar trick to what we just saw. In dotty, the intention was to have everything map to a simple object type; this has been formalized in a calculus called DOT, (path-)Dependent Object Types.</p>
<h3 id="dot">DOT</h3>
<p>The DOT syntax is described in the <a href="http://lampwww.epfl.ch/~amin/dot/fool.pdf">DOT paper</a>. Types are in uppercase, terms in lowercase. Note that recursive types $\mu (x: T)$ are different from what we’ve talked about, but we’ll get to that later.</p>
<p>As a small technicality, DOT imposes the restriction of only allowing member selection and application on variables, and not on values or full terms. This is equivalent, because we could just assign the value to a variable before selection or application. This way of writing programs is also called <em>administrative normal form</em> (ANF).</p>
<p>To simplify things, we can introduce a programmer-friendly notation with ASCII versions of DOT constructs:</p>
<p>.</p>
<p>Our calculus does not have generic types, because we can encode them as dependent function types. For instance, the polymorphic type of the twice method, $\forall X.\ (X \rightarrow X) \rightarrow X \rightarrow X$ is represented as:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="o">(</span><span class="n">cX</span><span class="k">:</span> <span class="o">{</span><span class="kt">A:</span> <span class="kt">Nothing..Any</span><span class="o">})</span> <span class="o">-></span> <span class="o">(</span><span class="n">cX</span><span class="o">.</span><span class="n">A</span> <span class="o">-></span> <span class="n">cX</span><span class="o">.</span><span class="n">A</span><span class="o">)</span> <span class="o">-></span> <span class="n">cX</span><span class="o">.</span><span class="n">A</span> <span class="o">-></span> <span class="n">cX</span><span class="o">.</span><span class="n">A</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The <code class="highlighter-rouge">cX</code> parameter is a kind of cell containing a type variance X (hence the name <code class="highlighter-rouge">cX</code>).</p>
<p>As an example, let’s see how Church Booleans could be implemented:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="c1">// Define an abstract "if type" IFT
</span><span class="k">type</span> <span class="kt">IFT</span> <span class="o">=</span> <span class="o">{</span> <span class="k">if:</span> <span class="o">(</span><span class="kt">x:</span> <span class="o">{</span><span class="kt">A:</span> <span class="kt">Nothing..Any</span><span class="o">})</span> <span class="kt">-></span> <span class="kt">x.A</span> <span class="kt">-></span> <span class="kt">x.A</span> <span class="kt">-></span> <span class="kt">x.A</span> <span class="o">}</span>
<span class="n">let</span> <span class="n">boolimpl</span> <span class="k">=</span>
<span class="n">let</span> <span class="n">boolImpl</span> <span class="k">=</span>
<span class="k">new</span><span class="o">(</span><span class="n">b</span><span class="k">:</span> <span class="o">{</span> <span class="kt">Boolean:</span> <span class="kt">IFT..IFT</span> <span class="o">}</span> <span class="kt">&</span>
<span class="o">{</span> <span class="kt">true:</span> <span class="kt">IFT</span> <span class="o">}</span> <span class="kt">&</span>
<span class="o">{</span> <span class="kt">false:</span> <span class="kt">IFT</span> <span class="o">})</span>
<span class="o">{</span> <span class="nc">Boolean</span> <span class="k">=</span> <span class="nc">IFT</span> <span class="o">}</span> <span class="o">&</span>
<span class="o">{</span> <span class="kc">true</span> <span class="k">=</span> <span class="o">{</span> <span class="k">if</span> <span class="k">=</span> <span class="o">(</span><span class="n">x</span><span class="k">:</span> <span class="o">{</span><span class="kt">A:</span> <span class="kt">Nothing..Any</span><span class="o">})</span> <span class="k">=></span> <span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="n">t</span> <span class="o">}</span> <span class="o">&</span>
<span class="o">{</span> <span class="kc">false</span> <span class="k">=</span> <span class="o">{</span> <span class="k">if</span> <span class="k">=</span> <span class="o">(</span><span class="n">x</span><span class="k">:</span> <span class="o">{</span><span class="kt">A:</span> <span class="kt">Nothing..Any</span><span class="o">})</span> <span class="k">=></span> <span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="n">f</span> <span class="o">}</span>
<span class="n">in</span> <span class="o">...</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can hide the implementation details of this with a small wrapper to which we apply <code class="highlighter-rouge">boolImpl</code>. This is all a little long-winded, so we can introduce some abbreviations:</p>
<p>.
.
.</p>
<p>With these in place, we can give an abbreviated definition:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="n">let</span> <span class="n">bool</span> <span class="k">=</span>
<span class="k">new</span> <span class="o">{</span> <span class="n">b</span> <span class="k">=></span>
<span class="k">type</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="o">{</span><span class="k">if:</span> <span class="o">(</span><span class="kt">x:</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span> <span class="o">})</span> <span class="kt">-></span> <span class="o">(</span><span class="kt">t:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="kt">-></span> <span class="o">(</span><span class="kt">f:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="kt">-></span> <span class="kt">x.A</span><span class="o">}</span>
<span class="kc">true</span> <span class="k">=</span> <span class="o">{</span><span class="k">if:</span> <span class="o">(</span><span class="kt">x:</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span> <span class="o">})</span> <span class="o">=></span> <span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="n">t</span><span class="o">}</span>
<span class="kc">false</span> <span class="k">=</span> <span class="o">{</span><span class="k">if:</span> <span class="o">(</span><span class="kt">x:</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span> <span class="o">})</span> <span class="o">=></span> <span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="o">(</span><span class="n">f</span><span class="k">:</span> <span class="kt">x.A</span><span class="o">)</span> <span class="k">=></span> <span class="n">f</span><span class="o">}</span>
<span class="o">}</span><span class="k">:</span> <span class="o">{</span> <span class="kt">b</span> <span class="o">=></span> <span class="k">type</span> <span class="kt">Boolean</span><span class="o">;</span> <span class="kc">true</span><span class="k">:</span> <span class="kt">b.Boolean</span><span class="o">;</span> <span class="kc">false</span><span class="k">:</span> <span class="kt">b.Boolean</span> <span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We’ve introduced all the concepts we need to actually define the covariant list in DOT (see slides). This concept of hiding the implementation is nominality. A nominal type such as <code class="highlighter-rouge">List</code> is simply an abstract type with a hidden implementation. This shows that nominal and structural types aren’t completely separated; we can do nominal types within a structural setting if we have these constructs.</p>
<h4 id="evaluation-3">Evaluation</h4>
<p>Evaluation is interesting, because we’d like it to keep terms in ANF.</p>
<h3 id="abstract-types">Abstract types</h3>
<p>Abstract types turn out to be both the most interesting and most difficult part of this, so let’s take a quick look at it before we go on.</p>
<p>Abstract types can be used to encode type parameters (as in <code class="highlighter-rouge">List</code>), hide information (as in <code class="highlighter-rouge">KeyGen</code>), and also to resolve some puzzlers like this one:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
</pre></td><td class="code"><pre><span class="k">trait</span> <span class="nc">Animal</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">Food</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span>
<span class="k">trait</span> <span class="nc">Cow</span> <span class="k">extends</span> <span class="nc">Animal</span> <span class="k">with</span> <span class="nc">Food</span> <span class="o">{</span>
<span class="c1">// error: does not override Animal.eat because of contravariance
</span> <span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">Grass</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span>
<span class="k">trait</span> <span class="nc">Lion</span> <span class="k">extends</span> <span class="nc">Animal</span> <span class="o">{</span>
<span class="c1">// error: does not override Animal.eat because of contravariance
</span> <span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">Cow</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span>
<span class="k">trait</span> <span class="nc">Food</span>
<span class="k">trait</span> <span class="nc">Grass</span> <span class="k">extends</span> <span class="nc">Food</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Scala disallows this, but Eiffel, Dart and TypeScript and allow it. The trade-off that the latter languages choose is modeling power over soundness, though some languages have eventually come back around and tried to fix this (Dart has a strict mode, Eiffel proposed some data flow analysis, …).</p>
<p>In Scala, this contravariance can be solved with abstract types:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre><span class="k">trait</span> <span class="nc">Animal</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Diet</span> <span class="k"><:</span> <span class="kt">Food</span>
<span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">Diet</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span>
<span class="k">trait</span> <span class="nc">Cow</span> <span class="k">extends</span> <span class="nc">Animal</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Diet</span> <span class="k"><:</span> <span class="kt">Grass</span>
<span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">this.Diet</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span>
<span class="k">object</span> <span class="nc">Milka</span> <span class="k">extends</span> <span class="nc">Cow</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Diet</span> <span class="o">=</span> <span class="nc">AlpineGrass</span>
<span class="k">def</span> <span class="n">eat</span><span class="o">(</span><span class="n">food</span><span class="k">:</span> <span class="kt">AlpineGrass</span><span class="o">)</span><span class="k">:</span> <span class="kt">Unit</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="progress-and-preservation">Progress and preservation</h3>
<p>Progress is actually wrong. Here’s a counter example:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="n">t</span> <span class="k">=</span> <span class="n">let</span> <span class="n">x</span> <span class="k">=</span> <span class="o">(</span><span class="n">y</span><span class="k">:</span> <span class="kt">Bool</span><span class="o">)</span> <span class="k">=></span> <span class="n">y</span> <span class="n">in</span> <span class="n">x</span></pre></td></tr></tbody></table></code></pre></figure>
<p>But we can extend our definition of progress. Instead of values, we’ll just want to get answers, which we define as variables, values or let-bindings.</p>
<p>But this is difficult (and it’s what took 8 years to prove), because we always need an inversion, and the subtyping relation is user-definable. This is not a problem for simple type bounds:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">type</span> <span class="kt">T</span> <span class="k">>:</span> <span class="kt">S</span> <span class="k"><:</span> <span class="kt">U</span></pre></td></tr></tbody></table></code></pre></figure>
<p>But it becomes complex for non-sensical bounds:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">type</span> <span class="kt">T</span> <span class="k">>:</span> <span class="kt">Any</span> <span class="k"><:</span> <span class="kt">Nothing</span></pre></td></tr></tbody></table></code></pre></figure>
<p>By transitivity, it would mean that <code class="highlighter-rouge">Any <: Nothing</code>, so by transitivity all types are subtypes of each other. This is bad because it means that inversion fails, as we cannot tell anything from the types anymore.</p>
<p>We might say that this should be easy to disallow in the compiler, but it isn’t. The compiler cannot always tell.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="c1">// S and T are both good:
</span><span class="k">type</span> <span class="kt">S</span> <span class="o">=</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span><span class="o">;</span> <span class="k">type</span> <span class="kt">B</span> <span class="k">>:</span> <span class="kt">A</span> <span class="k"><:</span> <span class="kt">Bot</span> <span class="o">}</span>
<span class="k">type</span> <span class="kt">T</span> <span class="o">=</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span> <span class="k">>:</span> <span class="kt">Top</span> <span class="k"><:</span> <span class="kt">B</span><span class="o">;</span> <span class="k">type</span> <span class="kt">B</span> <span class="o">}</span>
<span class="c1">// But their intersection is bad
</span><span class="k">type</span> <span class="kt">S</span> <span class="kt">&</span> <span class="kt">T</span> <span class="o">=</span><span class="k">=</span> <span class="o">{</span> <span class="k">type</span> <span class="kt">A</span> <span class="k">>:</span> <span class="kt">Top</span> <span class="k"><:</span> <span class="kt">Bot</span><span class="o">;</span> <span class="k">type</span> <span class="kt">B</span> <span class="k">>:</span> <span class="kt">Top</span> <span class="k"><:</span> <span class="kt">Bot</span> <span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Bad bounds can arise from intersecting types with good bounds. This isn’t too bad in and of itself, as we could just check all intersection types, written or inferred, for these bad bounds. But there’s a final problem: bad bounds can arise at run-time. By preservation, if $\Gamma\vdash t: T$ and $t\longrightarrow u$ then $\Gamma\vdash u: T$. Because of subsumption, $u$ may also have a type $S$ which is a true subtype of $T$, and that type $S$ could have bad bounds (from an intersection for instance).</p>
<p>To solve this, the idea is to reason about environments $\Gamma$ arising from an actual computation in the preservation rule. This environment corresponds to an evaluated <code class="highlighter-rouge">let</code> binding, binding variables to values. Values are guaranteed to have good bounds because all type members are aliases.</p>
<p>In other words, the <code class="highlighter-rouge">let</code> prefix acts like a store, a set of bindings $x = v$ of variables to values. Evaluation will then relate terms <em>and</em> stores:</p>
<script type="math/tex; mode=display">s \mid t \longrightarrow s' \mid t'</script>
<p>For the theorems of proofs and preservation, we need to relate environment and store. We’ll introduce a definition:</p>
<blockquote>
<p>An environment $\Gamma$ <em>corresponds</em> to a store $s$, written $\Gamma \sim s$ if for every binding $x=v$ there is an entry $\Gamma\vdash x: T$ where $\Gamma \vdash_{!} v: T$.</p>
</blockquote>
<p>Here $\vdash_{!}$ denotes an exact typing relation, whose typing derivation ends with <code class="highlighter-rouge">All-I</code> or <code class="highlighter-rouge">{}-I</code> (so no subsumption or structural rules).</p>
<p>By restating our theorems as follows, we can then prove them.</p>
<ul>
<li><strong>Preservation</strong>: If $\Gamma\vdash t: T$ and $G\sim s$ and $s \mid t \longrightarrow s’ \mid t’$ then there exists an environment $\Gamma’ \subset \Gamma$ such that $\Gamma’ \vdash t’ : T$ and $\Gamma’ \sim s’$.</li>
<li><strong>Progress</strong>: if $\Gamma\vdash t: T$ and $\Gamma\sim s$ then either $t$ is a normal form, or $s\mid t \longrightarrow s’ \mid t’$ for some store $s’$ and term $t’$.</li>
</ul>
<div class="footnotes">
<ol>
<li id="fn:in-relation-notation">
<p>$(t, C) \in \text{Consts}$ is equivalent to $\text{Consts}(t) = C$ <a href="#fnref:in-relation-notation" class="reversefootnote">↩</a></p>
</li>
<li id="fn:well-typed-store-notation">
<p>Recall that this notation is used to say a store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$, as defined in the section on <a href="#safety">safety in STLC with stores</a>. <a href="#fnref:well-typed-store-notation" class="reversefootnote">↩</a></p>
</li>
<li id="fn:inversion-lemma-evaluation-lambda">
<p>Both the course and TAPL only specify the inversion lemma for evaluation <a href="#inversion-lemma">for the toy language with if-else and booleans</a>, but the same reasoning applies to get an inversion lemma for evaluation for pure lambda calculus, in which three rules can be used: $\ref{eq:e-app1}$, $\ref{eq:e-app2}$ and $\ref{eq:e-appabs}$. <a href="#fnref:inversion-lemma-evaluation-lambda" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Writing a parser with parser combinators Boilerplate The basic idea Simple parser primitives Parser combinators Shorthands Example: JSON parser The trouble with left-recursion Arithmetic expressions — abstract syntax and proof principles Basics of induction Mathematical representation of syntax Mathematical representation 1 Mathematical representation 2 Mathematical representation 3 Comparison of the representations Induction on terms Inductive function definitions What is a function? Induction example 1 Induction example 2 Operational semantics and reasoning Evaluation Derivations Inversion lemma Abstract machines Normal forms Values that are normal form Values that are not normal form Multi-step evaluation Termination of evaluation Lambda calculus Pure lambda calculus Scope Operational semantics Evaluation strategies Classical lambda calculus Confluence in full beta reduction Alpha conversion Programming in lambda-calculus Multiple arguments Booleans Pairs Numbers Lists Recursion in lambda-calculus Equivalence of lambda terms Types Properties of the Typing Relation Inversion lemma Canonical form Progress Theorem Preservation Theorem Messing with it Removing a rule Changing type-checking rule Adding bit Simply typed lambda calculus Type annotations Typing rules Inversion lemma Canonical form Progress Preservation Weakening lemma Permutation lemma Substitution lemma Proof Erasure Curry-Howard Correspondence Extensions to STLC Base types Unit type Sequencing Ascription Pairs Tuples Records Sums and variants Sum type Sums and uniqueness of type Variants Recursion References Mutability Aliasing Typing rules Evaluation Store typing Safety Type reconstruction and polymorphism Constraint-based Typing Algorithm Constraint generation Soundness and completeness Substitutions Unification Strong normalization Polymorphism Explicit polymorphism Implicit polymorphism Alternative Hindley Milner Principal types Subtyping Motivation Rules General rules Records Arrow types Top type Aside: structural vs. declared subtyping Properties of subtyping Safety Inversion lemma for subtyping Inversion lemma for typing Preservation Subtyping features Casting Variants Covariance Invariance Algorithmic subtyping Objects Dynamic dispatch Encapsulation Inheritance This Using this Featherweight Java Structural vs. Nominal type systems Representing objects Syntax Evaluation Typing Properties Progress Preservation Foundations of Scala Modeling Lists Abstract class DOT Evaluation Abstract types Progress and preservation ⚠ Work in progressWriting a parser with parser combinatorsIn Scala, you can (ab)use the operator overload to create an embedded DSL (EDSL) for grammars. While a grammar may look as follows in a grammar description language (Bison, Yak, ANTLR, …):123Expr ::= Term {'+' Term | '−' Term}Term ::= Factor {'∗' Factor | '/' Factor}Factor ::= Number | '(' Expr ')'In Scala, we can model it as follows:123def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term)def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor)def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLitThis is perhaps a little less elegant, but allows us to encode it directly into our language, which is often useful for interop.The ~, |, rep and opt are parser combinators. These are primitives with which we can construct a full parser for the grammar of our choice.BoilerplateFirst, let’s define a class ParseResult[T] as an ad-hoc monad; parsing can either succeed or fail:123sealed trait ParseResult[T]case class Success[T](result: T, in: Input) extends ParseResult[T]case class Failure(msg : String, in: Input) extends ParseResult[Nothing] 👉 Nothing is the bottom type in Scala; it contains no members, and nothing can extend itLet’s also define the tokens produced by the lexer (which we won’t define) as case classes extending Token:12345sealed trait Tokencase class Keyword(chars: String) extends Tokencase class NumericLit(chars: String) extends Tokencase class StringLit(chars: String) extends Tokencase class Identifier(chars: String) extends TokenInput into the parser is then a lazy stream of tokens (with positions for error diagnostics, which we’ll omit here):1type Input = Reader[Token]We can then define a standard, sample parser which looks as follows on the type-level:123class StandardTokenParsers { type Parser = Input => ParseResult}The basic ideaFor each language (defined by a grammar symbol S), define a function f that, given an input stream i (with tail i'): if a prefix of i is in S, return Success(Pair(x, i')), where x is a result for S otherwise, return Failure(msg, i), where msg is an error message stringThe first is called success, the second is failure. We can compose operations on this somewhat conveniently, like we would on a monad (like Option).Simple parser primitivesAll of the above boilerplate allows us to define a parser, which succeeds if the first token in the input satisfies some given predicate pred. When it succeeds, it reads the token string, and splits the input there.12345def token(kind: String)(pred: Token => boolean) = new Parser[String] { def apply(in : Input) = if (pred(in.head)) Success(in.head.chars, in.tail) else Failure(kind + " expected ", in)}We can use this to define a keyword parser:1234implicit def keyword(chars: String) = token("'" + chars + "'") { case Keyword(chars1) => chars == chars1 case _ => false}Marking it as implicit allows us to write keywords as normal strings, where we can omit the keyword call (this helps us simplify the notation in our DSL; we can write "if" instead of keyword("if")).We can make other parsers for our other case classes quite simply:123def numericLit = token("number")(_.isInstanceOf[NumericLit])def stringLit = token("string literal")(_.isInstanceOf[StringLit])def ident = token("identifier")(_.isInstanceOf[Identifier])Parser combinatorsWe are going to define the following parser combinators: ~: sequential composition <~, >~: sequential composition, keeping left / right only |: alternative opt(X): option (like a ? quantifier in a regex) rep(X): repetition (like a * quantifier in a regex) repsep(P, Q): interleaved repetition ^^: result conversion (like a map on an Option) ^^^: constant result (like a map on an Option, but returning a constant value regardless of result)But first, we’ll write some very basic parser combinators: success and failure, that respectively always succeed and always fail:1234567def success[T](result: T) = new Parser[T] { def apply(in: Input) = Success(result, in)}def failure(msg: String) = new Parser[Nothing] { def apply(in: Input) = Failure(msg, in)}All of the above are methods on a Parser[T] class. Thanks to infix space notation in Scala, we can denote x.y(z) as x y z, which allows us to simplify our DSL notation; for instance A ~ B corresponds to A.~(B).123456789101112131415161718192021222324252627282930abstract class Parser[T] { // An abstract method that defines the parser function def apply(in : Input): ParseResult def ~[U](rhs: Parser[U]) = new Parser[T ~ U] { def apply(in: Input) = Parser.this(in) match { case Success(x, tail) => rhs(tail) match { case Success(y, rest) => Success(new ~(x, y), rest) case failure => failure } case failure => failure } } def |(rhs: => Parser[T]) = new Parser[T] { def apply(in : Input) = Parser.this(in) match { case s1 @ Success(_, _) => s1 case failure => rhs(in) } } def ^^[U](f: T => U) = new Parser[U] { def apply(in : Input) = Parser.this(in) match { case Success(x, tail) => Success(f(x), tail) case x => x } } def ^^^[U](r: U): Parser[U] = ^^(x => r)} 👉 In Scala, T ~ U is syntactic sugar for ~[T, U], which is the type of the case class we’ll define belowFor the ~ combinator, when everything works, we’re using ~, a case class that is equivalent to Pair, but prints the way we want to and allows for the concise type-level notation above.123case class ~[T, U](_1 : T, _2 : U) { override def toString = "(" + _1 + " ~ " + _2 +")"}At this point, we thus have two different meanings for ~: a function ~ that produces a Parser, and the ~(a, b) case class pair that this parser returns (all of this is encoded in the function signature of the ~ function).Note that the | combinator takes the right-hand side parser as a call-by-name argument. This is because we don’t want to evaluate it unless it is strictly needed—that is, if the left-hand side fails.^^ is like a map operation on Option; P ^^ f succeeds iff P succeeds, in which case it applies the transformation f on the result of P. Otherwise, it fails.ShorthandsWe can now define shorthands for common combinations of parser combinators:1234567def opt[T](p : Parser[T]): Parser[Option[T]] = p ^^ Some | success(None)def rep[T](p : Parser[T]): Parser[List[T]] = p ~ rep(p) ^^ { case x ~ xs => x :: xs } | success(Nil)def repsep[T, U](p : Parser[T], q : Parser[U]): Parser[List[T]] = p ~ rep(q ~> p) ^^ { case r ~ rs => r :: rs } | success(Nil)Note that none of the above can fail. They may, however, return None or Nil wrapped in success.As an exercise, we can implement the rep1(P) parser combinator, which corresponds to the + regex quantifier:1def rep1[T](p: Parser[T]) = p ~ rep(p)Example: JSON parserLet’s define a JSON parser. Scala’s parser combinator library has a StandardTokenParsers that give us a variety of utility methods for lexing, like lexical.delimiters, lexical.reserved, stringLit and numericLit.12345678910111213141516171819202122232425object JSON extends StandardTokenParsers { lexical.delimiters += ("{", "}", "[", "]", ":") lexical.reserved += ("null", "true", "false") // Return Map def obj: Parser[Any] = "{" ~ repsep(member, ",") ~ "}" ^^ (ms => Map() ++ ms) // Return List def arr: Parser[Any] = "[" ~> repsep(value, ",") <~ "]" // Return name/value pair: def member: Parser[Any] = stringLit ~ ":" ~ value ^^ { case name ~ ":" ~ value => (name, value) } // Return correct Scala type def value: Parser[Any] = obj | arr | stringLit | numericLit ^^ (_.toInt) | "null" ^^^ null | "true" ^^^ true | "false" ^^^ false}The trouble with left-recursionParser combinators work top-down and therefore do not allow for left-recursion. For example, the following would go into an infinite loop, where the parser keeps recursively matching the same token unto expr:1def expr = expr ~ "-" ~ termLet’s take a look at an arithmetic expression parser:123456object Arithmetic extends StandardTokenParsers { lexical.delimiters ++= List("(", ")", "+", "−", "∗", "/") def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term) def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor) def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLit}This definition of expr, namely term ~ rep("-" ~ term) produces a right-leaning tree. For instance, 1 - 2 - 3 produces 1 ~ List("-" ~ 2, ~ "-" ~ 3).The solution is to combine calls to rep with a final foldLeft on the list:123456789101112131415161718object Arithmetic extends StandardTokenParsers { lexical.delimiters ++= List("(", ")", "+", "−", "∗", "/") def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term) ^^ reduceList def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor) ^^ reduceList def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLit private def reduceList(list: Expr ~ List[String ~ Expr]): Expr = list match { case x ~ xs => (x foldLeft ps)(reduce) } private def reduce(x: Int, r: String ~ Int) = r match { case "+" ~ y => x + y case "−" ~ y => x − y case "∗" ~ y => x ∗ y case "/" ~ y => x / y case => throw new MatchError("illegal case: " + r) }} 👉 It used to be that the standard library contained parser combinators, but those are now a separate module. This module contains a chainl (chain-left) method that reduces after a rep for you.Arithmetic expressions — abstract syntax and proof principlesThis section follows Chapter 3 in TAPL.Basics of inductionOrdinary induction is simply:Suppose P is a predicate on natural numbers.Then: If P(0) and, for all i, P(i) implies P(i + 1) then P(n) holds for all nWe can also do complete induction:Suppose P is a predicate on natural numbers.Then: If for each natural number n, given P(i) for all i < n we can show P(n) then P(n) holds for all nIt proves exactly the same thing as ordinary induction, it is simply a restated version. They’re interderivable; assuming one, we can prove the other. Which one to use is simply a matter of style or convenience. We’ll see some more equivalent styles as we go along.Mathematical representation of syntaxLet’s assume the following grammar:12345678t ::= true false if t then t else t 0 succ t pred t iszero tWhat does this really define? A few suggestions: A set of character strings A set of token lists A set of abstract syntax treesIt depends on how you read it; a grammar like the one above contains information about all three.However, we are mostly interested in the ASTs. The above grammar is therefore called an abstract grammar. Its main purpose is to suggest a mapping from character strings to trees.For our use of these, we won’t be too strict with these. For instance, we’ll freely use parentheses to disambiguate what tree we mean to describe, even though they’re not strictly supported by the grammar. What matters to us here aren’t strict implementation semantics, but rather that we have a framework to talk about ASTs. For our purposes, we’ll consider that two terms producing the same AST are basically the same; still, we’ll distinguish terms that only have the same evaluation result, as they don’t necessarily have the same AST.How can we express our grammar as mathematical expressions? A grammar describes the legal set of terms in a program by offering a recursive definition. While recursive definitions may seem obvious and simple to a programmer, we have to go through a few hoops to make sense of them mathematically.Mathematical representation 1We can use a set $\mathcal{T}$ of terms. The grammar is then the smallest set such that: $\left\{ \text{true}, \text{false}, 0 \right\} \subseteq \mathcal{T}$, If $t_1 \in \mathcal{T}$ then $\left\{ \text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \right\} \subseteq \mathcal{T}$, If $t_1, t_2, t_3 \in \mathcal{T}$ then we also have $\text{if } t_1 \text{ then } t_2 \text{ else } t_3 \in \mathcal{T}$.Mathematical representation 2We can also write this somewhat more graphically:This is exactly equivalent to representation 1, but we have just introduced a different notation. Note that “the smallest set closed under…” is often not stated explicitly, but implied.Mathematical representation 3Alternatively, we can build up our set of terms as an infinite union:We can thus build our final set as follows:Note that we can “pull out” the definition into a generating function $F$:The generating function is thus defined as:Each function takes a set of terms $U$ as input and produces “terms justified by $U$” as output; that is, all terms that have the items of $U$ as subterms.The set $U$ is said to be closed under F or F-closed if $F(U) \subseteq U$.The set of terms $T$ as defined above is the smallest F-closed set. If $O$ is another F-closed set, then $T \subseteq O$.Comparison of the representationsWe’ve seen essentially two ways of defining the set (as representation 1 and 2 are equivalent, but with different notation): The smallest set that is closed under certain rules. This is compact and easy to read. The limit of a series of sets. This gives us an induction principle on which we can prove things on terms by induction.The first one defines the set “from above”, by intersecting F-closed sets.The second one defines it “from below”, by starting with $\emptyset$ and getting closer and closer to being F-closed.These are equivalent (we won’t prove it, but Proposition 3.2.6 in TAPL does so), but can serve different uses in practice.Induction on termsFirst, let’s define depth: the depth of a term $t$ is the smallest $i$ such that $t\in\mathcal{S_i}$.The way we defined $\mathcal{S}_i$, it gets larger and larger for increasing $i$; the depth of a term $t$ gives us the step at which $t$ is introduced into the set.We see that if a term $t$ is in , then all of its immediate subterms must be in $\mathcal{S}_{i-1}$, meaning that they must have smaller depth.This justifies the principle of induction on terms, or structural induction. Let P be a predicate on a term:If, for each term s, given P(r) for all immediate subterms r of s we can show P(s) then P(t) holds for all tAll this says is that if we can prove the induction step from subterms to terms (under the induction hypothesis), then we have proven the induction.We can also express this structural induction using generating functions, which we introduced previously.Suppose T is the smallest F-closed set.If, for each set U, from the assumption "P(u) holds for every u ∈ U", we can show that "P(v) holds for every v ∈ F(U)"then P(t) holds for all t ∈ TWhy can we use this? We assumed that $T$ was the smallest F-closed set, which means that $T\subseteq O$ for any other F-closed set $O$. Showing the pre-condition (“for each set $U$, from the assumption…”) amounts to showing that the set of all terms satisfying $P$ (call it $O$) is itself an F-closed set. Since $T\subseteq O$, every element of $T$ satisfies $P$.Inductive function definitionsAn inductive definition is used to define the elements in a set recursively, as we have done above. The recursion theorem states that a well-formed inductive definition defines a function. To understand what being well-formed means, let’s take a look at some examples.Let’s define our grammar function a little more formally. Constants are the basic values that can’t be expanded further; in our example, they are true, false, 0. As such, the set of constants appearing in a term $t$, written $\text{Consts}(t)$, is defined recursively as follows:This seems simple, but these semantics aren’t perfect. First off, a mathematical definition simply assigns a convenient name to some previously known thing. But here, we’re defining the thing in terms of itself, recursively. And the semantics above also allow us to define ill-formed inductive definitions:The last rule produces infinitely large rules (if we implemented it, we’d expect some kind of stack overflow). We’re missing the rules for if-statements, and we have a useless rule for 0, producing empty sets.How do we tell the difference between a well-formed inductive definition, and an ill-formed one as above? What is well-formedness anyway?What is a function?A relation over $T, U$ is a subset of $T \times U$, where the Cartesian product is defined as:A function $f$ from $A$ (domain) to $B$ (co-domain) can be viewed as a two-place relation, albeit with two additional properties: It is total: $\forall a \in A, \exists b \in B : (a, b) \in f$ It is deterministic: $(a, b_1) \in f, (a, b_2) \in f \implies b_1 = b_2$Totality ensures that the A domain is covered, while being deterministic just means that the function always produces the same result for a given input.Induction example 1As previously stated, $\text{Consts}$ is a relation. It maps terms (A) into the set of constants that they contain (B). The induction theorem states that it is also a function. The proof is as follows.$\text{Consts}$ is total and deterministic: for each term $t$ there is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$1 . The proof is done by induction on $t$.To be able to apply the induction principle for terms, we must first show that for an arbitrary term $t$, under the following induction hypothesis: For each immediate subterm $s$ of $t$, there is exactly one set of terms $C_s$ such that $(s, C_s) \in \text{Consts}$Then the following needs to be proven as an induction step: There is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$We proceed by cases on $t$: If $t$ is $0$, $\text{true}$ or $\text{false}$ We can immediately see from the definition that of $\text{Consts}$ that there is exactly one set of terms $C = \set{t}$) such that $(t, C) \in \text{Consts}$. This constitutes our base case. If $t$ is $\text{succ } t_1$, $\text{pred } t_1$ or $\text{iszero } t_1$ The immediate subterm of $t$ is $t_1$, and the induction hypothesis tells us that there is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$. But then it is clear from the definition that there is exactly one set of terms $C = C_1$ such that $(t, C) \in \text{Consts}$. If $t$ is $\ifelse$ The induction hypothesis tells us: There is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$ There is exactly one set of terms $C_2$ such that $(t_2, C_2) \in \text{Consts}$ There is exactly one set of terms $C_3$ such that $(t_3, C_3) \in \text{Consts}$ It is clear from the definition of $\text{Consts}$ that there is exactly one set $C = C_1 \cup C_2 \cup C_3$ such that $(t, C) \in \text{Consts}$. This proves that $\text{Consts}$ is indeed a function.But what about $\text{BadConsts}$? It is also a relation, but it isn’t a function. For instance, we have $\text{BadConsts}(0) = \set{0}$ and $\text{BadConsts}(0) = \emptyset$, which violates determinism. To reformulate this in terms of the above, there are two sets $C$ such that $(0, C) \in \text{BadConsts}$, namely $C = \set{0}$ and $C = \emptyset$.Note that there are many other problems with $\text{BadConsts}$, but this is sufficient to prove that it isn’t a function.Induction example 2Let’s introduce another inductive definition:We’d like to prove that the number of distinct constants in a term is at most the size of the term. In other words, that $\abs{\text{Consts}(t)} \le \text{size}(t)$The proof is by induction on $t$: $t$ is a constant; $t=\text{true}$, $t=\text{false}$ or $t=0$ The proof is immediate. For constants, the number of constants and the size are both one: $\abs{\text{Consts(t)}} = \abs{\set{t}} = 1 = \text{size}(t)$ $t$ is a function; $t = \text{succ}\ t_1$, $t = \text{pred}\ t_1$ or $t = \text{iszero}\ t_1$ By the induction hypothesis, $\abs{\text{Consts}(t1)} \le \text{size}(t_1)$. We can then prove the proposition as follows: $\abs{\text{Consts}(t)} = \abs{\text{Consts}(t_1)} \overset{\text{IH}}{\le} \text{size}(t_1) = \text{size}(t) + 1 < \text{size}(t)$ $t$ is an if-statement: $t = \ifelse$ By the induction hypothesis, $\abs{\text{Consts}(t_1)} \le \text{size}(t_1)$, $\abs{\text{Consts}(t_2)} \le \text{size}(t_2)$ and $\abs{\text{Consts}(t_3)} \le \text{size}(t_3)$. We can then prove the proposition as follows: Operational semantics and reasoningEvaluationSuppose we have the following syntax1234t ::= // terms true // constant true false // constant false if t then t else t // conditionalThe evaluation relation $t \longrightarrow t’$ is the smallest relation closed under the following rules.The following are computation rules, defining the “real” computation steps:The following is a congruence rule, defining where the computation rule is applied next:We want to evaluate the condition before the conditional clauses in order to save on evaluation; we’re not sure which one should be evaluated, so we need to know the condition first.DerivationsWe can describe the evaluation logically from the above rules using derivation trees. Suppose we want to evaluate the following (with parentheses added for clarity): if (if true then true else false) then false else true.In an attempt to make all this fit onto the screen, true and false have been abbreviated T and F in the derivation below, and the then keyword has been replaced with a parenthesis notation for the condition.The final statement is a conclusion. We say that the derivation is a witness for its conclusion (or a proof for its conclusion). The derivation records all reasoning steps that lead us to the conclusion.Inversion lemmaWe can introduce the inversion lemma, which tells us how we got to a term.Suppose we are given a derivation $\mathcal{D}$ witnessing the pair $(t, t’)$ in the evaluation relation. Then either: If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $\if true \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $\if false \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \if t_1 \then t_2 \else t_3$ and $t’ = t = \if t_1’ \then t_2 \else t_3$, for some $t_1, t_1’, t_2, t_3$. Moreover, the immediate subderivation of $\mathcal{D}$ witnesses $(t_1, t_1’) \in \longrightarrow$.This is super boring, but we do need to acknowledge the inversion lemma before we can do induction proofs on derivations. Thanks to the inversion lemma, given an arbitrary derivation $\mathcal{D}$ with conclusion $t \longrightarrow t’$, we can proceed with a case-by-case analysis on the final rule used in the derivation tree.Let’s recall our definition of the size function. In particular, we’ll need the rule for if-statements:We want to prove that if $t \longrightarrow t’$, then $\text{size}(t) > \text{size}(t’)$. If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $t = \if true \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $t = \if false \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \ifelse$ and $t’ = \if t_1’ \then t_2 \else t_3$. In this case, $t_1 \longrightarrow t_1’$ is witnessed by a derivation $\mathcal{D}_1$. By the induction hypothesis, $\text{size}(t_1) > \text{size}(t_1’)$, and the result is then immediate from the definition of $\text{size}$Abstract machinesAn abstract machine consists of: A set of states A transition relation of states, written $\longrightarrow$$t \longrightarrow t’$ means that $t$ evaluates to $t’$ in one step. Note that $\longrightarrow$ is a relation, and that $t \longrightarrow t’$ is shorthand for $(t, t’) \in \longrightarrow$. Often, this relation is a partial function (not necessarily covering the domain A; there is at most one possible next state). But without loss of generality, there may be many possible next states, determinism isn’t a criterion here.Normal formsA normal form is a term that cannot be evaluated any further. More formally, a term $t$ is a normal form if there is no $t’$ such that $t \longrightarrow t’$. A normal form is a state where the abstract machine is halted; we can regard it as the result of a computation.Values that are normal formPreviously, we intended for our values (true and false) to be exactly that, the result of a computation. Did we get that right?Let’s prove that a term $t$ is a value $\iff$ it is in normal form. The $\implies$ direction is immediate from the definition of the evaluation relation $\longrightarrow$. The $\impliedby$ direction is more conveniently proven as its contrapositive: if $t$ is not a value, then it is not a normal form, which we can prove by induction on the term $t$. Since $t$ is not a value, it must be of the form $\ifelse$. If $t_1$ is directly true or false, then $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ apply, and we are done. Otherwise, if $t = \ifelse$ where $t_1$ isn’t a value, by the induction hypothesis, there is a $t_1’$ such that $t_1 \longrightarrow t_1’$. Then rule $\ref{eq:e-if}$ yields $\if t_1’ \then t_2 \else t_3$, which proves that $t$ is not in normal form. Values that are not normal formLet’s introduce new syntactic forms, with new evaluation rules.1234567891011t ::= // terms 0 // constant 0 succ t // successor pred t // predecessor iszero t // zero testv ::= nv // valuesnv ::= // numeric values 0 // zero value succ nv // successor valueThe evaluation rules are given as follows:All values are still normal forms. But are all normal forms values? Not in this case. For instance, succ true, iszero true, etc, are normal forms. These are stuck terms: they are in normal form, but are not values. In general, these correspond to some kind of type error, and one of the main purposes of a type system is to rule these kinds of situations out.Multi-step evaluationLet’s introduce the multi-step evaluation relation, $\longrightarrow^*$. It is the reflexive, transitive closure of single-step evaluation, i.e. the smallest relation closed under these rules:In other words, it corresponds to any number of single consecutive evaluations.Termination of evaluationWe’ll prove that evaluation terminates, i.e. that for every term $t$ there is some normal form $t’$ such that $t\longrightarrow^* t’$.First, let’s recall our proof that $t\longrightarrow t’ \implies \text{size}(t) > \text{size}(t’)$. Now, for our proof by contradiction, assume that we have an infinite-length sequence $t_0, t_1, t_2, \dots$ such that:But this sequence cannot exist: since $\text{size}(t_0)$ is a finite, natural number, we cannot construct this infinite descending chain from it. This is a contradiction.Most termination proofs have the same basic form. We want to prove that the relation $R\subseteq X \times X$ is terminating — that is, there are no infinite sequences $x_0, x_1, x_2, \dots$ such that $(x_i, x_{i+1}) \in R$ for each $i$. We proceed as follows: Choose a well-suited set $W$ with partial order $<$ such that there are no infinite descending chains $w_0 > w_1 > w_2 > \dots$ in $W$. Also choose a function $f: X \rightarrow W$. Show $f(x) > f(y) \quad \forall (x, y) \in R$ Conclude that are no infinite sequences $(x_0, x_1, x_2, \dots)$ such that $(x_i, x_{i+1}) \in R$ for each $i$. If there were, we could construct an infinite descending chain in $W$.As a side-note, partial order is defined as the following properties: Anti-symmetry: $\neg(x < y \land y < x)$ Transitivity: $x<y \land y<z \implies x < z$We can add a third property to achieve total order, namely $x \ne y \implies x <y \lor y<x$.Lambda calculusLambda calculus is Turing complete, and is higher-order (functions are data). In lambda calculus, all computation happens by means of function abstraction and application.Lambda calculus is isomorphic to Turing machines.Suppose we wanted to write a function plus3 in our previous language:plus3 x = succ succ succ xThe way we write this in lambda calculus is:$\lambda x. t$ is written x => t in Scala, or fun x -> t in OCaml. Application of our function, say plus3(succ 0), can be written as:Abstraction over functions is possible using higher-order functions, which we call $\lambda$-abstractions. An example of such an abstraction is the function $g$ below, which takes an argument $f$ and uses it in the function position.If we apply $g$ to an argument like $\text{plus3}$, we can just use the substitution rule to see how that defines a new function.Another example: the double function below takes two arguments, as a curried function would. First, it takes the function to apply twice, then the argument on which to apply it, and then returns $f(f(y))$.Pure lambda calculusOnce we have $\lambda$-abstractions, we can actually throw out all other language primitives like booleans and other values; all of these can be expressed as functions, as we’ll see below. In pure lambda-calculus, everything is a function.Variables will always denote a function, functions always take other functions as parameters, and the result of an evaluation is always a function.The syntax of lambda-calculus is very simple:1234t ::= // terms, also called λ-terms x // variable λx. t // abstraction, also called λ-abstractions t t // applicationA few rules and syntactic conventions: Application associates to the left, so $t\ u\ v$ means $(t\ u)\ v$, not $t\ (u\ v)$. Bodies of lambda abstractions extend as far to the right as possible, so $\lambda x. \lambda y.\ x\ y$ means $\lambda x.\ (\lambda y. x\ y)$, not $\lambda x.\ (\lambda y.\ x)\ y$ScopeThe lambda expression $\lambda x.\ t$ binds the variable $x$, with a scope limited to $t$. Occurrences of $x$ inside of $t$ are said to be bound, while occurrences outside are said to be free.Let $\text{fv}(t)$ be the set of free variables in a term $t$. It’s defined as follows:Operational semanticsAs we saw with our previous language, the rules could be distinguished into computation and congruence rules. For lambda calculus, the only computation rule is:The notation $\left[ x \mapsto v_2 \right] t_{12}$ means “the term that results from substituting free occurrences of $x$ in $t_{12}$ with $v_2$”.The congruence rules are:A lambda-expression applied to a value, $(\lambda x.\ t)\ v$, is called a reducible expression, or redex.Evaluation strategiesThere are alternative evaluation strategies. In the above, we have chosen call by value (which is the standard in most mainstream languages), but we could also choose: Full beta-reduction: any redex may be reduced at any time. This offers no restrictions, but in practice, we go with a set of restrictions like the ones below (because coding a fixed way is easier than coding probabilistic behavior). Normal order: the leftmost, outermost redex is always reduced first. This strategy allows to reduce inside unapplied lambda terms Call-by-name: allows no reductions inside lambda abstractions. Arguments are not reduced before being substituted in the body of lambda terms when applied. Haskell uses an optimized version of this, call-by-need (aka lazy evaluation).Classical lambda calculusClassical lambda calculus allows for full beta reduction.Confluence in full beta reductionThe congruence rules allow us to apply in different ways; we can choose between $\ref{eq:e-app1}$ and $\ref{eq:e-app2}$ every time we reduce an application, and this offers many possible reduction paths.While the path is non-deterministic, is the result also non-deterministic? This question took a very long time to answer, but after 25 years or so, it was proven that the result is always the same. This is known the Church-Rosser confluence theorem:Let $t, t_1, t_2$ be terms such that $t \longrightarrow^* t_1$ and $t \longrightarrow^* t_2$. Then there exists a term $t_3$ such that $t_1 \longrightarrow^* t_3$ and $t_2 \longrightarrow^* t_3$Alpha conversionSubstitution is actually trickier than it looks! For instance, in the expression $\lambda x.\ (\lambda y.\ x)\ y$, the first occurrence of $y$ is bound (it refers to a parameter), while the second is free (it does not refer to a parameter). This is comparable to scope in most programming languages, where we should understand that these are two different variables in different scopes, $y_1$ and $y_2$.The above example had a variable that is both bound and free, which is something that we’ll try to avoid. This is called a hygiene condition.We can transform a unhygienic expression to a hygienic one by renaming bound variables before performing the substitution. This is known as alpha conversion. Alpha conversion is given by the following conversion rule:And these equivalence rules (in mathematics, equivalence is defined as symmetry and transitivity):The congruence rules are as usual.Programming in lambda-calculusMultiple argumentsThe way to handle multiple arguments is by currying: $\lambda x.\ \lambda y.\ t$BooleansThe fundamental, universal operator on booleans is if-then-else, which is what we’ll replicate to model booleans. We’ll denote our booleans as $\text{tru}$ and $\text{fls}$ to be able to distinguish these pure lambda-calculus abstractions from the true and false values of our previous toy language.We want true to be equivalent to if (true), and false to if (false). The terms $\text{tru}$ and $\text{fls}$ represent boolean values, in that we can use them to test the truth of a boolean value:We can consider these as booleans. Equivalently tru can be considered as a function performing (t1, t2) => if (true) t1 else t2. To understand this, let’s try to apply $\text{tru}$ to two arguments:This works equivalently for fls.We can also do inversion, conjunction and disjunction with lambda calculus, which can be read as particular if-else statements: not is a function that is equivalent to not(b) = if (b) false else true. and is equivalent to and(b, c) = if (b) c else false or is equivalent to or(b, c) = if (b) true else cPairsThe fundamental operations are construction pair(a, b), and selection pair._1 and pair._2. pair is equivalent to pair(f, s) = (b => b f s) When tru is applied to pair, it selects the first element, by definition of the boolean, and that is therefore the definition of fst Equivalently for fls applied to pair, it selects the second elementNumbersWe’ve actually been representing numbers as lambda-calculus numbers all along! Our succ function represents what’s more formally called Church numerals.Note that $c_0$’s implementation is the same as that of $\text{fls}$ (just with renamed variables).Every number $n$ is represented by a term $c_n$ taking two arguments, which are $s$ and $z$ (for “successor” and “zero”), and applies $s$ to $z$, $n$ times. Fundamentally, a number is equivalent to the following:With this in mind, let us implement some functions on numbers. Successor $\text{scc}$: we apply the successor function to $n$ (which has been correctly instantiated with $s$ and $z$) Addition $\text{add}$: we pass the instantiated $n$ as the zero of $m$ Subtraction $\text{sub}$: we apply $\text{pred}$ $n$ times to $m$ Multiplication $\text{mul}$: instead of the successor function, we pass the addition by $n$ function. Zero test $\text{iszero}$: zero has the same implementation as false, so we can lean on that to build an iszero function. An alternative understanding is that we’re building a number, in which we use true for the zero value $z$. If we have to apply the successor function $s$ once or more, we want to get false, so for the successor function we use a function ignoring its input and returning false if applied.What about predecessor? This is a little harder, and it’ll take a few steps to get there. The main idea is that we find the predecessor by rebuilding the whole succession up until our number. At every step, we must generate the number and its predecessor: zero is $(c_0, c_0)$, and all other numbers are $(c_{n-1}, c_n)$. Once we’ve reconstructed this pair, we can get the predecessor by taking the first element of the pair.SidenoteThe story goes that Church was stumped by predecessors for a long time. This solution finally came to him while he was at the barber, and he jumped out half shaven to write it down.ListsNow what about lists?Recursion in lambda-calculusLet’s start by taking a step back. We talked about normal forms and terms for which we terminate; does lambda calculus always terminate? It’s Turing complete, so it must be able to loop infinitely (otherwise, we’d have solved the halting problem!).The trick to recursion is self-application:From a type-level perspective, we would cringe at this. This should not be possible in the typed world, but in the untyped world we can do it. We can construct a simple infinite loop in lambda calculus as follows:The expression evaluates to itself in one step; it never reaches a normal form, it loops infinitely, diverges. This is not a stuck term though; evaluation is always possible.In fact, there are no stuck terms in pure lambda calculus. Every term is either a value or reduces further.So it turns out that $\text{omega}$ isn’t so terribly useful. Let’s try to construct something more practical:Now, the divergence is a little more interesting:This $Y_f$ function is known as a Y combinator. It still loops infinitely (though note that while it works in classical lambda calculus, it blows up in call-by-name), so let’s try to build something more useful.To delay the infinite recursion, we could build something like a poison pill:It can be passed around (after all, it’s just a value), but evaluating it will cause our program to loop infinitely. This is the core idea we’ll use for defining the fixed-point combinator $\text{fix}$ (also known as the call-by-value Y combinator), which allows us to do recursion. It’s defined as follows:This looks a little intricate, and we won’t need to fully understand the definition. What’s important is mostly how it is used to define a recursive function. For instance, if we wanted to define a modulo function in our toy language, we’d do it as follows:123def mod(x, y) = if (y > x) x else mod(x - y, y)In lambda calculus, we’d define this as:We’ve assumed that a greater-than $\text{gt}$ function was available here.More generally, we can define a recursive function as:Equivalence of lambda termsWe’ve seen how to define Church numerals and successor. How can we prove that $\text{succ } c_n$ is equal to $c_{n+1}$?The naive approach unfortunately doesn’t work; they do not evaluate to the same value.This still seems very close. If we could simplify a little further, we do see how they would be the same.The intuition behind the Church numeral representation was that a number $n$ is represented as a term that “does something $n$ times to something else”. $\text{scc}$ takes a term that “does something $n$ times to something else”, and returns a term that “does something $n+1$ times to something else”.What we really care about is that $\text{scc } c_2$ behaves the same as $c_3$ when applied to two arguments. We want behavioral equivalence. But what does that mean? Roughly, two terms $s$ and $t$ are behaviorally equivalent if there is no “test” that distinguishes $s$ and $t$.Let’s define this notion of “test” this a little more precisely, and specify how we’re going to observe the results of a test. We can use the notion of normalizability to define a simple notion of a test: Two terms $s$ and $t$ are said to be observationally equivalent if they are either both normalizable (i.e. they reach a normal form after a finite number of evaluation steps), or both diverge.In other words, we observe a term’s behavior by running it and seeing if it halts. Note that this is not decidable (by the halting problem).For instance, $\text{omega}$ and $\text{tru}$ are not observationally equivalent (one diverges, one halts), while $\text{tru}$ and $\text{fls}$ are (they both halt).Observational equivalence isn’t strong enough of a test for what we need; we need behavioral equivalence. Two terms $s$ and $t$ are said to be behaviorally equivalent if, for every finite sequence of values $v_1, v_2, \dots, v_n$ the applications $s\ v_1\ v_2\ \dots\ v_n$ and $t\ v_1\ v_2\ \dots\ v_n$ are observationally equivalent.This allows us to assert that true and false are indeed different:The former returns a normal form, while the latter diverges.TypesAs previously, to define a language, we start with a set of terms and values, as well as an evaluation relation. But now, we’ll also define a set of types (denoted with a first capital letter) classifying values according to their “shape”. We can define a typing relation $t:\ T$. We must check that the typing relation is sound in the sense that:These rules represent some kind of safety and liveness, but are more commonly referred to as progress and preservation, which we’ll talk about later. The first one states that types are preserved throughout evaluation, while the second says that if we can type-check, then evaluation of $t$ will not get stuck.In our previous toy language, we can introduce two types, booleans and numbers:123T ::= // types Bool // type of booleans Nat // type of numbersOur typing rules are then given by:With these typing rules in place, we can construct typing derivations to justify every pair $t: T$ (which we can also denote as a $(t, T)$ pair) in the typing relation, as we have done previously with evaluation. Proofs of properties about the typing relation often proceed by induction on these typing derivations.Like other static program analyses, type systems are generally imprecise. They do not always predict exactly what kind of value will be returned, but simply a conservative approximation. For instance, if true then 0 else false cannot be typed with the above rules, even though it will certainly evaluate to a number. We could of course add a typing rule for if true statements, but there is still a question of how useful this is, and how much complexity it adds to the type system, and especially for proofs. Indeed, the inversion lemma below becomes much more tedious when we have more rules.Properties of the Typing RelationThe safety (or soundness) of this type system can be expressed by the following two properties: Progress: A well-typed term is not stuck. If $t\ :\ T$ then either $t$ is a value, or else $t\longrightarrow t’$ for some $t’$. Preservation: Types are preserved by one-step evaluation. If $t\ :\ T$ and $t\longrightarrow t’$, then $t’\ :\ T$. We will prove these later, but first we must state a few lemmas.Inversion lemmaAgain, for types we need to state the same (boring) inversion lemma: If $\text{true}: R$, then $R = \text{Bool}$. If $\text{false}: R$, then $R = \text{Bool}$. If $\ifelse: R$, then $t_1: \text{ Bool}$, $t_2: R$ and $t_3: R$ If $0: R$ then $R = \text{Nat}$ If $\text{succ } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$ If $\text{pred } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$ If $\text{iszero } t_1: R$ then $R = \text{Bool}$ and $t_1: \text{Nat}$From the inversion lemma, we can directly derive a typechecking algorithm:12345678910111213141516171819def typeof(t: Expr): T = t match { case True | False => Bool case If(t1, t2, t3) => val type1 = typeof(t1) val type2 = typeof(t2) val type3 = typeof(t3) if (type1 == Bool && type2 == type3) type2 else throw Error("not typable") case Zero => Nat case Succ(t1) => if (typeof(t1) == Nat) Nat else throw Error("not typable") case Pred(t1) => if (typeof(t1) == Nat) Nat else throw Error("not typable") case IsZero(t1) => if (typeof(t1) == Nat) Bool else throw Error("not typable")}Canonical formA simple lemma that will be useful for lemma is that of canonical forms. Given a type, it tells us what kind of values we can expect: If $v$ is a value of type Bool, then $v$ is either $\text{true}$ or $\text{false}$ If $v$ is a value of type Nat, then $v$ is a numeric valueThe proof is somewhat immediate from the syntax of values.Progress TheoremTheorem: suppose that $t$ is a well-typed term of type $T$. Then either $t$ is a value, or else there exists some $t’$ such that $t\longrightarrow t’$.Proof: by induction on a derivation of $t: T$. The $\ref{eq:t-true}$, $\ref{eq:t-false}$ and $\ref{eq:t-zero}$ are immediate, since $t$ is a value in these cases. For $\ref{eq:t-if}$, we have $t=\ifelse$, with $t_1: \text{Bool}$, $t_2: T$ and $t_3: T$. By the induction hypothesis, there is some $t_1’$ such that $t_1 \longrightarrow t_1’$. If $t_1$ is a value, then rule 1 of the canonical form lemma tells us that $t_1$ must be either $\text{true}$ or $\text{false}$, in which case $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ applies to $t$. Otherwise, if $t_1 \longrightarrow t_1’$, then by $\ref{eq:e-if}$, $t\longrightarrow \if t_1’ \then t_2 \text{ else } t_3$ For $\ref{eq:t-succ}$, we have $t = \text{succ } t_1$. $t_1$ is a value, by rule 5 of the inversion lemma and by rule 2 of the canonical form, $t_1 = nv$ for some numeric value $nv$. Therefore, $\text{succ }(t_1)$ is a value. If $t_1 \longrightarrow t_1’$, then $t\longrightarrow \text{succ }t_1$. The cases for $\ref{eq:t-zero}$, $\ref{eq:t-pred}$ and $\ref{eq:t-iszero}$ are similar.Preservation TheoremTheorem: Types are preserved by one-step evaluation. If $t: T$ and $t\longrightarrow t’$, then $t’: T$.Proof: by induction on the given typing derivation For $\ref{eq:t-true}$ and $\ref{eq:t-false}$, the precondition doesn’t hold (no reduction is possible), so it’s trivially true. Indeed, $t$ is already a value, either $t=\text{ true}$ or $t=\text{ false}$. For $\ref{eq:t-if}$, there are three evaluation rules by which $t\longrightarrow t’$ can be derived, depending on $t_1$ If $t_1 = \text{true}$, then by $\ref{eq:e-iftrue}$ we have $t’=t_2$, and from rule 3 of the inversion lemma and the assumption that $t: T$, we have $t_2: T$, that is $t’: T$ If $t_1 = \text{false}$, then by $\ref{eq:e-iffalse}$ we have $t’=t_3$, and from rule 3 of the inversion lemma and the assumption that $t: T$, we have $t_3: T$, that is $t’: T$ If $t_1 \longrightarrow t_1’$, then by the induction hypothesis, $t_1’: \text{Bool}$. Combining this with the assumption that $t_2: T$ and $t_3: T$, we can apply $\ref{eq:t-if}$ to conclude $\if t_1’ \then t_2 \else t_3: T$, that is $t’: T$ Messing with itRemoving a ruleWhat if we remove $\ref{eq:e-predzero}$? Then pred 0 type checks, but it is stuck and is not a value; the progress theorem fails.Changing type-checking ruleWhat if we change the $\ref{eq:t-if}$ to the following?This doesn’t break our type system. It’s still sound, but it rejects if-else expressions that return other things than numbers (e.g. booleans). But that is an expressiveness problem, not a soundness problem; our type system disallows things that would otherwise be fine by the evaluation rules.Adding bitWe could add a boolean to natural function bit(t). We’d have to add it to the grammar, add some evaluation and typing rules, and prove progress and preservation.We’ll do something similar this below, so the full proof is omitted.Simply typed lambda calculusSimply Typed Lambda Calculus (STLC) is also denoted $\lambda_\rightarrow$. The “pure” form of STLC is not very interesting on the type-level (unlike for the term-level of pure lambda calculus), so we’ll allow base values that are not functions, like booleans and integers. To talk about STLC, we always begin with some set of “base types”:123T ::= // types Bool // type of booleans T -> T // type of functionsIn the following examples, we’ll work with a mix of our previously defined toy language, and lambda calculus. This will give us a little syntactic sugar.123456789101112t ::= // terms x // variable λx. t // abstraction t t // application true // constant true false // constant false if t then t else t // conditionalv ::= // values λx. t // abstraction value true // true value false // false valueType annotationsWe will annotate lambda-abstractions with the expected type of the argument, as follows:We could also omit it, and let type inference do the job (as in OCaml), but for now, we’ll do the above. This will make it simpler, as we won’t have to discuss inference just yet.Typing rulesIn STLC, we’ve introduced abstraction. To add a typing rule for that, we need to encode the concept of an environment $\Gamma$, which is a set of variable assignments. We also introduce the “turnstile” symbol $\vdash$, meaning that the environment can verify the right hand-side typing, or that $\Gamma$ must imply the right-hand side.This additional concept must be taken into account in our definition of progress and preservation: Progress: If $\Gamma\vdash t : T$, then either $t$ is a value or else $t\longrightarrow t’$ for some $t’$ Preservation: If $\Gamma\vdash t : T$ and $t\longrightarrow t’$, then $\Gamma\vdash t’ : T$To prove these, we must take the same steps as above. We’ll introduce the inversion lemma for typing relations, and restate the canonical forms lemma in order to prove the progress theorem.Inversion lemmaLet’s start with the inversion lemma. If $\Gamma\vdash\text{true} : R$ then $R = \text{Bool}$ If $\Gamma\vdash\text{false} : R$ then $R = \text{Bool}$ If $\Gamma\vdash\ifelse : R$ then $\Gamma\vdash t_1 : \text{Bool}$ and $\Gamma\vdash t_2, t_3: R$. If $\Gamma\vdash x: R$ then $x: R \in\Gamma$ If $\Gamma\vdash\lambda x: T_1 .\ t_2 : R$ then $R = T_1 \rightarrow T_2$ for some $R_2$ with $\Gamma\cup(x: T_1)\vdash t_2: R_2$ If $\Gamma\vdash t_1\ t_2 : R$ then there is some type $T_{11}$ such that $\Gamma\vdash t_1 : T_{11} \rightarrow R$ and $\Gamma\vdash t_2 : T_{11}$.Canonical formThe canonical forms are given as follows: If $v$ is a value of type Bool, then it is either $\text{true}$ or $\text{false}$ If $v$ is a value of type $T_1 \rightarrow T_2$ then $v$ has the form $\lambda x: T_1 .\ t_2$ProgressFinally, we get to prove the progress by induction on typing derivations.Theorem: Suppose that $t$ is a closed, well typed term (that is, $\Gamma\vdash t: T$ for some type $T$). Then either $t$ is a value, or there is some $t’$ such that $t\longrightarrow t’$. For boolean constants, the proof is immediate as $t$ is a value For variables, the proof is immediate as $t$ is closed, and the precondition therefore doesn’t hold For abstraction, the proof is immediate as $t$ is a value Application is the only case we must treat. Consider $t = t_1\ t_2$, with $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2: T_{11}$. By the induction hypothesis, $t_1$ is either a value, or it can make a step of evaluation. The same goes for $t_2$. If $t_1$ can reduce, then rule $\ref{eq:e-app1}$ applies to $t$. Otherwise, if it is a value, and $t_2$ can take a step, then $\ref{eq:e-app2}$ applies. Otherwise, if they are both values (and we cannot apply $\beta$-reduction), then the canonical forms lemma above tells us that $t_1$ has the form $\lambda x: T_11.\ t_{12}$, and so rule $\ref{eq:e-appabs}$ applies to $t$. PreservationTheorem: If $\Gamma\vdash t: T$ and $t \longrightarrow t’$ then $\Gamma\vdash t’: T$.Proof: by induction on typing derivations. We proceed on a case-by-case basis, as we have done so many times before. But one case is hard: application.For $t = t_1\ t_2$, such that $\Gamma\vdash t_1 : T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2 : T_{11}$, and where $T=T_{12}$, we want to show $\Gamma\vdash t’ : T_{12}$.To do this, we must use the inversion lemma for evaluation (note that we haven’t written it down for STLC, but the idea is the same). There are three subcases for it, starting with the following:The left-hand side is $t_1 = \lambda x: T_{11}.\ t_{12}$, and the right-hand side of application $t_2$ is a value $v_2$. In this case, we know that the result of the evaluation is given by $t’ = \left[ x\mapsto v_2 \right] t_{12}$.And here, we already run into trouble, because we do not know about how types act under substitution. We will therefore need to introduce some lemmas.Weakening lemmaWeakening tells us that we can add assumptions to the context without losing any true typing statements:If $\Gamma\vdash t: T$, and the environment $\Gamma$ has no information about $x$—that is, $x\notin \text{dom}(\Gamma)$—then the initial assumption still holds if we add information about $x$ to the environment:Moreover, the latter $\vdash$ derivation has the same depth as the former.Permutation lemmaPermutation tells us that the order of assumptions in $\Gamma$ does not matter.If $\Gamma \vdash t: T$ and $\Delta$ is a permutation of $\Gamma$, then $\Delta\vdash t: T$.Moreover, the latter $\vdash$ derivation has the same depth as the former.Substitution lemmaSubstitution tells us that types are preserved under substitution.That is, if $\Gamma\cup(x: S) \vdash t: T$ and $\Gamma\vdash s: S$, then $\Gamma\vdash \left[x\mapsto s\right] t: T$.The proof goes by induction on the derivation of $\Gamma\cup(x: S) \vdash t: T$, that is, by cases on the final typing rule used in the derivation. Case $\ref{eq:t-app}$: in this case, $t = t_1\ t_2$. Thanks to typechecking, we know that the environment validates $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_1: T_2 \rightarrow T_1$ and $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_2: T_2$. In this case, the resulting type of the application is $T=T_1$. By the induction hypothesis, $\Gamma\vdash[x\mapsto s]t_1 : T_2 \rightarrow T_1$, and $\Gamma\vdash[x\mapsto s]t_2 : T_2$. By $\ref{eq:t-app}$, the environment then also verifies the application of these two substitutions as $T$: $\Gamma\vdash[x\mapsto s]t_1\ [x\mapsto s]t_2: T$. We can factorize the substitution to obtain the conclusion, i.e. $\Gamma\vdash \left[x\mapsto s\right](t_1\ t_2): T$ Case $\ref{eq:t-var}$: if $t=z$ ($t$ is a simple variable $z$) where $z: T \in \bigl(\Gamma\cup (x: S)\bigr)$. There are two subcases to consider here, depending on whether $z$ is $x$ or another variable: If $z=x$, then $\left[x\mapsto s\right] z = s$. The result is then $\Gamma\vdash s: S$, which is among the assumptions of the lemma If $z\ne x$, then $\left[x\mapsto s\right] z = z$, and the desired result is immediate Case $\ref{eq:t-abs}$: if $t=\lambda y: T_2.\ t_1$, with $T=T_2\rightarrow T_1$, and $\bigl(\Gamma\cup (x: S)\cup (y: T_2)\bigr)\vdash t_1 : T_1$. Based on our hygiene convention, we may assume $x\ne y$ and $y \notin \text{fv}(s)$. Using permutation on the first given subderivation in the lemma ($\Gamma\cup(x: S) \vdash t: T$), we obtain $\bigl(\Gamma\cup (y: T_2)\cup (x: S)\bigr)\vdash t_1 : T_1$ (we have simply changed the order of $x$ and $y$). Using weakening on the other given derivation in the lemma ($\Gamma\vdash s: S$), we obtain $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash s: S$. By the induction hypothesis, $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash\left[x\mapsto s\right] t_1: T_1$. By $\ref{eq:t-abs}$, we have $\Gamma\vdash(\lambda y: T_2.\ [x\mapsto s]t_1): T_1$ By the definition of substitution, this is $\Gamma\vdash([x\mapsto s]\lambda y: T_2.\ t_1): T_2 \rightarrow T_1$. ProofWe’ve now proven the following lemmas: Weakening Permutation Type preservation under substitution Type preservation under reduction (i.e. preservation)We won’t actually do the proof, we’ve just set up the pieces we need for it.ErasureType annotations do not play any role in evaluation. In STLC, we don’t do any run-time checks, we only run compile-time type checks. Therefore, types can be removed before evaluation. This often happens in practice, where types do not appear in the compiled form of a program; they’re typically encoded in an untyped fashion. The semantics of this conversion can be formalized by an erasure function:Curry-Howard CorrespondenceThe Curry-Howard correspondence tells us that there is a correspondence between propositional logic and types.An implication $P\supset Q$ (which could also be written $P\implies Q$) can be proven by transforming evidence for $P$ into evidence for $Q$. A conjunction $P\land Q$ is a pair of evidence for $P$ and evidence for $Q$. For more examples of these correspondences, see the Brouwer–Heyting–Kolmogorov (BHK) interpretation. Logic Programming languages Propositions Types $P \supset Q$ Type $P\rightarrow Q$ $P \land Q$ Pair type $P\times Q$ $P \lor Q$ Sum type $P+Q$ $\exists x\in S: \phi(x)$ Dependent type $\sum{x: S, \phi(x)}$ $\forall x\in S: \phi(x)$ $\forall (x:S): \phi(x)$ Proof of $P$ Term $t$ of type $P$ $P$ is provable Type $P$ is inhabited Proof simplification Evaluation In Scala, all types are inhabited except for the bottom type Nothing. Singleton types are only inhabited by a single term.As an example of the equivalence, we’ll see that application is equivalent to modus ponens:This also tells us that if we can prove something, we can evaluate it.How can we prove the following? Remember that $\rightarrow$ is right-associative.The proof is actually a somewhat straightforward conversion to lambda calculus:Extensions to STLCBase typesUp until now, we’ve defined our base types (such as $\text{Nat}$ and $\text{Bool}$) manually: we’ve added them to the syntax of types, with associated constants ($\text{zero}, \text{true}, \text{false}$) and operators ($\text{succ}, \text{pred}$), as well as associated typing and evaluation rules.This is a lot of minutiae though, especially for theoretical discussions. For those, we can often ignore the term-level inhabitants of the base types, and just treat them as uninterpreted constants: we don’t really need the distinction between constants and values. For theory, we can just assume that some generic base types (e.g. $B$ and $C$) exist, without defining them further.Unit typeIn C-like languages, this type is usually called void. To introduce it, we do not add any computation rules. We must only add it to the grammar, values and types, and then add a single typing rule that trivially verifies units.Units are not too interesting, but are quite useful in practice, in part because they allow for other extensions.SequencingWe can define sequencing as two statements following each other:123t ::= ... t1; t2This implies adding some evaluation and typing rules, defined below:But there’s another way that we could define sequencing: simply as syntactic sugar, a derived form for something else. In this way, we define an external language, that is transformed to an internal language by the compiler in the desugaring step.This is useful to know, because it makes proving soundness much easier. We do not need to re-state the inversion lemma, re-prove preservation and progress. We can simple rely on the proof for the underlying internal language.Ascription123t ::= ... t as TAscription allows us to have a compiler type-check a term as really being of the correct type:This seems like it preserves soundness, but instead of doing the whole proof over again, we’ll just propose a simple desugaring, in which an ascription is equivalent to the term $t$ applied the identity function, typed to return $T$:Alternatively, we could do the whole proof over again, and institute a simple evaluation rule that ignores the ascription.PairsWe can introduce pairs into our grammar.12345678910111213t ::= ... {t, t} // pair t.1 // first projection t.2 // second projectionv ::= ... {v, v} // pair valueT ::= ... T1 x T2 // product typesWe can also introduce evaluation rules for pairs:The typing rules are then:Pairs have to be added “the hard way”: we do not really have a way to define them in a derived form, as we have no existing language features to piggyback onto.TuplesTuples are like pairs, except that we do not restrict it to 2 elements; we allow an arbitrary number from 1 to n. We can use pairs to encode tuples: (a, b, c) can be encoded as (a, (b, c)). Though for performance and convenience, most languages implement them natively.RecordsWe can easily generalize tuples to records by annotating each field with a label. A record is a bundle of values with labels; it’s a map of labels to values and types. Order of records doesn’t matter, the only index is the label.If we allow numeric labels, then we can encode a tuple as a record, where the index implicitly encodes the numeric label of the record representation.No mainstream language has language-level support for records (two case classes in Scala may have the same arguments but a different constructor, so it’s not quite the same; records are more like anonymous objects). This is because they’re often quite inefficient in practice, but we’ll still use them as a theoretical abstraction.Sums and variantsSum typeA sum type $T = T_1 + T_2$ is a disjoint union of $T_1$ and $T_2$. Pragmatically, we can have sum types in Scala with case classes extending an abstract object:123sealed trait Option[+T]case class Some[+T] extends Option[T]case object None extends Option[Nothing]In this example, Option = Some + None. We say that $T_1$ is on the left, and $T_2$ on the right. Disjointness is ensured by the tags $\text{inl}$ and $\text{inr}$. We can think of these as functions that inject into the left or right of the sum type $T$:Still, these aren’t really functions, they don’t actually have function type. Instead, we use them them to tag the left and right side of a sum type, respectively.Another way to think of these stems from Curry-Howard correspondence. Recall that in the BHK interpretation, a proof of $P \lor Q$ is a pair <a, b> where a is 0 (also denoted $\text{inl}$) and b a proof of $P$, or a is 1 (also denoted $\text{inr}$) and b is a proof of $Q$.To use elements of a sum type, we can introduce a case construct that allows us to pattern-match on a sum type, allowing us to distinguishing the left type from the right one.We need to introduce these three special forms in our syntax:1234567891011t ::= ... // terms inl t // tagging (left) inr t // tagging (right) case t of inl x => t | inr x => t // casev ::= ... // values inl v // tagged value (left) inr v // tagged value (right)T ::= ... // types T + T // sum typeThis also leads us to introduce some new evaluation rules:And we’ll also introduce three typing rules:Sums and uniqueness of typeThe rules $\ref{eq:t-inr}$ and $\ref{eq:t-inl}$ may seem confusing at first. We only have one type to deduce from, so what do we assign to $T_2$ and $T_1$, respectively? These rules mean that we have lost uniqueness of types: if $t$ has type $T$, then $\text{inl } t$ has type $T+U$ for every $U$.There are a couple of solutions to this: We can infer $U$ as needed during typechecking Give constructors different names and only allow each name to appear in one sum type. This requires generalization to variants, which we’ll see next. OCaml adopts this solution. Annotate each inl and inr with the intended sum type.For now, we don’t want to look at type inference and variance, so we’ll choose the third approach for simplicity. We’ll introduce these annotation as ascriptions on the injection operators in our grammar:123456789t ::= ... inl t as T inr t as Tv ::= ... inl v as T inr v as TThe evaluation rules would be exactly the same as previously, but with ascriptions in the syntax. The injection operators just now also specify which sum type we’re injecting into, for the sake of uniqueness of type.VariantsJust as we generalized binary products to labeled records, we can generalize binary sums to labeled variants. We can label the members of the sum type, so that we write $\langle l_1: T_1, l_2: T_2 \rangle$ instead of $T_1 + T_2$ ($l_1$ and $l_2$ are the labels).As a motivating example, we’ll show a useful idiom that is possible with variants, the optional value. We’ll use this to create a table. The example below is just like in OCaml.123456789OptionalNat = <none: Unit, some: Nat>;Table = Nat -> OptionalNat;emptyTable = λt: Nat. <none=unit> as OptionalNat;extendTable = λt: Table. λkey: Nat. λval: Nat. λsearch: Nat. if (equal search key) then <some=val> as OptionalNat else (t search)The implementation works a bit like a linked list, with linear look-up. We can use the result from the table by distinguishing the outcome with a case:123x = case t(5) of <none=u> => 999 | <some=v> => vRecursionIn STLC, all programs terminate. We’ll go into a little more detail later, but the main idea is that evaluation of a well-typed program is guaranteed to halt; we say that the well-typed terms are normalizable.Indeed, the infinite recursions from untyped lambda calculus (terms like $\text{omega}$ and $\text{fix}$) are not typable, and thus cannot appear in STLC. Since we can’t express $\text{fix}$ in STLC, instead of defining it as a term in the language, we can add it as a primitive instead to get recursion.123t ::= ... fix tWe’ll need to add evaluation rules recreating its behavior, and a typing rule that restricts its use to the intended use-case.In order for a function to be recursive, the function needs to map a type to the same type, hence the restriction of $T_1 \rightarrow T_1$. The type $T_1$ will itself be a function type if we’re doing a recursion. Still, note that the type system doesn’t enforce this. There will actually be situations in which it will be handy to use something else than a function type inside a fix operator.Seeing that this fixed-point notation can be a little involved, we can introduce some nice syntactic sugar to work with it:This $t_1$ can now refer to the $x$; that’s the convenience offered by the construct. Although we don’t strictly need to introduce typing rules (it’s syntactic sugar, we’re relying on existing constructs), a typing rule for this could be:In Scala, a common error message is that a recursive function needs an explicit return type, for the same reasons as the typing rule above.ReferencesMutabilityIn most programming languages, variables are (or can be) mutable. That is, variables can provide a name referring to a previously calculated value, as well as a way of overwriting this value with another (under the same name). How can we model this in STLC?Some languages (e.g. OCaml) actually formally separate variables from mutation. In OCaml, variables are only for naming, the binding between a variable and a value is immutable. However, there is the concept of mutable values, also called reference cells or references. This is the style we’ll study, as it is easier to work with formally. A mutable value is represented in the type-level as a Ref T (or perhaps even a Ref(Option T), since the null pointer cannot produce a value).The basic operations are allocation with the ref operator, dereferencing with ! (in C, we use the * prefix), and assignment with :=, which updates the content of the reference cell. Assignment returns a unit value.AliasingTwo variables can reference the same cell: we say that they are aliases for the same cell. Aliasing is when we have different references (under different names) to the same cell. Modifying the value of the reference cell through one alias modifies the value for all other aliases.The possibility of aliasing is all around us, in object references, explicit pointers (in C), arrays, communication channels, I/O devices; there’s practically no way around it. Yet, alias analysis is quite complex, costly, and often makes is hard for compilers to do optimizations they would like to do.With mutability, the order of operations now matters; r := 1; r := 2 isn’t the same as r := 2; r := 1. If we recall the Church-Rosser theorem, we’ve lost the principle that all reduction paths lead to the same result. Therefore, some language designers disallow it (Haskell). But there are benefits to allowing it, too: efficiency, dependency-driven data flow (e.g. in GUI), shared resources for concurrency (locks), etc. Therefore, most languages provide it.Still, languages without mutability have come up with a bunch of abstractions that allow us to have some of the benefits of mutability, like monads and lenses.Typing rulesWe’ll introduce references as a type Ref T to represent a variable of type T. We can construct a reference as r = ref 5, and access the contents of the reference using !r (this would return 5 instead of ref 5).Let’s define references in our language:12345678t ::= // terms unit // unit constant x // variable λx: T. t // abstraction t t // application ref t // reference creation !t // dereference t := t // assignmentEvaluationWhat is the value of ref 0? The crucial observation is that evaluation ref 0 must do something. Otherwise, the two following would behave the same:12345r = ref 0s = ref 0r = ref 0 s = rEvaluating ref 0 should allocate some storage, and return a reference (or pointer) to that storage. A reference names a location in the store (also known as the heap, or just memory). Concretely, the store could be an array of 8-bit bytes, indexed by 32-bit integers. More abstractly, it’s an array of values, or even more abstractly, a partial function from locations to values.We can introduce this idea of locations in our syntax. This syntax is exactly the same as the previous one, but adds the notion of locations:1234567891011121314v ::= // values unit // unit constant λx: T. t // abstraction value l // store locationt ::= // terms unit // unit constant x // variable λx: T. t // abstraction t t // application ref t // reference creation !t // dereference t := t // assignment l // store location This doesn’t mean that we’ll allow programmers to write explicit locations in their programs. We just use this as a modeling trick; we’re enriching the internal language to include some run-time structures.With this added notion of stores and locations, the result of an evaluation now depends on the store in which it is evaluated, which we need to reflect in our evaluation rules. Evaluation must now include terms $t$ and store $\mu$:Let’s take a look for the evaluation rules for STLC with references, operator by operator.The assignments $\ref{eq:e-assign1}$ and $\ref{eq:e-assign2}$ evaluate terms until they become values. When they have been reduced, we can do that actual assignment: as per $\ref{eq:e-assign}$, we update the store and return return unit.A reference $\text{ref }t_1$ first evaluates $t_1$ until it is a value ($\ref{eq:e-ref}$). To evaluate the reference operator, we find a fresh location $l$ in the store, to which it binds $v_1$, and it returns the location $l$.We find the same congruence rule as usual in $\ref{eq:e-deref}$, where a term $!t_1$ first evaluates $t_1$ until it is a value. Once it is a value, we can return the value in the current store using $\ref{eq:e-derefloc}$.The evaluation rules for abstraction and application are augmented with stores, but otherwise unchanged.Store typingWhat is the type of a location? The answer to this depends on what is in the store. Unless we specify it, a store could contain anything at a given location, which is problematic for typechecking. The solution is to type the locations themselves. This leads us to a typed store:As a first attempt at a typing rule, we can just say that the type of a location is given by the type of the value in the store at that location:This is problematic though; in the following, the typing derivation for $!l_2$ would be infinite because we have a cyclic reference:The core of the problem here is that we would need to recompute the type of a location every time. But shouldn’t be necessary. Seeing that references are strongly typed as Ref T, we know exactly what type of value we can place in a given store location. Indeed, the typing rules we chose for references guarantee that a given location in the store always is used to hold values of the same type.So to fix this problem, we need to introduce a store typing. This is a partial function from location to types, which we’ll denote by $\Sigma$.Suppose we’re given a store typing $\Sigma$ describing the store $\mu$. We can use $\Sigma$ to look up the types of locations, without doing a lookup in $\mu$:This tells us how to check the store typing, but how do we create it? We can start with an empty typing $\Sigma = \emptyset$, and add a typing relation with the type of $v_1$ when a new location is created during evaluation of $\ref{eq:e-refv}$.The rest of the typing rules remain the same, but are augmented with the store typing. So in conclusion, we have updated our evaluation rules with a store $\mu$, and our typing rules with a store typing $\Sigma$.SafetyLet’s take a look at progress and preservation in this new type system. Preservation turns out to be more interesting, so let’s look at that first.We’ve added a store and a store typing, so we need to add those to the statement of preservation to include these. Naively, we’d write:But this would be wrong! In this statement, $\Sigma$ and $\mu$ would not be constrained to be correlated at all, which they need to be. This constraint can be defined as follows:A store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$ (which we denote by $\Gamma\mid\Sigma\vdash\mu$) if the following is satisfied:This gets us closer, and we can write the following preservation statement:But this is still wrong! When we create a new cell with $\ref{eq:e-refv}$, we would break the correspondence between store typing and store.The correct version of the progress theorem is the following:This progress theorem just asserts that there is some store typing $\Sigma’ \supseteq \Sigma$ (agreeing with $\Sigma$ on the values of all old locations, but that may have also add new locations), such that $t’$ is well typed in $\Sigma’$.The progress theorem must also be extended with stores and store typings:Suppose that $t$ is a closed, well-typed term; that is, $\emptyset\mid\Sigma\vdash t: T$ for some type $T$ and some store typing $\Sigma$. Then either $t$ is a value or else, for any store $\mu$ such that $\emptyset\mid\Sigma\vdash\mu$2, there is some term $t’$ and store $\mu’$ with $t\mid\mu \longrightarrow t’\mid\mu’$.Type reconstruction and polymorphismIn type checking, we wanted to, given $\Gamma$, $t$ and $T$, check whether $\Gamma\vdash t: T$. So far, for type checking to take place, we required explicit type annotations.In this section, we’ll look into type reconstruction, which allows us to infer types when type annotations aren’t present: given $\Gamma$ and $t$, we want to find a type $T$ such that $\Gamma\vdash t:T$.Immediately, we can see potential problems with this idea: Abstractions without the parameter type annotation seem complicated to reconstruct (a parameter could almost have any type) A term can have many typesTo solve these problems, we’ll introduce polymorphism into our type system.Constraint-based Typing AlgorithmThe idea is to split the work in two: first, we want to generate and record constraints, and then, unify them (that is, attempt to satisfy the constraints).In the following, we’ll denote constraints as a set of equations $\set{T_i \hat{=} U_i}_{i=1, \dots, m}$, constraining type variables $T_i$ to actual types $U_i$.Constraint generationThe constraint generation algorithm can be described as the following function $TP: \text{Judgement } \rightarrow \text{Equations}$This creates a set of constraints between type variables and the expected types.Soundness and completenessIn general a type reconstruction algorithm $\mathcal{A}$ assigns to an environment $\Gamma$ and a term $t$ a set of types $\mathcal{A}(\Gamma, t)$.The algorithm is sound if for every type $T\in \mathcal{A}(\Gamma, t)$ we can prove the judgment $\Gamma\vdash t: T$.The algorithm is complete if for every provable judgment $\Gamma\vdash t: T$ we have $T\in\mathcal{A}(\Gamma, t)$.Soundness and completeness are the two directions of the following implication:Soundness and completeness are about the $\Leftarrow$ and $\Rightarrow$ directions of the above, respectively. The TP function we defined previously for STLC is sound and complete, and the relationship is thus $\iff$. We can write this mathematically as follows:Where: $a$ is a new type variable $EQNS = TP(\Gamma\vdash t: a)$ is the set of type constraints $\bar{b} = \text{tv}(EQNS)\setminus\text{tv}(\Gamma)$, where $\text{tv}$ denotes the set of free type variables. $[T / a] EQNS$ is notation for replacing $a$ with $T$ in $EQNS$What this means is still a little unclear to me, but it seems to say that we can prove the judgement $\Gamma\vdash t: T$ if and only if we have some type variables in the constraints ??? todoSubstitutionsNow that we’ve generated constraints in the form $\set{T_i\ \hat{=}\ U_i}_{i=1, \dots, m}$, we’d like a way to substitute these constraints into real types. We must generate a set of substitutions:These substitutions cannot be cyclical. The type variables may not appear recursively on their right-hand side (directly or indirectly). We can write this requirement as:This substitution is an idempotent mapping from type variables to types, mapping all but a finite number of type variables to themselves. We can think of a substitution as a set of equations:Alternatively, we can think of it as a function transforming types (based on the set of equations). Substitution is applied in a straightforward way:Substitution has two properties: Idempotence: $s(s(T)) = s(T)$ Composition: $(f \circ g)\ x = f(g\ x)$, the composition of substitutions, is also a substitutionUnificationWe present a unification algorithm based on Robinson’s 1965 unification algorithm:This function is called $\text{mgu}$, which stands for most general unifier.A substitution $u$ is a unifier of a set of equations $\set{T_i\ \hat{=}\ U_i}$ if $uT_i = uU_i,\, \forall i$. This means that it can find an assignment to the type variables in the constraints so that all equations are trivially true.The substitution is a most general unifier if for every other unifier $u’$ of the same equations, there exists a substitution $s$ such that $u’ = s\circ u$. In other words, it must be less specific (or more general) than all other unifiers.We won’t prove this, but just state it as a theorem: if we get a set of constraints $\text{EQNS}$ which has a unifier, then $\text{mgu EQNS} \set{}$ computes the most general unifier of the constraints. If the constraints do not have a unifier, it fails.In other words, the TP function is sound and complete.Strong normalizationWith this typing inference in place, we can be tempted to try to run this on the diverging $\Omega$ that we defined much earlier, or perhaps on the Y combinator. But as we said before, self-application is not typable. In fact, we can state a stronger assertion:Strong Normalization Theorem: if $\vdash t: T$, then there is a value $V$ such that $t \longrightarrow^* V$.In other words, if we can type it, it reduces to a value. In the case of the infinite recursion, we cannot type it, and it does not evaluate to a value (instead, it diverges). So looping infinitely isn’t possible in STLC, which leads us to the corollary of this theorem: STLC is not Turing complete.PolymorphismThere are multiple forms of polymorphism: Universal polymorphism (aka generic types): the ability to instantiate type variables Inclusion polymorphism (aka subtying): the ability to treat a value of a subtype as a value of one of its supertypes Ad-hoc (aka overloading): the ability to define several versions of the same function name with different types.We’ll concentrate on universal polymorphism, of which there are to variants: explicit and implicit.Explicit polymorphismIn STLC, a term can have many types, but a variable or parameter only has one type. With polymorphism, we open this up: we allow functions to be applied to arguments of many types.To do this, we introduce a new polymorphic type $\forall a.T$, which can be used as any other type. The typing rules are:The $\Lambda$ symbol represents a type abstraction. It corresponds to [T] or <T> in most programming languages. For instance, the signature of map could be written as follows in Scala:1def map[A][B](f: A => B)(xs: List[A]) = ...In lambda calculus we’d write:Implicit polymorphismImplicit polymorphism does not require annotations for parameter types. The idea is that inference treats unannotated terms as polymorphic types. To have this feature, we must introduce the notion of type schemes. These are not fully general types, but are an internal construct used to type named values (val or let ... in ... statements). A type scheme has the following syntax:This feature is called implicit polymorphism or let-polymorphism. The resulting type system is called the Hindley/Milner system. Its typing rules are:$\ref{eq:hm-var}$ means that we can verify $x: S$ if $(x: S)$ is in the environment and it isn’t overwritten later (in $\Gamma’$). This allows us to have some concept of scoping of variables.$\ref{eq:hm-forall-e}$ allows to verify specific instances of a polymorphic type, and $\ref{eq:hm-forall-i}$ allows to generalize to a polymorphic type (with a hygiene condition telling us that the type variable we choose isn’t already in the environment).$\ref{eq:hm-let}$ is fairly straightforward. $\ref{eq:hm-arrow-i}$ and $\ref{eq:hm-arrow-e}$ are simply as in STLC.Alternative Hindley MilnerA let-in statement can be regarded as shorthand for a substitution:We can use this to get a revised Hindley/Milner system which we call HM’, where $\ref{eq:hm-let}$ is replaced by the following:In essence, it only changes the typing rule for let so that they perform a step of evaluation before calculating the types. This is equivalent to the previous HM system; we’ll state that as a theorem, without proof.Theorem: $\Gamma\vdash_{\text{HM}} t: S \iff \Gamma\vdash_{\text{HM}’} t: S$The corollary to this theorem is that, if we let $t^*$ be the result of expanding all lets in $t$ using the substitution above, then:The converse is true if every let-bound name is used at least once:Principal typesA type $T$ is a generic instance of a type scheme $S = \forall \alpha_1, \dots, \forall \alpha_n. T’$ if there is a substitution $s$ on $\alpha_1, \dots, \alpha_n$ such that $T = sT’$. In this case, we write $S \le T$.A type scheme $S’$ is a generic instance of a type scheme $S$ iff for all types $T$:In this case, we write $S \le S’$.A type scheme $S$ is principal (or most general) for $\Gamma$ and $t$ iff: $\Gamma\vdash t: S$ $\Gamma\vdash t: S’ \implies S \le S’$A type system TS has the principal typing property iff, whenever $\Gamma\vdash_{\text{TS}} t: S$, there exists a principal type scheme for $\Gamma$ and $t$.In other words, a type system with principal types is one where the type engine doesn’t make any choices; it always finds the most general solution. The type checker may fail if it cannot advance without making a choice (e.g. for $\lambda x. x+x$, where the typechecker would have to choose between $\text{Int} \rightarrow \text{Int}$, $\text{Float} \rightarrow \text{Float}$, etc).The following can be stated as a theorem: HM’ without let has the principal typing property HM’ with let has the principal typing property HM has the principal typing propertySubtypingMotivationUnder $\ref{eq:t-app}$, the following is not well typed:We’re passing a record to a function that selects its x member. This is not well typed, but would still evaluate just fine; after all, we’re passing the function a better argument than it needs.In general, we’d like to be able to define hierarchies of classes, with descendants having richer interfaces. These should still be usable instead of their ancestors. We solve this using subtyping.We achieve this by introducing a subtyping relation $S <: T$, and a subsumption rule:This rule tells us that if $S <: T$, then any value of type $S$ can also be regarded as having type $T$. With this rule in place, we just need to define the rules for when we can assert $S <: T$.RulesGeneral rulesSubtyping is reflective and transitive:RecordsTo solve our previous example, we can introduce subtyping between record types:Using $\ref{eq:t-sub}$, we can see that our example is now well-typed. Of course, the subtyping rule we introduced here is too specific; we need something more general. We can do this by introducing three rules for subtyping of record types:$\ref{eq:s-rcdwidth}$ tells us that a record is a supertype of a record with additional fields to the right. Intuitively, the reason that the record more fields is a subtype of the record with fewer fields is because it places a stronger constraint on values, and thus describes fewer values (think of the Venn diagram of possible values).Of course, adding fields to the right only is not strong enough of a rule, as order in a record shouldn’t matter. We fix this with $\ref{eq:s-rcdperm}$, which allows us to reorder the record so that all additional fields are on the right: $\ref{eq:s-rcdperm}$, $\ref{eq:s-rcdwidth}$ and $\ref{eq:s-trans}$ allows us to drop arbitrary fields within records.Finally, $\ref{eq:s-rcddepth}$ allows for the types of individual fields to be subtypes of the supertype record’s fields.Note that real languages often choose not to adopt these structural record subtyping rules. For instance, Java has no depth subtyping (a subclass may not change the argument or result types of a method of its superclass), no permutation for classes (single inheritance means that each member can be assigned a single index; new members can be added as new indices “on the right”), but has permutation for interfaces (multiple inheritance of interfaces is allowed).Arrow typesFunction types are contravariant in the argument and covariant in the return type. The rule is therefore:Top typeFor convenience, we have a top type that everything can be a subtype of. In Java, this corresponds to Object.Aside: structural vs. declared subtypingThe subtype relation we defined for records is structural: we decide whether $S$ is a subtype of $T$ by examining the structure of $S$ and $T$. By contrast, most OO languages (e.g. Java) use declared subtyping: $S$ is only a subtype of $T$ if the programmer has stated that it should be (with extends or implements).We’ll come back to this when we talk about Featherweight Java.Properties of subtypingSafetyThe problem with subtyping is that it changes how we do proofs. They become a bit more involved, as the typing relation is no longer syntax directed; when we’re proving things, we need to start making choices, as the rule $\ref{eq:t-sub}$ could appear anywhere. Still, the proofs are possible.Inversion lemma for subtypingBefore we can prove safety and preservation, we’ll introduce the inversion lemma for subtyping.Inversion Lemma: If $U <: T_1 \rightarrow T_2$, then $U$ has the form $U_1 \rightarrow U_2$ with $T_1 <: U_1$ and $U_2 <: T_2$.The proof is by induction on subtyping derivations: Case $\ref{eq:s-arrow}$, $U=U_1 \rightarrow U_2$: immediate, as $U$ already has the correct form, and as we can deduce $T_1 <: U_1$ and $U_2 <: T_2$ from $\ref{eq:s-arrow}$. Case $\ref{eq:s-refl}$, $U=T_1 \rightarrow T_2$: by applying $\ref{eq:s-refl}$ twice, we get $T_1 <: T_1$ and $T_2 <: T_2$, as required. Case $\ref{eq:s-trans}$, $U <: W$ and $W <: T_1 \rightarrow T_2$ By the IH on the second subderivation, we find that $W$ has the form $W_1 \rightarrow W_2$ with $T_1 <: W_1$ and $W_2 <: T_2$. Applying the IH again to the first subderivation, we find that $U$ has the form $U_1 \rightarrow U_2$ with $W_1 <: U_1$ and $U_2 <: W_2$ By $\ref{eq:s-trans}$, we get $T_1 <: U_1$, and by $\ref{eq:s-trans}$ again, $U_2 <: T_2$ as required Inversion lemma for typingWe’ll introduce another lemma, but this time for typing (not subtyping):Iversion lemma: if $\Gamma\vdash\lambda x: S_1. s_2 : T_1 \rightarrow T_2$, then $T_1 <: S_1$ and $\Gamma\cup(x: S_1)\vdash s_2: T_2$.Again, the proof is by induction on typing derivations: Case $\ref{eq:t-abs}$, where $T_1 = S_1$, $T_2 = S_2$ and $\Gamma\cup(x: S_1)\vdash s_2 : S_2$: the result is immediate (using $\ref{eq:s-refl}$ to get $T_1 <: S_1$ from $T_1 = S_1$). Case $\ref{eq:t-sub}$, $\Gamma\vdash\lambda x: X_1.\ s_2: U$ and $U <: T_1 \rightarrow T_2$ By the inversion lemma for subtyping, we have $U = U_1 \rightarrow U_2$, with $T_1 <: U_1$ and $U_2 <: T_2$. By the IH, we then have $U_1 <: S_1$ and $\Gamma\cup(x: S_1)\vdash s_2 : U_2$. We can apply $\ref{eq:s-trans}$ to $U_1 <: S_1$ and $T_1 <: U_1$ to get $T_1 <: S_1$. We can apply $\ref{eq:t-sub}$ to the assumptions that $\Gamma\cup(x: S_1)\vdash s_2: U_2$ and $U_2 <: T_2$ to conclude $\Gamma\cup(x: S_1)\vdash s_2: T_2$ PreservationRemember that preservation states that if $\Gamma\vdash t: T$ and $t\longrightarrow t’$ then $\Gamma\vdash t’: T$.The proof is by induction on typing derivations: Case $\ref{eq:t-sub}$: $t: S$ and $S <: T$. By the IH, $\Gamma\vdash t’: S$. By $\ref{eq:t-sub}$, $\Gamma\vdash t: T$. Case $\ref{eq:t-app}$: $t = t_1\ t_2$, $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$, $\Gamma\vdash t_2: T_{11}$ and $T = T_{12}$. By the inversion lemma for evaluation3, there are three rules by which $t\longrightarrow t’$ can be derived: Subcase $\ref{eq:e-app1}$: $t_1 \longrightarrow t_1’$ and $t’ = t_1’\ t_2$. The result follows from the IH and $\ref{eq:t-app}$ Subcase $\ref{eq:e-app2}$: $t_1 = v_1$, $t_2 \longrightarrow t_2’$ and $t’ = v_1\ t_2’$. The result follows from the IH and $\ref{eq:t-app}$ Subcase $\ref{eq:e-appabs}$: $t_1 = \lambda x: S_{11}.\ t_{12}$, $t_2 = v_2$ and $t’ = [x\mapsto v_2]t_{12}$. By the inversion lemma for typing, $T_{11} <: S_{11}$ and $\Gamma\cup (x: S_{11})\vdash t_{12}: T_{12}$. By $\ref{eq:t-sub}$, $\Gamma\vdash t_2: S_{11}$ By the substitution lemma, $\Gamma\vdash t’: T_{12}$. Subtyping featuresCastingIn languages like Java and C++, ascription is a little more interesting than what we previously defined it as. In these languages, ascription serves as a casting operator.Contrary to $\ref{eq:t-ascribe}$, the $\ref{eq:t-cast}$ rule allows the ascription to be of a different type than the term. This allows the programmer to have an escape hatch, and get around the type checker. However, this laissez-faire solution means that a run-time check is necessary, as $\ref{eq:e-cast}$ shows.VariantsThe subtyping rules for variants are almost identical to those of records, with the main difference being the width rule allows variants to be added, not dropped:The intuition for $\ref{eq:s-variantwidth}$ is that a tagged expression $\langle l = t \rangle$ belongs to a variant type $\langle l_i : {T_i}^{i\in 1\dots n} \rangle$ if the label $l$ is one of the possible labels $\set{l_i}$. This is easy to understand if we consider the Option example that we used previously: some and none are subtypes of Option.CovarianceList is an example of a covariant type constructor: we want List[None] to be a subtype of List[Option].InvarianceReferences are not covariant nor invariant. An example of an invariant constructor is a reference. When a reference is read, the context expects $T_1$ so giving a $S_1 <: T_1$ is fine When a reference is written, the context provides a $T_1$. If the the actual type of the reference is $\text{Ref } S_1$, someone may later use the $T_1$ as an $S_1$, so we need $T_1 <: S_1$Similarly, arrays are invariant, for the same reason:Instead, Java has covariant arrays:This is because the Java language designers felt that they needed to be able to write a sort routine for mutable arrays, and implemented this as a quick fix. Instead, it turned out to be a mistake that even the Java designers regret.The solution to this invariance problem is based on the following observation: a Ref T can be used either for reading or writing. To be able to have contravariant reading and covariant writing, we can split a Ref T in three: Source T: a reference with read capability Sink T: a reference cell with write capability Ref T: a reference cell with both capabilitiesThe typing rules then limit dereference to sources, and assignment to sinks:The subtyping rules establish sources as covariant constructors, sinks as contravariant, and a reference as a subtype of both:Algorithmic subtypingSo far, in STLC, our typing rules were syntax directed. This means that for every for every form of a term, a specific rule applied; which rule to choose was always straightforward.The reason the choice is so straightforward is because we can divide the positions of a typing relation like $\ref{eq:t-app}$ into input positions ($\Gamma$ and $t$), and output positions ($T_{11}$, $T_{12}$).However, by introducing subtyping, we introduced rules that break this: $\ref{eq:t-sub}$ and $\ref{eq:s-trans}$ apply to any kind of term, and can appear at any point of a derivation. Every time our type checking algorithm encounters a term, it must decide which rule to apply. $\ref{eq:s-trans}$ also introduces the problem of having to pick an intermediary type $U$ (which is neither an input nor an output position), for which there can be multiple choices. $\ref{eq:s-refl}$ also overlaps with the conclusions of other rules, although this is a less severe problem.But this excess flexibility isn’t strictly needed; we don’t need 1000 ways to prove a given typing or subtyping statement, one is enough. The solution to these problems is to replace the ordinary, declarative typing and subtyping relations with algorithmic relations, whose sets of rules are syntax directed. This implies proving that the algorithmic relations are equivalent to the original ones.ObjectsFor simple objects and classes, we can easily use a translational analysis, converting ideas like dynamic dispatch, state, inheritance, into derived forms from lambda calculus such as (higher-order) functions, records, references, recursion, subtyping. However, for more complex features (like this), we’ll need a more direct treatment.In lambda calculus, we can represent an object as a record inside a let:12345class Counter { protected int x = 1; int get() { return x; } void inc() { x++; }}To create an object, we can just do the following:This returns a newCounter object of type $\text{Unit} \rightarrow \text{Counter}$, where $\text{Counter} = \set{\text{get}: \text{Unit} \rightarrow \text{Nat},\ \text{inc}: \text{Unit}\rightarrow\text{Unit}}$.More generally, the state may consist of more than a single reference cell, so we can let the state be represented by a variable r corresponding to a record with (potentially) multiple fields.Dynamic dispatchWhen an operation is invoked on an object, the ensuing behavior depends on the object itself; indeed, two object of the same type may be implemented internally in completely different ways.This is late binding for function calls. The idea is to bind a call to the corresponding function at runtime. todo.EncapsulationIn most OO languages, each object consists of some internal state. The state is directly accessible to the methods, but inaccessible from the outside. It’s a form of information hiding.Note that this information hiding is different from what abstract data types (ADTs), which do not offer dynamic dispatch.In Java, the encapsulation can be enabled with protected.The type of an object is just the set of operations that can be performed on it. It doesn’t include the internal state.InheritanceSubtyping is a way to talk about types. Inheritance is more focused on the idea of sharing behavior, on avoiding duplication of code. The basic mechanism of inheritance is classes. Classes can be instantiated to create new objects (“instances”), or refined to create new classes (“subclasses”). Subclasses are subtypes of their parent classes. We’ll talk about both here, but it’s important to know the distinction.We saw previously that a record A with more fields than B is a subtype of B. As an example, let’s try to look at a ResetCounter inheriting from Counter, adding a reset method that sets x to 0.Initially, we can just try to do this by copying the code, and adding a method. But this goes against the DRY principle from software engineering. Another thing that we could try is to take a Counter as an argument in the object generator, but this is problematic because we’re not sharing the state; we’ve got two separate counts in Counter and ResetCounter, and they can not access each other’s state.To avoid these problems, we must separate the definitions of the methods, from the act of binding these methods to a particular set of instance variables, in the object generator. Here, we use the age-old computer science adage of “every problem can be solved with an additional level of indirection”.We’ll first have to introduce the notion of super. We know this construct from Java, for instance. Java’s super gives us a mechanism to avoid dynamic dispatch. We can call specifically the methods of the class we’re inheriting from through super.To define a subclass, the idea is then to instantiate the super, and bind the methods of the object to the super’s method. The classes both have access to the same value through the use of references.This also allows us to call super in redefined methods (so inc could call super.inc if it needs to).Our record $r$ can even contain more variable than the superclass needs, as records with more fields are subtypes of those with a subset of fields. This allows us to have more instance variables in the subclass.Note that to be more rigorous, we’d have to define this more precisely. In most OO languages, things aren’t subtypes of each other just because they have the same methods; it’s because we declare them to be so. We’d need to be more rigorous to model this.ThisOO langauges provide access to this, the current method receiver. It may be an instance of a subclass, no the class we’re currently looking at. So this’s actual class (at runtime) must be able to override the definitions.123456789class E { protected int x = 0; int m() { x = x+1; return x; } int n() { x = x-1; return this.m(); }}class F extends E { int m() { x = x+100; return x; }}Above, we saw how to call the parent class through super. To call methods between each other, we need to add this.In an initial attempt at implementing this in lambda calculus, we can add a fix operator to the class definition, so that we can call ourselves.But the fixed point here is “closed”. We have “tied the knot” when we built the record. So this does not model the behavior of this in OO. To solve this, we can move the application of fix from the class definition to the object creation function (essentially switching the order of $\text{fix}$ and $\lambda r: \text{CounterRep}$):Note that this changes the type signature: todo (slide 50)Using thisLet’s continue the example from above by defining a new class of counter object, keeping count of the number of times set has been called. We’ll call this an “instrumented counter”.1234567891011121314151617181920InstrCounter = { get: Unit -> Nat, set: Nat -> Unit, inc: Unit -> Unit accesses: Unit -> Nat}IntrCounterRep = { x: Ref Nat, a: Ref Nat}instrCounterClass = λr: InstrCounterRep. λthis: InstrCounter. let super = setCounterClass r this in {get = super.get, set = λi: Nat. (r.a := succ(!(r.a)); super.set i), inc = super.inc, accesses = λ_: Unit. !(r.a)};But this implementation is not very useful, as the object creator diverges. Intuitively, the problem is that the “unprotected” use of this. A solution is to “delay” this by putting a dummy abstraction in front of it. This allows us to replace call-by-value with call-by-name. Now, this is of type $\text{Unit} \rightarrow \text{SetCounter}$.This works, but very slowly. All the delaying we added has a side effect. Instead of computing the method table just once, we now re-compute it every time we invoke a method. Indeed, every time we need it, since we’re in call-by-name, we re-compute it every time. The solution here is to use a lazy value. In lambda calculus, we represent lazy values with a reference, along with a flag about whether we’ve computed it or not. Section 18.12 describes this in more detail.Featherweight JavaHaving covered all the topics related to the essence of objects. But there are still certain things missing compared to Java. With objects, we’ve captured the runtime aspect of classes, but we haven’t really talked about the classes as types. We’re also missing a discussion on named types with declared subtyping (we’ve only done structural subtyping), nor recursive types (like the ones we need for list tails, for instance) or run-time type analysis. Additionally, most type systems have escape hatches known as casts, which we haven’t talked about either.Seeing that we have plenty to talk about, let’s try to define a model for Java. Remember that a model always abstracts details away, so there’s no such thing as a perfect model. It’s always a question of which tradeoffs we choose for our specific use-case. Seeing that Java has a lot of different purposes, we are going to have lots of different models. For instance, some of the choices we need to make are: Source-level vs. bytecode level Large (inclusive) vs small (simple) models Type system vs. run-time Models of specific featuresFeatherweight Java was proposed as a tool for analyzing GJ (Java with generics), and has since been used to study proposed Java extensions. It aims to be very simple, modelling just the core OO features and their types, and nothing else. It models classes, objects, methods, method invocation, fields, field access, inheritance and casting, but leaves out more complex topics such as reflection, concurrency, exceptions, loops, assignment and overloading.The model aims to be very explicit. Every class must declare a parent class All classes must have a constructor All fields must be represented 1-to-1 in the constructor The constructor must call super() Always explicitly name receiver object in method invocation or field access (using this.x or that.x) Methods are just a single return expressionStructural vs. Nominal type systemsThere’s a big dichotomy in the world of programming languages.On one hand, we structural type systems, where the names are convenient but inessential abbreviations. What really matters about a type in a structural type system is its structure. It’s somewhat cleaner and more elegant, easier to extend, but once we need to talk about recursive types, some of the elegance falls away.On the other hand, what’s used in almost all mainstream programming languages is nominal type systems. Here, recursive types are much simpler, and using names everywhere makes type checking much simpler. Having named types is also useful at run-time for casting, type testing, reflection, etc.Representing objectsHow can we represent an object? What defines it? Two objects are different if their constructors are different, or if their constructors have been passed different arguments. This observation leads us to the idea that we can identify an object fully by looking at the new expression. Here, having omitted assignments makes our life much easier.SyntaxWe’ll use the notation $\bar{C}$ to mean arbitrary repetition of $C$ (a constructor) or $c$ (a variable or value). The notation $\bar{C}\ \bar{f}$ means we’ve “zipped” the two together, like $(C_1\ f_1, \dots, C_n\ f_n)$.todoEvaluationFJ uses call-by-value, like lambda calculus and Java.TypingWe have two rules for casting: one for subtypes, and one for supertypes. We do not allow casting to an unrelated type, because FJ complies with Java, and Java doesn’t allow it.For methods and classes, we want to make sure that overrides are valid, that we pass the correct arguments to the superclass constructor.Also note that the our typing rules often have subsumption built into it, instead of having a separate subsumption rule. This allows us to have algorithmic subtyping, which we need for two reasons: To perform static overloading resolution (picking between different overloaded methods at compile-time), we need to be able to speak about the type of an expression (and we need one single type, not several of them) We’d run into trouble typing conditional expressions. This is not something that we have included in FJ, but regular Java has it, and we may wish to include it as an extension to FJLet’s talk about this problem with conditionals in a little more detail. If we have a conditional (or a ternary expression) $t_1 ? t_2 : t_3$, with $t_1: T_1$, $t_2: T_2$, $t_3: T:_3$, what is the return type of the expression? The simple solution is the least common supertype (this corresponds to the lowest common ancestor), but that becomes problematic with interfaces, which allow for multiple inheritance (for instance, if $T_2$ and $T_3$ both implement $I_2$ and $I_3$, we wouldn’t know which one to pick).The actual Java rule that’s used is that the return type is $\min (T_2, T_3)$. Scala solves this (in Dotty) with union types, where the result type is $T_2 \mid T_3$.PropertiesProgressWe can’t actually prove progress, as well-typed programs can get stuck because of casting. Casting can fail, and we’d get stuck. The solution is to weaken the statement of progress. We’ll instead try to prove that a well-typed FJ term is either value, reduces to one, or gets stuck at a cast failure.To formalize this, we need a little more work. Indeed, since casts are done at runtime, we need to describe the evaluation context.todoWe can now restate progress more formally. Suppose $t$ is a closed, well-typed normal form. Then either: $t$ is a value $t \longrightarrow t’$ for some $t’$ For some evaluation context $E$, we can express $t$ as $t = E[(C)(new D(\bar{v}))]$, with $\neg (D <: C)$PreservationTheorem: todoBut preservation doesn’t actually hold here. Because we allow casts to go up and down, we can upcast to Object before downcasting to another, unrelated type. Because FJ must model Java, we need to actually introduce a rule for this. In this new rule, we give a “stupid warning”Foundations of ScalaModeling ListsIf we’d like to apply everything we’ve learned so far to model Scala, we’ll run into problems. Say we’d like to model a List; immediately, we run into these problems It’s parameterized It’s recursive It can be invariant or covariantTo solve this, we need a way to express type constructors:1234* // kind of normal types (Boolean, Int, ...)* -> * // unary type constructor: something that takes a type, returns one* -> * -> * ... We need some way to express these, so we’ll introduce $\mu$, which works like $\lambda$ but for types. This allows us to have constructors for recursive types, $\mu t.\ T(t)$. While it is possible, this introduces problems for dealing with subtyping and equality (e.g. how do T and Int -> T).We can deal with variance by expressing definition site variance as use-site variance, using Java wildcards:12345678910111213// definition site variance:trait List[+T] { ... }trait Function1[-T, +U] { ... }List[C]Function1[D, E]// use-site variance:trait List[T] { ... }trait Function1[T, U]List[_ <: C]Function1[_ >: D, _ <: E]Function1[_ >: D, _ <: E] is the type of functions from some (unknown) supertpye of D to some (unknown) subtype of E, which corresponds to an existential type. This is one possible way of modeling it, but it gets messy rather quickly. Can we find a nicer way of expressing this?Scala has type members, so we can re-formulate the list as follows:123456789101112131415trait List { self => type T def isEmpty: Boolean def head: T def tail: List { type T <: self.T } // refinement handling co-variance}def Cons[X](hd: X, tl: List {type T <: X}) = new List { type T = X def isEmpty = false def head = hd def tail = tl}// analogous for NilUsing these path-dependent types self.T, we can avoid using existential types.Abstract classAbstract types are types without a concrete implementation. They may have an upper and/or lower bound (as type L >: T <: U).1234567891011// Abstract type:trait KeyGen { type Key def key(s: String): this.Key}// Implementation object HashKeyGen extends KeyGen { type Key = Int def key(s: String) = s.hashCode}We can reference the Key type of a term k as k.Key, which is a path-dependent type. For instance:1def mapKeys(k: KeyGen, ss: List[String]): List[k.Key] = ss.map(s => k.key(s))The function mapKeys has a dependent function type. This is an interesting type, because it has an internal dependency: (k: KeyGen, ss: List[String]) -> List[k.Key]. In Scala 2, can’t express this directly (we’d have to go through a trait with an apply method). Scala 3 (dotty) introduces these dependent function types at the language level; it’s done with a similar trick to what we just saw. In dotty, the intention was to have everything map to a simple object type; this has been formalized in a calculus called DOT, (path-)Dependent Object Types.DOTThe DOT syntax is described in the DOT paper. Types are in uppercase, terms in lowercase. Note that recursive types $\mu (x: T)$ are different from what we’ve talked about, but we’ll get to that later.As a small technicality, DOT imposes the restriction of only allowing member selection and application on variables, and not on values or full terms. This is equivalent, because we could just assign the value to a variable before selection or application. This way of writing programs is also called administrative normal form (ANF).To simplify things, we can introduce a programmer-friendly notation with ASCII versions of DOT constructs:.Our calculus does not have generic types, because we can encode them as dependent function types. For instance, the polymorphic type of the twice method, $\forall X.\ (X \rightarrow X) \rightarrow X \rightarrow X$ is represented as:1(cX: {A: Nothing..Any}) -> (cX.A -> cX.A) -> cX.A -> cX.AThe cX parameter is a kind of cell containing a type variance X (hence the name cX).As an example, let’s see how Church Booleans could be implemented:123456789101112// Define an abstract "if type" IFTtype IFT = { if: (x: {A: Nothing..Any}) -> x.A -> x.A -> x.A }let boolimpl = let boolImpl = new(b: { Boolean: IFT..IFT } & { true: IFT } & { false: IFT }) { Boolean = IFT } & { true = { if = (x: {A: Nothing..Any}) => (t: x.A) => (f: x.A) => t } & { false = { if = (x: {A: Nothing..Any}) => (t: x.A) => (f: x.A) => f }in ...We can hide the implementation details of this with a small wrapper to which we apply boolImpl. This is all a little long-winded, so we can introduce some abbreviations:...With these in place, we can give an abbreviated definition:123456let bool = new { b => type Boolean = {if: (x: { type A }) -> (t: x.A) -> (f: x.A) -> x.A} true = {if: (x: { type A }) => (t: x.A) => (f: x.A) => t} false = {if: (x: { type A }) => (t: x.A) => (f: x.A) => f} }: { b => type Boolean; true: b.Boolean; false: b.Boolean }We’ve introduced all the concepts we need to actually define the covariant list in DOT (see slides). This concept of hiding the implementation is nominality. A nominal type such as List is simply an abstract type with a hidden implementation. This shows that nominal and structural types aren’t completely separated; we can do nominal types within a structural setting if we have these constructs.EvaluationEvaluation is interesting, because we’d like it to keep terms in ANF.Abstract typesAbstract types turn out to be both the most interesting and most difficult part of this, so let’s take a quick look at it before we go on.Abstract types can be used to encode type parameters (as in List), hide information (as in KeyGen), and also to resolve some puzzlers like this one:12345678910111213141516trait Animal { def eat(food: Food): Unit}trait Cow extends Animal with Food { // error: does not override Animal.eat because of contravariance def eat(food: Grass): Unit}trait Lion extends Animal { // error: does not override Animal.eat because of contravariance def eat(food: Cow): Unit}trait Foodtrait Grass extends FoodScala disallows this, but Eiffel, Dart and TypeScript and allow it. The trade-off that the latter languages choose is modeling power over soundness, though some languages have eventually come back around and tried to fix this (Dart has a strict mode, Eiffel proposed some data flow analysis, …).In Scala, this contravariance can be solved with abstract types:1234567891011121314trait Animal { type Diet <: Food def eat(food: Diet): Unit}trait Cow extends Animal { type Diet <: Grass def eat(food: this.Diet): Unit}object Milka extends Cow { type Diet = AlpineGrass def eat(food: AlpineGrass): Unit}Progress and preservationProgress is actually wrong. Here’s a counter example:1t = let x = (y: Bool) => y in xBut we can extend our definition of progress. Instead of values, we’ll just want to get answers, which we define as variables, values or let-bindings.But this is difficult (and it’s what took 8 years to prove), because we always need an inversion, and the subtyping relation is user-definable. This is not a problem for simple type bounds:1type T >: S <: UBut it becomes complex for non-sensical bounds:1type T >: Any <: NothingBy transitivity, it would mean that Any <: Nothing, so by transitivity all types are subtypes of each other. This is bad because it means that inversion fails, as we cannot tell anything from the types anymore.We might say that this should be easy to disallow in the compiler, but it isn’t. The compiler cannot always tell.123456// S and T are both good:type S = { type A; type B >: A <: Bot }type T = { type A >: Top <: B; type B }// But their intersection is badtype S & T == { type A >: Top <: Bot; type B >: Top <: Bot }Bad bounds can arise from intersecting types with good bounds. This isn’t too bad in and of itself, as we could just check all intersection types, written or inferred, for these bad bounds. But there’s a final problem: bad bounds can arise at run-time. By preservation, if $\Gamma\vdash t: T$ and $t\longrightarrow u$ then $\Gamma\vdash u: T$. Because of subsumption, $u$ may also have a type $S$ which is a true subtype of $T$, and that type $S$ could have bad bounds (from an intersection for instance).To solve this, the idea is to reason about environments $\Gamma$ arising from an actual computation in the preservation rule. This environment corresponds to an evaluated let binding, binding variables to values. Values are guaranteed to have good bounds because all type members are aliases.In other words, the let prefix acts like a store, a set of bindings $x = v$ of variables to values. Evaluation will then relate terms and stores:For the theorems of proofs and preservation, we need to relate environment and store. We’ll introduce a definition: An environment $\Gamma$ corresponds to a store $s$, written $\Gamma \sim s$ if for every binding $x=v$ there is an entry $\Gamma\vdash x: T$ where $\Gamma \vdash_{!} v: T$.Here $\vdash_{!}$ denotes an exact typing relation, whose typing derivation ends with All-I or {}-I (so no subsumption or structural rules).By restating our theorems as follows, we can then prove them. Preservation: If $\Gamma\vdash t: T$ and $G\sim s$ and $s \mid t \longrightarrow s’ \mid t’$ then there exists an environment $\Gamma’ \subset \Gamma$ such that $\Gamma’ \vdash t’ : T$ and $\Gamma’ \sim s’$. Progress: if $\Gamma\vdash t: T$ and $\Gamma\sim s$ then either $t$ is a normal form, or $s\mid t \longrightarrow s’ \mid t’$ for some store $s’$ and term $t’$. $(t, C) \in \text{Consts}$ is equivalent to $\text{Consts}(t) = C$ ↩ Recall that this notation is used to say a store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$, as defined in the section on safety in STLC with stores. ↩ Both the course and TAPL only specify the inversion lemma for evaluation for the toy language with if-else and booleans, but the same reasoning applies to get an inversion lemma for evaluation for pure lambda calculus, in which three rules can be used: $\ref{eq:e-app1}$, $\ref{eq:e-app2}$ and $\ref{eq:e-appabs}$. ↩ CS-451 Distributed Algorithms2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/distributed-algorithms
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul>
<li><a href="#links" id="markdown-toc-links">Links</a> <ul>
<li><a href="#fair-loss-link-fll" id="markdown-toc-fair-loss-link-fll">Fair loss link (FLL)</a></li>
<li><a href="#stubborn-link" id="markdown-toc-stubborn-link">Stubborn link</a></li>
<li><a href="#perfect-link" id="markdown-toc-perfect-link">Perfect link</a></li>
</ul>
</li>
<li><a href="#impossibility-of-consensus" id="markdown-toc-impossibility-of-consensus">Impossibility of consensus</a> <ul>
<li><a href="#solvable-atomicity-problem" id="markdown-toc-solvable-atomicity-problem">Solvable atomicity problem</a></li>
<li><a href="#unsolvable-atomicity-problem" id="markdown-toc-unsolvable-atomicity-problem">Unsolvable atomicity problem</a></li>
</ul>
</li>
<li><a href="#failure-detection" id="markdown-toc-failure-detection">Failure detection</a></li>
</ul>
</li>
<li><a href="#reliable-broadcast" id="markdown-toc-reliable-broadcast">Reliable broadcast</a> <ul>
<li><a href="#best-effort-broadcast" id="markdown-toc-best-effort-broadcast">Best-effort broadcast</a></li>
<li><a href="#reliable-broadcast-1" id="markdown-toc-reliable-broadcast-1">Reliable broadcast</a></li>
<li><a href="#uniform-reliable-broadcast" id="markdown-toc-uniform-reliable-broadcast">Uniform reliable broadcast</a></li>
</ul>
</li>
<li><a href="#causal-order-broadcast" id="markdown-toc-causal-order-broadcast">Causal order broadcast</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#causality" id="markdown-toc-causality">Causality</a></li>
<li><a href="#algorithm" id="markdown-toc-algorithm">Algorithm</a></li>
</ul>
</li>
<li><a href="#total-order-broadcast" id="markdown-toc-total-order-broadcast">Total order broadcast</a></li>
<li><a href="#consensus" id="markdown-toc-consensus">Consensus</a> <ul>
<li><a href="#consensus-algorithm" id="markdown-toc-consensus-algorithm">Consensus algorithm</a></li>
<li><a href="#uniform-consensus-algorithm" id="markdown-toc-uniform-consensus-algorithm">Uniform consensus algorithm</a></li>
<li><a href="#uniform-consensus-algorithm-with-eventually-perfect-failure-detector" id="markdown-toc-uniform-consensus-algorithm-with-eventually-perfect-failure-detector">Uniform consensus algorithm with eventually perfect failure detector</a></li>
</ul>
</li>
<li><a href="#atomic-commit" id="markdown-toc-atomic-commit">Atomic commit</a> <ul>
<li><a href="#non-blocking-atomic-commit-nbac" id="markdown-toc-non-blocking-atomic-commit-nbac">Non-Blocking Atomic Commit (NBAC)</a></li>
<li><a href="#2-phase-commit" id="markdown-toc-2-phase-commit">2-Phase Commit</a></li>
</ul>
</li>
<li><a href="#terminating-reliable-broadcast-trb" id="markdown-toc-terminating-reliable-broadcast-trb">Terminating reliable broadcast (TRB)</a></li>
<li><a href="#group-membership" id="markdown-toc-group-membership">Group membership</a></li>
<li><a href="#view-synchronous-vs-communication" id="markdown-toc-view-synchronous-vs-communication">View-Synchronous (VS) communication</a></li>
<li><a href="#from-message-passing-to-shared-memory" id="markdown-toc-from-message-passing-to-shared-memory">From message passing to Shared memory</a></li>
<li><a href="#byzantine-failures" id="markdown-toc-byzantine-failures">Byzantine failures</a></li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<!-- More -->
<h2 id="introduction">Introduction</h2>
<ul>
<li><a href="http://dcl.epfl.ch/site/education/da">Website</a></li>
<li>Course follows the book <em>Introduction to Reliable (and Secure) Distributed Programming</em></li>
<li>Final exam is 60%</li>
<li>Projects in teams of 2-3 are 40%
<ul>
<li>The project is the implementation of a blockchain</li>
<li>Send team members to matej.pavlovic@epfl.ch</li>
</ul>
</li>
<li>No midterm</li>
</ul>
<p>Distributed algorithms are between the application and the channel.</p>
<p>We have a few commonly used abstractions:</p>
<ul>
<li><strong>Processes</strong> abstract computers</li>
<li><strong>Channels</strong> abstract networks</li>
<li><strong>Failure detectors</strong> abstract time</li>
</ul>
<p>When defining a problem, there are two important properties that we care about:</p>
<ul>
<li><strong>Safety</strong> states that nothing bad should happen</li>
<li><strong>Liveness</strong> states that something good should happen</li>
</ul>
<p>Safety is trivially implemented by doing nothing, so we also need liveness to make sure that the correct things actually happen.</p>
<h3 id="links">Links</h3>
<p>Two nodes can communicate through a link by passing messages. However, this message passing can be faulty: it can drop messages or repeat them. How can we ensure correct and reliable message passing under such conditions?</p>
<p>A link has two basic types of events:</p>
<ul>
<li>Send</li>
<li>Deliver</li>
</ul>
<h4 id="fair-loss-link-fll">Fair loss link (FLL)</h4>
<p>A fair loss link is a link that may lose or repeat some packets. This is the weakest type of link we can assume. In practice, it corresponds to UDP.</p>
<p>Deliver can be thought of as a reception event on the receiver end. The terminology used here (“deliver”) implies that the link delivers to the client, but this can equally be thought of as the client receiving from the link.</p>
<p>For a link to be considered a fair-loss link, we must respect the following three properties:</p>
<ul>
<li><strong>Fair loss</strong>: if the sender sends infinitely many times, the receiver must deliver infinitely many times. This does not guarantee that all messages get through, but at least ensures that some messages get through.</li>
<li><strong>No creation</strong>: every delivery must be the result of a send; no message must be created out of the blue.</li>
<li><strong>Finite duplication</strong>: a message can only be repeated by the link a finite number of times.</li>
</ul>
<h4 id="stubborn-link">Stubborn link</h4>
<p>A stubborn link is one that stubbornly delivers messages; that is, it ensures that the message is received, with no regard to performance.</p>
<p>A stubborn link can be implemented with a FLL as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="n">upon</span> <span class="n">send</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">FLL</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">FLL</span><span class="o">.</span><span class="n">deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="n">trigger</span> <span class="n">deliver</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The above uses generic pseudocode, but the syntax we’ll use in this course is as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">SubbornLinks</span> <span class="p">(</span><span class="n">sp2p</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">FairLossLinks</span> <span class="p">(</span><span class="n">flp2p</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">sp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="k">while</span> <span class="bp">True</span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">flp2p</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">flp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">sp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Note that this piece of code is meant to sit between two abstraction levels; it is between the channel and the application. As such, it receives sends from the application and forwards them to the link, and receives delivers from the link and forwards them to the application.</p>
<p>It must respect the interface of the underlying FLL, and as such, only specifies send and receive hooks.</p>
<h4 id="perfect-link">Perfect link</h4>
<p>Here again, we respect the send/deliver interface. The properties are:</p>
<ul>
<li><strong>Validity</strong> or reliable delivery: if both peers are correct, then every message sent is eventually delivered</li>
<li><strong>No duplication</strong></li>
<li><strong>No creation</strong></li>
</ul>
<p>This is the type of link that we usually use: TCP is a perfect link, although it also has more guarantees (notably on message ordering, which this definition of a perfect link does not have). TCP keeps retransmitting a message stubbornly, until it gets an acknowledgement, which means that it can stop transmitting. Acknowledgements aren’t actually needed <em>in theory</em>, it would still work without them, but we would also completely flood the network, so acknowledgements are a practical consideration for performance; just note that the theorists don’t care about them.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">PerfectLinks</span> <span class="p">(</span><span class="n">pp2p</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">StubbornLinks</span> <span class="p">(</span><span class="n">sp2p</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">pp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">sp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">sp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span> <span class="n">then</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">pp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">add</span> <span class="n">m</span> <span class="n">to</span> <span class="n">delivered</span><span class="p">;</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="impossibility-of-consensus">Impossibility of consensus</h3>
<p>Suppose we’d like to compute prime numbers on a distributed system. Let <em>P</em> be the producer of prime numbers. Whenever it finds one, it notifies two servers, <em>S1</em> and <em>S2</em> about it. A client <em>C</em> may request the full list of known prime numbers from either server.</p>
<p>As in any distributed system, we want the servers to behave as a single (abstract) machine.</p>
<h4 id="solvable-atomicity-problem">Solvable atomicity problem</h4>
<p><em>P</em> finds 1013 as a new prime number, and sends it to <em>S1</em>, which receives it immediately, and <em>S2</em>, which receives it after a long delay. In the meantime, before both servers have received the update, we have an atomicity problem: one server has a different list from the other. In this time window, <em>C</em> will get different results from <em>S1</em> (which has numbers up to 1013) and <em>S2</em> (which only has numbers up to 1009, which is the previous prime).</p>
<p>A simple way to solve this is to have <em>C</em> send the new number (1013) to the other servers; if it requested from <em>S1</em> it’ll send the update to <em>S2</em> as a kind of write back, to make sure that <em>S2</em> also has it for the next request. We haven’t strictly defined the problem or its requirements, but this may need to assume a link that guarantees delivery and order (i.e. TCP, not UDP).</p>
<h4 id="unsolvable-atomicity-problem">Unsolvable atomicity problem</h4>
<p>Now assume that we have two prime number producers <em>P1</em> and <em>P2</em>. This introduces a new atomicity problem: the updates may not reach all servers atomically in order, and the servers cannot agree on the order.</p>
<p>This is <strong>impossible</strong> to solve; we won’t prove it, but universality of Turing is lost (unless we make very strong assumptions). This is known as the <a href="https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf"><em>impossibility of consensus</em></a>.</p>
<h3 id="failure-detection">Failure detection</h3>
<p>A <strong>failure detector</strong> is a distributed oracle that provides processes with suspicions about crashed processes. There are two kinds of failure detectors, with the following properties</p>
<ul>
<li><strong>Perfect</strong>
<ul>
<li><strong>Strong completeness</strong>: eventually, every process that crashed is permanently suspected by every correct process</li>
<li><strong>Strong accuracy</strong>: no process is suspected before it crashes</li>
</ul>
</li>
<li><strong>Eventually perfect</strong>
<ul>
<li><strong>Strong completeness</strong></li>
<li><strong>Eventual strong accuracy</strong>: eventually, no correct process is ever suspsected</li>
</ul>
</li>
</ul>
<p>An eventually perfect detector may make mistakes and may operate under a delay. But eventually, it will tell us the truth.</p>
<p>A failure detector can be implemented by the following algorithm:</p>
<ol>
<li>Processes periodically send heartbeat messages</li>
<li>A process sets a timeout based on worst case round trip of a message exchange</li>
<li>A process suspects another process has failed if it timeouts that process</li>
<li>A process that delivers a message from a suspected process revises its suspicion and doubles the time-out</li>
</ol>
<p>Failure detection algorithms are all designed under certain <strong>timing assumptions</strong>. The following timing assumptions are possible:</p>
<ul>
<li><strong>Synchronous</strong>
<ul>
<li><strong>Processing</strong>: the time it takes for a process to execute is bounded and known.</li>
<li><strong>Delays</strong>: there is a known upper bound limit on the time it takes for a message to be received</li>
<li><strong>Clocks</strong>: the drift between a local clock and the global, real-time clock is bounded and known</li>
</ul>
</li>
<li><strong>Eventually synchronous</strong>: the timing assumptions hold eventually</li>
<li><strong>Asynchronous</strong>: no assumptions</li>
</ul>
<p>These 3 possible assumption levels mean that the world is divised into 3 kinds of failure algorithms. The algorithm above is based on the eventually synchronous assumption (I think?).</p>
<details><summary><p>Not exam material</p>
</summary><div class="details-content">
<h2 id="mathematically-robust-distributed-systems">Mathematically robust distributed systems</h2>
<p>Some bugs in distributed systems can be very difficult to catch (it could involve long and costly simulation; with $n$ computers, it takes time $2^n$ to simulate all possible cases), and can be very costly when it happens.</p>
<p>The only way to be sure that there are no bugs is to <em>prove</em> it formally and mathematically.</p>
<h3 id="definition-of-the-distributed-system-graph">Definition of the distributed system graph</h3>
<p>Let $G(V, E)$ be a graph, where $V$ is the set of process nodes, and $E$ is the set of channel edges connecting the processes.</p>
<p>Two nodes $p$ and $q$ are <strong>neighbors</strong> if and only if there is an edge $\left\{ p, q \right\} \in E$.</p>
<p>Let $X \subseteq V$ be the set of <strong>crashed nodes</strong>. The other nodes are <strong>correct nodes</strong>.</p>
<p>We’ll define the <strong>path</strong> as the sequence of nodes $(p_1, p_2, \dots, p_n)$ such that $\forall i \in \left\{i, \dots, n-1\right\}$, $p_i$ and $p_{i+1}$ are neighbors.</p>
<p>Two nodes $p$ and $q$ are <strong>connected</strong> if we have a path $(p_1, p_2, \dots, p_n)$ such that $p_1 = p$ and $p_2 = q$.</p>
<p>They are <strong>n-connected</strong> if there are $n$ disjoint paths connecting them; two paths $A = \left\{ p_1, \dots, p_n \right\}$ and $B = \left\{ p_1, \dots, p_n \right\}$ are disjoint if $A \cap B = \left\{ p, q \right\}$ (i.e. $p$ and $q$ are the two only nodes in common in the path).</p>
<p>The graph is <strong>k-connected</strong> if, $\forall \left\{ p, q \right\} \subseteq V$ there are $k$ disjoint paths between $p$ and $q$.</p>
<h3 id="example-on-a-simple-algorithm">Example on a simple algorithm</h3>
<p>Each node $p$ holds a message $m_p$ and a set $p.R$. The goal is for two nodes $p$ and $q$ to have $(p, m_p) \in q.R$ and $(q, m_q) \in p.R$; that is, they want to exchange messages, to <em>communicate reliably</em>. The algorithm is as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">for</span> <span class="n">each</span> <span class="n">node</span> <span class="n">p</span><span class="p">:</span>
<span class="n">initially</span><span class="p">:</span>
<span class="n">send</span> <span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">m</span><span class="p">(</span><span class="n">p</span><span class="p">))</span> <span class="n">to</span> <span class="nb">all</span> <span class="n">neighbors</span>
<span class="n">upon</span> <span class="n">reception</span> <span class="n">of</span> <span class="n">of</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span>
<span class="n">add</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="n">to</span> <span class="n">p</span><span class="o">.</span><span class="n">R</span>
<span class="n">send</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="n">to</span> <span class="nb">all</span> <span class="n">neighbors</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="reliable-communication">Reliable communication</h4>
<p>Now, let’s prove that if two nodes $p$ and $q$ are connected, then they communicate reliably. We’ll do this by induction; formally, we’d like to prove that the proposition $\mathcal{P}_k$, defined as “$p_k \text{ receives } (p, m_p)$”, is true for $k\in \left\{ 1, \dots, n \right\}$.</p>
<ul>
<li>
<p><strong>Base case</strong></p>
<p>According to the algorithm, $p=p_1$ initially sends $(p, m_p)$ to $p_2$. So $p_2$ receives $(p, m_p)$ from $p_1$, and $\mathcal{P}_2$ is true.</p>
</li>
<li>
<p><strong>Induction step</strong></p>
<p>Suppose that the induction hypothesis $\mathcal{P}$ is true for $k \in \left\{2, \dots, n-1 \right\}$.</p>
<p>Then, according to the algorithm, $p_k$ sends $(p, m_p)$ to $p_{k+1}$, meaning that $p_{k+1}$ receives $(p, m_p)$ from $p_k$, which means that $\mathcal{P}_{k+1}$ is true.</p>
</li>
</ul>
<p>Thus $\mathcal{P}_k$ is true.</p>
<h3 id="robustness-property">Robustness property</h3>
<p>If at most $k$ nodes are crashed, and the graph is $(k+1)$-connected, then all correct nodes <strong>communicate reliably</strong>.</p>
<p>We prove this by contradiction. We want to prove $\mathcal{P}$, so let’s suppose that the opposite, $\bar{\mathcal{P}}$ is true; to prove this, we must be able to conclude that the graph is $(k+1)$-connected, but there are 2 correct nodes $p$ and $q$ that <em>do not</em> communicate reliably. Hopefully, doing so will lead us to a paradoxical conclusion that allows us to assert $\mathcal{P}$.</p>
<p>As we are $(k+1)$-connected, there exists $k+1$ paths $(P_1, P_2, \dots, P_{k+1})$ paths connecting any two nodes $p$ and $q$. We want to prove that $p$ and $q$ do not communicate reliably, meaning that all paths between them are “cut” by at least one crashed node. As the paths are disjoint, this requires at least $k+1$ crashed nodes to cut them all.</p>
<p>This is a contradiction: we were working under the assumption that $k$ nodse were crashed, and proved that $k+1$ nodes were crashed. This disproves $\bar{\mathcal{P}}$ and proves $\mathcal{P}$.</p>
<h3 id="random-failures">Random failures</h3>
<p>Let’s assume that $p$ and $q$ are connected by a single path of length 1, only separated by a node $n$. If each node has a probability $f$ of crashing, then the probability of communicating reliably is $1-f$.</p>
<p>Now, suppose that the path is of length $n$; the probability of communicating reliably is the probability that none of the nodes crashing; individually, that is $1-f$, so for the whole chain, the probability is $(1-f)^n$.</p>
<p>However, if we have $n$ paths of length 1 (that is, instead of setting them up serially like previously, we set them up in parallel), the probability of not communicating reliably is that of all intermediary nodes crashing, which is $f^n$; thus, the probability of actually communicating reliably is $1-f^n$.</p>
<p>If our nodes are connecting by $n$ paths of length $m$, the probability of not communicating reliably is that of all lines being cut. The probability of a single line being cut is $1 - (1 - f)^m$. The probability of any line being cut is one minus the probability of no line being cut, so the final probability is $1 - (1 - (1 - f)^m)^n$.</p>
<h3 id="example-proof">Example proof</h3>
<p>Assume an infinite 2D grid of nodes. Nodes $p$ and $q$ are connected, with the distance in the shortest path being $D$. What is the probability of communicating reliably when this distance tends to infinity?</p>
<script type="math/tex; mode=display">\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\lim_{D \rightarrow \infty} = \dots</script>
<p>First, let’s define a sequence of grids $G_k$. $G_0$ is a single node, $G_{k+1}$ is built from 9 grids $G_k$.</p>
<p>$G_{k+1}$ is <strong>correct</strong> if at least 8 of its 9 grids are correct.</p>
<p>We’ll introduce the concept of a “meta-correct” node; this is not really anything official, just something we’re making up for the purpose of this proof. Consider a grid $G_n$. A node $p$ is “meta-correct” if:</p>
<ul>
<li>It is in a correct grid $G_n$, and</li>
<li>It is in a correct grid $G_{n-1}$, and</li>
<li>It is in a correct grid $G_{n-2}$, …</li>
</ul>
<p>For the sake of this proof, let’s just admit that all meta-correct nodes are connected; if you take two nodes $p$ and $q$ that are both meta-correct, there will be a path of nodes connecting them.</p>
<h4 id="step-1">Step 1</h4>
<p>If $x$ is the probability that $G_k$ is correct, what is the probability $P(x)$ that $G_{k+1}$ is correct?</p>
<p>$G_{k+1}$ is built up of 9 subgrids $G_k$. Let $P_i$ be the probability of $i$ nodes failing; the probability of $G_k$ being correct is the probability at most one subgrid being incorrect.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P_0 & = x^9 \\
P_1 & = 9(1-x)x^8 \\
P(x) & = P_0 + P_1 = x^9 + 9(1-x)x^8 \\
\end{align} %]]></script>
<h4 id="step-2">Step 2</h4>
<p>Let $\alpha = 0.9$, and $z(x) = 1 + \alpha (x-1)$.</p>
<p>We will admit the following: if $x \in [0.99, 1]$ then $z(x) \le P(x)$.</p>
<p>Let $P_k$ be the result of applying $P$ (as defined in step 1) to $1-f$, $k$ times: $P_k = P(P(P(\dots P(1-f))))$. We will prove that $P_k \ge 1 - \alpha^k, \forall k > 0$, by induction:</p>
<ul>
<li><strong>Base case</strong>: $P_0 = 1-f = 0.99$ and $1-\alpha^0 = 1-1 = 0$, so $P_0 \ge 1-\alpha^0$.</li>
<li>
<p><strong>Induction step</strong>:</p>
<p>Let’s suppose that $P_k \ge 1-\alpha^k$. We want to prove this for $k+1$, namely $P_{k+1} \ge 1 - \alpha^{k+1}$.</p>
<script type="math/tex; mode=display">P_{k+1} \ge P(P_k) \ge z(P_k) \ge z(1 - \alpha^k) \\
P_{k+1} \ge 1 + \alpha(1 - \alpha^k - 1) \\
P_{k+1} \ge 1 - \alpha^{k+1}</script>
</li>
</ul>
<p>This proves the result that $\forall k, P_k \ge 1 - \alpha^k$.</p>
<h4 id="step-3">Step 3</h4>
<p>Todo.</p>
</div></details>
<h2 id="reliable-broadcast">Reliable broadcast</h2>
<p>Broadcast is useful for some applications with pubsub-like mechanisms, where the subscribers might need some reliability guarantees from the publisher (we sometimes say quality of service QoS).</p>
<h3 id="best-effort-broadcast">Best-effort broadcast</h3>
<p>Best-effort broadcast (beb) has the following properties:</p>
<ul>
<li><strong>BEB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>BEB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>BEB3 No creation</strong>: no message is delivered unless it was broadcast</li>
</ul>
<p>The broadcasting machine may still crash in the middle of a broadcast, where it hasn’t broadcast the message to everyone yet. It offers no guarantee against that.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">PerfectLinks</span> <span class="p">(</span><span class="n">pp2p</span><span class="p">)</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">forall</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span><span class="p">,</span> <span class="n">the</span> <span class="nb">set</span> <span class="n">of</span> <span class="nb">all</span> <span class="n">nodes</span> <span class="ow">in</span> <span class="n">the</span> <span class="n">system</span><span class="p">,</span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">pp2pSend</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">pp2pDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is not the most efficient algorithm, but we’re not concerned about that. We just care about whether it’s correct, which we’ll sketch out a proof for:</p>
<ul>
<li><strong>Validity</strong>: By the validity property of perfect links and the very facts that:
<ul>
<li>the sender sends the message to all</li>
<li>every correct process that <code class="highlighter-rouge">pp2pDelivers</code> delivers a message to, <code class="highlighter-rouge">bebDelivers</code> it too</li>
</ul>
</li>
<li><strong>No duplication</strong>: by the no duplication property of perfect links</li>
<li><strong>No creation</strong>: by the no creation property of perfect links</li>
</ul>
<h3 id="reliable-broadcast-1">Reliable broadcast</h3>
<p>Reliable broadcast has the following properties:</p>
<ul>
<li><strong>RB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>RB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>RB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>RB4 Agreement</strong>: for any message $m$, if a <strong>correct</strong> process delivers $m$, then every correct process delivers $m$</li>
</ul>
<p>Notice that RB has the same properties as best-effort, but also adds a guarantee RB4: even if the broadcaster crashes in the middle of a broadcast and is unable to send to other processes, we’ll honor the agreement property. This is done by distinguishing receiving and delivering; the broadcaster may not have sent to everyone, but in that case, reliable broadcast makes sure that no one delivers.</p>
<p>Note that a process may still deliver and crash before others deliver; it is then incorrect, and we have no guarantees that the message will be delivered to others.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEfforBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span>
<span class="n">forall</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span> <span class="n">do</span><span class="p">:</span>
<span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span> <span class="c1"># application tells us to broadcast
</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="c1"># deliver to itself
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="c1"># broadcast to others using beb
</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="k">if</span> <span class="n">pi</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">correct</span><span class="p">:</span> <span class="c1"># echo if sender not in correct
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="k">else</span><span class="p">:</span>
<span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">forall</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="ow">in</span> <span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="n">do</span><span class="p">:</span> <span class="c1"># echo all previous messages from crashed pi
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>The idea is to echo all messages from a node that has crashed. From the moment we get the crash message from the oracle, we may have received messages from an actually crashed node, even though we didn’t know it was crashed yet. This is because our failure detector is eventually correct, which means that the crash notification may eventually come. To solve this, we also send all the old messages.</p>
<p>We’ll sketch a proof for the properties:</p>
<ul>
<li><strong>Validity</strong>: as above</li>
<li><strong>No duplication</strong>: as above</li>
<li><strong>No creation</strong>: as above</li>
<li><strong>Agreement</strong>: Assume some correct process $p_i$ <code class="highlighter-rouge">rbDelivers</code> a message $m$ that was broadcast through <code class="highlighter-rouge">rbBroadcast</code> by some process $p_k$. If $p_k$ is correct, then by the validity property of best-effort broadcast, all correct processes will get the message through <code class="highlighter-rouge">bebDeliver</code>, and then deliver $m$ through <code class="highlighter-rouge">rebDeliver</code>. If $p_k$ crashes, then by the completeness property of the failure detector $P$, $p_i$ detects the crash and broadcasts $m$ with <code class="highlighter-rouge">bebBroadcast</code> to all. Since $p_i$ is correct, then by the validity property of best effort, all correct processes <code class="highlighter-rouge">bebDeliver</code> and then <code class="highlighter-rouge">rebDeliver</code> $m$.</li>
</ul>
<p>Note that the proof only uses the completeness property of the failure detector, not the accuracy. Therefore, the predictor can either be perfect or eventually perfect.</p>
<h3 id="uniform-reliable-broadcast">Uniform reliable broadcast</h3>
<p>Uniform broadcast satisfies the following properties:</p>
<ul>
<li><strong>URB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>URB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>URB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>URB4 Uniform agreement</strong>: for any message $m$, if a process delivers $m$, then every correct process delivers $m$</li>
</ul>
<p>We’ve removed the word “correct” in the agreement, and this changes everything. This is the strongest assumption, which guarantees that all messages are delivered to everyone, no matter their future correctness status.</p>
<p>The algorithm is given by:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">uniformBroadcast</span> <span class="p">(</span><span class="n">urb</span><span class="p">)</span><span class="o">.</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span><span class="o">.</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span><span class="o">.</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span> <span class="c1"># set of correct nodes, initiated to all nodes
</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># set of delivered and already forwarded messages
</span> <span class="n">ack</span><span class="p">[</span><span class="n">Message</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># set of nodes that have acknowledged Message
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="c1"># before broadcasting, save message in forward
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">urbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="n">U</span> <span class="p">{[</span><span class="bp">self</span><span class="p">,</span><span class="n">m</span><span class="p">]}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="bp">self</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="c1"># if I haven't sent the message, echo it
# if I've already sent it, don't do it again
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="k">if</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">forward</span><span class="p">:</span>
<span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]};</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="c1"># deliver the message when we know that all correct processes have delivered
# (and if we haven't delivered already)
</span><span class="n">upon</span> <span class="n">event</span> <span class="p">(</span><span class="k">for</span> <span class="nb">any</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span> <span class="ow">in</span> <span class="n">forward</span><span class="p">)</span> <span class="n">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="ow">and</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">urbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="k">def</span> <span class="nf">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">return</span> <span class="n">correct</span> <span class="err">⊆</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>To prove the correctness, we must first have a simple lemma: if a correct process $p_i$ <code class="highlighter-rouge">bebDeliver</code>s a message $m$, then $p_i$ eventually <code class="highlighter-rouge">urbDeliver</code>s the message $m$.</p>
<p>This can be proven as follows: any process that <code class="highlighter-rouge">bebDeliver</code>s $m$ <code class="highlighter-rouge">bebBroadcast</code>s $m$. By the completeness property of the failure detector $P$, and the validity property of best-effort broadcasting, there is a time at which $p_i$ <code class="highlighter-rouge">bebDeliver</code>s $m$ from every correct process and hence <code class="highlighter-rouge">urbDeliver</code>s it.</p>
<p>The proof is then:</p>
<ul>
<li><strong>Validity</strong>: If a correct process $p_i$ <code class="highlighter-rouge">urbBroadcast</code>s a message $m$, then $p_i$ eventually <code class="highlighter-rouge">bebBroadcast</code>s and <code class="highlighter-rouge">bebDeliver</code>s $m$. By our lemma, $p_i$ <code class="highlighter-rouge">urbDeliver</code>s it.</li>
<li><strong>No duplication</strong>: as best-effort</li>
<li><strong>No creation</strong>: as best-effort</li>
<li><strong>Uniform agreement</strong>: Assume some process $p_i$ <code class="highlighter-rouge">urbDeliver</code>s a message $m$. By the algorithm and the completeness <em>and</em> accuracy properties of the failure detector, every correct process <code class="highlighter-rouge">bebDeliver</code>s $m$. By our lemma, every correct process will <code class="highlighter-rouge">urbDeliver</code> $m$.</li>
</ul>
<p>Unlike previous algorithms, this relies on perfect failure detection. But under the assumption that the majority of processes stay correct, we can do with an eventually perfect failure detector. To do so, we remove the crash event above, and replace the <code class="highlighter-rouge">can_deliver</code> method with the following:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">])</span> <span class="o">></span> <span class="n">N</span><span class="o">/</span><span class="mi">2</span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="causal-order-broadcast">Causal order broadcast</h2>
<h3 id="motivation">Motivation</h3>
<p>So far, we didn’t consider ordering among messages. In particular, we considered messages to be independent. Two messages from the same process might not be delivered in the order they were broadcast.</p>
<h3 id="causality">Causality</h3>
<p>The above means that <strong>causality</strong> is broken: a message $m_1$ that causes $m_2$ might be delivered by some process after $m_1$.</p>
<p>Let $m_1$ and $m_2$ be any two messages. $m_1\longrightarrow m_2$ ($m_1$ <strong>causally precedes</strong> $m_2$) if and only if:</p>
<ul>
<li><strong>C1 (FIFO Order)</strong>: Some process $p_i$ broadcasts $m_1$ before broadcasting $m_2$</li>
<li><strong>C2 (Causal Order)</strong>: Some process $p_i$ delivers $m_1$ and then broadcasts $m_2$</li>
<li><strong>C3 (Transitivity)</strong>: There is a message $m_3$ such that $m_1 \longrightarrow m_3$ and $m_3 \longrightarrow m_2$.</li>
</ul>
<p>The <strong>causal order property (CO)</strong> is given by the following: if any process $p_i$ delivers a message $m_2$, then $p_i$ must have delivered every message $m_1$ such that $m_1 \longrightarrow m_2$.</p>
<h3 id="algorithm">Algorithm</h3>
<p>We get reliable causal broadcast by using reliable broadcast, uniform causal broadcast using uniform reliable broadcast.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableCausalOrderBroadcast</span> <span class="p">(</span><span class="n">rco</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rcoBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">past</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pastm</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="k">for</span> <span class="p">[</span><span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="p">]</span> <span class="ow">in</span> <span class="n">pastm</span><span class="p">:</span>
<span class="k">if</span> <span class="n">n</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="o">></span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">n</span><span class="p">}</span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="p">]}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This algorithm ensures causal reliable broadcast. The idea is to re-broadcast all past messages every time, making sure we don’t deliver twice. This is obviously not efficient, but it works in theory.</p>
<p>To improve this, we can implement a form of garbage collection. We can delete the <code class="highlighter-rouge">past</code> only when all others have delivered. To do this, we need a perfect failure detector.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="n">Implements</span> <span class="n">GarbageCollection</span> <span class="o">+</span> <span class="n">previous</span> <span class="n">algorithm</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span> <span class="c1"># set of all nodes
</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># forall m
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">upon</span> <span class="k">for</span> <span class="n">some</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">,</span> <span class="bp">self</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="bp">self</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">ACK</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="p">[</span><span class="n">ACK</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="k">if</span> <span class="n">correct</span><span class="o">.</span><span class="n">forall</span><span class="p">(</span><span class="k">lambda</span> <span class="n">pj</span><span class="p">:</span> <span class="n">pj</span> <span class="ow">in</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]):</span> <span class="c1"># if all correct in ack
</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> \ <span class="p">{[</span><span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span> <span class="c1"># remove message from past</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We need the perfect failure detector’s strong accuracy property to prove the causal order property. We don’t need the failure detector’s completeness property; if we don’t know that a process is crashed, it has no impact on correctness, only on performance, since it just means that we won’t delete the past.</p>
<p>Another algorithm is given below. It uses a <a href="https://en.wikipedia.org/wiki/Vector_clock">“vector clock” VC</a> as an alternative, more efficient encoding of the past. A VC is updated under the following rules:</p>
<ul>
<li>Initially all clocks are empty</li>
<li>Each time a process sends a message, it increments its own logical clock in the vector by one and then sends a copy of its own vecto.</li>
<li>Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).</li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableCausalOrderBroadcast</span> <span class="p">(</span><span class="n">rco</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="k">for</span> <span class="nb">all</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span><span class="p">:</span>
<span class="n">VC</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="mi">0</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">upon</span> <span class="n">event</span><span class="o"><</span><span class="n">rcoBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VC</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">VC</span><span class="p">[</span><span class="bp">self</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">VC</span><span class="p">[</span><span class="bp">self</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1"># we have seen the message, so increment VC
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="k">if</span> <span class="n">pj</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">:</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="n">pending</span> <span class="n">U</span> <span class="p">(</span><span class="n">pj</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span>
<span class="n">deliver</span><span class="o">-</span><span class="n">pending</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">deliver</span><span class="o">-</span><span class="n">pending</span><span class="p">():</span>
<span class="k">while</span> <span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span> <span class="ow">in</span> <span class="n">pending</span><span class="p">:</span>
<span class="n">forall</span> <span class="n">pk</span> <span class="n">such</span> <span class="n">that</span> <span class="p">(</span><span class="n">VC</span><span class="p">[</span><span class="n">pk</span><span class="p">]</span> <span class="o"><=</span> <span class="n">VCm</span><span class="p">[</span><span class="n">pk</span><span class="p">]):</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="n">pending</span> <span class="n">U</span> <span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">VC</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">VC</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="total-order-broadcast">Total order broadcast</h2>
<p>In <a href="#reliable-broadcast">reliable broadcast</a>, the processes are free to deliver in any order they wish. In <a href="#causal-broadcast">causal broadcast</a>, the processes must deliver in causal order. But causal order is only partial: some message may be delivered in a different order by the processes.</p>
<p>In <strong>total order</strong> broadcast, the processes must deliver all messages according to the same order. Note that this is orthogonal to causality, or even FIFO ordering. It can be <em>made</em> to respect causal or FIFO ordering, but at its core, it is only concerned with all processes delivering in the same order.</p>
<p>An application using total order broadcast would be Bitcoin; for the blockchain, we want to make sure that everybody gets messages in the same order, for consistency.</p>
<p>The properties are:</p>
<ul>
<li><strong>RB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>RB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>RB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>RB4 Agreement</strong>: for any message $m$, if a <strong>correct</strong> process delivers $m$, then every correct process delivers $m$</li>
<li><strong>TO1 (Uniform) Total Order</strong>: Let $m$ and $m’$ be any two messages. Let $p_i$ be any (correct) process that delivers $m$ without having delivered $m’$ before. Then no (correct) process delivers $m’$ before $m$</li>
</ul>
<p>The algorithm can be implemented as:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">TotalOrder</span> <span class="p">(</span><span class="n">to</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">Consensus</span> <span class="p">(</span><span class="n">cons</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">init</span><span class="o">></span><span class="p">:</span>
<span class="n">unordered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># two sets
</span> <span class="n">wait</span> <span class="p">:</span><span class="o">=</span> <span class="bp">False</span>
<span class="n">sn</span> <span class="p">:</span><span class="o">=</span> <span class="mi">1</span> <span class="c1"># sequence number
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">toBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="ow">and</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">unordered</span><span class="o">.</span><span class="n">add</span><span class="p">((</span><span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
<span class="n">upon</span> <span class="n">unordered</span> <span class="ow">not</span> <span class="n">empty</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">wait</span><span class="p">:</span>
<span class="n">wait</span> <span class="p">:</span><span class="o">=</span> <span class="bp">True</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">propose</span><span class="p">,</span> <span class="n">unordered</span><span class="o">></span> <span class="k">with</span> <span class="n">sn</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">decide</span><span class="p">,</span> <span class="n">decided</span><span class="o">></span> <span class="k">with</span> <span class="n">sn</span><span class="p">:</span>
<span class="n">unordered</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">decided</span><span class="p">)</span>
<span class="n">ordered</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">decided</span><span class="p">)</span>
<span class="k">for</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">ordered</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">toDeliver</span><span class="p">,</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">delivered</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="n">sn</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">wait</span> <span class="o">=</span> <span class="bp">False</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Our total order broadcast is based on consensus, which we describe below.</p>
<h2 id="consensus">Consensus</h2>
<p>In the (uniform) consensus problem, the processes all propose values, and need to agree on one of these. This gives rise to two basic events: a proposition, and a decision. Solving consensus is key to solving many problems in distributed computing (total order broadcast, atomic commit, …).</p>
<p>The properties that we would like to see are:</p>
<ul>
<li><strong>C1 Validity</strong>: if a value is decided, it has been proposed</li>
<li><strong>C2 (Uniform) Agreement</strong>: no two correct (any) processes decide differently</li>
<li><strong>C3 Termination</strong>: every correct process eventually decides</li>
<li><strong>C4 Integrity</strong>: Every process decides at most once</li>
</ul>
<p>If C2 is Uniform Agreement, then we talk about uniform consensus.</p>
<p>Todo: write about consensus and fairness, does it violate validity?</p>
<p>We can build consensus using total order broadcast, which is described above. But total broadcast can be built with consensus. It turns out that <strong>consensus and total order broadcast are equivalent problems in a system with reliable channels</strong>.</p>
<p>Blockchain is based on consensus. Bitcoin mining is actually about solving consensus: a leader is chosen to decide on the broadcast order, and this leader gains 50 bitcoin. Seeing that this is a lot of money, many people want to be the leader; but we only want a single leader. Nakamoto’s solution is to choose the leader by giving out a hard problem. The computation can only be done with brute-force, there are no smart tricks or anything. So people put <a href="https://digiconomist.net/bitcoin-energy-consumption">enormous amounts of energy</a> towards solving this. Usually, only a single person will win the mining block; the probability is small, but the <a href="https://bitcoin.org/bitcoin.pdf">original Bitcoin paper</a> specifies that we should wait a little before rewarding the winner, in case there are two winners.</p>
<h3 id="consensus-algorithm">Consensus algorithm</h3>
<p>Suppose that there are $n$ processes. At the beginning, every process proposes a value; to decide, the processes go through $n$ rounds incrementally. At each round, the process with the id corresponding to the round number is the leader of the round. Note that the rounds are not global time; we may make them so in examples for the sake of simplicity, but rounds are simply a local thing, which are somewhat synchronized by message passing from the leader.</p>
<p>The leader decides its current proposal and broadcasts it to all. A process that is not the leader waits. It can either deliver the proposal of the leader to adopt it, or suspect the leader. In any case, we can move on to the next round at that moment. Note that processes don’t need to move on at the same time, they can do so at different moments.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="n">todo</span></pre></td></tr></tbody></table></code></pre></figure>
<p>correctness argument todo</p>
<h3 id="uniform-consensus-algorithm">Uniform consensus algorithm</h3>
<p>The idea is here is to do the same thing, but instead of deciding at the beginning of the round, we wait until round n.</p>
<p>not taking notes today, don’t feel like it.</p>
<h3 id="uniform-consensus-algorithm-with-eventually-perfect-failure-detector">Uniform consensus algorithm with eventually perfect failure detector</h3>
<p>This assumes a correct majority, and an eventually perfect failure detector.</p>
<p>When you suspect a process, you send them a message. When a new leader arrives, he asks what the previous value was, and at least one process will respond.</p>
<h2 id="atomic-commit">Atomic commit</h2>
<p>The unit of data processing in a distributed system is the <em>transaction</em>. A transaction describes the actions to be taken, and can be terminated either by <strong>committing</strong> or <strong>aborting</strong>.</p>
<h3 id="non-blocking-atomic-commit-nbac">Non-Blocking Atomic Commit (NBAC)</h3>
<p>The <strong>nonblocking atomic commit (NBAC)</strong> abstraction is used to solve this problem in a reliable way. As in consensus, every process proposes an initial value of 0 or 1 (no or yes), and must decide on a final value 0 or 1 (abort or commit). Unlike consensus, the processes here seek to decide 1, but every process has a veto right.</p>
<p>The properties of NBAC are:</p>
<ul>
<li><strong>NBAC1. Agreement</strong>: no two processes decide differently</li>
<li><strong>NBAC2. Termination</strong>: every correct process eventually decides</li>
<li><strong>NBAC3. Commit-validity</strong>: 1 can only be decided if all processes propose 1</li>
<li><strong>NBAC4. Abort-validity</strong>: 0 can only be decided if some process crashes or votes 0</li>
</ul>
<p>Note that here, NBAC must decide to abort if some process crashes, even though all processes have proposed 1 (commit).</p>
<p>We can implement NBAC using three underlying abstractions:</p>
<ul>
<li>A perfect failure detector P</li>
<li>Uniform consensus</li>
<li>Best-effort broadcast BEB</li>
</ul>
<p>It works as follows: every process $p$ broadcasts its initial vote (0 or 1, abort or commit) to all other processes using BEB. It waits to hear something from every process $q$ in the system; this is either done through <em>beb</em>-delivery from $q$, or by detecting the crash of $q$. At this point, two situations are possible:</p>
<ul>
<li>If $p$ gets 0 (abort) from any other process, or if it detects a crash, it invokes consensus with a proposal to abort (0).</li>
<li>Otherwise, if it receives the vote to commit (1) from all processes, then it invokes consensus with a proposal to commit (1).</li>
</ul>
<p>Once the consensus is over, every process nbac decides according to the outcome of the consensus.</p>
<p>We can write this more formally:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
</pre></td><td class="code"><pre><span class="n">Events</span><span class="p">:</span>
<span class="n">Request</span><span class="p">:</span> <span class="o"><</span><span class="n">Propose</span><span class="p">,</span> <span class="n">v1</span><span class="o">></span>
<span class="n">Indication</span><span class="p">:</span> <span class="o"><</span><span class="n">Decide</span><span class="p">,</span> <span class="n">v2</span><span class="o">></span>
<span class="n">Properties</span><span class="p">:</span>
<span class="n">NBAC1</span><span class="p">,</span> <span class="n">NBAC2</span><span class="p">,</span> <span class="n">NBAC3</span><span class="p">,</span> <span class="n">NBAC4</span>
<span class="n">Implements</span><span class="p">:</span> <span class="n">nonBlockingAtomicCommit</span> <span class="p">(</span><span class="n">nbac</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">UniformConsensus</span> <span class="p">(</span><span class="n">uc</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="mi">1</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">all_processes</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Propose</span><span class="p">,</span> <span class="n">v</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">v</span><span class="o">></span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="n">prop</span> <span class="o">*</span> <span class="n">v</span>
<span class="n">upon</span> <span class="n">event</span> <span class="n">correct</span> \ <span class="n">delivered</span> <span class="o">=</span> <span class="err">Ø</span><span class="p">:</span>
<span class="k">if</span> <span class="n">correct</span> <span class="o">!=</span> <span class="n">all_processes</span><span class="p">:</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="mi">0</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">ucPropose</span><span class="p">,</span> <span class="n">prop</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">ucDecide</span><span class="p">,</span> <span class="n">decision</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class=&qu