Jekyll2018-12-07T11:45:47+00:00https://kjaer.io/OUTPUTMaxime Kjaer's blog posts about the Web, software & technology. Most likely in that order.Maxime Kjaermaxime.kjaer@gmail.comhttps://kjaer.ioCS-525 Foundations and tools for tree-structured data2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/tree-structured-data
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#xpath" id="markdown-toc-xpath">XPath</a> <ul>
<li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
</ul>
</li>
<li><a href="#xml-schemas" id="markdown-toc-xml-schemas">XML Schemas</a> <ul>
<li><a href="#dtd" id="markdown-toc-dtd">DTD</a></li>
<li><a href="#xml-schema" id="markdown-toc-xml-schema">XML Schema</a> <ul>
<li><a href="#criticism" id="markdown-toc-criticism">Criticism</a></li>
</ul>
</li>
<li><a href="#relax-ng" id="markdown-toc-relax-ng">Relax NG</a></li>
<li><a href="#schematron" id="markdown-toc-schematron">Schematron</a></li>
</ul>
</li>
<li><a href="#xml-information-set" id="markdown-toc-xml-information-set">XML Information Set</a></li>
<li><a href="#xslt" id="markdown-toc-xslt">XSLT</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#default-templates" id="markdown-toc-default-templates">Default templates</a></li>
<li><a href="#example" id="markdown-toc-example">Example</a></li>
</ul>
</li>
<li><a href="#xquery" id="markdown-toc-xquery">XQuery</a> <ul>
<li><a href="#syntax" id="markdown-toc-syntax">Syntax</a></li>
<li><a href="#creating-xml-content" id="markdown-toc-creating-xml-content">Creating XML content</a></li>
<li><a href="#sequences" id="markdown-toc-sequences">Sequences</a></li>
<li><a href="#flwor" id="markdown-toc-flwor">FLWOR</a></li>
<li><a href="#conditional-expressions" id="markdown-toc-conditional-expressions">Conditional expressions</a></li>
<li><a href="#quantified-expressions" id="markdown-toc-quantified-expressions">Quantified expressions</a></li>
<li><a href="#functions" id="markdown-toc-functions">Functions</a></li>
<li><a href="#modules" id="markdown-toc-modules">Modules</a></li>
<li><a href="#updating-xml-content" id="markdown-toc-updating-xml-content">Updating XML Content</a></li>
<li><a href="#advanced-features" id="markdown-toc-advanced-features">Advanced features</a></li>
<li><a href="#coding-guidelines" id="markdown-toc-coding-guidelines">Coding guidelines</a></li>
</ul>
</li>
<li><a href="#xml-based-webapps" id="markdown-toc-xml-based-webapps">XML Based Webapps</a> <ul>
<li><a href="#xml-databases" id="markdown-toc-xml-databases">XML Databases</a></li>
<li><a href="#rest" id="markdown-toc-rest">REST</a></li>
<li><a href="#oppidum" id="markdown-toc-oppidum">Oppidum</a></li>
</ul>
</li>
<li><a href="#foundations-of-xml-types" id="markdown-toc-foundations-of-xml-types">Foundations of XML types</a> <ul>
<li><a href="#tree-grammars" id="markdown-toc-tree-grammars">Tree Grammars</a> <ul>
<li><a href="#dtd--local-tree-grammars" id="markdown-toc-dtd--local-tree-grammars">DTD & Local tree grammars</a></li>
<li><a href="#xml-schema--single-type-tree-grammars" id="markdown-toc-xml-schema--single-type-tree-grammars">XML Schema & Single-Type tree grammars</a></li>
<li><a href="#relax-ng--regular-tree-grammars" id="markdown-toc-relax-ng--regular-tree-grammars">Relax NG & Regular tree grammars</a></li>
</ul>
</li>
<li><a href="#tree-automata" id="markdown-toc-tree-automata">Tree automata</a> <ul>
<li><a href="#definition" id="markdown-toc-definition">Definition</a></li>
<li><a href="#example-1" id="markdown-toc-example-1">Example</a></li>
<li><a href="#properties" id="markdown-toc-properties">Properties</a></li>
</ul>
</li>
<li><a href="#validation" id="markdown-toc-validation">Validation</a> <ul>
<li><a href="#inclusion" id="markdown-toc-inclusion">Inclusion</a></li>
<li><a href="#closure" id="markdown-toc-closure">Closure</a></li>
<li><a href="#emptiness" id="markdown-toc-emptiness">Emptiness</a></li>
<li><a href="#type-inclusion" id="markdown-toc-type-inclusion">Type inclusion</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#dealing-with-non-textual-content" id="markdown-toc-dealing-with-non-textual-content">Dealing with non-textual content</a> <ul>
<li><a href="#mathml" id="markdown-toc-mathml">MathML</a></li>
<li><a href="#tables" id="markdown-toc-tables">Tables</a></li>
</ul>
</li>
<li><a href="#xml-processing" id="markdown-toc-xml-processing">XML Processing</a> <ul>
<li><a href="#dom" id="markdown-toc-dom">DOM</a></li>
<li><a href="#sax" id="markdown-toc-sax">SAX</a></li>
<li><a href="#dom-and-web-applications" id="markdown-toc-dom-and-web-applications">DOM and web applications</a></li>
<li><a href="#xforms-an-alternative-to-html-forms" id="markdown-toc-xforms-an-alternative-to-html-forms">XForms: an alternative to HTML forms</a></li>
</ul>
</li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<!-- More -->
<h2 id="xpath">XPath</h2>
<p>XPath is the W3C standard language for traversal and navigation in XML trees.</p>
<p>For navigation, we use the <strong>location path</strong> to identify nodes or content. A location path is a sequence of location steps separated by a <code class="highlighter-rouge">/</code>:</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>child::chapter/descendant::section/child::para</pre></td></tr></tbody></table></code></pre></figure>
<p>Every location step has an axis, <code class="highlighter-rouge">::</code> and then a node test. Starting from a context node, a location returns a node-set. Every selected node in the node-set is used as the context node for the next step.</p>
<p>You can start an XPath expression with <code class="highlighter-rouge">/</code> start from the root, which is known as an <strong>absolute path</strong>.</p>
<p>XPath defines 13 axes allowing navigation, including <code class="highlighter-rouge">self</code>, <code class="highlighter-rouge">parent</code>, <code class="highlighter-rouge">child</code>, <code class="highlighter-rouge">following-sibling</code>, <code class="highlighter-rouge">ancestor-or-self</code>, etc. There is a special <code class="highlighter-rouge">attribute</code> axis to select attributes of the context node, which are not really in the child hierarchy. Similarly, <code class="highlighter-rouge">namespace</code> selects namespace nodes.</p>
<p>A nodetest filters nodes:</p>
<table>
<thead>
<tr>
<th style="text-align: left">Test</th>
<th style="text-align: left">Semantics</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">node()</code></td>
<td style="text-align: left">let any node pass</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">text()</code></td>
<td style="text-align: left">select only text nodes</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">comment()</code></td>
<td style="text-align: left">preserve only comment nodes</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">name</code></td>
<td style="text-align: left">preserves only <strong>elements/attributes</strong> with that name</td>
</tr>
<tr>
<td style="text-align: left"><code class="highlighter-rouge">*</code></td>
<td style="text-align: left"><code class="highlighter-rouge">*</code> preserves every <strong>element/attribute</strong></td>
</tr>
</tbody>
</table>
<p>At each navigation step, nodes can be filtered using qualifiers.</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>axis::nodetest[qualifier][qualifier]</pre></td></tr></tbody></table></code></pre></figure>
<p>For instance:</p>
<figure class="highlight"><pre><code class="language-xpath" data-lang="xpath"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>following-sibling::para[position()=last()]</pre></td></tr></tbody></table></code></pre></figure>
<p>A qualifier filters a node-set depending on the axis. Each node in a node-set is kept only if the evaluation of the qualifier returns true.</p>
<p>Qualifiers may include comparisons (<code class="highlighter-rouge">=</code>, <code class="highlighter-rouge"><</code>, <code class="highlighter-rouge"><=</code>, …). The comparison is done on the <code class="highlighter-rouge">string-value()</code>, which is the concatenation of all descendant text nodes in <em>document order</em>. But there’s a catch here! Comparison between node-sets is under existential semantics: there only needs to be one pair of nodes for which the comparison is true. Thus, when negating, we can get universal quantification.</p>
<p>XPaths can be a union of location paths separated by <code class="highlighter-rouge">|</code>. Qualifiers can include boolean expressions (<code class="highlighter-rouge">or</code>, <code class="highlighter-rouge">not</code>, <code class="highlighter-rouge">and</code>, …).</p>
<p>There are a few basic functions: <code class="highlighter-rouge">last()</code>, <code class="highlighter-rouge">position()</code>, <code class="highlighter-rouge">count(node-set)</code>, <code class="highlighter-rouge">concat(string, string, ...string</code>), <code class="highlighter-rouge">contains(str1, str2)</code>, etc. These can be used within a qualifier.</p>
<p>XPath also supports abbreviated syntax. For instance, <code class="highlighter-rouge">child::</code> is the default axis and can be omitted, <code class="highlighter-rouge">@</code> is a shorthand for <code class="highlighter-rouge">attribute::</code>, <code class="highlighter-rouge">[4]</code> is a shorthand for <code class="highlighter-rouge">[position()=4]</code> (note that positions start at 1).</p>
<p>XPath is used in XSLT, XQuery, XPointer, XLink, XML Schema, XForms, …</p>
<h3 id="evaluation">Evaluation</h3>
<p>To evaluate an XPath expression, we have in our state:</p>
<ul>
<li>The context node</li>
<li>Context size: number of nodes in the node-set</li>
<li>Context position: index of the context node in the node-set</li>
<li>A set of variable bindings</li>
</ul>
<h2 id="xml-schemas">XML Schemas</h2>
<p>There are three classes of languages that constraint XML content:</p>
<ul>
<li>Constraints expressed by <strong>a description</strong> of each element, and potentially related attributes (DTD, XML Schema)</li>
<li>Constraints expressed by <strong>patterns</strong> defining the admissible elements, attributes and text nodes using regexes (Relax NG)</li>
<li>Constraints expressed by <strong>rules</strong> (Schematron)</li>
</ul>
<h3 id="dtd">DTD</h3>
<p>Document Type Definitions (DTDs) are XML’s native schema system. It allows to define document classes, using a declarative approach to define the logical structure of a document.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><!ELEMENT recipe (title, comment*, item+, picture?, nbPers)></span>
<span class="cp"><!ATTLIST recipe difficulty (easy|medium|difficult) #IMPLIED></span>
<span class="cp"><!ELEMENT title (#PCDATA)></span>
<span class="cp"><!ELEMENT comment (#PCDATA)></span>
<span class="cp"><!ELEMENT item (header?,((ingredient+, step+) | (ingredient+, step)+))></span>
<span class="cp"><!ELEMENT header (#PCDATA)></span>
<span class="cp"><!ELEMENT ingredient (#PCDATA)></span>
<span class="cp"><!ELEMENT step (#PCDATA)></span>
<span class="cp"><!ELEMENT picture EMPTY></span>
<span class="cp"><!ATTLIST picture source CDATA #REQUIRED format (jpeg | png) #IMPLIED ></span>
<span class="cp"><!ELEMENT nbPers (#PCDATA)></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="xml-schema">XML Schema</h3>
<p>XML Schemas are a <a href="http://www.w3.org/TR/xmlschema-0/">W3C standard</a> that go beyond the native DTDs. XML Schema descriptions are valid XML documents themselves.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsd:schema</span> <span class="na">xmlns:xsd=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span><span class="nt">></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"RecipesCollection"</span><span class="nt">></span>
<span class="nt"><xsd:complexType></span>
<span class="nt"><xsd:sequence</span> <span class="na">minOccurs=</span><span class="s">"0"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"Recipe"</span> <span class="na">type=</span><span class="s">"RecipeType"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"></xsd:element></span>
...
<span class="nt"></xsd:schema></span></pre></td></tr></tbody></table></code></pre></figure>
<p>To declare an element, we do as follows; by default, the author element as defined below may only contain string values:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"author"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>But we can define other types of elements, that aren’t just strings. Types include <code class="highlighter-rouge">string</code>,
<code class="highlighter-rouge">boolean</code>, <code class="highlighter-rouge">number</code>, <code class="highlighter-rouge">float</code>, <code class="highlighter-rouge">duration</code>, <code class="highlighter-rouge">time</code>, <code class="highlighter-rouge">date</code>, <code class="highlighter-rouge">AnyURI</code>, … The types are still string-encoded and must be extracted by the XML application, but this helps verify the consistency.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"year"</span> <span class="na">type=</span><span class="s">"xsd:date"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can bound the number of occurrences of an element. Below, the <code class="highlighter-rouge">character</code> element may be repeated 0 to ∞ times (this is equivalent to something like <code class="highlighter-rouge">character*</code> in a regex). Absence of <code class="highlighter-rouge">minOccurs</code> and <code class="highlighter-rouge">maxOccurs</code> implies exactly once (like in a regex).</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"character"</span> <span class="na">minOccurs=</span><span class="s">"0"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can define more complex types using <strong>type constructors</strong>.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre><span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"Characters"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"character"</span> <span class="na">minOccurs=</span><span class="s">"1"</span> <span class="na">maxOccurs=</span><span class="s">"unbounded"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"Prolog"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"series"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"author"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"characters"</span> <span class="na">type=</span><span class="s">"Characters"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:complexType></span>
<span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"prolog"</span> <span class="na">type=</span><span class="s">"Prolog"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This defines a Prolog type containing a sequence of a <code class="highlighter-rouge">series</code>, <code class="highlighter-rouge">author</code>, and <code class="highlighter-rouge">Characters</code>, which is <code class="highlighter-rouge">character+</code>.</p>
<p>Using the <code class="highlighter-rouge">mixed="true"</code> attribute on an <code class="highlighter-rouge">xsd:complexType</code> allows for mixed content: attributes, elements, and text can be mixed (like we know in HTML, where you can do <code class="highlighter-rouge"><p>hello <em>world</em>!</p></code>).</p>
<p>There are more type constructor primitives that allow to do much of what regexes do: <code class="highlighter-rouge">xsd:sequence</code>, which we’ve seen above, but also <code class="highlighter-rouge">xsd:choice</code> (for enumerated elements) and <code class="highlighter-rouge">xsd:all</code> (for unordered elements).</p>
<p>Attributes can also be declared within their owner element:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="nt"><xsd:element</span> <span class="na">name=</span><span class="s">"strip"</span><span class="nt">></span>
<span class="nt"><xsd:attribute</span> <span class="na">name=</span><span class="s">"copyright"</span><span class="nt">/></span>
<span class="nt"><xsd:attribute</span> <span class="na">name=</span><span class="s">"year"</span> <span class="na">type=</span><span class="s">"xsd:gYear"</span><span class="nt">/></span>
<span class="nt"></xsd:element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Because writing complex types can be tedious, complex types can be derived by extension or restriction from existing base types:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="nt"><xsd:complexType</span> <span class="na">name=</span><span class="s">"BookType"</span><span class="nt">></span>
<span class="nt"><xsd:complexContent></span>
<span class="nt"><xsd:extension</span> <span class="na">base=</span><span class="s">"Publication"</span><span class="nt">></span>
<span class="nt"><xsd:sequence></span>
<span class="nt"><xsd:element</span> <span class="na">name =</span><span class="s">"ISBN"</span> <span class="na">type=</span><span class="s">"xsd:string"</span><span class="nt">/></span>
<span class="nt"><xsd:element</span> <span class="na">name =</span><span class="s">"Publisher"</span> <span class="na">type=</span><span class="s">"xsd:string"</span><span class="nt">/></span>
<span class="nt"></xsd:sequence></span>
<span class="nt"></xsd:extension></span>
<span class="nt"></xsd:complexContent></span>
<span class="nt"></xsd:complexType></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Additionally, it is possible to define user-defined types:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre><span class="nt"><xsd:simpleType</span> <span class="na">name=</span><span class="s">"Car"</span><span class="nt">></span>
<span class="nt"><xsd:restriction</span> <span class="na">base=</span><span class="s">"xsd:string"</span><span class="nt">></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"Audi"</span><span class="nt">/></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"BMW"</span><span class="nt">/></span>
<span class="nt"><xsd:enumeration</span> <span class="na">value=</span><span class="s">"VW"</span><span class="nt">/></span>
<span class="nt"></xsd:restriction></span>
<span class="nt"></xsd:simpleType></span>
<span class="nt"><xsd:simpleType</span> <span class="na">name=</span><span class="s">"WeakPasswordType"</span><span class="nt">></span>
<span class="nt"><xsd:restriction</span> <span class="na">base=</span><span class="s">"xsd:string"</span><span class="nt">></span>
<span class="nt"><xsd:pattern</span> <span class="na">value=</span><span class="s">"[a-z A-Z 0-9{8}]"</span><span class="nt">/></span>
<span class="nt"></xsd:restriction></span>
<span class="nt"></xsd:simpleType></span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="criticism">Criticism</h4>
<p>There have been some criticisms addressed to XML Schema:</p>
<ul>
<li>The specification is very difficult to understand</li>
<li>It requires a high level of expertise to avoid surprises, as there are many complex and unintuitive behaviors</li>
<li>The choice between element and attribute is largely a matter of the taste of the designer, but XML Schema provides separate functionality for them, distinguishing them strongly</li>
<li>There is only weak support for unordered content. In SGML, there was support for the <code class="highlighter-rouge">&</code> operator. <code class="highlighter-rouge">A & B</code> means that we must have <code class="highlighter-rouge">A</code> followed by <code class="highlighter-rouge">B</code> or vice-versa (order doesn’t matter). But we could enforce <code class="highlighter-rouge">A & B*</code> such that there would have to be a sequence of <code class="highlighter-rouge">B</code> which would have to be grouped. XML Schema is too limited to enforce such things.</li>
<li>
<p>The datatypes (strings, dates, etc) are tied to <a href="https://www.w3.org/TR/xmlschema-2/">a single collection of datatypes</a>, which can be a little too limited for certain domain-specific datatypes.</p>
<p>But XML Schema 1.1 addressed this with two new features, co-occurrences constraints and assertions on simple types.</p>
<p>Co-occurrences are constraints which make the presence of an attribute, element or values allowable for it, depend on the value or presence of other attributes or elements.</p>
<p>Assertions on simple types introduced a new facet for simple types, called an assertion, to precise constraints using XPath expressions.</p>
</li>
</ul>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xs:schema</span> <span class="na">xmlns:xs=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span><span class="nt">></span>
<span class="nt"><xs:element</span> <span class="na">name=</span><span class="s">"NbOfAttempts"</span><span class="nt">></span>
<span class="nt"><xs:complexType></span>
<span class="nt"><xs:attribute</span> <span class="na">name=</span><span class="s">"min"</span> <span class="na">type=</span><span class="s">"xs:int"</span><span class="nt">/></span>
<span class="nt"><xs:attribute</span> <span class="na">name=</span><span class="s">"max"</span> <span class="na">type=</span><span class="s">"xs:int"</span><span class="nt">/></span>
<span class="nt"><xs:assert</span> <span class="na">test=</span><span class="s">"@min le @max"</span><span class="nt">/></span>
<span class="nt"></xs:complexType></span>
<span class="nt"></xs:element></span>
<span class="nt"></xs:schema></span>
</pre></td></tr></tbody></table></code></pre></figure>
<p>Therefore, some of the original W3C XML Schema committee have gone on to create alternatives, some of which we will see below.</p>
<h3 id="relax-ng">Relax NG</h3>
<p>Pronounced “relaxing”. Relax NG’s goals are:</p>
<ul>
<li>Be easier to learn and use</li>
<li>Provide an XML syntax that is more readable and compact</li>
<li>Provide a theoretical sound language (based on tree automata, which we’ll talk about later)</li>
<li>The schema follows the structure of the document.</li>
</ul>
<p>The reference book for Relax NG is <a href="http://books.xmlschemata.org/relaxng/">Relax NG by Eric van der Vlist</a>.</p>
<p>As the example below shows, Relax NG is much more legible:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
</pre></td><td class="code"><pre><span class="nt"><element</span> <span class="na">name=</span><span class="s">"AddressBook"</span><span class="nt">></span>
<span class="nt"><zeroOrMore></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Card"</span><span class="nt">></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Name"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Email"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><optional></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Note"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"></optional></span>
<span class="nt"></element></span>
<span class="nt"></zeroOrMore></span>
<span class="nt"></element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Another example shows a little more advanced functionality; here, a card can either contain a single <code class="highlighter-rouge">Name</code>, or (exclusive or) both a <code class="highlighter-rouge">GivenName</code> and <code class="highlighter-rouge">FamilyName</code>.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="nt"><element</span> <span class="na">name=</span><span class="s">"Card"</span><span class="nt">></span>
<span class="nt"><choice></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"Name"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><group></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"GivenName"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"><element</span> <span class="na">name=</span><span class="s">"FamilyName"</span><span class="nt">></span>
<span class="nt"><text/></span>
<span class="nt"></element></span>
<span class="nt"></group></span>
<span class="nt"></choice></span>
<span class="nt"></element></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Some other tags include:</p>
<ul>
<li><code class="highlighter-rouge"><choice></code> allows only one of the enumerated children to occur</li>
<li><code class="highlighter-rouge"><interleave></code> allows child elements to occur in any order (like <code class="highlighter-rouge">xsd:all</code> in XML Schema)</li>
<li><code class="highlighter-rouge"><attribute></code> inside an <code class="highlighter-rouge"><element></code> specifies the schema for attributes. By itself, it’s considered required, but it can be wrapped in an <code class="highlighter-rouge"><optional></code> too.</li>
<li><code class="highlighter-rouge"><group></code> allows to, as the name implies, logically group elements. This is especially useful inside <code class="highlighter-rouge"><choice></code> elements, as in the example above.</li>
</ul>
<p>The Relax NG book has a more detailed overview of these in <a href="http://books.xmlschemata.org/relaxng/relax-CHP-3-SECT-2.html">Chapter 3.2</a></p>
<p>Relax NG allows to reference externally defined datatypes, such as <a href="https://www.w3.org/2001/XMLSchema-datatypes">those defined in XML Schema</a>. To include such a reference, we can specify a <code class="highlighter-rouge">datatypeLibrary</code> attribute on the root <code class="highlighter-rouge"><grammar></code> element:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><grammar</span> <span class="na">xmlns=</span><span class="s">"http://relaxng.org/ns/structure/1.0"</span>
<span class="na">xmlns:a=</span><span class="s">"http://relaxng.org/ns/compatibility/annotations/1.0"</span>
<span class="na">datatypeLibrary=</span><span class="s">"http://www.w3.org/2001/XMLSchema-datatypes"</span><span class="nt">></span>
<span class="nt"><start></span>
...
<span class="nt"></start></span>
<span class="nt"></grammar></span></pre></td></tr></tbody></table></code></pre></figure>
<p>In addition to datatypes, we can also express admissible XML <em>content</em> using regexes, but (and this is important!) <strong>we cannot exprain cardinality constraints or uniqueness constraints</strong>.</p>
<p>If we need to express those, we can make use of Schematron.</p>
<h3 id="schematron">Schematron</h3>
<p><a href="http://schematron.com">Schematron</a> is an assertion language making use of XPath for node selection and for encoding predicates. It is often used <em>in conjunction</em> with Relax NG to express more complicated constraints, that aren’t easily expressed (or can’t be expressed at all) in Relax NG. The common pattern is to build the structure of the schema in Relax NG, and the business logic in Schematron.</p>
<p>They can be combined in the same file by declaring different namespaces. For instance, the example below allows us to write a Relax NG schema as usual, and some Schematron rules rules under the <code class="highlighter-rouge">sch</code> namespace.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><grammar</span> <span class="na">xmlns=</span><span class="s">"http://relaxng.org/ns/structure/1.0"</span>
<span class="na">xmlns:a=</span><span class="s">"http://relaxng.org/ns/compatibility/annotations/1.0"</span>
<span class="na">xmlns:sch=</span><span class="s">"http://purl.oclc.org/dsdl/schematron"</span>
<span class="na">datatypeLibrary=</span><span class="s">"http://www.w3.org/2001/XMLSchema-datatypes"</span><span class="nt">></span>
...
<span class="nt"></grammar></span></pre></td></tr></tbody></table></code></pre></figure>
<p>As we can see in the example below, a Schematron schema is built from a series of assertions:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre><span class="nt"><schema</span> <span class="na">xmlns=</span><span class="s">"http://purl.oclc.org/dsdl/schematron"</span> <span class="nt">></span>
<span class="nt"><title></span>A Schema for Books<span class="nt"></title></span>
<span class="nt"><ns</span> <span class="na">prefix=</span><span class="s">"bk"</span> <span class="na">uri=</span><span class="s">"http://www.example.com/books"</span> <span class="nt">/></span>
<span class="nt"><pattern</span> <span class="na">id=</span><span class="s">"authorTests"</span><span class="nt">></span>
<span class="nt"><rule</span> <span class="na">context=</span><span class="s">"bk:book"</span><span class="nt">></span>
<span class="nt"><assert</span> <span class="na">test=</span><span class="s">"count(bk:author)!= 0"</span><span class="nt">></span>
A book must have at least one author
<span class="nt"></assert></span>
<span class="nt"></rule></span>
<span class="nt"></pattern></span>
<span class="nt"><pattern</span> <span class="na">id=</span><span class="s">"onLoanTests"</span><span class="nt">></span>
<span class="nt"><rule</span> <span class="na">context=</span><span class="s">"bk:book"</span><span class="nt">></span>
<span class="nt"><report</span> <span class="na">test=</span><span class="s">"@on-loan and not(@return-date)"</span><span class="nt">></span>
Every book that is on loan must have a return date
<span class="nt"></report></span>
<span class="nt"></rule></span>
<span class="nt"></pattern></span>
<span class="nt"></schema></span></pre></td></tr></tbody></table></code></pre></figure>
<p>A short description of the different Schematron elements follows:</p>
<ul>
<li><code class="highlighter-rouge"><ns></code>: specifies to which namespace a prefix is bound. In the above example, the <code class="highlighter-rouge">bk</code> prefix, used as <code class="highlighter-rouge">bk:book</code>, is bound to <code class="highlighter-rouge">http://www.example.com/books</code>. This prefix is used by XPath in the elements below.</li>
<li><code class="highlighter-rouge"><pattern></code>: a pattern contains a list of rules, and is used to group similar assertions. This isn’t just for better code organization, but also allows to execute groups at different stages in the validation</li>
<li><code class="highlighter-rouge"><rule></code>: a rule contains <code class="highlighter-rouge"><assert></code> and <code class="highlighter-rouge"><report></code> elements. It has a <code class="highlighter-rouge">context</code> attribute, which is an XPath specifying the element on which we’re operating; all nodes matching the XPath expression are tested for all the assertions and reports of a rule</li>
<li><code class="highlighter-rouge"><assert></code>: provides a mechanism to check if an assertion is true. If it isn’t, a validation error occurs</li>
<li><code class="highlighter-rouge"><report></code>: same as an assertion, but the validation doesn’t fail; instead, a warning is issued.</li>
</ul>
<h2 id="xml-information-set">XML Information Set</h2>
<p>The purpose of <a href="https://msdn.microsoft.com/en-us/library/aa468561.aspx">XML Information Set</a>, or Infoset, is to “purpose is to provide a consistent set of definitions for use in other specifications that need to refer to the information in a well-formed XML document<sup id="fnref:infoset-spec"><a href="#fn:infoset-spec" class="footnote">1</a></sup>”.</p>
<p>It specifies a standardized, abstract model to represent the properties of XML trees. The goal is to provide a standardized viewpoint for the implementation and description of various XML technologies.</p>
<p>It functions like an AST for XML documents. It’s abstract in the sense that it abstract away from the concrete encoding of data, and just retains the meaning. For instance, it doesn’t distinguish between the two forms of the empty element; the following are considered equivalent (pairwise):</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="nt"><element></element></span>
<span class="nt"><element/></span>
<span class="nt"><element</span> <span class="na">attr=</span><span class="s">"example"</span><span class="nt">/></span>
<span class="nt"><element</span> <span class="na">attr=</span><span class="s">'example'</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>The Information Set is described as a tree of information items, which are simply blocks of information about a node in the tree; every information item is an abstract representation of a component in an XML document.</p>
<p>As such, at the root we have a document information item, which, most importantly, contains a list of children, which is a list of information items, in document order. Information items for elements contain a local name, the name of the namespace, a list of attribute information items, which contain the key and value of the attribute, etc.</p>
<h2 id="xslt">XSLT</h2>
<h3 id="motivation">Motivation</h3>
<p>XSLT is part of a more general language, XSL. The hierarchy is as follows:</p>
<ul>
<li><strong>XSL</strong>: eXtensible Stylesheet Language
<ul>
<li><strong>XSLT</strong>: XSL Transformation</li>
<li><strong>XLS-FO</strong>: XSL Formatting Objects</li>
</ul>
</li>
</ul>
<p>An XSLT Stylesheet allows us to transform XML input into other formats. An XSLT Processor takes an XML input, and an XSLT stylesheet and produces a result, either in XML, XHTML, LaTeX, …</p>
<p>XSLT is a <strong>declarative</strong> and <strong>functional</strong> language, which uses XML and XPath. It’s a <a href="https://www.w3.org/TR/xslt/all/">W3C recommendation</a>, often used for generating HTML views of XML content.</p>
<p>The XSLT Stylesheet consists of a set of templates. Each of them matches specific elements in the XML input, and participates to the generation of data in the resulting output.</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsl:stylesheet</span> <span class="na">xmlns:xsl=</span><span class="s">"http://www.w3.org/1999/XSL/Transform"</span> <span class="na">xmlns:xd=</span><span class="s">"http://oxygenxml.com/ns/doc/xsl"</span> <span class="na">version=</span><span class="s">"1.0"</span><span class="nt">></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"a"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"b"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"c"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"d"</span><span class="nt">></span>...<span class="nt"></xsl:template></span>
<span class="nt"></xsl:stylesheet></span></pre></td></tr></tbody></table></code></pre></figure>
<p>Let’s take a look at an individual XSLT template:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"e"</span><span class="nt">></span>
result: <span class="nt"><xsl:apply-templates/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<ul>
<li><code class="highlighter-rouge">e</code> is an XPath expression that selects the nodes the XSLT processor will apply the template to</li>
<li><code class="highlighter-rouge">result</code> specifies the content to be produces in the output for each node selected by <code class="highlighter-rouge">e</code></li>
<li><code class="highlighter-rouge">xsl:apply-templates</code> indicates that templates are to be applied on the selected nodes, in document order; to select nodes, it may have a <code class="highlighter-rouge">select</code> attribute, which is an XPath expression defaulting to <code class="highlighter-rouge">child::node()</code>.</li>
</ul>
<p>The XSLT execution is roughly as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">process</span><span class="p">(</span><span class="n">node</span><span class="p">):</span>
<span class="n">find</span> <span class="n">most</span> <span class="n">specific</span> <span class="n">pattern</span>
<span class="c1"># instantiate template:
</span> <span class="n">create</span> <span class="n">result</span> <span class="n">fragment</span>
<span class="k">for</span> <span class="p">(</span><span class="n">instruction</span> <span class="n">selecting</span> <span class="n">other</span> <span class="n">nodes</span><span class="p">)</span> <span class="ow">in</span> <span class="n">template</span><span class="p">:</span>
<span class="k">for</span> <span class="n">new_node</span> <span class="ow">in</span> <span class="n">instruction</span><span class="p">:</span>
<span class="n">process</span><span class="p">(</span><span class="n">new_node</span><span class="p">)</span>
<span class="n">process</span><span class="p">(</span><span class="n">xml</span><span class="o">.</span><span class="n">root</span><span class="p">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Recursion stops when no more source nodes are selected.</p>
<h3 id="default-templates">Default templates</h3>
<p>XSLT Stylesheets contain <strong>default templates</strong>:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"/ | *"</span><span class="nt">></span>
<span class="nt"><xsl:apply-templates/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This recursively drives the matching process, starting from the root node. If templates are associated to the root node, then this default template is overridden; if the overridden version doesn’t contain any <code class="highlighter-rouge"><xml: ></code> elements, then the matching process is stopped.</p>
<p>Another default template is:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"text()|@*"</span><span class="nt">></span>
<span class="nt"><xsl:value-of</span> <span class="na">select=</span><span class="s">"self::node()"</span><span class="nt">/></span>
<span class="nt"></xsl:template></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This copies text and attribute nodes in the output.</p>
<p>A third default is:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"processing-instruction()|comment()"</span><span class="nt">/></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is a template that specifically matches processing instructions and comments; it is empty, so it does not generate anything for them.</p>
<h3 id="example">Example</h3>
<p>To get an idea of what XSLT could do, let’s consider the following example of XML data representing a catalog of books and CDs:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
</pre></td><td class="code"><pre><span class="nt"><Catalog></span>
<span class="c"><!-- Book Sample --></span>
<span class="nt"><Product></span>
<span class="nt"><ProductNo></span>bk-005<span class="nt"></ProductNo></span>
<span class="nt"><Book</span> <span class="na">Language=</span><span class="s">"FR"</span><span class="nt">></span>
<span class="nt"><Price></span>
<span class="nt"><Value></span>19<span class="nt"></Value></span>
<span class="nt"><Currency></span>EUR<span class="nt"></Currency></span>
<span class="nt"></Price></span>
<span class="nt"><Title></span>Profecie<span class="nt"></Title></span>
<span class="nt"><Authors></span>
<span class="nt"><Author></span>
<span class="nt"><FirstName></span>Jonathan<span class="nt"></FirstName></span>
<span class="nt"><LastName></span>Zimmermann<span class="nt"></LastName></span>
<span class="nt"></Author></span>
<span class="nt"></Authors></span>
<span class="nt"><Year></span>2015<span class="nt"></Year></span>
<span class="nt"><Cover></span>profecie<span class="nt"></Cover></span>
<span class="nt"></Book></span>
<span class="nt"></Product></span>
<span class="c"><!-- CD sample --></span>
<span class="nt"><Product></span>
<span class="nt"><ProductNo></span>cd-003<span class="nt"></ProductNo></span>
<span class="nt"><CD></span>
<span class="nt"><Price></span>
<span class="nt"><Value></span>18.90<span class="nt"></Value></span>
<span class="nt"><Currency></span>EUR<span class="nt"></Currency></span>
<span class="nt"></Price></span>
<span class="nt"><Title></span>Witloof Bay<span class="nt"></Title></span>
<span class="nt"><Interpret></span>Witloof Bay<span class="nt"></Interpret></span>
<span class="nt"><Year></span>2010<span class="nt"></Year></span>
<span class="nt"><Sleeve></span>witloof<span class="nt"></Sleeve></span>
<span class="nt"><Opinion></span>
<span class="nt"><Parag></span>Original ce groupe belge.<span class="nt"></Parag></span>
<span class="nt"><Parag></span>Une véritable prouesse technique.<span class="nt"></Parag></span>
<span class="nt"></Opinion></span>
<span class="nt"></CD></span>
<span class="nt"></Product></span>
<span class="nt"></Catalog></span></pre></td></tr></tbody></table></code></pre></figure>
<p>For our example of books and CDs, we can create the following template:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="code"><pre><span class="cp"><?xml version="1.0" encoding="UTF-8"?></span>
<span class="nt"><xsl:stylesheet</span> <span class="na">xmlns:xsl=</span><span class="s">"http://www.w3.org/1999/XSL/Transform"</span>
<span class="na">xmlns:xs=</span><span class="s">"http://www.w3.org/2001/XMLSchema"</span>
<span class="na">exclude-result-prefixes=</span><span class="s">"xs"</span>
<span class="na">version=</span><span class="s">"2.0"</span><span class="nt">></span>
<span class="nt"><xsl:output</span> <span class="na">method=</span><span class="s">"html"</span><span class="nt">/></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"/"</span><span class="nt">></span>
<span class="nt"><html></span>
<span class="nt"><head></span>...<span class="nt"></head></span>
<span class="nt"><body></span>
<span class="nt"><h2></span>Welcome to our catalog<span class="nt"></h2></span>
<span class="nt"><h3></span>Books<span class="nt"></h3></span>
<span class="nt"><ul></span>
<span class="nt"><xsl:apply-templates</span> <span class="na">select=</span><span class="s">"Catalog/Product/Book/Title"</span><span class="nt">></span>
<span class="nt"><xsl:sort</span> <span class="na">select=</span><span class="s">"."</span><span class="nt">/></span>
<span class="nt"></xsl:apply-templates></span>
<span class="nt"></ul></span>
<span class="nt"></body></span>
<span class="nt"></html></span>
<span class="nt"></xsl:template></span>
<span class="nt"><xsl:template</span> <span class="na">match=</span><span class="s">"Title"</span><span class="nt">></span>
<span class="nt"><li></span>
<span class="nt"><xsl:value-of</span> <span class="na">select=</span><span class="s">"."</span><span class="nt">/></span>
<span class="nt"></li></span>
<span class="nt"></xsl:template></span>
<span class="nt"></xsl:stylesheet></span></pre></td></tr></tbody></table></code></pre></figure>
<p>In the above, the <code class="highlighter-rouge">xsl:sort</code> element has the following possible attributes:</p>
<ul>
<li><code class="highlighter-rouge">select</code>: here, the attribute is <code class="highlighter-rouge">.</code>, which refers to the title in this context</li>
<li><code class="highlighter-rouge">data-type</code>: gives the kind of order (e.g. text or number)</li>
<li><code class="highlighter-rouge">order</code>: <code class="highlighter-rouge">ascending</code> or <code class="highlighter-rouge">descending</code></li>
</ul>
<h2 id="xquery">XQuery</h2>
<p>XQuery is a <strong>strongly typed</strong> and <strong>functional</strong> language that offers features to operate on XML input for searching, selecting, filtering, transforming, restructuring information, etc. It is an SQL-like language for XML. It wasn’t defined with the same goals as XSLT, but has some overlap that we’ll discuss later.</p>
<p>It does not use the XML syntax. Instead, it offers a general purpose (Turing-complete) language that can be used for developing XML based applications.</p>
<p>XQuery is a <a href="https://www.w3.org/TR/xquery/all/">W3C Recommendation</a>, and is therefore closely linked to <a href="#xml-schema">XML Schema</a>, as it uses the XML Schema type system. Note that there is for no support for XQuery with Relax NG or other non-W3C schema languages. A nice book on XQuery is <a href="http://shop.oreilly.com/product/0636920035589.do">available at O’Reily</a>.</p>
<h3 id="syntax">Syntax</h3>
<p>A query is made up of three parts:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>(: Comments are written in these smiley-like delimiters :)
(: 1. Optional version declaration :)
xquery version "3.0";
(: 2. Optional query prolog :)
(: This contains declarations such as namespaces, variables, etc. :)
declare namespace html = "http://www.w3.org/1999/xhtml";
(: 3. Query body :)
substring("Welcome to the world of XML", 1, 7)</pre></td></tr></tbody></table></code></pre></figure>
<p>A query takes some kind of XML content: an XML file, an XML fragment retrieved online, a native XML database, etc. The output is a sequence of values, which are often XML elements (this is important: not a document, but elements). But it could also be an XML Schema type, such as a string, a list of integers, etc.</p>
<p>The output can be serialized to a document, or just kept in-memory in the application for further processing.</p>
<p>Queries are evaluated by an XQuery processor, which works in two phases. First, the analysis phase may raise errors (that do not depend on the input, only on the query). Then, there is an evaluation phase, which may raise dynamic errors (e.g. missing input).</p>
<p>A query consists of one or more comma-separated <strong>XQuery expressions</strong>, which are composed of the following:</p>
<ul>
<li>Primary expressions (literals, variables, function calls, etc)</li>
<li>Arithmetic expressions</li>
<li>Logical expressions</li>
<li>XPath (with <code class="highlighter-rouge">collection</code> and <code class="highlighter-rouge">doc</code> functions used to access resources)</li>
<li>XML constructors</li>
<li>Sequence constructors</li>
<li><a href="https://en.wikipedia.org/wiki/FLWOR">FLWOR statements</a> (pronounced “flower”: for, let, where, order by, return).</li>
<li>Conditional expressions</li>
<li>Quantified expressions</li>
</ul>
<h3 id="creating-xml-content">Creating XML content</h3>
<p>To build XML content, we can embed “escaped” XQuery code using curly brackets, within our template file, as follows:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><report</span> <span class="na">year=</span><span class="s">"2018"</span><span class="nt">></span>
The value is {round (3.14)}
<span class="nt"></report></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="sequences">Sequences</h3>
<p>A sequence is an ordered collection of items, which may be of any type (atomic value, node, etc). Duplicates are allowed. A sequence can contain zero (empty), one (singleton) or many items. Sequences are comma-separated. We can add parentheses for clarity, but not for nesting; a sequence is always flat (even if we nest parentheses):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre>1, 2, <example/>
(1, 2, <example/>)</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="flwor">FLWOR</h3>
<p>A FLWOR expression is constructed as follows:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>flwor ::= ((for | let) expr)+ (where expr)? (order by expr)? return expr</pre></td></tr></tbody></table></code></pre></figure>
<p>For instance:</p>
<p>XQuery also has support for for variables, denoted <code class="highlighter-rouge">$x</code> (which are more like constants):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>let $FREvents := /RAS/Events/Event[Canton/text() = "FR"],
$FRTopics := $FREvents/TopicRef/text()
return /RAS/Members/Member[Topics/TopicRef/text() = $FRTopics]/Email</pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 This gives us the email addresses of reporters who may deal with events in the canton of Fribourg. See exercises 01 for more context.</p>
</blockquote>
<p>Let’s take a look at another XQuery expression:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>for $book in /Catalog/Product/Book
where $book/@Language = "EN"
return $book/Title
(: equivalently written as :)
for $book in /Catalog/Product/Book[@Language = "EN"]
return $book/Title</pre></td></tr></tbody></table></code></pre></figure>
<p>This returns the book titles in the document:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="nt"><Title></span>XSLT<span class="nt"></Title></span>
<span class="nt"><Title></span>Electronic Publishing<span class="nt"></Title></span>
<span class="nt"><Title></span>Making Sense of NoSQL<span class="nt"></Title></span></pre></td></tr></tbody></table></code></pre></figure>
<p>As we can see above, there is some overlap between XQuery and XPath; the <code class="highlighter-rouge">where</code> condition can also be written as an XPath selection condition. Which to use is a question of style; there is no difference in performance.</p>
<p>The <code class="highlighter-rouge">order by</code> and <code class="highlighter-rouge">where</code> keywords work just like in SQL, so I won’t go into details here.</p>
<h3 id="conditional-expressions">Conditional expressions</h3>
<p>Like in any templating language, we can create conditional statements. It is mandatory to specify an <code class="highlighter-rouge">else</code> to every <code class="highlighter-rouge">if</code>, but if we do not want to return anything, we can return the empty sequence <code class="highlighter-rouge">()</code>.</p>
<p>The condition of an <code class="highlighter-rouge">if</code> must be a boolean or a sequence. Empty sequences are falsey, and sequences of one or more elements are truthy.</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre>for $book in /catalog/product/book
order by $book/title
return
<title>
{$book/title/text()}
{if ($book/@Language = 'EN') then '[English]' else ()}
</title></pre></td></tr></tbody></table></code></pre></figure>
<p>This returns:</p>
<figure class="highlight"><pre><code class="language-xml" data-lang="xml"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="nt"><title></span>Electronic Publishing [English]<span class="nt"></title></span>
<span class="nt"><title></span>Making Sense of NoSQL [English]<span class="nt"></title></span>
<span class="nt"><title></span>Profecie<span class="nt"></title></span>
<span class="nt"><title></span>XML - le langage et ses applications<span class="nt"></title></span>
<span class="nt"><title></span>XSLT [English]<span class="nt"></title></span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="quantified-expressions">Quantified expressions</h3>
<p>A quantified expression allows us to express universal or existential quantifiers using <code class="highlighter-rouge">some</code> and <code class="highlighter-rouge">every</code>. The predicate is given with the keyword <code class="highlighter-rouge">satisfies</code>, as below:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre>some $dept in doc("catalog.xml")//product/@dept
satisfies ($dept = "ACC")</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="functions">Functions</h3>
<p>User defined functions can be declared as follows:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre>declare function local:discountPrice(
$price as xs:decimal?,
$discount as xs:decimal?,
$maxDiscountPct as xs:integer?) as xs:decimal?
{
let $maxDiscount := ($price * $maxDiscountPct) div 100
let $actualDiscount := min(($maxDiscount, $discount))
return ($price - $actualDiscount)
};</pre></td></tr></tbody></table></code></pre></figure>
<p>The types are sequence types, with both the number and types of items. For instance, <code class="highlighter-rouge">xs:string?</code> means a sequence of zero or one string. The return type is optional, but is strongly encouraged for readability, error checking and optimization.</p>
<p>Functions can be overloaded with a different number of parameters.</p>
<p>The body is enclosed in curly braces. It does not have to contain a <code class="highlighter-rouge">return</code> clause, it just needs to be an XQuery expression.</p>
<h3 id="modules">Modules</h3>
<p>Functions can be grouped into modules, which declare the target namespace and bind it to a prefix (here, the <code class="highlighter-rouge">strings</code> prefix):</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>module namespace strings = "https://example.com/strings"</pre></td></tr></tbody></table></code></pre></figure>
<p>Anything declared under that prefix can be accessed from the outside, when importing the module.</p>
<p>Modules can be imported at a location using the <code class="highlighter-rouge">at</code> clause:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre>import module namespace search = "https://example.com/search" at "search.xqm"</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="updating-xml-content">Updating XML Content</h3>
<p>Unlike SQL, standard XQuery only offers ways of querying data, and not of inserting, deleting or updating data. That’s why the W3C developed an extension to XQuery called the <a href="https://www.w3.org/TR/xquery-update-10/">XQuery Update Facility</a>.</p>
<p>Like SQL, the implementation of this Update Facility is often tied to specific database systems. In this course, we will use the <a href="http://exist-db.org/exist/apps/homepage/index.html">eXist-db</a> variant. Updates are executed by specifying the <code class="highlighter-rouge">update</code> keyword in the <code class="highlighter-rouge">return</code> clause.</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre>let $catalog := doc('db/catalog.xml')
return update insert
<product>...</product>
into $catalog</pre></td></tr></tbody></table></code></pre></figure>
<p>The keyword <code class="highlighter-rouge">into</code> places content after the last child of the element. We can also use <code class="highlighter-rouge">following</code>, placing it as the next sibling, or <code class="highlighter-rouge">preceding</code> to place it as the previous sibling.</p>
<p>Instead of <code class="highlighter-rouge">update insert</code>, we can also do an <code class="highlighter-rouge">update delete</code>, or a <code class="highlighter-rouge">update replace XPATH with ELEMENT</code>.</p>
<p>Updates can be chained as a sequence:</p>
<figure class="highlight"><pre><code class="language-xquery" data-lang="xquery"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>let $cd := doc('db/catalog.xml')/Product[ProductNo = $no]/CD
return
(
update replace $cd/Price/Value with <value>18</value>,
update replace $cd/Year with <year>2010</year>
)</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="advanced-features">Advanced features</h3>
<p>As we mentionned earlier, XQuery is Turing complete. You can define your own functions, which may be grouped into modules, and may be higher-order functions.</p>
<p>Schema awareness is an optional feature; if it is supported, the <code class="highlighter-rouge">validate</code> expression may be used, which is useful for optimization and error checking. However, as we mentioned earlier, there is only support for W3C standardized schemas, not Relax NG.</p>
<p>While XQuery is mainly associated with XML, it is possible in newer versions to deal with text documents (like CSV, name/value config files, etc. since 3.0) and even JSON (since 3.1).</p>
<h3 id="coding-guidelines">Coding guidelines</h3>
<p>MarkLogic has some <a href="https://developer.marklogic.com/blog/xquery-coding-guidelines">XQuery coding guidelines</a> that are good to follow.</p>
<p>For robustness, it is important to handle missing values (empty sequences) and data variations.</p>
<h2 id="xml-based-webapps">XML Based Webapps</h2>
<p>We’ve now learned to model (with schemas), transform (with XSLT), and query and process (with XQuery). How can we develop an XML based webapp combining these?</p>
<p>We will take a look at the <a href="https://github.com/ssire/oppidum">Oppidum framework</a>, which targets the development of XML-REST-XQuery (XRX) applications, using the eXist-db XML database.</p>
<h3 id="xml-databases">XML Databases</h3>
<p>An XML database looks quite a lot like a normal database; for instance, it uses a traditional, B-tree based indexing system, has a querying language, etc. The main difference is simply that data is XML instead of a table, and that we use XQuery instead of SQL.</p>
<h3 id="rest">REST</h3>
<p>REST stands for REpresentational State Transfer. It’s an architectural style created by Roy Fieding in [his PhD thesis](https://www.ics.uci.edu/~fielding/</p>
<p>In REST, we have resources, located by a URL on Web-based REST, that may be processed by a client. A collection is simply a set of resources. Interaction with a REST API happens with classical CRUD (Create, Read, Update, Delete) on URLs, which in HTTP are the <code class="highlighter-rouge">POST</code>, <code class="highlighter-rouge">GET</code>, <code class="highlighter-rouge">PUT</code> and <code class="highlighter-rouge">DELETE</code> requests.</p>
<h3 id="oppidum">Oppidum</h3>
<p><a href="https://github.com/ssire/oppidum">Oppidum</a> is an open source framework to build XML Web-based applications with an MVC approach. The <a href="https://ssire.github.io/oppidum/docs/fr/guide.html">documentation</a> is only in French, but the core idea is as follows: HTTP requests are handed to Oppidum by eXist. The application logic is then detailed in a pipeline consisting of:</p>
<ul>
<li><strong>Model</strong>: XQuery script (<code class="highlighter-rouge">*.xql</code>) returning relevant XML content</li>
<li><strong>View</strong>: XSLT transformation (<code class="highlighter-rouge">*.xsl</code>)</li>
<li><strong>Epilogue</strong>: XQuery script (<code class="highlighter-rouge">epilogue.xql</code>) for templating common content in HTML pages; this works using tags with the <code class="highlighter-rouge">site</code> namespace</li>
</ul>
<p>To specify the REST architecture, Oppidum has a DSL that allows us to define the set of resources and actions, determine the URLs and associated HTTP verbs (<code class="highlighter-rouge">GET</code>, <code class="highlighter-rouge">POST</code>, etc) recognized by the application, and so on.</p>
<h2 id="foundations-of-xml-types">Foundations of XML types</h2>
<p>We’ve seen seen XML tools for validation (DTD, XML Schema, Relax NG), navigation and extraction (XPath) and transformation (XQuery, XSLT).</p>
<p>Some essential questions about these tools are:</p>
<ul>
<li><strong>Expressive power</strong>: can I express requirement X using XML type language Y?</li>
<li><strong>Operations over XML types</strong>: can I check forward-compatibility when my XML file format evolves? Type inclusion?</li>
<li><strong>Static type-checking</strong>: can we make my XML manipulating programs will never output an invalid document?</li>
</ul>
<p>To answer this, we must know more about XML types, and dive into the theoretical foundations of XML types.</p>
<h3 id="tree-grammars">Tree Grammars</h3>
<p>XML documents can be modelled by finite, ordered, labeled trees of unbounded depth and arity. To describe a tree, we use a tree language, which can be specified by a tree grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>Person = person[Name, Gender, Children?]
Name = name[String]
Gender = gender[Male | Female]
Male = male[]
Female = female[]
Children = children[Person+]</pre></td></tr></tbody></table></code></pre></figure>
<p>By convention, capitalized variables are <strong>type variables</strong> (non-terminals), and non-capitalized are terminals.</p>
<p>A tree grammar defines a set of legal trees. As any grammar, tree grammars are defined within an alphabet $\Sigma$, with a set of type variables $X := \left\{X_1 ::= T_1, \dots, X_N ::= T_n\right\}$. A tree grammar is defined by the pair $(E, X)$, where $E$ represents the starting type variable. Each $T_i$ is a tree type expression:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>T ::=
l[T] // l ∈ Σ with content model T
| () // empty sequence
| T1, T2 // concatenation
| T1 | T2 // choice
| X // reference</pre></td></tr></tbody></table></code></pre></figure>
<p>The usual regex operators <code class="highlighter-rouge">?</code>, <code class="highlighter-rouge">+</code> and <code class="highlighter-rouge">*</code> are syntactic sugar.</p>
<p>To ensure that our tree grammar remains regular, we must introduce a syntactic restriction: every recursive use of a type variable $X$ (unless it is within the content model) must be in the tail. For instance, the following grammars are not acceptable:</p>
<script type="math/tex; mode=display">\left\{ X = a, X, b \right\} \\
\left\{ X = a, Y, b; \quad Y = X \right\} \\</script>
<p>But the following are fine:</p>
<script type="math/tex; mode=display">\left\{ X = a, c[X], b \right\} \\
\left\{ X = a, Y; \quad Y = b, X | \epsilon \right\} \\</script>
<p>A small reminder on regular vs. context-free grammars: regular grammars are decidable (we can check for inclusion with a DFA), while context-free grammars are undecidable (we cannot check for inclusion in $a^n b^n$ with a DFA, for instance).</p>
<p>Within the class of regular grammars, there are three subclasses of interest, in order of specificity (each of these is a subset of the classes above):</p>
<ol>
<li>Context-free</li>
<li>Regular</li>
<li>Single Type</li>
<li>Local</li>
</ol>
<p>Each subclass is defined by additional restrictions compared to its parent. The more restrictions we add, the more expressive power we lose. It turns out that these classes correspond to different XML technologies:</p>
<ol>
<li><strong>Context-free</strong>: ?</li>
<li><strong>Regular</strong>: Relax NG</li>
<li><strong>Single Type</strong>: XML Schema</li>
<li><strong>Local</strong>: DTD</li>
</ol>
<h4 id="dtd--local-tree-grammars">DTD & Local tree grammars</h4>
<p>As we said previously, the expressive power of a grammar class is defined by which restriction have been imposed. In DTD, the restriction is that each element name is associated with a regex. This means that for each $a[T_1]$ and $a[T_1]$ occuring in $X$, the content models are identical: $T_1 = T_2$.</p>
<p>In other words, in DTDs, the content of an XML tag cannot depend on the context of the tag. This removes some expressive power.</p>
<p>To construct a DTD validator, we just use a word automaton associated with each terminal. This automaton is a DFA, as DTD requires regular expressions to be deterministic. That is, the matched regexp must be able to be determined without lookahead to the next symbol. <code class="highlighter-rouge">a(bc | bb)</code> is not deterministic, but <code class="highlighter-rouge">ab(c | b)</code> is.</p>
<p>As a corollary, the union of two DTDs may not be a DTD. Indeed, the two DTDs could define different content models for the same terminal, which would be illegal. We say that the class is not closed composition (here, we showed that it isn’t closed under union).</p>
<h4 id="xml-schema--single-type-tree-grammars">XML Schema & Single-Type tree grammars</h4>
<p>In XML Schema, it is possible to have different content models for elements of the same name when they are in different contexts (unlike for DTD). But still, for each $a[T_1]$ and $a[T_2]$ occuring <em>under the same parent</em>, the content models must be identical ($T_1 = T_2$).</p>
<p>Still, this bring us more expressive power, so we have $\mathcal{L}_{\text{DTD}} \subset \mathcal{L}_{\text{xmlschema}}$. This inclusion is strict, as we can construct grammars that are single-type (and not local) in XML Schema:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre>Dealer = dealer[Used, New]
Used = used[UsedCar]
New = new[NewCar]
UsedCar = car[Model, Year] // here, car can have different content models
NewCar = car[Model] // this is allowed as they have different parents
...</pre></td></tr></tbody></table></code></pre></figure>
<p>But XML schemas also have weaknesses: we cannot encode more advanced restrictions in it. For instance, with our car dealership example, we cannot encode something like “at least one car has a discount”, as it is not a <em>single-type</em>; we would require two different content models for a car within the same parent.</p>
<p>Consequently, this class is still not closed under union.</p>
<h4 id="relax-ng--regular-tree-grammars">Relax NG & Regular tree grammars</h4>
<p>Relax NG does not have any of the previously discussed restrictions. The content model does not have to depend on the label of the parent; it can also depend on the ancestor’s siblings, for instance. This allows us to have much more expressive power. Relax NG places itself in the class of regular tree grammars, and $\mathcal{L}_{\text{xmlschema}} \subset \mathcal{L}_{\text{r}}$.</p>
<p>For instance, we can now encode what we couldn’t with XML Schema:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre>Dealer = dealer[Used, New]
Used = used[UsedCar]
New = new[NewCar, DNewCar]
UsedCar = car[Model, Year]
NewCar = car[Model] // the same terminal used within 'new'
DNewCar = car[Model, Discount] // but with different content models
...</pre></td></tr></tbody></table></code></pre></figure>
<p>Regular tree grammars are more robust (closed under set operations like union and intersection), give us high expressive power, while still remaining simply defined and well-characterized (inclusion can still be verified in linear time by DFA).</p>
<h3 id="tree-automata">Tree automata</h3>
<h4 id="definition">Definition</h4>
<p>A tree automaton (plural automata) is a state machine dealing with tree structure instead of strings (like a word automaton would). Introducing these will allow us to provide a general framework for XML type languages by giving us a tool with which we can reason about regular tree languages.</p>
<p>A ranked tree can be thought of as the AST representation of a function call. For instance, <code class="highlighter-rouge">f(a, b)</code> can be represented as a tree with parent node <code class="highlighter-rouge">f</code> and two children <code class="highlighter-rouge">a</code> and <code class="highlighter-rouge">b</code> (in that order). We can also represent more complex trees with these notations (<code class="highlighter-rouge">f(g(a, b, c), h(i))</code> gives us the full structure of a tree, for instance).</p>
<p>We define a ranked alphabet symbol as a formalization of a function call. It is a symbol $a$, associated with an integer representing the number of children, $\text{arity}(a)$. We write $a^{(k)}$ for the symbol $a$ with $\text{arity}(a) = k$.</p>
<p>This allows us to fix an arity to different tree symbols. Our alphabet could then be, for instance, $\left\{ a^{(2)}, b^{(2)}, c^{(3)}, \sharp^{(0)} \right\}$. In this alphabet, <code class="highlighter-rouge">#</code> would always be the leaves.</p>
<p>A ranked tree automaton A consists of:</p>
<ul>
<li>$F$, a finite ranked alphabet of symbols</li>
<li>$Q$, a finite set of states</li>
<li>$\Delta$, a finite set of transition rules</li>
<li>$Q_f \subseteq Q$, a finite set of final states</li>
</ul>
<p>In a word automaton, we write transitions as $\text{even} \overset{1}{\rightarrow} \text{odd}$. In a (bottom-up) tree automaton, the transitions are from the children’s state to the parents’ state. If a tree node has arity 2, a transition could be $(q_0, q_1) \overset{a}{\rightarrow} q_0$. If the arity is $k=0$, we write $\epsilon \overset{a^{(0)}}{\rightarrow} q$.</p>
<h4 id="example-1">Example</h4>
<p>As an example, we can think of a tree of boolean expressions. Let’s consider the following:</p>
<script type="math/tex; mode=display">((0 \land 1) \lor (1 \lor 0)) \land ((0 \lor 1) \land (1 \land 1))</script>
<p>We can construct this as a binary tree by treating the logical operators as infix notation of a function call:</p>
<script type="math/tex; mode=display">\land(\lor(\land(0, 1), \lor(1, 0)), \land(\lor(0, 1), \land(1, 1)))</script>
<p>In this case, our alphabet is $F = \left\{\land, \lor, 0, 1\right\}$. Our states are $Q = \left\{ q_0, q_1\right\}$ (either true or false). The accepting state is $Q_f = \left\{ q_1 \right\}$. Our transition rules are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\epsilon \overset{0}{\rightarrow} q_0 & \quad & \epsilon \overset{1}{\rightarrow} q_0 \\
(q_1, q_1) \overset{\land}{\rightarrow} q_1 & \quad & (q_1, q_1) \overset{\lor}{\rightarrow} q_1 \\
(q_0, q_1) \overset{\land}{\rightarrow} q_0 & \quad & (q_0, q_1) \overset{\lor}{\rightarrow} q_1 \\
(q_1, q_0) \overset{\land}{\rightarrow} q_0 & \quad & (q_1, q_0) \overset{\lor}{\rightarrow} q_1 \\
(q_0, q_0) \overset{\land}{\rightarrow} q_0 & \quad & (q_0, q_0) \overset{\lor}{\rightarrow} q_0 \\
\\
\end{align} %]]></script>
<p>With these rules in place, we can evaluate binary expressions with a tree automaton.</p>
<h4 id="properties">Properties</h4>
<p>The language of A is the set of trees accepted by A. For a tree automaton, the language is a <strong>regular tree language</strong>.</p>
<p>A tree automaton is <strong>deterministic</strong> as long as there aren’t too rules pointing us to different states:</p>
<script type="math/tex; mode=display">(q_1, \dots q_k) \overset{a^{(k)}}{\rightarrow} q, \quad
(q_1, \dots q_k) \overset{a^{(k)}}{\rightarrow} q'
\qquad q \ne q'</script>
<p>With word automata, we know that we can build a DFA from any NFA. The same applies to tree automata: from a given non-deterministic (bottom-up) tree automaton, we can build a deterministic tree automaton.</p>
<p>As a corollary, this tells us that non-deterministic tree automata do not give us more expressive power; deterministic and non-deterministic automata recognize the same languages. However, non-deterministic automata tend to allow us to represent languages more compactly (conversion can turn a non-deterministic tree automaton of size $N$ into a deterministic tree automaton of size $\mathcal{O}(2^N$), so we’ll use those freely.</p>
<h3 id="validation">Validation</h3>
<h4 id="inclusion">Inclusion</h4>
<p>Given a tree automaton A and a tree t, how do we check $t\in\text{Language}(A)$?</p>
<p>What we do is to just mechanically apply the transition rules. If the automaton is non-deterministic, we can keep track of the set of possible states, and see if the root of the tree contains a finishing state.</p>
<p>This mechanism of membership checking is linear in the size of the tree.</p>
<h4 id="closure">Closure</h4>
<p>Tree automata are closed under set theoretic operations (we can just compute the union/intersection/product of the tuples defining the trees).</p>
<h4 id="emptiness">Emptiness</h4>
<p>We can also do emptiness checking with tree automata (that is, checking if $\text{Language}(A) = \emptyset$). To do so, we compute the set of reachable states, and see if any of them are in $Q_f$. This process is linear in the size of the automaton.</p>
<h4 id="type-inclusion">Type inclusion</h4>
<p>Given two automata $A_1$ and $A_2$, how can we check $\text{Language}(A_1) \subseteq \text{Language}(A_2)$?</p>
<p>Containment of a non-deterministic automata can be decided in exponential time. We do this by checking whether $\text{Language}(A_1 \cap \bar{A_2}) = \emptyset$. For this, we must make $A_2$ deterministic (which is an exponential process).</p>
<h2 id="dealing-with-non-textual-content">Dealing with non-textual content</h2>
<p>So far, we’ve just been dealing with text. In the following, we’ll see how we can deal with images, graphics, sound, video, animations, etc. For these types of data, semi-structured tree data is commonly used for its flexibility, while retaining rigorous structures and data typing.</p>
<p>For instance, there are many application-specific markup languages (MathML, CML for chemistry, GraphML, SVG tables, etc).</p>
<h3 id="mathml">MathML</h3>
<p>MathML actually has two possible structures: a presentation structure, telling us how to display math, and a mathematical structure, telling us how to apply or compute the result of a mathematical expression. It’s possible to go from mathematical to presentation structure, but not the other way (the other way is too ambiguous, it’s not a bijection).</p>
<h3 id="tables">Tables</h3>
<p>This distinction between content and presentation also exists within tables. For instance, creating the presentation and layout of a calendar, or of a complex table, is quite difficult because of the discrepancy between the presentation and structural forms.</p>
<p>The main issues with tables are:</p>
<ul>
<li>How can we model it in such a way that variations in presentation only depends on values of the formatting attributes?</li>
<li>How can we edit a table? (How do we modify the structure and update the backing content?)</li>
</ul>
<p>From a logical point of view, we can view a table as a d-dimensional space. A simple row-column table is 2D, but we can “add dimensions” by adding subdivision headers. Each cell in the table is described by a d-dimensional tuple of coordinates. How can we use a tree model to represent this?</p>
<p>We can use tree of height d, but more efficiently (or at least, flatly), we could encode each dimension as a direct child of the root, and link each data point to the relevant axes.</p>
<p>This is what HTML 4 proposes:</p>
<figure class="highlight"><pre><code class="language-html" data-lang="html"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="nt"><tr></span>
<span class="nt"><th></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a2"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Meals<span class="nt"></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a3"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Hotels<span class="nt"></th></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a4"</span> <span class="na">axis=</span><span class="s">"expenses"</span><span class="nt">></span>Transport<span class="nt"></th></span>
<span class="nt"><td></span>subtotals<span class="nt"></td></span>
<span class="nt"></tr></span>
<span class="nt"><tr></span>
<span class="nt"><th</span> <span class="na">id=</span><span class="s">"a6"</span> <span class="na">axis=</span><span class="s">"location"</span><span class="nt">></span>San Jose<span class="nt"></th></span>
<span class="nt"><th></th></span>
<span class="nt"><th></th></span>
<span class="nt"><th></th></span>
<span class="nt"><td></td></span>
<span class="nt"></tr></span>
<span class="nt"><tr></span>
<span class="nt"><td</span> <span class="na">id=</span><span class="s">"a7"</span> <span class="na">axis=</span><span class="s">"date"</span><span class="nt">></span>25-Aug-97<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a2"</span><span class="nt">></span>37.74<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a3"</span><span class="nt">></span>112.00<span class="nt"></td></span>
<span class="nt"><td</span> <span class="na">headers=</span><span class="s">"a6 a7 a4"</span><span class="nt">></span>45.00<span class="nt"></td></span>
<span class="nt"><td></td></span>
<span class="nt"></tr></span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="xml-processing">XML Processing</h2>
<p>When working with XML, there’s no need to write a parser. General-purpose XML parsers are widely available (e.g. Apache Xerces). Incidentally, an XML parser can be validating or non-validating.</p>
<p>XML parsers can communicate the XML tree structure to applications using it; there are two approaches for this:</p>
<ul>
<li>DOM: the parser stores the XML input to a fixed data structure, and exposes an API</li>
<li>SAX: parser trigger events. The input isn’t stored, the application must specify how to store and process events triggered by the parser.</li>
</ul>
<h3 id="dom">DOM</h3>
<p>DOM (Document Object Model) is a W3C standard. An application generates DOM library calls to manipulate the parsed XML input. There are multiple DOM levels, that have been introduced successively to expand the capabilities of DOM.</p>
<ul>
<li>DOM Level 1 provided basic API to access and manipulate tree structures (<code class="highlighter-rouge">getParentNode()</code>, <code class="highlighter-rouge">getFirstChild()</code>, <code class="highlighter-rouge">insertBefore()</code>, <code class="highlighter-rouge">replaceChild()</code>, …)</li>
<li>DOM Level 2 introduces specialized interfaces dedicated to XM Land namespace-related methods, dynamic access and update of the content of style sheets, an event system, …</li>
<li>DOM Level 3 introduces the ability to dynamically load the content of an XML document into a DOM document, serialize DOM into XML, dynamically update the content while ensureing validation, access the DOM using XPath, …</li>
</ul>
<p>DOM allows us to abstract away from the syntactical details of the XML structure, and allows us to ensure well-formedness (no missing tags, non-matching tags, etc). Thanks to that, document manipulation is considerably simplified.</p>
<p>However, the DOM approach is not without its flaws. The main disadvantage is that we must maintain a data structure representing the whole XML input, which can be problematic for big documents. To remedy this situation, we can preprocess to filter the document, reducing its overall size, but that only takes us so far. Alternatively, we can use a different approach for XML processing: SAX.</p>
<h3 id="sax">SAX</h3>
<p>SAX, the <a href="http://www.saxproject.org/">Simple API for XML</a> is not a W3C standard; it’s more of a de facto standard that started out as a Java-only API.</p>
<p>It’s very efficient, using only constant space, regardless of the XML input size. However, it means that we must also write more code. Indeed, we must specify callbacks for certain events, write our own code to store what we need, etc.</p>
<p>The SAX processor reads the input sequentially (while the DOM afforded us with random access), and once only. It sends events like <code class="highlighter-rouge">startDocument</code>, <code class="highlighter-rouge">startElement</code>, <code class="highlighter-rouge">characters</code>, etc. White spaces and tabs are reported too, so this also potentially means more code to write.</p>
<h3 id="dom-and-web-applications">DOM and web applications</h3>
<p>DOM is language and platform independent, with DOM APIs for all major programming languages. Most common though, is the DOM API used with JavaScript.</p>
<h3 id="xforms-an-alternative-to-html-forms">XForms: an alternative to HTML forms</h3>
<p>XForms give us a declarative approach to capture information from the user, and place it into XML documents, with constraint checking. XForms are a W3C standard, but are not implemented in the browsers.</p>
<div class="footnotes">
<ol>
<li id="fn:infoset-spec">
<p><a href="https://www.w3.org/TR/xml-infoset/">XML Information Set specification</a>, W3C Recommendation <a href="#fnref:infoset-spec" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
⚠ Work in progressCS-443 Machine Learning2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/ml
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#linear-regression" id="markdown-toc-linear-regression">Linear regression</a> <ul>
<li><a href="#simple-linear-regression" id="markdown-toc-simple-linear-regression">Simple linear regression</a></li>
<li><a href="#multiple-linear-regression" id="markdown-toc-multiple-linear-regression">Multiple linear regression</a></li>
<li><a href="#the-d--n-problem" id="markdown-toc-the-d--n-problem">The $D > N$ problem</a></li>
</ul>
</li>
<li><a href="#cost-functions" id="markdown-toc-cost-functions">Cost functions</a> <ul>
<li><a href="#properties" id="markdown-toc-properties">Properties</a></li>
<li><a href="#good-cost-functions" id="markdown-toc-good-cost-functions">Good cost functions</a> <ul>
<li><a href="#mse" id="markdown-toc-mse">MSE</a></li>
<li><a href="#mae" id="markdown-toc-mae">MAE</a></li>
</ul>
</li>
<li><a href="#convexity" id="markdown-toc-convexity">Convexity</a></li>
</ul>
</li>
<li><a href="#optimization" id="markdown-toc-optimization">Optimization</a> <ul>
<li><a href="#learning--estimation--fitting" id="markdown-toc-learning--estimation--fitting">Learning / Estimation / Fitting</a></li>
<li><a href="#grid-search" id="markdown-toc-grid-search">Grid search</a></li>
<li><a href="#optimization-landscapes" id="markdown-toc-optimization-landscapes">Optimization landscapes</a> <ul>
<li><a href="#local-minimum" id="markdown-toc-local-minimum">Local minimum</a></li>
<li><a href="#global-minimum" id="markdown-toc-global-minimum">Global minimum</a></li>
<li><a href="#strict-minimum" id="markdown-toc-strict-minimum">Strict minimum</a></li>
</ul>
</li>
<li><a href="#smooth-differentiable-optimization" id="markdown-toc-smooth-differentiable-optimization">Smooth (differentiable) optimization</a> <ul>
<li><a href="#gradient" id="markdown-toc-gradient">Gradient</a></li>
<li><a href="#gradient-descent" id="markdown-toc-gradient-descent">Gradient descent</a></li>
<li><a href="#gradient-descent-for-linear-mse" id="markdown-toc-gradient-descent-for-linear-mse">Gradient descent for linear MSE</a></li>
<li><a href="#stochastic-gradient-descent-sgd" id="markdown-toc-stochastic-gradient-descent-sgd">Stochastic gradient descent (SGD)</a></li>
<li><a href="#mini-batch-sgd" id="markdown-toc-mini-batch-sgd">Mini-batch SGD</a></li>
</ul>
</li>
<li><a href="#non-smooth-non-differentiable-optimization" id="markdown-toc-non-smooth-non-differentiable-optimization">Non-smooth (non-differentiable) optimization</a> <ul>
<li><a href="#subgradients" id="markdown-toc-subgradients">Subgradients</a></li>
<li><a href="#subgradient-descent" id="markdown-toc-subgradient-descent">Subgradient descent</a></li>
<li><a href="#stochastic-subgradient-descent" id="markdown-toc-stochastic-subgradient-descent">Stochastic subgradient descent</a></li>
</ul>
</li>
<li><a href="#comparison" id="markdown-toc-comparison">Comparison</a></li>
<li><a href="#constrained-optimization" id="markdown-toc-constrained-optimization">Constrained optimization</a> <ul>
<li><a href="#convex-sets" id="markdown-toc-convex-sets">Convex sets</a></li>
<li><a href="#projected-gradient-descent" id="markdown-toc-projected-gradient-descent">Projected gradient descent</a></li>
<li><a href="#turning-constrained-problems-into-unconstrained-problems" id="markdown-toc-turning-constrained-problems-into-unconstrained-problems">Turning constrained problems into unconstrained problems</a></li>
</ul>
</li>
<li><a href="#implementation-issues-in-gradient-methods" id="markdown-toc-implementation-issues-in-gradient-methods">Implementation issues in gradient methods</a> <ul>
<li><a href="#stopping-criteria" id="markdown-toc-stopping-criteria">Stopping criteria</a></li>
<li><a href="#optimality" id="markdown-toc-optimality">Optimality</a></li>
<li><a href="#step-size" id="markdown-toc-step-size">Step size</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#least-squares" id="markdown-toc-least-squares">Least squares</a> <ul>
<li><a href="#normal-equations" id="markdown-toc-normal-equations">Normal equations</a></li>
<li><a href="#single-parameter-linear-regression" id="markdown-toc-single-parameter-linear-regression">Single parameter linear regression</a></li>
<li><a href="#multiple-parameter-linear-regression" id="markdown-toc-multiple-parameter-linear-regression">Multiple parameter linear regression</a> <ul>
<li><a href="#simplest-way" id="markdown-toc-simplest-way">Simplest way</a></li>
<li><a href="#directly-verify-the-definition" id="markdown-toc-directly-verify-the-definition">Directly verify the definition</a></li>
<li><a href="#compute-the-hessian" id="markdown-toc-compute-the-hessian">Compute the Hessian</a></li>
</ul>
</li>
<li><a href="#geometric-interpretation" id="markdown-toc-geometric-interpretation">Geometric interpretation</a></li>
<li><a href="#least-squares-1" id="markdown-toc-least-squares-1">Least squares</a></li>
<li><a href="#invertibility-and-uniqueness" id="markdown-toc-invertibility-and-uniqueness">Invertibility and uniqueness</a></li>
</ul>
</li>
<li><a href="#maximum-likelihood" id="markdown-toc-maximum-likelihood">Maximum likelihood</a> <ul>
<li><a href="#gaussian-distribution" id="markdown-toc-gaussian-distribution">Gaussian distribution</a></li>
<li><a href="#a-probabilistic-model-for-least-squares" id="markdown-toc-a-probabilistic-model-for-least-squares">A probabilistic model for least squares</a></li>
<li><a href="#defining-cost-with-log-likelihood" id="markdown-toc-defining-cost-with-log-likelihood">Defining cost with log-likelihood</a></li>
<li><a href="#maximum-likelihood-estimator-mle" id="markdown-toc-maximum-likelihood-estimator-mle">Maximum likelihood estimator (MLE)</a> <ul>
<li><a href="#properties-of-mle" id="markdown-toc-properties-of-mle">Properties of MLE</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#overfitting-and-underfitting" id="markdown-toc-overfitting-and-underfitting">Overfitting and underfitting</a> <ul>
<li><a href="#underfitting-with-linear-models" id="markdown-toc-underfitting-with-linear-models">Underfitting with linear models</a></li>
<li><a href="#extended-feature-vectors" id="markdown-toc-extended-feature-vectors">Extended feature vectors</a></li>
<li><a href="#reducing-overfitting" id="markdown-toc-reducing-overfitting">Reducing overfitting</a></li>
</ul>
</li>
<li><a href="#regularization" id="markdown-toc-regularization">Regularization</a> <ul>
<li><a href="#l_2-regularization-ridge-regression" id="markdown-toc-l_2-regularization-ridge-regression">$L_2$-Regularization: Ridge Regression</a> <ul>
<li><a href="#ridge-regression" id="markdown-toc-ridge-regression">Ridge regression</a></li>
<li><a href="#ridge-regression-to-fight-ill-conditioning" id="markdown-toc-ridge-regression-to-fight-ill-conditioning">Ridge regression to fight ill-conditioning</a></li>
</ul>
</li>
<li><a href="#l_1-regularization-the-lasso" id="markdown-toc-l_1-regularization-the-lasso">$L_1$-Regularization: The Lasso</a></li>
</ul>
</li>
<li><a href="#model-selection" id="markdown-toc-model-selection">Model selection</a> <ul>
<li><a href="#probabilistic-setup" id="markdown-toc-probabilistic-setup">Probabilistic setup</a></li>
<li><a href="#training-error-vs-generalization-error" id="markdown-toc-training-error-vs-generalization-error">Training Error vs. Generalization Error</a></li>
<li><a href="#splitting-the-data" id="markdown-toc-splitting-the-data">Splitting the data</a></li>
<li><a href="#generalization-error-vs-test-error" id="markdown-toc-generalization-error-vs-test-error">Generalization error vs test error</a></li>
<li><a href="#model-selection-1" id="markdown-toc-model-selection-1">Model selection</a> <ul>
<li><a href="#model-selection-based-on-test-error" id="markdown-toc-model-selection-based-on-test-error">Model selection based on test error</a></li>
</ul>
</li>
<li><a href="#cross-validation" id="markdown-toc-cross-validation">Cross-validation</a></li>
<li><a href="#bias-variance-decomposition" id="markdown-toc-bias-variance-decomposition">Bias-Variance decomposition</a> <ul>
<li><a href="#data-generation-model" id="markdown-toc-data-generation-model">Data generation model</a></li>
<li><a href="#error-decomposition" id="markdown-toc-error-decomposition">Error Decomposition</a></li>
<li><a href="#interpretation-of-the-decomposition" id="markdown-toc-interpretation-of-the-decomposition">Interpretation of the decomposition</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#classification" id="markdown-toc-classification">Classification</a> <ul>
<li><a href="#linear-classifier" id="markdown-toc-linear-classifier">Linear classifier</a></li>
<li><a href="#is-classification-a-special-case-of-regression" id="markdown-toc-is-classification-a-special-case-of-regression">Is classification a special case of regression?</a></li>
<li><a href="#nearest-neighbor" id="markdown-toc-nearest-neighbor">Nearest neighbor</a></li>
<li><a href="#linear-decision-boundaries" id="markdown-toc-linear-decision-boundaries">Linear decision boundaries</a></li>
<li><a href="#optimal-classification-for-a-known-generating-model" id="markdown-toc-optimal-classification-for-a-known-generating-model">Optimal classification for a known generating model</a></li>
</ul>
</li>
<li><a href="#logistic-regression" id="markdown-toc-logistic-regression">Logistic regression</a> <ul>
<li><a href="#training" id="markdown-toc-training">Training</a></li>
<li><a href="#conditions-of-optimality" id="markdown-toc-conditions-of-optimality">Conditions of optimality</a></li>
<li><a href="#gradient-descent-1" id="markdown-toc-gradient-descent-1">Gradient descent</a></li>
<li><a href="#newtons-method" id="markdown-toc-newtons-method">Newton’s method</a> <ul>
<li><a href="#hessian-of-the-log-likelihood" id="markdown-toc-hessian-of-the-log-likelihood">Hessian of the log-likelihood</a></li>
<li><a href="#newtons-method-1" id="markdown-toc-newtons-method-1">Newton’s method</a></li>
</ul>
</li>
<li><a href="#regularized-logistic-regression" id="markdown-toc-regularized-logistic-regression">Regularized logistic regression</a></li>
</ul>
</li>
<li><a href="#generalized-linear-models" id="markdown-toc-generalized-linear-models">Generalized Linear Models</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#exponential-family" id="markdown-toc-exponential-family">Exponential family</a> <ul>
<li><a href="#example-bernoulli" id="markdown-toc-example-bernoulli">Example: Bernoulli</a></li>
<li><a href="#example-gaussian" id="markdown-toc-example-gaussian">Example: Gaussian</a></li>
<li><a href="#properties-1" id="markdown-toc-properties-1">Properties</a></li>
<li><a href="#link-function" id="markdown-toc-link-function">Link function</a></li>
</ul>
</li>
<li><a href="#application-in-ml" id="markdown-toc-application-in-ml">Application in ML</a> <ul>
<li><a href="#maximum-likelihood-parameter-estimation" id="markdown-toc-maximum-likelihood-parameter-estimation">Maximum Likelihood Parameter Estimation</a></li>
<li><a href="#generalized-linear-models-1" id="markdown-toc-generalized-linear-models-1">Generalized linear models</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#nearest-neighbor-classifiers-and-the-curse-of-dimensionality" id="markdown-toc-nearest-neighbor-classifiers-and-the-curse-of-dimensionality">Nearest neighbor classifiers and the curse of dimensionality</a> <ul>
<li><a href="#k-nearest-neighbor-knn" id="markdown-toc-k-nearest-neighbor-knn">K Nearest Neighbor (KNN)</a></li>
<li><a href="#analysis" id="markdown-toc-analysis">Analysis</a></li>
</ul>
</li>
<li><a href="#support-vector-machines" id="markdown-toc-support-vector-machines">Support Vector Machines</a> <ul>
<li><a href="#definition" id="markdown-toc-definition">Definition</a></li>
<li><a href="#alternative-formulation-duality" id="markdown-toc-alternative-formulation-duality">Alternative formulation: Duality</a> <ul>
<li><a href="#how-do-we-find-a-suitable-function-g" id="markdown-toc-how-do-we-find-a-suitable-function-g">How do we find a suitable function G?</a></li>
<li><a href="#when-is-it-ok-to-switch-min-and-max" id="markdown-toc-when-is-it-ok-to-switch-min-and-max">When is it OK to switch min and max?</a></li>
<li><a href="#when-is-the-dual-easier-to-optimize-than-the-primal" id="markdown-toc-when-is-the-dual-easier-to-optimize-than-the-primal">When is the dual easier to optimize than the primal?</a></li>
</ul>
</li>
<li><a href="#kernel-trick" id="markdown-toc-kernel-trick">Kernel trick</a> <ul>
<li><a href="#alternative-formulation-of-ridge-regression" id="markdown-toc-alternative-formulation-of-ridge-regression">Alternative formulation of ridge regression</a></li>
<li><a href="#representer-theorem" id="markdown-toc-representer-theorem">Representer theorem</a></li>
<li><a href="#kernelized-ridge-regression" id="markdown-toc-kernelized-ridge-regression">Kernelized ridge regression</a></li>
<li><a href="#kernel-functions" id="markdown-toc-kernel-functions">Kernel functions</a></li>
<li><a href="#kernel-trick-1" id="markdown-toc-kernel-trick-1">Kernel trick</a> <ul>
<li><a href="#radial-basis-function" id="markdown-toc-radial-basis-function">Radial basis function</a></li>
</ul>
</li>
<li><a href="#new-kernel-functions-from-old-ones" id="markdown-toc-new-kernel-functions-from-old-ones">New kernel functions from old ones</a></li>
</ul>
</li>
<li><a href="#properties-of-kernels" id="markdown-toc-properties-of-kernels">Properties of kernels</a></li>
</ul>
</li>
<li><a href="#unsupervised-learning" id="markdown-toc-unsupervised-learning">Unsupervised learning</a> <ul>
<li><a href="#k-means" id="markdown-toc-k-means">K-Means</a> <ul>
<li><a href="#coordinate-descent-interpretation" id="markdown-toc-coordinate-descent-interpretation">Coordinate descent interpretation</a></li>
<li><a href="#matrix-factorization-interpretation" id="markdown-toc-matrix-factorization-interpretation">Matrix factorization interpretation</a></li>
<li><a href="#probabilistic-interpretation" id="markdown-toc-probabilistic-interpretation">Probabilistic interpretation</a></li>
</ul>
</li>
<li><a href="#gaussian-mixture-model-gmm" id="markdown-toc-gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</a></li>
<li><a href="#em-algorithm" id="markdown-toc-em-algorithm">EM algorithm</a> <ul>
<li><a href="#expectation-step" id="markdown-toc-expectation-step">Expectation step</a></li>
<li><a href="#maximization-step" id="markdown-toc-maximization-step">Maximization step</a></li>
<li><a href="#interpretation" id="markdown-toc-interpretation">Interpretation</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#matrix-factorization" id="markdown-toc-matrix-factorization">Matrix Factorization</a> <ul>
<li><a href="#prediction-using-a-matrix-factorization" id="markdown-toc-prediction-using-a-matrix-factorization">Prediction using a matrix factorization</a></li>
<li><a href="#choosing-k" id="markdown-toc-choosing-k">Choosing K</a></li>
<li><a href="#regularization-1" id="markdown-toc-regularization-1">Regularization</a></li>
<li><a href="#stochastic-gradient-descent" id="markdown-toc-stochastic-gradient-descent">Stochastic gradient descent</a></li>
<li><a href="#alternating-least-squares-als" id="markdown-toc-alternating-least-squares-als">Alternating least squares (ALS)</a> <ul>
<li><a href="#no-missing-entries" id="markdown-toc-no-missing-entries">No missing entries</a></li>
<li><a href="#missing-entries" id="markdown-toc-missing-entries">Missing entries</a></li>
</ul>
</li>
<li><a href="#text-representation-learning" id="markdown-toc-text-representation-learning">Text representation learning</a> <ul>
<li><a href="#motivation-1" id="markdown-toc-motivation-1">Motivation</a></li>
<li><a href="#co-occurrence-matrix" id="markdown-toc-co-occurrence-matrix">Co-occurrence matrix</a></li>
<li><a href="#learning-word-representations" id="markdown-toc-learning-word-representations">Learning word representations</a></li>
<li><a href="#skip-gram-model" id="markdown-toc-skip-gram-model">Skip-gram model</a></li>
<li><a href="#fasttext" id="markdown-toc-fasttext">FastText</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#svd-and-pca" id="markdown-toc-svd-and-pca">SVD and PCA</a> <ul>
<li><a href="#motivation-2" id="markdown-toc-motivation-2">Motivation</a></li>
<li><a href="#svd" id="markdown-toc-svd">SVD</a> <ul>
<li><a href="#dimensionality-reduction" id="markdown-toc-dimensionality-reduction">Dimensionality reduction</a></li>
<li><a href="#svd-and-matrix-factorization" id="markdown-toc-svd-and-matrix-factorization">SVD and matrix factorization</a></li>
</ul>
</li>
<li><a href="#pca-and-decorrelation" id="markdown-toc-pca-and-decorrelation">PCA and decorrelation</a></li>
<li><a href="#computing-the-svd-efficiently" id="markdown-toc-computing-the-svd-efficiently">Computing the SVD efficiently</a></li>
<li><a href="#pitfalls-of-pca" id="markdown-toc-pitfalls-of-pca">Pitfalls of PCA</a></li>
</ul>
</li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<!-- More -->
<p>We’ll always use subscript $n$ for data point, and $d$ for feature. $N$ is the data size and $D$ is the dimensionality.</p>
<p>Recommended website: <a href="http://www.matrixcalculus.org/">http://www.matrixcalculus.org/</a></p>
<h2 id="linear-regression">Linear regression</h2>
<p>A linear regression is a model that assumes a linear relationship between inputs and the output. We will study three types of methods:</p>
<ol>
<li>Grid search</li>
<li>Iterative optimization algorithms</li>
<li>Least squares</li>
</ol>
<h3 id="simple-linear-regression">Simple linear regression</h3>
<p>For a single input dimension ($D=1$), we can use a simple linear regression, which is given by:</p>
<script type="math/tex; mode=display">\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\set}[1]{\left\{#1\right\}}
\newcommand{\norm}[1]{\left\lVert#1\right\rVert}
\newcommand{\frobnorm}[1]{\left\lVert#1\right\rVert_{\text{Frob}}}
\newcommand{\expect}[1]{\mathbb{E}\left[#1\right]}
\newcommand{\expectsub}[2]{\mathbb{E}_{#1}\left[#2\right]}
\newcommand{\cost}[1]{\mathcal{L}\left(#1\right)}
\newcommand{\Strain}{S_{\text{train}}}
\newcommand{\Stest}{S_{\text{test}}}
\DeclareMathOperator*{\argmax}{\arg\!\max}
\DeclareMathOperator*{\argmin}{\arg\!\min}
y_n \approx f(x_n) := w_0 + w_1 x_{n1}</script>
<p>$w = (w_0, w_1)$ are the parameters of the model.</p>
<h3 id="multiple-linear-regression">Multiple linear regression</h3>
<p>If our data has multiple input dimensions, we obtain multivariate linear regression:</p>
<script type="math/tex; mode=display">y_n \approx
f(\pmb{x}_n) := w_0 + w_1 x_{n1} + \dots + w_D x_{wD}
= w_0 + \pmb{x}_n^t \begin{bmatrix}
w_1 \\
\vdots \\
w_D \\
\end{bmatrix}
= \tilde{\pmb{x}}_n ^T \tilde{\pmb{w}}</script>
<blockquote>
<p>👉🏼 If we wanted to be a little more strict, we should write $f_{\pmb{w}}(\pmb{x}_n)$, as the model of course also depends on the weights.</p>
</blockquote>
<p>The tilde notation means that we have included the offset term $w_0$, also known as the <strong>bias</strong>:</p>
<script type="math/tex; mode=display">\tilde{\pmb{x}}_n=\begin{bmatrix}1 \\ x_{n1} \\ \vdots \\ x_{nD} \end{bmatrix} \in \mathbb{R}^{D+1},
\quad
\tilde{\pmb{w}} = \begin{bmatrix}w_0 \\ w_1 \\ \vdots \\ w_D\end{bmatrix} \in \mathbb{R^{D+1}}</script>
<h3 id="the-d--n-problem">The $D > N$ problem</h3>
<p>If the number of parameters exceeds the number of data examples, we say that the task is <em>under-determined</em>. This can be solved by regularization, which we’ll get to more precisely later</p>
<h2 id="cost-functions">Cost functions</h2>
<p>$\pmb{x}_n$ is the data, which we can easily understand where comes from. But how does one find a good $\pmb{w}$ from the data?</p>
<p>A <strong>cost function</strong> (also called loss function) is used to learn parameters that explain the data well. It quantifies how well our model does by giving errors a score, quantifying penalties for errors. Our goal is to find weights that minimize the loss functions.</p>
<h3 id="properties">Properties</h3>
<p>Desirable properties of cost functions are:</p>
<ul>
<li><strong>Symmetry around 0</strong>: that is, being off by a positive or negative amount is equivalent; what matters is the amplitude of the error, not the sign.</li>
<li><strong>Robustness</strong>: penalizes large errors at about the same rate as very large errors. This is a way to make sure that outliers don’t completely dominate our regression.</li>
</ul>
<h3 id="good-cost-functions">Good cost functions</h3>
<h4 id="mse">MSE</h4>
<p>Probably the most commonly used cost function is Mean Square Error (MSE):</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{MSE}}(\pmb{w}) := \frac{1}{N} \sum_{n=1}^N \left(y_n - f(\pmb{x}_n)\right)^2
\label{def:mse}</script>
<p>MSE is symmetrical around 0, but also tends to penalize outliers quite harshly (because it squares error): MSE is not robust. In practice, this is problematic, because outliers occur more often than we’d like to.</p>
<p>Note that we often use MSE with a factor $\frac{1}{2N}$ instead of $\frac{1}{N}$. This is because it makes for a cleaner derivative, but we’ll get into that later. Just know that for all intents and purposes, it doesn’t change really change anything about the behavior of the models we’ll study.</p>
<h4 id="mae">MAE</h4>
<p>When outliers are present, Mean Absolute Error (MAE) tends to fare better:</p>
<script type="math/tex; mode=display">\text{MAE}(\pmb{w}) := \frac{1}{N} \sum_{n=1}^N \left| y_n - f(\pmb{x}_n)\right|</script>
<p>Instead of squaring, we take the absolute value. This is more robust. Note that MAE isn’t differentiable at 0, but we’ll talk about that later.</p>
<p>There are other cost functions that are even more robust; these are available as additional reading, but is not exam material.</p>
<h3 id="convexity">Convexity</h3>
<p>A function is <strong>convex</strong> iff a line joining two points never intersects with the function anywhere else. More strictly defined, a function $f(\pmb{u})$ with $\pmb{u}\in\chi$ is <em>convex</em> if, for any $\pmb{u}, \pmb{v} \in\chi$, and for any $0 \le\lambda\le 1$, we have:</p>
<script type="math/tex; mode=display">f(\lambda\pmb{u}+(1-\lambda)\pmb{v})\le\lambda f(\pmb{u}) +(1-\lambda)f(\pmb{v})</script>
<p>A function is <strong>strictly convex</strong> if the above inequality is strict ($<$).</p>
<p>A stritly convex function has a unique global minimum $\pmb{w}^*$. For convex functions, every local minimum is a global minimum. This makes it a desirable property for loss functions, since it means that cost function optimization is guaranteed to find the global minimum.</p>
<p>Sums of convex functions are also convex. Therefore, MSE and MAE are convex.</p>
<h2 id="optimization">Optimization</h2>
<h3 id="learning--estimation--fitting">Learning / Estimation / Fitting</h3>
<p>Given a cost function (or loss function) $\mathcal{L}(\pmb{w})$, we wish to find $\pmb{w}^*$ which minimizes the cost:</p>
<script type="math/tex; mode=display">\min_{\pmb W}{\mathcal{L}(\pmb w)}, \quad\text{ subject to } \pmb w \in \mathbb R^D</script>
<p>This is what we call <strong>learning</strong>: learning is simply an optimization problem, and as such, we’ll use an optimization algorithm to solve it – that is, find a good $\pmb w$.</p>
<h3 id="grid-search">Grid search</h3>
<p>This is one of the simplest optimization algorithms, although far from being the most efficient one. It can be described as “try all the values”, a kind of brute-force algorithm; you can think of it as nested for-loops over the individual $w_i$ weights.</p>
<p>For instance, if our weights are $\pmb{w} = \begin{bmatrix}w_1 \ w_2\end{bmatrix}$, then we can try, say 4 values for $w_1$, 4 values for $w_2$, for a total of 16 values of $\mathcal{L}(\pmb{w})$.</p>
<p>But obviously, complexity is exponential $\mathcal{O}(a^D)$ (where $a$ is the number of values to try), which is really bad, especially when we can have $D\approx$ millions of parameters. Additionally, grid search has no guarantees that it’ll find an optimum; it’ll just find the best value we tried.</p>
<p>If grid search sounds bad for optimization, that’s because it is. In practice, it is not used for optimization of parameters, but it <em>is</em> used to tune hyperparameters.</p>
<h3 id="optimization-landscapes">Optimization landscapes</h3>
<h4 id="local-minimum">Local minimum</h4>
<p>A vector $\pmb{w}^*$ is a <em>local minimum</em> of a function $\mathcal{L}$ (we’re interested in the minimum of cost functions $\mathcal{L}$, which we denote with $\pmb{w}^*$, as opposed to any other value $\pmb{w}$, but this obviously holds for any function) if $\exists \epsilon > 0$ such that</p>
<script type="math/tex; mode=display">% <![CDATA[
\mathcal{L}(\pmb{w}^*) \le \mathcal{L(\pmb{w})}, \quad \forall\pmb w : \norm{\pmb{w} -\pmb{w}^*} < \epsilon %]]></script>
<p>In other words, the local minimum $\pmb{w}^*$ is better than all the neighbors in some non-zero radius.</p>
<h4 id="global-minimum">Global minimum</h4>
<p>The global minimum is defined by getting rid of the radius $\epsilon$ and comparing to all other values:</p>
<script type="math/tex; mode=display">\mathcal{L}(\pmb{w}^*) \le \mathcal{L(\pmb{w})}, \qquad \forall\pmb{w}\in\mathbb{R}^D</script>
<h4 id="strict-minimum">Strict minimum</h4>
<p>A minimum is said to be <strong>strict</strong> if the corresponding equality is strict for $\pmb{w} \ne \pmb{w}^*$, that is, there is only one minimum value.</p>
<h3 id="smooth-differentiable-optimization">Smooth (differentiable) optimization</h3>
<h4 id="gradient">Gradient</h4>
<p>A gradient at a given point is the slope of the tangent to the function at that point. It points to the direction of largest increase of the function. By following the gradient (in the opposite direction, because we’re searching for a minimum and not a maximum), we can find the minimum.</p>
<p><img src="/images/ml/mse-mae.png" alt="Graphs of MSE and MAE" /></p>
<p>Gradient is defined by:</p>
<script type="math/tex; mode=display">\nabla \mathcal{L}(\pmb{w}) := \begin{bmatrix}
\frac{\partial\mathcal{L}(\pmb{w})}{\partial w_1} \\
\vdots \\
\frac{\partial\mathcal{L}(\pmb{w})}{\partial w_D} \\
\end{bmatrix}</script>
<p>This is a vector, i.e. $\nabla\mathcal{L}(\pmb{w})\in\mathbb R^D$. Each dimension $i$ of the vector indicates how fast the cost $\mathcal{L}$ changes depending on the weight $w_i$.</p>
<h4 id="gradient-descent">Gradient descent</h4>
<p>Gradient descent is an iterative algorithm. We start from a candidate $w^{(t)}$, and iterate.</p>
<script type="math/tex; mode=display">\pmb{w}^{(t+1)}:=\pmb{w}^{(t)} - \gamma \nabla\mathcal{L}\left(\pmb{w}^{(t)}\right)</script>
<p>As stated previously, we’re adding the negative gradient to find the minimum, hence the subtraction.</p>
<p>$\gamma$ is known as the <strong>step-size</strong>, which is a small value (maybe 0.1). You don’t want to be too aggressive with it, or you might risk overshooting in your descent. In practice, the step-size that makes the learning as fast as possible is often found by trial and error 🤷🏼♂️.</p>
<p>As an example, we will take an analytical look at a gradient descent, in order to understand its behavior and components. We will do gradient descent on a 1-parameter model, in which we minimize the MSE, which is defined as follows:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(w_0\right)=\frac{1}{2N}\sum_{n=1}^N{\left(y_n - w_0\right)^2}</script>
<p>Note that we’re dividing by 2 on top of the regular MSE; it has no impact on finding the minimum, but when we will compute the gradient below, it will conveniently cancel out the $\frac{1}{2}$.</p>
<p>The gradient of $\mathcal{L}\left(w_0\right)$ is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\mathcal{L}\left(\pmb{w}\right)
& = \frac{\partial}{\partial w_0}\mathcal{L} \\
& = \frac{1}{2N}\sum_{n=1}^N{-2(y_n - w_0)} \\
& = w_0 - \bar{y}
\end{align} %]]></script>
<p>And thus, our gradient descent is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
w_0^{(t+1)}
&:= w_0^{(t)} - \gamma\nabla\mathcal{L}\left(\pmb w\right) \\
& = w_0^{(t)} - \gamma(w_0^{(t)} - \bar{y}) \\
& = (1-\gamma)w_0^{(t)} + \gamma\bar{y},
\qquad\text{where } \bar{y}:=\sum_{n}{\frac{y_n}{N}}
\end{align*} %]]></script>
<p>This sequence is guaranteed to converge for $\pmb{w}^* = \bar{y}$ (so the solution to this exact problem can be extracted analytically from gradient descent). This would set the cost function to 0, which is the minimum.</p>
<p>The choice of $\gamma$ has an influence on the algorithm’s outcome:</p>
<ul>
<li>If we pick $\gamma=1$, we would get to the optimum in one step</li>
<li>If we pick $\gamma < 1$, we would get a little closer in every step, eventually converging to $\bar{y}$</li>
<li>If we pick $\gamma > 1$, we are going to overshoot $\bar{y}$. Slightly bigger than 1 (say, 1.5) would still converge; $\gamma=2$ would loop infinitely between two points; $\gamma > 2$ diverges.</li>
</ul>
<h4 id="gradient-descent-for-linear-mse">Gradient descent for linear MSE</h4>
<p>Our linear regression is given by a line $\pmb{y}$ that is a regression for some data $\pmb X$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\pmb{y}=\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix},
\quad
\pmb{X}=\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1D} \\
x_{21} & x_{22} & \dots & x_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \dots & x_{ND} \\
\end{bmatrix} %]]></script>
<p>Our model is:</p>
<script type="math/tex; mode=display">f_w(x)=\pmb{x}_n^T \pmb{w}</script>
<p>We define the error vector by:</p>
<script type="math/tex; mode=display">\pmb{e}=\pmb{y} - \pmb{Xw},
\quad \text{ or } \quad
\pmb{e}_n = \pmb{x}_n^T\pmb{w}</script>
<p>The MSE can then be restated as follows:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\pmb{w}\right)
:= \frac{1}{2N}\sum_{n=1}^N{\left( y_n - \pmb{x}_n^T \pmb{w}\right)^2}
= \frac{1}{2N}\pmb{e}^T\pmb{e}</script>
<p>And the gradient is, component-wise:</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial\pmb{w}_d}\mathcal{L}
= -\frac{1}{2N}\sum_{n=1}^N {2(y_n - \pmb{x}_n^T \pmb{w}) \pmb{x}_{nd}}
= -\frac{1}{N} (\pmb{X}_{:d})^T \pmb{e}</script>
<p>We’re using column notation $\pmb{X}_{:d}$ to signify column $d$ of the matrix $X$.</p>
<p>And thus, all in all, our gradient is:</p>
<script type="math/tex; mode=display">\nabla\mathcal{L}\left(\pmb{w}\right) = -\frac{1}{N}\pmb{X}^T\pmb{e}</script>
<p>To compute this expression, we must compute:</p>
<ul>
<li>The error $\pmb e$, which takes $2N\cdot D - 1$ floating point operations (flops) for the matrix-vector multiplication, and $N$ for the subtraction, for a total of $2N\cdot D + N - 1$, which is $\mathcal{O}(N\cdot D)$</li>
<li>The gradient $\nabla\mathcal{L}$, which costs $2N\cdot D + D - 1$, which is $\mathcal{O}(N\cdot D)$.</li>
</ul>
<p>In total, this process is $\mathcal{O}(N\cdot D)$ at every step. This is not too bad, it’s equivalent to reading the data once.</p>
<h4 id="stochastic-gradient-descent-sgd">Stochastic gradient descent (SGD)</h4>
<p>In ML, most cost functions are formulated as a sum of:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\pmb{w}\right) = \frac{1}{N}\sum_{n=1}^N{\mathcal{L}_n(\pmb{w})}</script>
<p>In practice, this can be expensive to compute, so the solution is to pick a random $n$ uniformly at random in $n\in\left[1, N\right]$ to be able to make the sum go away.</p>
<p>The stochastic gradient descent is thus:</p>
<script type="math/tex; mode=display">\pmb{w}^{(t+1)}:=\pmb{w}^{(t)} - \gamma \nabla\mathcal{L}_n\left({\pmb{w}^{(t)}}\right)</script>
<p>Why is it allowed to pick just one $n$ instead of the full thing? We won’t give a full proof, but the intuition is that:</p>
<script type="math/tex; mode=display">\expect{\nabla\mathcal{L}_n(\pmb{w})}
= \frac{1}{N} \sum_{n=1}^N{\nabla\mathcal{L}_n(\pmb{w})}
= \nabla\left(\frac{1}{N} \sum_{n=1}^N{\mathcal{L}_n(\pmb{w})}\right)
\equiv \nabla\mathcal{L}\left(\pmb{w}\right)</script>
<p>The gradient of a single n is:</p>
<script type="math/tex; mode=display">\mathcal{L}_n(\pmb{w}) = \frac{1}{2} \left(y_n -\pmb{x}_n^T w\right)^2 \\
\nabla\mathcal{L}_n(\pmb{w}) = (-x_n^T) (y_n-\pmb{x}_n^T \pmb{w})</script>
<p>Note that $x_n^T \in\mathbb{R}^D$, and $(y_n-\pmb{x}_n^T \pmb{w})\in\mathbb{R}$. Computational complexity for this is $\mathcal{O}(D)$.</p>
<h4 id="mini-batch-sgd">Mini-batch SGD</h4>
<p>But perhaps just picking a <strong>single</strong> value is too extreme; there is an intermediate version in which we choose a subset $B\subseteq \left[N\right]$ instead of $\abs{B}$ points, instead of a single point.</p>
<script type="math/tex; mode=display">g := \frac{1}{|B|}\sum_{n\in B}{\nabla\mathcal{L}_n(\pmb{w}^{(t)})} \\
w^{(t+1)} := w^{(t)} - \gamma\pmb{g}</script>
<p>Note that if $\abs{B} = N$ then we’re performing a full gradient descent.</p>
<p>The computation of $\pmb{g}$ can be parallelized easily over $\abs{B}$ GPU threads, which is quite common in practice; $\abs{B}$ is thus often dictated by the number of available threads.</p>
<p>Computational complexity is $\mathcal{O}(\abs{B}\cdot D)$.</p>
<h3 id="non-smooth-non-differentiable-optimization">Non-smooth (non-differentiable) optimization</h3>
<p>We’ve defined convexity previously, but we can also use the following alternative characterization of convexity:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\pmb u\right) \ge \mathcal{L}\left(\pmb w\right) + \nabla \mathcal{L}\left(\pmb w\right)^T(\pmb{u} - \pmb{w}) \quad \forall \pmb{u}, \pmb{w}
\iff \mathcal{L} \text{ convex}</script>
<p>Meaning that the function must always lie above its linearization (which is the first-order Taylor expansion) to be convex.</p>
<p><img src="/images/ml/convex-above-linearization.png" alt="A convex function lies above its linearization" /></p>
<h4 id="subgradients">Subgradients</h4>
<p>A vector $\pmb{g}\in\mathbb{R}^D$ such that:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\pmb u\right) \ge \mathcal{L}\left(\pmb w\right) + \pmb{g}^T(\pmb u - \pmb w) \quad \forall \pmb{u}, \pmb{w}</script>
<p>is called a <strong>subgradient</strong> to the function $\mathcal{L}$ at $\pmb w$. The subgradient forms a line that is always below the curve, somewhat like the gradient of a convex function.</p>
<p><img src="/images/ml/subgradient-below-function.png" alt="The subgradient lies below the function" /></p>
<p>This definition is valid even for an arbitrary $\mathcal{L}$ that may not be differentiable, and not even necessarily convex.</p>
<p>If the function $\mathcal{L}$ is differentiable at $\pmb w$, then the <em>only subgradient</em> at $\pmb{w}$ is $\pmb{g} = \nabla\mathcal{L}\left(\pmb{w}\right)$.</p>
<h4 id="subgradient-descent">Subgradient descent</h4>
<p>This is exactly like gradient descent, except for the fact that we use the <em>subgradient</em> $\pmb{g}$ at the current iterate $\pmb{w}^{(t)}$ instead of the <em>gradient</em>:</p>
<script type="math/tex; mode=display">w^{(t+1)} := w^{(t)} - \gamma\pmb{g}</script>
<p>For instance, MAE is not differentiable at 0, so we must use the subgradient.</p>
<script type="math/tex; mode=display">% <![CDATA[
\text{Let }h: \mathbb{R} \rightarrow \mathbb{R}, \quad h(e) := |e| \\
\text{At } e, \text{the subgradient }
g \in \partial h = \begin{cases}
-1 & \text{if } e < 0 \\
[-1, 1] & \text{if } e = 0 \\
1 & \text{if } e > 0 \\
\end{cases} %]]></script>
<p>Here, $\partial h$ is somewhat confusing notation for the set of all possible subgradients at our position.</p>
<p>For linear regressions, the (sub)gradient is easy to compute using the <em>chain rule</em>.</p>
<p>Let $h$ be non-differentiable, $q$ differentiable, and $\mathcal{L}\left(\pmb{w}\right) = h(q(w))$. The chain rule tells us that, at $\pmb w$, our subgradient is:</p>
<script type="math/tex; mode=display">g \in \partial h(q(\pmb{w})) \cdot \nabla q(\pmb{w})</script>
<h4 id="stochastic-subgradient-descent">Stochastic subgradient descent</h4>
<p>This is still commonly abbreviated SGD.</p>
<p>It’s exactly the same, except that $\pmb g$ is a subgradient to the randomly selected $\mathcal{L}_n$ at the current iterate $\pmb{w}^{(t)}$.</p>
<h3 id="comparison">Comparison</h3>
<table>
<thead>
<tr>
<th> </th>
<th>Smooth</th>
<th>Non-smooth</th>
</tr>
</thead>
<tbody>
<tr>
<td>Full gradient descent</td>
<td>Gradient of <script type="math/tex">\mathcal{L}</script> <br />Complexity is $\mathcal{O}(N\cdot D)$</td>
<td>Subgradient of $\mathcal{L}$<br />Complexity is $\mathcal{O}(N\cdot D)$</td>
</tr>
<tr>
<td>Stochastic gradient descent</td>
<td>Gradient of $\mathcal{L}_n$</td>
<td>Subgradient of $\mathcal{L}_n$</td>
</tr>
</tbody>
</table>
<h3 id="constrained-optimization">Constrained optimization</h3>
<p>Sometimes, optimization problems come posed with an additional constraint.</p>
<h4 id="convex-sets">Convex sets</h4>
<p>We’ve seen convexity for functions, but we can also define it for sets. A set $\mathcal{C}$ is convex iff the line segment between any two points of $\mathcal{C}$ lies in $\mathcal{C}$. That is, $\forall \pmb{u}, \pmb{v} \in \mathcal{C}, \quad \forall 0 \le \theta \le 1$, we have:</p>
<script type="math/tex; mode=display">\theta \pmb{u} + (1 - \theta)\pmb{v} \in \mathcal{C}</script>
<p>This means that the line between any two points in the set $\mathcal{C}$ must also be fully contained within the set.</p>
<p><img src="/images/ml/convex-sets.png" alt="Examples of convex and non-convex sets" /></p>
<p>A couple of properties of convex sets:</p>
<ul>
<li>Intersection of convex sets is also convex.</li>
<li>Projections onto convex sets are <strong>unique</strong> (and often efficient to compute).</li>
</ul>
<h4 id="projected-gradient-descent">Projected gradient descent</h4>
<p>When dealing with constrained problems, we have two options. The first one is to add a projection onto $\mathcal{C}$ in every step:</p>
<script type="math/tex; mode=display">P_\mathcal{C}(\pmb{w}') := \arg{\min_{\pmb{v}\in\mathcal{C}}}\norm{\pmb{v-w'}}</script>
<p>The rule for gradient descent can thus be updated to become:</p>
<script type="math/tex; mode=display">w^{(t+1)} := P_\mathcal{C}\left(w^{(t)} - \gamma \nabla \mathcal{L}(w^{(t)}) \right)</script>
<p>This means that at every step, we compute the new $w^{(t+1)}$ normally, but apply a projection on top of that. In other words, if the regular gradient descent sets our weights outside of the constrained space, we project them back.</p>
<figure>
<img alt="Steps of projected SGD" src="/images/ml/projected-sgd.png" />
<figcaption>Here, $w'$ is the result of regular SGD, i.e. $w' = w^{(t)} - \gamma \nabla \mathcal{L}(w^{(t)})$</figcaption>
</figure>
<p>This is the same for stochastic gradient descent, and we have the same convergence properties.</p>
<p>Note that the computational cost of the projection is very important here, since it is performed at every step.</p>
<h4 id="turning-constrained-problems-into-unconstrained-problems">Turning constrained problems into unconstrained problems</h4>
<p>If projection as described above is approach A, this is approach B.</p>
<p>We use a <strong>penalty function</strong>, such as the “brick wall” indicator function below:</p>
<script type="math/tex; mode=display">% <![CDATA[
I_\mathcal{C}(\pmb w) = \begin{cases}
0 & \pmb{w} \in \mathcal{C} \\
+\infty & \pmb{w} \notin \mathcal{C}
\end{cases} %]]></script>
<p>We could also perhaps use something with a less drastic error value than $+\infty$, if we don’t care about the constraint quite as extreme.</p>
<p>Note that this is similar to regularization, which we’ll talk about later.</p>
<p>Now, instead of directly solving $min_{\pmb{w}\in\mathcal{C}}{\mathcal{L}(\pmb{w})}$, we solve for:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}\in \mathbb{R}^D} {
\mathcal{L}(\pmb{w}) + I_\mathcal{C}(\pmb{w})
}</script>
<h3 id="implementation-issues-in-gradient-methods">Implementation issues in gradient methods</h3>
<h4 id="stopping-criteria">Stopping criteria</h4>
<p>When $\norm{\mathcal{L}(\pmb{w})}$ is zero (or close to zero), we are often close to the optimum.</p>
<h4 id="optimality">Optimality</h4>
<p>If the second order derivative is positive (or positive semi-definite for the general case $D\ge1$), then it is a (possibly local) minimum. If the function is also convex, then this is necessarily a global minimum. That is:</p>
<script type="math/tex; mode=display">\nabla \mathcal{L(\pmb{w})} = 0, \quad \mathcal{L} \text{ convex}
\implies
\text{optimum at }\pmb{w}</script>
<h4 id="step-size">Step size</h4>
<p>If $\gamma$ is too big, we might diverge (<a href="#gradient-descent">as seen previously</a>). But if it is too small, we might be very slow! Convergence is only guaranteed for $\gamma < \gamma_{min}$, which is a value that depends on the problem.</p>
<h2 id="least-squares">Least squares</h2>
<h3 id="normal-equations">Normal equations</h3>
<p>In some rare cases, we can take an analytical approach to computing the optimum of the cost function, rather than a computational one; for instance, for linear regression with MSE, as we’ve done previously. These types of equations are sometimes called <strong>normal equations</strong>. This is one of the most popular methods for data fitting, called <strong>least squares</strong>.</p>
<p>How do we get these normal equations?</p>
<p>First, we show that the problem is convex. If that is the case, then according to the <a href="#optimality">optimality conditions</a> for convex functions, the point at which the derivative is zero is the optimum:</p>
<script type="math/tex; mode=display">\mathcal{L}(\pmb{w}^*)=\pmb{0}</script>
<p>This gives us a system of $D$ equations known as the normal equations.</p>
<h3 id="single-parameter-linear-regression">Single parameter linear regression</h3>
<p>Let’s try this for a single parameter linear regression, with MSE as the cost function.</p>
<p>First, we will just accept that the cost function is convex in the $w_0$ parameter.</p>
<p>As <a href="#gradient-descent">proven previously</a>, we know that for the single parameter model, the derivative is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\mathcal{L}\left(\pmb{w}\right)
& = \frac{\partial}{\partial w_0}\mathcal{L} \\
& = \frac{1}{2N}\sum_{n=1}^N{-2(y_n - w_0)} \\
& = w_0 - \bar{y}
\end{align} %]]></script>
<p>This means that the derivative is 0 for $w_0 = \bar{y}$. This allows us to define our optimum parameter $\pmb{w}^*$ as $\pmb{w}^* = \begin{bmatrix}\bar{y}\end{bmatrix}$.</p>
<h3 id="multiple-parameter-linear-regression">Multiple parameter linear regression</h3>
<p>As we know by now, the cost function for linear regression with MSE is:</p>
<script type="math/tex; mode=display">\mathcal{L}\left(\pmb{w}\right)
:= \frac{1}{2N}\sum_{n=1}^N{\left( y_n - \pmb{x}_n^T \pmb{w}\right)^2}
= \frac{1}{2N}(\pmb{y-Xw})^T(\pmb{y-Xw})</script>
<p>Where the matrices are defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\pmb{y}=\begin{bmatrix}
y_1 \\ y_2 \\ \vdots \\ y_N
\end{bmatrix},
\quad
\pmb{X}=\begin{bmatrix}
x_{11} & x_{12} & \dots & x_{1D} \\
x_{21} & x_{22} & \dots & x_{2D} \\
\vdots & \vdots & \ddots & \vdots \\
x_{N1} & x_{N2} & \dots & x_{ND} \\
\end{bmatrix} %]]></script>
<p>We denote the $i^\text{th}$ row of $X$ by $x_i^T$. Each $x_i^T$ represents a different data point.</p>
<p>We claim that this cost function is <em>convex</em> in $\pmb{w}$. We can prove that in any of the following ways:</p>
<hr />
<h4 id="simplest-way">Simplest way</h4>
<p>The cost function is the sum of many convex functions, and is thus also convex.</p>
<h4 id="directly-verify-the-definition">Directly verify the definition</h4>
<script type="math/tex; mode=display">\forall \lambda\in [0,1],
\quad \forall \pmb{w}, \pmb{w}',
\qquad
\mathcal{L}\left(\lambda\pmb{w} + \left(1-\lambda\right)\pmb{w}'\right)
- \left(\lambda\mathcal{L}(\pmb{w}) + \left( 1-\lambda \right) \mathcal{L}(\pmb{w}')\right) \le 0</script>
<p>The left-hand side of the inequality reduces to:</p>
<script type="math/tex; mode=display">-\frac{1}{2N}\lambda(1-\lambda)\norm{\pmb{X}(\pmb{w}-\pmb{w}')}_2^2</script>
<p>which is indeed non-positive.</p>
<h4 id="compute-the-hessian">Compute the Hessian</h4>
<p>The Hessian is the matrix of second derivatives, defined as follows:</p>
<script type="math/tex; mode=display">H_{ij} = \left( \frac{\partial\mathcal{L}}{\partial w_i \partial w_j} \right)_{ij}</script>
<p>If the Hessian is positive semidefinite (i.e. all its eigenvalues are non-negative), then the function is convex.</p>
<p>For our case, the Hessian is given by:</p>
<script type="math/tex; mode=display">\frac{1}{N}\pmb{X}^T\pmb{X}</script>
<p>This is indeed positive semi-definite, as its eigenvalues are the squares of the eigenvalues of $\pmb{X}$, and must therefore be positive.</p>
<hr />
<p>Knowing that the function is convex, we can find the minimum. If we take the gradient of this expression, we get:</p>
<script type="math/tex; mode=display">\nabla\mathcal{L}(\pmb{w}) = -\frac{1}{N}\pmb{X}^T(\pmb{y-Xw})</script>
<p>We can set this to 0 to get the normal equations for linear regression, which are:</p>
<script type="math/tex; mode=display">\pmb{X}^T(\pmb{y-Xw}) =: \pmb{X}^T\pmb{e} = \pmb{0}</script>
<p>This proves that the normal equations for linear regression are given by $\pmb{X}^T\pmb{e} = \pmb{0}$.</p>
<h3 id="geometric-interpretation">Geometric interpretation</h3>
<p>The above definition of normal equations are given by $\pmb{X}^T\pmb{e} = \pmb{0}$. How can visualize that?</p>
<p>The error is given by:</p>
<script type="math/tex; mode=display">\pmb{e} := \pmb{y} - \pmb{Xw}</script>
<p>By definition, this error vector is orthogonal to all columns of $\pmb{X}$. Indeed, it tells us how far above or below the span our prediction $\pmb{y}$ is.</p>
<p>The <strong>span</strong> of $\pmb{X}$ is the space spanned by the columns of $\pmb{X}$. Every element of the span can be written as $\pmb{u} = \pmb{Xw}$ for some choice of $\pmb{w}$.</p>
<p>For the normal equations, we must pick an optimal $\pmb{w}^*$ for which the gradient is 0. Picking an $\pmb{w}^*$ is equivalent to picking an optimal $\pmb{u}^* = \pmb{Xw}^*$ from the span of $\pmb{X}$.</p>
<p>But which element of $\text{span}(\pmb{X})$ shall we take, which one is the optimal one? The normal equations tell us that the optimum choice for $\pmb{u}$, called <script type="math/tex">\pmb{u}^*</script> is the element such that <script type="math/tex">\pmb{y} - \pmb{u}^*</script> is orthogonal to $\text{span}(X)$.</p>
<p>In other words, we should pick $\pmb{u}^*$ to be the projection of $\pmb{y}$ onto $\text{span}(\pmb{X})$.</p>
<p><img src="/images/ml/geometric-interpretation-normal-equations.png" alt="Geometric interpretation of the normal equations" /></p>
<h3 id="least-squares-1">Least squares</h3>
<p>All we’ve done so far is to solve the same old problem of a matrix equation:</p>
<script type="math/tex; mode=display">Ax = b</script>
<p>But we’ve always done so with a bit of a twist; there may not be an exact value of $x$ satisfying exact equality, but we could find one that gets us as close as possible:</p>
<script type="math/tex; mode=display">Ax \approx b</script>
<p>This is also what least squares does. It attempts to minimize the MSE to get as $Ax$ close as possible to $b$.</p>
<p>In this course, we often denote the data matrix $A$ as $\pmb{X}$, the weights $x$ as $\pmb{w}$, and $b$ as $y$; in other words, we’re trying to solve:</p>
<script type="math/tex; mode=display">\pmb{X}\pmb{w} \approx \pmb{y}</script>
<p>In least squares, we multiply this whole equation by $\pmb{X}^T$ on the left. We attempt to find $\pmb{w}^*$, the minimal weight that gets us as minimally wrong as possible. In other we’re trying to solve:</p>
<script type="math/tex; mode=display">\left( \pmb{X}^T\pmb{X} \right) \pmb{w} \approx \pmb{X}^T\pmb{y}</script>
<p>One way to solve this problem would simply be to invert the $A$ matrix, which in our case is $\pmb{X}^T\pmb{X}$:</p>
<script type="math/tex; mode=display">\pmb{w}^* = (\pmb{X}^T\pmb{X})^{-1} \pmb{X}^T y</script>
<p>As such, we can use this model to predict values for unseen data points:</p>
<script type="math/tex; mode=display">\hat{y}_m := \pmb{x}_m^T \pmb{w}^* = \pmb{x}_m^T (\pmb{X}^T\pmb{X})^{-1} \pmb{X}^T y</script>
<h3 id="invertibility-and-uniqueness">Invertibility and uniqueness</h3>
<p>Note that the Gram matrix, defined as $\pmb{X}^T\pmb{X} \in \mathbb{R}^{D\times D}$, is invertible <strong>if and only if</strong> $\pmb{X}$ has <strong>full column rank</strong>, or in other words, $\text{rank}(\pmb{X}) = D$.</p>
<script type="math/tex; mode=display">\pmb{X}^T\pmb{X} \in \mathbb{R}^{D\times D} \text{ invertible}
\iff
\text{rank}(\pmb{X}) = D</script>
<p>Unfortunately, in practice, our data matrix $\pmb{X}\in\mathbb{R}^{N\times D}$ is often <strong>rank-deficient</strong>.</p>
<ul>
<li>If $D>N$, we always have $\text{rank}(\pmb{X}) < D$ (since column and row rank are the same).</li>
<li>
<p>If $D \le N$, but some of the columns $\pmb{x}_{:d}$ are collinear (or in practice, nearly collinear), then the matrix is <strong>ill-conditioned</strong>. This leads to numerical issues when solving the linear system.</p>
<p>To know how bad things are, we can compute the condition number, which is the maximum eigenvalue of the Gram matrix, divided by the minimum See course contents of Numerical Methods.</p>
</li>
</ul>
<p>If our data matrix is rank-deficient or ill-conditioned (which is practically always the case), we certainly shouldn’t be inverting it directly! We’ll introduce high numerical errors that falsify our output.</p>
<p>That doesn’t mean we can’t do least squares in practice. We can still use a linear solver. In Python, that means you should use <a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.linalg.solve.html"><code class="highlighter-rouge">np.linalg.solve</code></a>, which uses a LU decomposition internally and thus avoids the worst numerical errors. In any case, do not directly invert the matrix as we have done above!</p>
<h2 id="maximum-likelihood">Maximum likelihood</h2>
<p>Maximum likelihood offers a second interpretation of least squares, but starting with a probabilistic approach.</p>
<h3 id="gaussian-distribution">Gaussian distribution</h3>
<p>A Gaussian random variable in $\mathbb{R}$ has mean $\mu$ and variance $\sigma^2$.</p>
<script type="math/tex; mode=display">\mathcal{N}(y \mid \mu, \sigma^2) =
\frac{1}{\sqrt{2\pi\sigma^2}}
\exp{\left[ -\frac{(y-\mu)^2}{2\sigma^2} \right]}</script>
<p>For a Gaussian random <em>vector</em> (instead of a single random variable), with mean $\pmb{\mu}$ and covariance $\pmb{\Sigma}$ (which is positive semi-definite) is:</p>
<script type="math/tex; mode=display">\pmb{\mathcal{N}}(\pmb{y} \mid \pmb{\mu}, \pmb{\Sigma}) =
\frac{1}
{\sqrt{(2\pi)^D \text{ det}(\pmb{\Sigma})}}
\exp{\left[ -\frac{1}{2} (\pmb{y} - \pmb{\mu})^T \pmb{\Sigma}^{-1} (\pmb{y} - \pmb{u}) \right]}</script>
<p>Remember that $\pmb{y} \in \mathbb{R}^N$.</p>
<p>As another reminder, two variables $x$ and $y$ are said to be <strong>independent</strong> when $p(x, y) = p(x)p(y)$.</p>
<h3 id="a-probabilistic-model-for-least-squares">A probabilistic model for least squares</h3>
<p>We assume that our data is generated by a linear model $\pmb{x}_n^T\pmb{w}$, with added Gaussian noise $\epsilon_n$:</p>
<script type="math/tex; mode=display">y_n = \pmb{x}_n^T\pmb{w} + \epsilon_n</script>
<p>This is often a realistic assumption in practice.</p>
<p><img src="/images/ml/gaussian-noise.png" alt="Noise generated by a Gaussian source" /></p>
<p>The noise is $\epsilon_n \overset{\text{i.i.d.}}{\sim} \mathcal{N}(y_n \mid \mu = 0, \sigma^2)$ for each dimension $n$. In other words, it is centered at 0, has a certain variance, and the error in each dimension is independent of that in other dimensions. The model $\pmb{w}$ is, as always, unknown.</p>
<p>Given $N$ samples, the <strong>likelihood</strong> of the data vector $\pmb{y} = (y_1, \dots, y_n)$ given the model $\pmb{w}$ and the input $\pmb{X}$ (where each row is one input) is:</p>
<script type="math/tex; mode=display">p(\pmb{y} \mid \pmb{X}, \pmb{w})
= \prod_{n=1}^N {p(y_n \mid \pmb{x}_n, \pmb{w})}
= \prod_{n=1}^N {\mathcal{N}(y_n \mid \pmb{x}_n^T\pmb{w}, \sigma^2)}</script>
<p>Intuitively, we’d like to maximize this likelihood over the choice of the best model $\pmb{w}$. The best model is the one that maximizes this likelihood.</p>
<h3 id="defining-cost-with-log-likelihood">Defining cost with log-likelihood</h3>
<p>The log-likelihood (LL) is given by:</p>
<script type="math/tex; mode=display">\mathcal{L}_{LL} := \log{p(\pmb{y} \mid \pmb{X}, \pmb{w})}
= - \frac{1}{2\sigma^2} \sum_{n=1}^N{\left(y_n - \pmb{x}_n^T\pmb{w}\right)^2} + \text{ cnst}</script>
<p>Taking the log allows us to get away from the nasty product, and get a nice sum instead.</p>
<p>Notice that this definition looks pretty similar to MSE:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{MSE}}(\pmb{w}) := \frac{1}{N} \sum_{n=1}^N \left(y_n - f(\pmb{x}_n)\right)^2</script>
<p>Note that we would like to minimize MSE, but we want LL to be as high as possible (intuitively, we can look at the sign to understand that).</p>
<h3 id="maximum-likelihood-estimator-mle">Maximum likelihood estimator (MLE)</h3>
<p>Maximizing the log-likelihood (and thus the likelihood) will be equivalent to minimizing the MSE; this gives us another way to design cost functions. We can describe the whole process as:</p>
<script type="math/tex; mode=display">\argmin_{\pmb{w}}{\mathcal{L}_\text{MSE}(\pmb{w})} =
\argmax_{\pmb{w}}{\mathcal{L}_\text{LL}(\pmb{w})}</script>
<p>The maximum likelihood estimator (MLE) can be understood as finding the model under which the observed data is most likely to have been generated from (probabilistically). This interpretation has some advantages that we discuss below.</p>
<h4 id="properties-of-mle">Properties of MLE</h4>
<p>MLE is a <em>sample</em> approximation to the <em>expected log-likelihood</em>. In other words, if we had an infinite amount of data, MLE would perfectly be equal to the expected value of the log-likelihood.</p>
<script type="math/tex; mode=display">\mathcal{L}_{LL}(\pmb{w})
\approx \expectsub{p(y, \pmb{x})}{\log{p(x \mid \pmb{x}, \pmb{w})}}</script>
<p>This means that MLE is <strong>consistent</strong>, i.e. it gives us the correct model assuming we have enough data. In probability, we can write this as:</p>
<script type="math/tex; mode=display">\pmb{w}_\text{MLE} \longrightarrow^p \pmb{w}_\text{true}</script>
<p>This sounds amazing, but the catch is that this all is under the assumption that the noise $\epsilon$ indeed was generated under a Gaussian model.</p>
<h2 id="overfitting-and-underfitting">Overfitting and underfitting</h2>
<h3 id="underfitting-with-linear-models">Underfitting with linear models</h3>
<p>Linear models can very easily underfit; as soon as the data itself is given by anything more complex than a line, fitting a linear model will underfit: the model is too simple for the data, and we’ll have huge errors.</p>
<p>But we can also easily overfit, where our model learns the specificities of the data too intimately. And this happens quite easily with linear combination of high-degree polynomials.</p>
<h3 id="extended-feature-vectors">Extended feature vectors</h3>
<p>We can actually get high-degree linear combinations of polynomials, but still keep our linear model. Instead of making the model more complex, we simply “augment” the input to become degree $M$. If the input is one-dimensional, we can add a polynomial basis to the input:</p>
<script type="math/tex; mode=display">% <![CDATA[
\pmb{\phi}(x_n) =
\begin{bmatrix}
1 & x_n & x_n^2 & x_n^3 & \dots & x_n^M
\end{bmatrix} %]]></script>
<p>Note that this is basically a <a href="https://en.wikipedia.org/wiki/Vandermonde_matrix">Vandermonde matrix</a>.</p>
<p>We then fit a linear model to this extended feature vector $\pmb{\phi}(x_n)$:</p>
<script type="math/tex; mode=display">y_n \approx w_0 + w_1 x_n + w_2 x_n^2 + \dots + w_m x_n^M =: \pmb{\phi}(x_n)^T\pmb{w}</script>
<p>Here, $\pmb{w}\in\mathbb{R}^{M+1}$. In other words, there are $M+1$ parameters in a degree $M$ extended feature vector. One should be careful with this degree; too high may overfit, too low may underfit.</p>
<p>If it is important to distinguish the original input $\pmb{x}$ from the augmented input $\pmb{\phi}(\pmb{x})$ then we will use the $\pmb{\phi}(\pmb{x})$ notation. But often, we can just consider this as a part of the pre-processing, and simply write $\pmb{x}$ as the input, which will save us a lot of notation.</p>
<h3 id="reducing-overfitting">Reducing overfitting</h3>
<p>To reduce overfitting, we can chose a less complex model (in the above, we can pick a lower degree $M$), but we could also just add more data:</p>
<p><img src="/images/ml/reduce-overfit-add-data.png" alt="An overfitted model acts more reasonably when we add a bunch of data" /></p>
<h2 id="regularization">Regularization</h2>
<p>To prevent overfitting, we can introduce <strong>regularization</strong> to penalize complex models. This can be applied to any model.</p>
<p>The idea is to not only minimize cost, but also minimize a regularizer:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}} {\mathcal{L}(\pmb{w}) + \Omega(\pmb{w})}</script>
<p>The $\Omega$ function is the regularizer, measuring the complexity of the model. We’ll see some good candidates for the regularizer below.</p>
<h3 id="l_2-regularization-ridge-regression">$L_2$-Regularization: Ridge Regression</h3>
<p>The most frequently used regularizer is the standard Euclidean norm ($L_2$-norm):</p>
<script type="math/tex; mode=display">\Omega(\pmb{w}) = \lambda \norm{\pmb{w}}^2_2</script>
<p>Where $\lambda \in \mathbb{R}$. The value of $\lambda$ will affect the fit; $\lambda \rightarrow 0$ can have overfitting, while $\lambda \rightarrow \infty$ can have underfitting.</p>
<p>The norm is given by:</p>
<script type="math/tex; mode=display">\norm{\pmb{w}}_2^2 = \sum_i{w_i^2}</script>
<p>The main effect of this is that large model weights $w_i$ will be penalized, while small ones won’t affect our minimization too much.</p>
<h4 id="ridge-regression">Ridge regression</h4>
<p>Depending on the values we choose for $\mathcal{L}$ and $\Omega$, we get into some special cases. For instance, choosing MSE for $\mathcal{L}$ is called <strong>ridge regression</strong>, in which we optimize the following:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}} {\left(\frac{1}{N} \sum_{n=1}^N \left[y_n - f(\pmb{x}_n)\right]^2 \quad + \quad \Omega(\pmb{w})\right)}</script>
<p>Least squares is also a special case of ridge regression, where $\lambda = 0$</p>
<p>We can find an explicit solution for $\pmb{w}$ in ridge regression by differentiating the cost and regularizer, and setting them to zero:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla \mathcal{L}(\pmb{w}) & = -\frac{1}{N} \pmb{X}^T (\pmb{y} - \pmb{Xw}) \\ \\
\nabla \Omega(\pmb{w}) & = 2\lambda \pmb{w} \\
\end{align} %]]></script>
<p>We can now set the full cost to zero, which gives us the result:</p>
<script type="math/tex; mode=display">\pmb{w}^*_\text{ridge} = (\pmb{X}^T\pmb{X} + \lambda' \pmb{I})^{-1}\pmb{X}^T\pmb{y}</script>
<p>Where $\frac{\lambda’}{2N} = \lambda$. Note that for $\lambda = 0$, we indeed have the solution least squares.</p>
<h4 id="ridge-regression-to-fight-ill-conditioning">Ridge regression to fight ill-conditioning</h4>
<p>This formulation of $\pmb{w}^*$ is quite nice, because adding the identity matrix helps us get something that always is invertible; in cases where we have ill-conditioned matrices, it also means that we can invert with more stability.</p>
<p>We’ll prove that the matrix indeed is invertible. The gist is that the eigenvalues of $(\pmb{X}^T\pmb{X} + \lambda’ \pmb{I})$ are all at least $\lambda’$.</p>
<p>To prove it, we’ll write the singular value decomposition (SVD) of $\pmb{X}^T\pmb{X}$ as $\pmb{USU}^T$. We then have:</p>
<script type="math/tex; mode=display">\pmb{X}^T\pmb{X} + \lambda'\pmb{I} = \pmb{USU}^T + \lambda'\pmb{UIU}^T = \pmb{U}(\pmb{S} + \lambda'\pmb{I})\pmb{U}^T</script>
<p>The singular value is “lifted” by an amount $\lambda’$. There’s an alternative proof in the class notes, but we won’t go into that.</p>
<h3 id="l_1-regularization-the-lasso">$L_1$-Regularization: The Lasso</h3>
<p>We can use a different norm as an alternative measure of complexity. The combination of $L_1$-norm and MSE is known as <strong>The Lasso</strong>:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}} {\frac{1}{2N} \sum_{n=1}^N \left[y_n - f(\pmb{x}_n)\right]^2 \quad + \quad \lambda \norm{w}_1}</script>
<p>Where the $L_1$-norm is defined as</p>
<script type="math/tex; mode=display">\norm{w}_1 := \sum_i{\abs{w_i}}</script>
<p>If we draw out a constant value of the $L_1$ norm, we get a sort of “ball”. Below, we’ve graphed $\left\{ \pmb{w} \mid \norm{\pmb{w}}_1 \le 5 \right\}$.</p>
<p><img src="/images/ml/lasso.png" alt="Graph of the lasso" /></p>
<p>To keep things in the following, we’ll just claim that $\pmb{X}^T\pmb{X}$ is invertible. We’ll also claim that the following set is an ellipsoid which scales around the origin as we change $\alpha$:</p>
<script type="math/tex; mode=display">\left\{
\pmb{w} \mid \norm{\pmb{y} - \pmb{Xw}}^2 = \alpha
\right\}</script>
<p>The slides have a formal proof for this, but we won’t get into it.</p>
<p>Note that the above definition of the set corresponds to the set of points with equal loss (which we can assume is MSE, for instance):</p>
<script type="math/tex; mode=display">\left\{
\pmb{w} \mid \mathcal{L}(\pmb{w}) = \alpha
\right\}</script>
<p>Under these assumptions, we claim that for $L_1$ regularization, the optimum solution will likely be sparse (many zero components) compared to $L_2$ regularization.</p>
<p>To prove this, suppose we know the $L_1$ norm of the optimum solution. Visualizing that ball, we know that our optimum solution $\pmb{w}^*$ will be somewhere on the surface of that ball. We also know that there are ellipsoids, all with the same mean and rotation, describing the equal error surfaces. The optimum solution is where the “smallest” of these ellipsoids just touches the
$L_1$ ball.</p>
<p><img src="/images/ml/ball-ellipse.png" alt="Intersection of the L1 ball and the cost ellipses" /></p>
<p>Due to the geometry of this ball this point is more likely to be on one of the “corner” points. In turn, sparsity is desirable, since it leads to a “simple” model.</p>
<h2 id="model-selection">Model selection</h2>
<p>As we’ve seen in ridge regression, we have a <em>regularization parameter</em> $\lambda > 0$ that can be tuned to reduce overfitting by reducing model complexity. We say that the parameter $\lambda$ is a <strong>hyperparameter</strong>.</p>
<p>We’ve also seen ways to enrich model complexity, like <a href="#extended-feature-vectors">polynomial feature expansion</a>, in which the degree $M$ is also a hyperparameter.</p>
<p>We’ll now see how best to choose these hyperparameters; this is called the <strong>model selection</strong> problem.</p>
<h3 id="probabilistic-setup">Probabilistic setup</h3>
<p>We assume that there is an (unknown) underlying distribution $\mathcal{D}$ producing the dataset, with range $\mathcal{X}\times\mathcal{Y}$. The dataset $\mathcal{S}$ we see is produced from samples from $\mathcal{D}$:</p>
<script type="math/tex; mode=display">S = \left\{(\pmb{x}_n, y_n)
\overset{\text{i.i.d}}{\sim}
\mathcal{D}\right\}_{n=1}^N</script>
<p>Based on this, the <em>learning algorithm</em> choses the “best” model, under the parameters of the algorithm.</p>
<p>We write $f_s = \mathcal{A}(S)$, where $\mathcal{A}$ denotes the learning algorithm. It depends on the data subset we’re given, and $f_s$ is the resulting prediction of our model.</p>
<p>To indicate that $f_s$ sometimes depend on hyperparameters, we’ll write $f_{s, \lambda}$.</p>
<h3 id="training-error-vs-generalization-error">Training Error vs. Generalization Error</h3>
<p>Given a model $f$, how can we asses if $f$ is any good? We already have the loss function, but its result is highly dependent on the error in the data, not to how good the model is. Instead, we can compute the <em>expected error</em> over all samples chosen according to $\mathcal{D}$.</p>
<script type="math/tex; mode=display">L_\mathcal{D}(f) = \expectsub{\mathcal{D}}{\mathcal{l}(y, f(\pmb{x}))}</script>
<p>Where $\mathcal{l}(\cdot, \cdot)$ is our loss function; e.g. for ridge regression, $\mathcal{l}(y, f(\pmb{x})) = \frac{1}{2}(y-f(\pmb{x}))^2$.</p>
<p>The quantity $L_\mathcal{D}(f)$ has many names, including <strong>generalization error</strong> (or true/expected error/risk/loss). This is the quantity that we are fundamentally interested in, but we cannot compute it since $\mathcal{D}$ is unknown.</p>
<p>What we do know is the data subset $\mathcal{S}$. It’s therefore natural to compute the equivalent <em>empirical</em> quantity, which is the average loss:</p>
<script type="math/tex; mode=display">L_S(f) = \frac{1}{\abs{S}} \sum_{(\pmb{x}_n, y_n)\in S} {\mathcal{l}(y_n, f(\pmb{x}_n))}</script>
<p>But again, we run into trouble. The function $f$ is itself a function of the data $S$, so what we really do is to compute the quantity:</p>
<script type="math/tex; mode=display">L_S(f_S) = \frac{1}{\abs{S}} \sum_{(\pmb{x}_n, y_n)\in S} {\mathcal{l}(y_n, f_S(\pmb{x}_n))}</script>
<p>$f_S$ is the trained model. This is called the <strong>training error</strong>. Usually, the training error is smaller than the generalization error, because overfitting can happen (even with regularization, because the hyperparameter may still be too low).</p>
<h3 id="splitting-the-data">Splitting the data</h3>
<p>To avoid validating the model on the same data subset we trained it on (which is conducive to overfitting), we can split the data into a <strong>training set</strong> and a <strong>test set</strong> (aka <em>validation set</em>), which we call $\Strain$ and $\Stest$.</p>
<p>We apply the learning algorithm $\mathcal{A}$ on the training set $\Strain$, and compute the function $f_{\Strain}$. We then compute the error on the test set:</p>
<script type="math/tex; mode=display">L_{\Stest}(f_{\Strain}) = \frac{1}{\abs{\Stest}} \sum_{(\pmb{x}_n, y_n)\in \Stest} {\mathcal{l}(y_n, f_{\Strain}(\pmb{x}_n))}</script>
<p>If we have duplicates in our data, then this could be a bit dangerous. Still, in general, this really helps us with the problem of overfitting since $\Stest$ is a “fresh” sample, which means that we can hope that $L_{\Stest}(f_{\Strain})$ defined above is close to the quantity $L_\mathcal{D}(f_{\Strain})$. Indeed, <em>in expectation</em> both are the same:</p>
<script type="math/tex; mode=display">L_\mathcal{D}(f_{\Strain})
= \expectsub{\Stest\sim\mathcal{D}}{
L_{\Stest}(f_{\Strain})
}</script>
<p>This is a quite nice property, but there are a few limits. First, we paid a price by splitting the data and thus reducing the size of our training data, though this can be mediated using cross-validation, which we’ll see later.</p>
<h3 id="generalization-error-vs-test-error">Generalization error vs test error</h3>
<p>Assume that we have a model $f$ and that our loss function $\mathcal{l}(\cdot, \cdot)$ is bounded in $[a, b]$. We are given a test set $\Stest$ chosen i.i.d. from the underlying distribution $\mathcal{D}$.</p>
<p>How far apart is the test error (empirical) from the true generalization error? As we’ve seen above, they are the same in expectation. But we need to worry about the variation, about how far off from the true error we typically are:</p>
<p>We claim that:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\abs{L_\mathcal{D}(f) - L_{\Stest}(f)}
\ge
\sqrt{\frac{(b-a)^2 \ln{(2/\delta)}}{2\abs{\Stest}}}
\right]
\le \delta
\label{eq:loss-bound}
\tag{loss-bound}</script>
<p>Where $\delta > 0$ is a quality parameter. This gives us an upper bound on how far away our empirical loss is from the true loss.</p>
<p>This bound gives us some nice insights. Error decreases in the size of the test set as $\mathcal{O}(1/\sqrt{\abs{\Stest}})$, so the more data points we have, the more confident we can be in the empirical loss being close to the true loss.</p>
<p>We’ll prove $\ref{eq:loss-bound}$. We assumed that each sample in the test set is chosen independently. Therefore, given a model $f$, the associated losses $\mathcal{l}(y_n, f(\pmb{x}_n))$ are also i.i.d. random variables, taking values in $[a, b]$ by assumption. We can call each such loss $\Theta_n$:</p>
<script type="math/tex; mode=display">\Theta_n = \mathcal{l}(y_n, f(\pmb{x}_n))</script>
<p>This is just a naming alias; since the underlying value is that of the loss function, the expected value of $\Theta_n$ is simply that of the loss function, which is the true loss:</p>
<script type="math/tex; mode=display">\expect{\Theta_n} = \expect{\mathcal{l}(y_n, f(\pmb{x}_n))} = L_\mathcal{D}(f)</script>
<p>The empirical loss on the other hand is equal to the average of $\abs{\Stest}$ such i.i.d. values.</p>
<p>The formula of $\ref{eq:loss-bound}$ gives us the probability that empirical loss $L_{\Stest}(f)$ diverges from the true loss by more than a given constant, which is a classical problem addressed in the following lemma (which we’ll just assert, not prove).</p>
<p><strong>Chernoff Bound</strong>: Let $\Theta_1, \dots, \Theta_n$ be a sequence of i.i.d random variables with mean $\expect{\Theta}$ and range $[a, b]$. Then, for any $\epsilon > 0$:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\abs{\frac{1}{N}\sum_{n=1}^N {\Theta_n - \expect{\Theta}}}
\ge
\epsilon
\right]
\le
2\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}
\label{eq:Chernoff}
\tag{Chernoff}</script>
<p>Using $\ref{eq:Chernoff}$ we can show $\ref{eq:loss-bound}$. By setting $\delta = 2\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}$, we find that $\epsilon = \sqrt{\frac{(b-a)^2 \ln{(2/\delta)}}{2\abs{\Stest}}}$ as claimed.</p>
<h3 id="model-selection-1">Model selection</h3>
<p>Our main goal was to look for a way to select the hyperparameters of our model. Given a finite set of values $\lambda_k$ for $k=1, \dots, K$ of a hyperparameter $\lambda$, we can run the learning algorithm $K$ times on the same training set $\Strain$, and compute the $K$ prediction functions $f_{\Strain, \lambda_k}$. For each such prediction function we compute the test error, and choose the $\lambda_k$ which minimizes the test error.</p>
<p>This is essentially a grid search on $\lambda$ using the test error function.</p>
<h4 id="model-selection-based-on-test-error">Model selection based on test error</h4>
<p>How do we know that, for a fixed function $f$, $L_{\Stest}(f)$ is a good approximation to $f_\mathcal{D}$?</p>
<p>The answer to this follows the same idea as when we talked about <a href="#generalization-error-vs-test-error">generalization vs test error</a>, but we now assume that we have $K$ models $f_k$ for $k=1, \dots, K$. We assume again that the loss function is bounded in $[a, b]$, and that we’re given a test set whose samples are chosen i.i.d. in $\mathcal{D}$.</p>
<p>How far is each of the $K$ (empirical) test errors $L_{\Stest}(f_k)$ from the true $L_\mathcal{D}(f_k)$? As before, we can bound the deviation for all $k$ candidates, by:</p>
<script type="math/tex; mode=display">\mathbb{P}\left[
\max_k {\abs{L_\mathcal{D}(f_k) - L_{\Stest}(f_k)}}
\ge
\sqrt{\frac{(b-a)^2 \ln{(2K/\delta)}}{2\abs{\Stest}}}
\right]
\le \delta</script>
<p>A bit of intuition of where this comes from: for a general $K$, if we check the deviations for $K$ independent samples and ask for the probability that for at least one such sample we get a deviation of at least $\epsilon$ (this is what the Chernoff bound answers). Then by the union bound this probability is at most $K$ times as large as in the case where we are only concerned with a single instance. Thus the upper bound in Chernoff becomes $2K\exp{\left(\frac{-2N\epsilon^2}{(b-a)^2}\right)}$, which gives us $\epsilon = \sqrt{\frac{(b-a)^2 \ln{(2K/\delta)}}{2\abs{\Stest}}}$ as above.</p>
<p>As before, this tells us that error decreases in $\mathcal{O}(1/\sqrt{\abs{\Stest}})$. Now that we test $K$ hyperparameters, our error only goes up by a tiny amount of $\sqrt{\ln{(K)}}$. This follows from $\ref{eq:loss-bound}$, which we proved for the special case of $K = 1$.</p>
<h3 id="cross-validation">Cross-validation</h3>
<p>Splitting the data once into two parts (one for training and one for testing) is not the most efficient way to use the data. Cross-validation is a better way.</p>
<p>K-fold cross-validation is a popular variant. We randomly partition the data into $K$ groups, and train $K$ times. Each time, we use one of the $K$ groups as our test set, and the remaining $K−1$ groups for training.</p>
<p>To get a common result, we average out the $K$ results. This means we’ll use the average weights to get the average test error over the $K$ folds.</p>
<p>Cross-validation returns an unbiased estimate of the generalization error and its variance.</p>
<h3 id="bias-variance-decomposition">Bias-Variance decomposition</h3>
<p>When we perform model selection, there is an inherent <a href="https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff">bias–variance</a> trade-off.</p>
<figure>
<img src="/images/ml/bias-variance.png" alt="Bullseye representation of bias vs variance" />
<figcaption>Graphical illustration of bias and variance. Taken from <a href="http://scott.fortmann-roe.com/docs/BiasVariance.html">Scott Fortmann-Roe's website</a></figcaption>
</figure>
<p>For now, we’ll just look at “high-bias & low-variance” models, and “high-variance & low-bias” models.</p>
<ul>
<li><strong>High-bias & low-variance</strong>: the model is too simple. It’s underfit, has a large bias, and and the variance of $L_\mathcal{D}(f_S)$ is small.</li>
<li><strong>High-variance & low-bias</strong>: the model is too complex. It’s overfit, has a small bias and large variance of $L_\mathcal{D}(f_S)$ (as a single addition of a data point is likely to change the prediction function $f_S$ considerably)</li>
</ul>
<p>Consider a linear regression with one-dimensional input and <a href="#extended-feature-vectors">polynomial feature expansion</a> of degree $d$. The former can be achieved by picking a too low value for $d$, while the latter by picking $d$ too high. The same principle applies for other parameters, such as ridge regression with hyperparameter $\lambda$.</p>
<h4 id="data-generation-model">Data generation model</h4>
<p>Let’s assume that our data is generated by some arbitrary, unknown function $f$, and a noise source with distribution $\mathcal{D}_\epsilon$ (i.i.d. from sample to sample, and independent from the data). We can think of $f$ representing the precise, hypothetical function that perfectly produced the data. We assume that the noise has mean zero (without loss of generality, as a non-zero mean could be encoded into $f$).</p>
<script type="math/tex; mode=display">y = f(\pmb{x}) + \epsilon</script>
<p>We assume that $\pmb{x}$ is generated according to some fixed but unknown distribution $\mathcal{D}_{\pmb{x}}$. We’ll be working with square loss as our loss function $\mathcal{l}(\cdot, \cdot)$. We will denote the joint distribution on pairs $(\pmb{x}, y)$ as $\mathcal{D}$.</p>
<h4 id="error-decomposition">Error Decomposition</h4>
<p>As always, we have a training set $\Strain$, which consists of $N$ i.i.d. samples from $\mathcal{D}$. Given our learning algorithm $\mathcal{A}$, we compute the prediction function $f_{\Strain} = \mathcal{A}(\Strain)$. The square loss of a single prediction for a fixed element $\pmb{x}_0$ is given by the computation of:</p>
<script type="math/tex; mode=display">\bigl( y_0 - f_{\Strain}(\pmb{x}_0) \bigr)^2
=
\bigl( f(\pmb{x}_0) + \epsilon - f_{\Strain}(\pmb{x}_0) \bigr)^2</script>
<p>Our experiment was to create $\Strain$, learn $f_{\Strain}$, and then evaluate the performance by computing the square loss for a fixed element $\pmb{x}_0$. If we run this experiment many times, the expected value is written as:</p>
<script type="math/tex; mode=display">\expectsub{\Strain \sim \mathcal{D},\ \epsilon\sim\mathcal{D}_\epsilon}{
\left( f(\pmb{x}_0) + \epsilon - f_{\Strain}(\pmb{x}_0) \right)^2
}</script>
<p>We will now show that this expression can be rewritten as a sum of three non-negative terms, and that each of these:</p>
<script type="math/tex; mode=display">% <![CDATA[
\newcommand{\otherconstantterm}{\mathbb{E}_{S'_\text{train}\sim\mathcal{D}}\left[f_{S'_\text{train}}(\pmb{x}_0)\right]}
\begin{align}
& \expectsub{\Strain \sim \mathcal{D},\ \epsilon\sim\mathcal{D}_\epsilon} {
\left( f(\pmb{x}_0) + \epsilon - f_{\Strain}(\pmb{x}_0) \right)^2
} \\
\overset{(a)}{=}\ &
\expectsub{\epsilon\sim\mathcal{D}_\epsilon} {
\epsilon^2
}
+ \expectsub{\Strain \sim \mathcal{D}} {
\bigl(f(\pmb{x}_0) - f_{\Strain}(\pmb{x}_0)\bigl)^2
} \\
\overset{(b)}{=}\ &
\text{Var}_{\epsilon\sim\mathcal{D}_\epsilon}\left[\epsilon\right]
+ \expectsub{\Strain \sim \mathcal{D}}{
\bigl(f(\pmb{x}_0) - f_{\Strain}(\pmb{x}_0)\bigl)^2
} \\
\overset{(c)}{=}\ &
\underbrace{
\text{Var}_{\epsilon\sim\mathcal{D}_\epsilon}\left[\epsilon\right]
}_\text{noise variance} \\
& + \underbrace{
\left( f(\pmb{x}_0) - \otherconstantterm \right)^2
}_\text{bias} \\
& + \expectsub{\Strain\sim\mathcal{D}} {
\underbrace{
\left( \otherconstantterm - f_{\Strain(\pmb{x}_0)} \right)^2
}_\text{variance}
} \\
\end{align} %]]></script>
<p>Note that here, $S'_\text{train}$ is a second training set, also sampled from $\mathcal{D}$, that is independent of the training set $\Strain$. It has the same expectation, but it is different and thus produces a different trained model $f_{S’}$.</p>
<p>Step $(a)$ uses $(u+v)^2 = u^2 + 2uv + v^2$ as well as linearity of expectation produce $\expect{(u+v)^2} = \expect{u^2} + 2\expect{uv} + \expect{v^2}$. Note that the $2uv$ part is zero as the noise $\epsilon$ is independent from $\Strain$.</p>
<p>Step $(b)$ uses the definition of variance as:</p>
<script type="math/tex; mode=display">\text{Var}(X) = \expect{(X - \expect{X})^2} = \expect{X^2} - \expect{X}^2</script>
<p>Seeing that our noise $\epsilon$ has mean zero, we have $\expect{\epsilon}^2 = 0$ and therefore $\text{Var}(\epsilon) = \expect{\epsilon^2}$.</p>
<p>In step $(c)$, we add and subtract the constant term $\otherconstantterm$ to the expression like so:</p>
<script type="math/tex; mode=display">\expectsub{S\sim \mathcal{D}}{\left(
\underbrace{A - \otherconstantterm}_u + \underbrace{\otherconstantterm + B}_v
\right)^2}</script>
<p>We can then expand the square. The $2uv$ part of the expansion is zero, as we show below:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \expectsub{S \sim \mathcal{D}} {
\left(
f(\pmb{x}_0) - \otherconstantterm
\right) \cdot \left(
\otherconstantterm - f_S(\pmb{x}_0)
\right)
} \\
& = \left(
f(\pmb{x}_0) - \otherconstantterm
\right) \cdot \expectsub{S\sim\mathcal{D}} {
\otherconstantterm - f_S(\pmb{x}_0)
} \\
& = \left(
f(\pmb{x}_0) - \otherconstantterm
\right) \cdot \left(
\otherconstantterm - \expectsub{S\sim\mathcal{D}}{S(\pmb{x}_0)}
\right) \\
& = 0 \\
\end{align} %]]></script>
<h4 id="interpretation-of-the-decomposition">Interpretation of the decomposition</h4>
<p>Each of the three terms in non-negative, so each of them is a lower bound on the expected loss when we predict the value for the input $\pmb{x}_0$.</p>
<ul>
<li>When the data contains <strong>noise</strong>, then that imposes a strict lower bound on the error we can achieve.</li>
<li>The <strong>bias term</strong> is a non-negative term that tells us how far we are from the true value, in expectation. It’s the square loss between the actual value $f(\pmb{x}_0)$ and the expected prediction, where the expectation is over the training sets. As <a href="#bias-variance-decomposition">we discussed above</a>, with a simple model we will not find a good fit in average, which means the bias will be large, which adds to the error we observe.</li>
<li>The <strong>variance term</strong> is the variance of the prediction function. For complex models, small variations in the data set can produce vastly different models, and our prediction will vary widely, which also adds to our total error.</li>
</ul>
<h2 id="classification">Classification</h2>
<p>When we did regression, our data was of the form:</p>
<script type="math/tex; mode=display">\mathcal{S}_\text{train} = \left\{(x_n, y_n)\right\}_{n=1}^N,
\qquad x_n \in \mathbb{R}^d,\ y_n \in\mathbb{R}</script>
<p>With <strong>classification</strong>, our prediction is no longer discrete. Now, $y_n\in\left\{\mathcal{C}_0, \dots, \mathcal{C}_{K-1} \right\}$. If it can only take two values (i.e. $K=2$), then it is called <strong>binary classification</strong>. If it can take more than two values, it is <strong>multi-class classification</strong>.</p>
<p>There is no ordering among these classes, so we may sometimes denote these labels as $y\in\left\{0, 1, 2, \dots, K-1\right\}$.</p>
<p>If we have an underlying distribution $\mathcal{D}$, then we can write:</p>
<script type="math/tex; mode=display">\expect{\mathbb{I}\left\{ y-f(x) \ne 0 \right\}} = \mathbb{P}(y-f(x) \ne 0)</script>
<p>Where $\mathbb{I}$ is an indicator function that returns 1 when the condition is correct, and 0 otherwise.</p>
<h3 id="linear-classifier">Linear classifier</h3>
<p>A classifier will divide the input space into a collection of regions belonging to each class; the boundaries are called <strong>decision boundaries</strong>.</p>
<p>A linear classifier splits the input with a line in 2D, a plane in 3D, or more generally, a hyperplane. But a linear classifier can also classify more complex shapes if we allow for <a href="#extended-feature-vectors">feature augmentation</a>. For instance (in 2D), if we augment the input to degree $M=2$ and a constant factor, our linear classifier can also detect ellipsoids. So without loss of generality, we’ll simply study linear classifiers and allow feature augmentation, without loss of generality.</p>
<h3 id="is-classification-a-special-case-of-regression">Is classification a special case of regression?</h3>
<p>From the initial definition of classification, we see that it is a special case of regression, where the output $y$ is restricted to a small discrete set instead of a continuous spectrum.</p>
<p>We could construct classification from regression by simply rounding to the nearest $\mathcal{C}_i$ value. For instance, if we have $y\in\left\{0, 1\right\}$, we can use (regularized) least-squares to learn a prediction function $f_{\Strain}$ for this regression problem. We can then convert the regression to a classification by rounding: we decide on $\mathcal{C}_1=0$ if $f_{\Strain}(\pmb{x})<0.5$ and $\mathcal{C}_2=1$ if $f_{\Strain}(\pmb{x})>0.5$.</p>
<p>But this is somewhat questionable as an approach. MSE penalizes points that are far away from the result <strong>before rounding</strong>, even though they would be correct <strong>after rounding</strong>. This means that the line will likely not be a good line.</p>
<p>With MSE, the “position” of the line defined by $f_{\Strain}$ will depend crucially on how many points are in each class, and where the points lie. This is not desirable for classification: instead of minimizing the cost function, we’d like for the fraction of misclassified cases to be small. The mean-squared error turns out to be only loosely related to this.</p>
<p>So instead of building classification as a special case of regression, let’s take a look at some basic alternative ideas to perform classification.</p>
<h3 id="nearest-neighbor">Nearest neighbor</h3>
<p>In some cases it is reasonable to postulate that there is some spatial correlations between points of the same class: inputs that are “close” are also likely to have the same label. Closeness may be measured by Euclidean distance, for instance.</p>
<p>This can be generalized easily: instead of taking the single nearest neighbor, a process very prone to being swayed by outliers, we can take the $k$ nearest neighbors (<a href="https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm">k-NN</a>), or a weighted linear combination of elements in the neighborhood (<a href="https://en.wikipedia.org/wiki/Kernel_smoother">smoothing kernels</a>).</p>
<p>But this idea fails miserably in high dimensions, where the geometry renders the idea of “closeness” meaningless: in a high-dimensional space, if we grow the area around a point, we’re likely to see no one for a very long time, and then 💥, everyone. This is known as the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a>. The idea also fails when we have too little data, especially in high dimensions, where the closest point may actually be far away and a very bad indicator of the local situation.</p>
<h3 id="linear-decision-boundaries">Linear decision boundaries</h3>
<p>As a starting point, we can assume that decision boundaries are linear (hyperplanes). To keep things simple, we can assume that there is a separating hyperplane, i.e. a hyperplane so that no point in the training set is misclassified.</p>
<p>There may be many such lines, so which one do we pick? This may be a little hand-wavy, but the intuition is the most “robustness”, or the one that offers the greatest “margin”: we want to be able to “wiggle” the inputs as much as possible while keeping the numbers of misclassifications low. This idea will lead us to <em>support vector machines</em> (SVMs).</p>
<p>But the linear decision boundaries are limited, and in many cases too strong of an assumption. We can augment the feature vector with some non-linear functions, which is what we do with the kernel trick, which we will talk about later. Another option is to use neural networks to find an appropriate non-linear transform of the inputs.</p>
<h3 id="optimal-classification-for-a-known-generating-model">Optimal classification for a known generating model</h3>
<p>To find a solution, we can gain some insights if we assume that we know the joint distribution $p(\pmb{x}, y)$ that created the data (where $y$ takes values in a discrete set $\mathcal{y}$). In practice, we don’t know the model, but this is just a thought experiment. We can assume that the data was generated from a model $(\pmb{x}, y)\sim\mathcal{D}$, where $y=g(\pmb{x})+\epsilon$, where $\epsilon$ is noise.</p>
<p>Given the fact that there is noise, a perfect solution may not always be possible. But if we see an input $\pmb{x}$, how can we pick an optimal choice $\hat{y}(\pmb{x})$ for this distribution? We want to maximize the probability of guessing the correct label, so we should choose according to the rule:</p>
<script type="math/tex; mode=display">\hat{y}(\pmb{x}) = \argmax_{y\in\mathcal{Y}}{p(y\mid\pmb{x})}</script>
<p>This is known as the maximum a-posteriori (MAP) criterion, since we maximize the posterior probability (the probability of a class label <em>after</em> having observed the input).</p>
<p>The probability of a correct guess is thus the average over all inputs of the MAP, i.e.:</p>
<script type="math/tex; mode=display">\mathbb{P}(\hat{y}(\pmb{x}) = y) = \int{p(\pmb{x})p(\hat{y}(\pmb{x})\mid \pmb{x})dx}</script>
<p>In practice we of course do not know the joint distribution, but we could use this approach by using the data itself to learn the distribution (perhaps under the assumption that it is Gaussian, and just fitting the $\mu$ and $\sigma$ parameters).</p>
<h2 id="logistic-regression">Logistic regression</h2>
<p>Recall that <a href="#is-classification-a-special-case-of-regression">we discussed</a> what happens if we look at binary classification as a regression. We also discussed that it is tempting to look at the predicted value as a probability (i.e. if the regression says 0.8, we could interpret it as 80% certainty of $\mathcal{C}_1 = 0$ and 20% probability of $\mathcal{C}_2 = 1$). But this leads to problems, as the predicted values may not be in $[0, 1]$, even largely surpassing these bounds, and this contributes to the error in MSE even though they indicate high certainty.</p>
<p>So the natural idea is to <em>transform</em> the prediction, which can take values in $(-\infty, \infty)$, into a true probability in $[0, 1]$. This is done by applying an appropriate function, one of which is the <em>logistic</em> function:</p>
<script type="math/tex; mode=display">\sigma(z) := \frac{e^z}{1+e^z}</script>
<p>How do we use this? Let’s consider binary classification, with labels 0 and 1. Given a training set, we learn a weight vector $\pmb{w}$, and a “shift” scalar $\pmb{w}_0$.</p>
<p>Note that $w_0$ can, for the sake of simpler notation, be considered to be a constant feature in $\pmb{w}$. <a href="#multiple-linear-regression">As before</a>, we’ll use this notation to keep things concise.</p>
<p>Given a new feature vector $\pmb{x}$, the <em>probability</em> of the class labels given $\pmb{x}$ are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(1 \mid \pmb{x}) & = \sigma(\pmb{x}^T\pmb{w}) \\
p(0 \mid \pmb{x}) & = 1 - \sigma(\pmb{x}^T\pmb{w}) \\
\end{align} %]]></script>
<p>This allows us to predict a certainty, which is a real value and not a label, which is why logistic regression is called regression, even though it is still part of a classification scheme. Indeed, we typically use logistic regression as the first step of a classifier.</p>
<h3 id="training">Training</h3>
<p>To train the classifier, the intuition is that we’d like to maximize the likelihood of our weight vector explaining the data:</p>
<script type="math/tex; mode=display">\argmax_{\pmb{w}}{p(y, X \mid \pmb{w})}</script>
<p><a href="#properties-of-mle">As with MLE</a>, this is <strong>consistent</strong>, it gives us the correct model assuming we have enough data. Using the chain rule for probabilities, the probability becomes:</p>
<script type="math/tex; mode=display">p(y, X \mid \pmb{w}) = p(\pmb{X}\mid\pmb{w})p(\pmb{y} \mid \pmb{X}, \pmb{w}) = p(\pmb{X})p(\pmb{y} \mid \pmb{X}, \pmb{w})</script>
<p>As we’re trying to get the argmax over the weights, we can discard $p(X)$ as it doesn’t depend on $\pmb{w}$. Therefore:</p>
<script type="math/tex; mode=display">\argmax_{\pmb{w}}{p(\pmb{y}, \pmb{X} \mid \pmb{w})} = \argmax_{\pmb{w}}{p(\pmb{y} \mid \pmb{X}, \pmb{w})}</script>
<p>Using the fact that the samples in the dataset are independent, and given the above formulation of the prior, we can express the maximum likelihood criterion for the general case (in the previous section, we had only done it for the binary case $N=2$):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(\pmb{y} \mid \pmb{X}, \pmb{w})
& = p(y_1, \dots, y_N \mid \pmb{x}_1, \dots, \pmb{x}_N, \pmb{w}) \\
& = \prod_{n=1}^N{p(y_n \mid \pmb{x}_n), \pmb{w}} \\
& = \prod_{n=1}^N{\sigma(x_n^T \pmb{w})^{y_n} (1-\sigma(x_n^T \pmb{w})^{1-y_n})} \\
\end{align} %]]></script>
<p>But this product is nasty, so we’ll remove it by taking the log. We also multiply by $-1$, which means we also need to be careful about taking the minimum instead of the maximum. The resulting cost function is thus:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{\pmb{w}}
& = -\sum_{n=1}^N{\left[
y_n \log{(\sigma(x_n^T w))} + (1-y_n)\log{(1-\sigma(x_n^T w))}
\right]} \\
& = \sum_{n=1}^N{\log{(1+\exp{(x_n^T w)})} - x_n x_n^T w}
\tag{Log-Likelihood}\label{eq:log-likelihood}
\end{align} %]]></script>
<h3 id="conditions-of-optimality">Conditions of optimality</h3>
<p>As we discuss above, we’d like to minimize the cost $\cost{\pmb{w}}$. Let’s look at the stationary points of our cost function by computing its gradient and setting it to zero.</p>
<p>It just turns out that taking the derivative of the logarithm in the inner part of the sum above gives us the logistic function:</p>
<script type="math/tex; mode=display">\frac{\partial \log{(1+\exp{(x_n^T w)})}}{\partial x} = \sigma(x)</script>
<p>Therefore, the whole derivative is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\nabla\cost{\pmb{w}}
& = \sum_{n=1}^N {x_n \sigma(\pmb{x}_n^T\pmb{w}) - y_n} \\
& = \pmb{X}^T \left[ \sigma(\pmb{Xw}) - \pmb{y} \right]
\end{align} %]]></script>
<p>The matrix $\pmb{x}$ is $N\times N$; both $\pmb{y}$ and $\pmb{w}$ are column vectors of length $N$. Therefore, to simplify notation, we let $\sigma(\pmb{Xw})$ represent element-wise application of the sigmoid function on the size $N$ vector resulting from $\pmb{Xw}$.</p>
<p>There is no closed-form solution for this, so we’ll discuss how to solve it in an iterative fashion by using gradient descent or the Newton method.</p>
<h3 id="gradient-descent-1">Gradient descent</h3>
<p>$\ref{eq:log-likelihood}$ is convex in the weight vector $\pmb{w}$. We can therefore do gradient descent on this cost function as we’ve always done:</p>
<script type="math/tex; mode=display">\pmb{w}^{(t+1)} := \pmb{w}^{(t)} - \gamma^{(t)}\nabla\cost(\pmb{w}^{(t)})</script>
<h3 id="newtons-method">Newton’s method</h3>
<p>Gradient descent is a <em>first-order</em> method, using only the first derivative of the cost function. We can get a more powerful optimization algorithm using the second derivative. This is based on the idea of Taylor expansions. The order 2 Taylor expansion of the cost, around $\pmb{w}^*$, is:</p>
<script type="math/tex; mode=display">\cost{\pmb{w}} \approx \cost{\pmb{w}^*}^T(\pmb{w}-\pmb{w}^*) + \frac{1}{2}(\pmb{w}-\pmb{w}^*)^T H(\pmb{w}^*)(\pmb{w}-\pmb{w}^*)</script>
<p>Where H denotes the Hessian, the $D\times D$ symmetric matrix with entries:</p>
<script type="math/tex; mode=display">H_{i, j} = \frac{\partial^2\cost{\pmb{w}}}{\partial w_i \partial w_j}</script>
<h4 id="hessian-of-the-log-likelihood">Hessian of the log-likelihood</h4>
<p>Let’s compute this Hessian matrix. We’ve already computed the gradient of the cost function <a href="#conditions-of-optimality">in the section above</a>. Looking at the sum form (not the matrix form), we see that each term only depends on $\pmb{w}$ in the $\sigma(\pmb{x}_n^T w)$ term. Therefore, the Hessian associated to one term is:</p>
<script type="math/tex; mode=display">\pmb{x}_n(\nabla\sigma(\pmb{x}_n^T\pmb{w}))^T</script>
<p>Given that the derivative of the sigmoid is $\sigma’(x) = \sigma(x)(1-\sigma(x))$, by the <a href="https://en.wikipedia.org/wiki/Chain_rule">chain rule</a>, each term of the sum gives rise to the Hessian:</p>
<script type="math/tex; mode=display">\pmb{x}_n\pmb{x}_n^T\sigma(\pmb{x}_n^T \pmb{w})(1 - \sigma(\pmb{x}_n^T \pmb{w}))</script>
<h4 id="newtons-method-1">Newton’s method</h4>
<p>In this model, we’ll assume that the Taylor expansion above denotes the cost function exactly instead of approximately. In other words, we’re assuming strict equality $=$ instead of approximation $\approx$ as above. This is only an assumption; it isn’t strictly true, but it’s a decent approximation. Where does this take minimum value? To know that, let’s set the gradient of the Taylor expansion to zero. This yields:</p>
<script type="math/tex; mode=display">H(\pmb{w}^*)^{-1} \nabla\cost{\pmb{w}^*} = \pmb{w}^* - \pmb{w}</script>
<p>If we solve for $\pmb{w}$, this gives us an iterative algorithm for finding the optimum:</p>
<script type="math/tex; mode=display">w^{(t+1)} = w^{(t)} - H(w^{(t)})^{-1} \nabla\cost{w^{(t)}} \gamma^{(t)}</script>
<p>In this iterative algorithm, our starting point, $w^{(0)}$ corresponds to $w^*$.</p>
<p>The above skips a few steps, and is just meant to give the intuition of how we get to our result. In any way, the important thing to remember is the formula for the descent, and the fact that the Hessian can be computed as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
H(w)
& = \sum_{n=1}^N{\nabla^2\mathcal{L}_n(w)} \\
& = \sum_{n=1}^N{
\underbrace{\pmb{X}_n \pmb{X}_n^T}_{D\times D}
\sigma(x_n^T w)
\bigl(1 - \sigma(x_n^T w) \bigr)
} \\
\end{align} %]]></script>
<p>This can also be written as:</p>
<script type="math/tex; mode=display">H(w) =
\underbrace{\ X^T \ }_{D\times N}\
\underbrace{\ S\ }_{N\times N}\
\underbrace{\ X \ }_{N\times D}</script>
<p>The S matrix is diagonal, where:</p>
<script type="math/tex; mode=display">S_{n, n} = \sigma(x_n^T w)\bigl(1 - \sigma(x_n^T w) \bigr)</script>
<p>The trade-off for the Newton method is that while we need fewer iterations, each of them is more costly. In practice, which one to use depends, but at least we have another option with the Newton method.</p>
<h3 id="regularized-logistic-regression">Regularized logistic regression</h3>
<p>If the data is linearly separable, there is no finite weight vector. Running the iterative algorithm will make the weights diverge to infinity. To avoid this, we can regularize with a penalty term.</p>
<script type="math/tex; mode=display">\argmin_w{-\sum_{n=1}^N{\log{p(y_n \mid \pmb{x}_n^T\pmb{w})}} + \frac{\lambda}{2}\norm{\pmb{w}}^2}</script>
<h2 id="generalized-linear-models">Generalized Linear Models</h2>
<p>Previously, with <a href="#least-squares">least squares</a>, our data was of the form:</p>
<script type="math/tex; mode=display">y = x^T w + z, \quad \text{with } z\sim\mathcal{N}(0, \sigma^2)</script>
<p>This is a D-linear model. When talking about generalized linear models, we’re still talking about something linear, but we allow the noise $z$ to be something else than a Gaussian distribution.</p>
<h3 id="motivation">Motivation</h3>
<p>The motivation for this is that while logistic regression allows for, say, binary outputs, we may want to have something equivalently computationally efficient for, say, $y\in\mathbb{N}$. To do so, we introduce different classes of distributions, called the <em>exponential family</em>, with which we can revisit logistic regression and get other properties.</p>
<p>This will be useful in adding a degree of freedom. Previously, we most often used linear models, in which we model the data as a line, plus zero-mean Gaussian noise. As we saw, this leads to least squares. When the data is more complex than a simple line, we saw that we could augment the features (e.g. with $x^2$, $x^3$), and still use a linear model. The idea was to augment the feature space $x$. This gave us an added degree of freedom, and allowed us to use linear models for higher-degree problems.</p>
<p>These linear models predicted the mean of the distribution from which we assumed the data to be sampled. When talking about mean here, we mean what we assume the data to be modeled after, without the noise. In this section, we’ll see how we can use the linear model to predict a different quantity than the mean. This will allow us to add another degree of freedom, and use linear models to get other predictions than just the shape of the data.</p>
<p>We’ve actually already done this, without knowing it. In (binary) logistic regression, the probability of the classes was:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y = 1 \mid \eta) & = \sigma(\eta y) \\
p(y = 0 \mid \eta) & = 1 - \sigma(\eta y) \\
\end{align} %]]></script>
<p>We’re using $\eta$ as a shorthand for $\pmb{x}^T\pmb{w}$, and will do so in this section. More compactly, we can write this in a single formula:</p>
<script type="math/tex; mode=display">p(y\mid\eta) = \frac{e^{\eta y}}{1 + e^\eta} = \exp{\left[
\eta y - \log{(1 + e^\eta)}
\right]}, \qquad y\in\left\{0, 1\right\}</script>
<p>Note that this linear model does not predict the mean, which we’ll denote $\mu$ (don’t get confused by this notation; $\mu$ is in general not a scalar, it represents the “real values” that the data is modeled after, without the noise). Instead, our linear model predicts $\eta = \pmb{x}^T\pmb{w}$, which is transformed into the mean by using the $\sigma$ function:</p>
<script type="math/tex; mode=display">\mu = \sigma(\eta)</script>
<p>This relation between $\mu$ and $\sigma$ is known as the <strong>link function</strong>. It is a nonlinear function that makes it possible to use a linear model to predict something else than the mean $\mu$.</p>
<h3 id="exponential-family">Exponential family</h3>
<p>In general, the form of a distribution in the exponential family is:</p>
<script type="math/tex; mode=display">p(y\mid\pmb{\eta}) = h(y)\exp{\left[\pmb{\eta}^T\phi(y) - A(\pmb{\eta})\right]}</script>
<p>Let’s take a look at the various components of this distribution:</p>
<ul>
<li>$\pmb{\eta}$ is a shorthand for $\pmb{x}^T\pmb{w}$</li>
<li>$\phi(y)$ is called a <strong>sufficient statistics</strong>. It’s usually a vector. Its name stems from the fact that its empirical average is all we need to estimate $\pmb{\eta}$</li>
<li>$A(\pmb{\eta})$ is the <strong>log-partition function</strong>, or the <strong>cumulant</strong>.</li>
</ul>
<p>The domain of $y$ can be vary: we could choose $y\in\mathbb{R}$, $y\in\left\{0, 1\right\}$, $y\in\mathbb{N}$, etc. Depending on this, we may have to do sums or integrals in the following.</p>
<p>We require that the probability be non-negative, so we need to ensure that $h(y) \ge 0$. Additionally, a probability distribution needs to integrate to 1, so we also require that that:</p>
<script type="math/tex; mode=display">\int_y{h(y)\exp{\left[\pmb{\eta}^T\phi(y) - A(\pmb{\eta})\right]}} dy = 1</script>
<p>This can be rewritten to:</p>
<script type="math/tex; mode=display">\int_y{h(y)\exp{\left[\pmb{\eta}^T\phi(y)\right]}} dy = \exp{A(\eta)}</script>
<p>The role of $A(\pmb{\eta})$ is thus only to ensure a proper normalization. To create a member of the exponential family, we can choose the factor $h(y)$, the vector $\phi(y)$ and the parameter $\pmb{\eta}$; the cumulant $A(\pmb{\eta})$ is then determined for each such choice, and ensures that the expression is properly normalized. From the above, it follows that $A(\pmb{\eta})$ is defined as:</p>
<script type="math/tex; mode=display">A(\eta) = \log{\left[\int_y{h(y)\exp{\left[\pmb{\eta}^T\phi(y) - A(\pmb{\eta})\right]}} dy\right]}</script>
<p>We exclude the case where the integral is infinite, as we cannot compute a real $A(\eta)$ for that case.</p>
<h4 id="example-bernoulli">Example: Bernoulli</h4>
<p>The Bernoulli distribution is a member of the exponential family. Its probability density is given by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y\mid\mu)
& = \mu^y(1-\mu)^{1-y}, \quad \text{where } \mu\in(0, 1) \\
& = \exp{\left[
\left( \log{\frac{\mu}{1-\mu}} \right) y +
\log{(1 - \mu)}
\right]} \\
& = \exp{\left[\eta\phi(y) - A(\eta)\right]}
\end{align} %]]></script>
<p>The parameters are thus:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\phi(y) & = y \\
\eta & = \log{\frac{\mu}{1-\mu}} \\
A(\eta) & = -\log{(1-\mu)}=\log{(1 + e^\eta)} \\
h(y) & = 1
\end{align} %]]></script>
<p>Here, $\phi(y)$ is a scalar, which means that the family only depends on a single parameter. Note that $\eta$ and $\mu$ are linked:</p>
<script type="math/tex; mode=display">\eta = g(\mu) = \log{\frac{\mu}{1-\mu}} \iff \mu = g^{-1}(\eta) = \log{\frac{e^\eta}{1+e^\eta}} = \sigma(\eta)</script>
<p>The link function is the same sigmoid function we encountered in logistic regression.</p>
<h4 id="example-gaussian">Example: Gaussian</h4>
<p>The density of a Gaussian $\mathcal{N}(\mu, \sigma^2)$ is:</p>
<script type="math/tex; mode=display">p(y) = \frac{1}{\sqrt{2\pi\sigma^2}}\exp{-\frac{(y-\mu)^2}{2\sigma^2}},
\qquad \mu\in\mathbb{R},
\quad \sigma\in\mathbb{R}^+</script>
<p>There are two parameters to choose in a Gaussian, $\mu$ and $\sigma$, so we can expect something of degree 2 in exponential form. Let’s rewrite the above:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
p(y) & = \exp{\left[
-\frac{y^2}{2\sigma^2}
+ \frac{\mu y}{\sigma^2}
- \underbrace{
\frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log{(2\pi\sigma^2)}
}_{A(\eta)}
\right]} \\
& = \exp{\left[
\eta^T\phi(y) - A(\eta)
\right]}
\end{align} %]]></script>
<p>Where:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
h(y) & = 1 \\
\phi(y) & = \begin{bmatrix}
y \\
y^2 \\
\end{bmatrix} \\
\eta & = \begin{bmatrix}
\eta_1 \\
\eta_2 \\
\end{bmatrix} = \begin{bmatrix}
\frac{\mu}{\sigma^2} \\
-\frac{1}{2\sigma^2} \\
\end{bmatrix} \\
A(\eta) & = \frac{\mu^2}{2\sigma^2} - \frac{1}{2}\log{(2\pi\sigma^2)}
= \frac{\eta_1^2}{4\eta_2} - \frac{1}{2}\log{(-\eta_2/\pi)}
\end{align} %]]></script>
<p>Indeed, this time $\phi(y)$ is a vector of dimension 2, which reflects that the distribution depends on 2 parameters. As the formulation of $\eta$ shows, we have a 1-to-1 correspondence to $\pmb{\eta}=(\eta_1, \eta_2)$ and the $(\mu, \sigma^2)$ parameters:</p>
<script type="math/tex; mode=display">\eta_1 = \frac{\mu}{\sigma^2},\ \eta_2 = -\frac{1}{2\sigma^2}
\quad \iff \quad
\mu = -\frac{\eta_1}{2\eta_2},\ \sigma^2 = -\frac{1}{2\eta_2}</script>
<h4 id="properties-1">Properties</h4>
<ol>
<li>$A(\eta)$ is convex</li>
<li>$\nabla_\eta A(\eta) = \expect{\phi(y)}$</li>
<li>$\nabla_\eta^2 A(\eta) = \expect{\phi(y)^T\phi(y)} - \expect{\phi(y)}^T\expect{\phi(y)}$</li>
</ol>
<p>Proofs are in the lecture notes.</p>
<h4 id="link-function">Link function</h4>
<p>As we’ve seen above, there is a relationship between the mean $\pmb{\mu} := \expect{\phi(y)}$ and $\pmb{\eta}$ using the link function $g$:</p>
<script type="math/tex; mode=display">\pmb{\eta} = g(\pmb{\mu}) \iff \pmb{\mu} = g^{-1}(\pmb{\eta})</script>
<p>For a list of such functions, consult the chapter on Generalized Linear Models in <a href="https://www.cs.ubc.ca/~murphyk/MLbook/">the KPM book</a>.</p>
<h3 id="application-in-ml">Application in ML</h3>
<h4 id="maximum-likelihood-parameter-estimation">Maximum Likelihood Parameter Estimation</h4>
<p>Assume that we have samples composing our training set, $\Strain = \left\{(x_n, y_n)\right\}_{n=1}^N$ i.i.d. from some distribution, which we assume is some exponential distribution family. Assume we have picked a model, i.e. that we fixed $h(y)$ and $\phi(y)$, but that $\eta$ is unknown. How can we find an optimal $\eta$?</p>
<p>We said previously that $\phi(y)$ is a sufficient statistic, and that we could find $\eta$ from its empirical average; this is what we’ll do here. We can use the maximum likelihood principle to find this parameter, meaning that we want to minimize log-likelihood:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\argmin{\mathcal{L}_{LL}(\pmb{\eta})}
& = -\log{(p(y \mid \pmb{\eta}))} \\
& = \sum_{n=1}^N{\left(
-\log{\left[h(y_n) - \pmb{\eta}^T\phi(y_n) + A(\pmb{\eta})\right]}
\right)}
\end{align} %]]></script>
<p>This is a convex function in $\eta$: the $h(y)$ term does not depend on $\eta$, $\eta^T\phi(y_n)$ is linear, $A(\eta)$ has the <a href="#properties-1">property of being convex</a>.</p>
<p>If we assume that we have the link function already, we can get $\pmb{\eta}$ by setting the gradient of our exponential family to 0:</p>
<script type="math/tex; mode=display">\frac{1}{N} \nabla\cost{\pmb{\eta}}
= -\left( \frac{1}{N} \sum_{n=1}^N{\phi(y)} \right)
+ \expect{\phi(y)}
= 0</script>
<p>Since $\pmb{\mu} := \expect{\phi(y)}$, we get:</p>
<script type="math/tex; mode=display">\pmb{\eta} = g^{-1}\left( \frac{1}{N}\sum_{n=1}^N{\phi(y_n)} \right)</script>
<h4 id="generalized-linear-models-1">Generalized linear models</h4>
<p>If we assume that our samples follow the distribution of an exponential family, we can construct a <em>generalized linear model</em>. As we’ve explained previously, this is a generalization of the model we used for logistic regression.</p>
<p>For such a model, the maximum likelihood problem, as described above, is easy to solve. As we’ve noted above, the cost function is convex, so a greedy, iterative algorithm should work well. Let’s look at the gradient of the cost:</p>
<script type="math/tex; mode=display">\nabla_{\pmb{w}}\cost{\pmb{w}} = - \sum_{n=1}^N {\pmb{x}_n \phi(y_n) - \pmb{x}_n A(\pmb{x}_n^T\pmb{w})}</script>
<p>Let’s recall that the derivative of the cumulant is:</p>
<script type="math/tex; mode=display">\frac{\partial A(\eta)}{\partial \eta} = \expect{\phi(y)} = g^{-1}(\eta)</script>
<p>Hence the gradient of the cost function is:</p>
<script type="math/tex; mode=display">\nabla_{\pmb{w}}\cost{\pmb{w}} = - \sum_{n=1}^N {\pmb{x}_n \phi(y_n) - \pmb{x}_n g^{-1}(\pmb{x}_n^T\pmb{w})}</script>
<p>Setting this to zero gives us the condition of optimality. Using matrix notation, we can rewrite this sum as follows:</p>
<script type="math/tex; mode=display">\nabla\cost{\pmb{w}} = \pmb{X}^T\left( g^{-1}(\pmb{Xw}) - \phi(\pmb{y}) \right) = 0</script>
<p>Note that this is a more general form of the formula we had <a href="#conditions-of-optimality">for logistic regression</a>.</p>
<h2 id="nearest-neighbor-classifiers-and-the-curse-of-dimensionality">Nearest neighbor classifiers and the curse of dimensionality</h2>
<p>For simplicity, let’s assume that we’re operating in a d-dimensional box, that is, in the domain $\chi = [0, 1]^d$. As always, we have a training set $\Strain=\left\{(x_n, y_n)\right\}$.</p>
<h3 id="k-nearest-neighbor-knn">K Nearest Neighbor (KNN)</h3>
<p>Given a “fresh” input x, we can make a prediction using $\text{nbh}_{S_\text{train},\ k}(\pmb{x})$. This is a function returning the $k$ inputs in the training set that are closest to $\pmb{x}$.</p>
<p>For the regression problem, we can take the average of the k nearest neighbors:</p>
<script type="math/tex; mode=display">f(x) = \frac{1}{k}\sum_{n\in\text{nbh}_{\Strain,\ k}}{y_n}</script>
<p>For binary classification, we take the majority element in the k-neighborhood. It’s a good idea to pick k odd so that there is a clear winner.</p>
<script type="math/tex; mode=display">f(x) = \text{maj}\left\{y_n : n \in \text{nbh}_{\Strain,\ k}(x) \right\}</script>
<p>If we pick a large value of k, then we are smoothing over a large area. Therefore, a large k gives us a simple model, with simpler boundaries, while a small k is a more complex model. In other words, complexity is inversely proportional to k.</p>
<p>If we pick k small we can expect a small bias but huge variance. If we pick a large k we can expect large bias but small variance.</p>
<h3 id="analysis">Analysis</h3>
<p>We’ll analyze the simplest setting, a binary KNN model (that is, there are only two output labels, 0 and 1). Let’s start by simplifying our notation. We’ll introduce the following function:</p>
<script type="math/tex; mode=display">\eta(\pmb{x}) = \mathbb{P}\left\{y=1\mid\pmb{x}\right\}</script>
<p>This is the conditional probability that the label is 1, given that the input is $\pmb{x}$. If this probability is to be meaningful at all, we must have some correlation between the “position” x and the associated label; knowing the labels close by must give us some information. This means that we need an assumption on the distribution $\mathcal{D}$:</p>
<script type="math/tex; mode=display">\abs{\eta(\pmb{x}) - \eta(\pmb{x}')} \le \mathcal{c}\norm{\pmb{x} - \pmb{x}'}
\label{eq:lipschitz-bound}\tag{Lipschitz bound}</script>
<p>On the right-hand side we have Euclidian distance. In other words, we ask that the conditional probability $\mathbb{P}\left\{y=1\mid\pmb{x}\right\}$, denoted by $\eta(x)$, be <a href="https://en.wikipedia.org/wiki/Lipschitz_continuity">Lipschitz continuous</a> with Lipschitz constant $\mathcal{c}$. We will use this assumption later on to prove a performance bound for our KNN model.</p>
<p>Let’s assume for a moment that we know the actual underlying distribution. This is not something that we actually know in practice, but is useful for deriving a formulation for the optimal model. Knowing the distribution probability distribution, our optimum decision rule is given by the classifier:</p>
<script type="math/tex; mode=display">f_*(\pmb{x}) = \mathbb{I}\left[ \eta(\pmb{x}) > \frac{1}{2} \right]</script>
<p>The idea of this classifier is that with two labels, we’ll pick the label that is likely to happen more than half of the time. The intuition is that if we were playing heads or tails and knew the probability in advance, we would always pick the option that has probability more than one half, and that is the best strategy we can use. This is known as the <strong>Bayes classifier</strong>, also called <strong>maximum a posteriori (MAP) classifier</strong>. It is optimal, in that it has the smallest probability of any classifier, namely:</p>
<script type="math/tex; mode=display">\cost{f_*} = \expectsub{\pmb{x}\sim\mathcal{D}}{
\min{\left\{ \eta(\pmb{x}), 1-\eta(\pmb{x}) \right\}}
}</script>
<p>Let’s compare this to the probability of misclassification of the real model:</p>
<script type="math/tex; mode=display">\cost{f_{\Strain,\ k=1}} = \expect{\mathbb{I}\left[ f_{\Strain}(\pmb{x}) \ne y \right]}</script>
<p>This tells us that the risk (that is, the error probability of our $k=1$ nearest neighbor classifier) is the above expectation. It’s hard to find a closed form for that expectation, but we can place a bound on it by comparing the ideal, theoretical model to the actual model. We’ll state the following lemma:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\cost{f_{\Strain}}
& \le 2 \cost{f_*} + \mathcal{c} \expectsub{\Strain, \pmb{x}\sim\mathcal{D}}{\norm{\pmb{x} - \text{nbh}_{\Strain, 1}(\pmb{x})}} \\
& \le 2 \cost{f_*} + 4\mathcal{c}\sqrt{d} N^{-\frac{1}{d+1}} \\
\end{align} %]]></script>
<p>Before we see where this comes from, let’s just interpret it. The above gives us a bound on the real classifier, compared to the optimal one. The actual classifier is upper bounded by twice the risk of the optimal classifier (this is good), plus a geometric term reflecting dimensionality (it depends on $d$: this will cause us some trouble).</p>
<p>This second term of the sum is the average distance of a randomly chosen point to the nearest point in the training set, times the Lipschitz constant $\mathcal{c}$. It intuitively makes sense to incorporate this factor into our bound: if we are basing our prediction on a point that is very close, we’re more likely to be right, and if it’s far away, less so. If we’re in a box of $[0, 1]^d$, then the distance between two corners would be $\sqrt{d}$ (by Pythagoras’ theorem). The term $N^{-\frac{1}{d+1}}$ indicates that the closest data point may be closer than the opposite corner of the cube: if we have more data, we’ll probably not have to go that far. However, for large dimensions, we need much more data to have something that’ll probably be close.</p>
<p>Let’s prove where this geometric term comes from. Let’s consider the cube $[0, 1]^d$, the space of inputs containing $\pmb{x}$. If we cut this large cube into small cubes of side length $\epsilon$. Consider the small cube containing $\pmb{x}$. If we are lucky, this small cube also contains a neighboring data point at distance at most $\sqrt{d}\epsilon$ (as per Pythagoras’ theorem, as stated above). However, if we’re less lucky, the closest neighbor may be at the other corner of the big cube, at distance $\sqrt{d}$. So what is the probability of a point not having a neighbor in its small $\epsilon$ cube?</p>
<p>Let’s denote the probability of $\pmb{x}$ landing in a particular box by $\mathbb{P}_i$. The chance that none of the N training points are in the box is $(1-\mathbb{P}_i)^N$. We don’t know the distribution $\mathcal{D}$, so we can’t really express $\mathbb{P}_i$ in a closed form, but that doesn’t matter, this notation allows us to abstract over that. The rest of the proof is calculus, carefully choosing the right scaling for $\epsilon$ in order to get a good bound.</p>
<p>Now, let’s understand where the term $2\cost{f_*}$ comes from. If we flip two coins, $y$ and $y’$, what is the probability of the outcome being different?</p>
<script type="math/tex; mode=display">\mathbb{P}\left\{y \ne y' \right\} = 2p(1-p)</script>
<p>Now, let’s consider two points $\pmb{x}$ and $\pmb{x}’$, both elements of $[0, 1]^d$. Their labels are $y$ and $y’$, respectively. The probability of these two labels being different is roughly the same as above (although the probabilities of the two events may not be the same in general):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathbb{P}\left\{ y \ne y'\right\}
= & \eta(\pmb{x})(1-\eta{\pmb{x'}}) + \eta(\pmb{x'})(1-\eta{\pmb{x}}) \\
= & 2\eta(\pmb{x})(1-\eta(\pmb{x})) + (2\eta(\pmb{x})-1)(\eta(\pmb{x})-\eta(\pmb{x}')) \\
\le & 2\eta(\pmb{x})(1-\eta(\pmb{x})) + (\eta(\pmb{x}) - \eta(\pmb{x}')) \\
\le & 2\eta(\pmb{x})(1-\eta(\pmb{x})) + \mathcal{c}\norm{\pmb{x}-\pmb{x}'}
\end{align} %]]></script>
<p>The last step uses $\ref{eq:lipschitz-bound}$.</p>
<p>Therefore, we can confirm the following bound:</p>
<script type="math/tex; mode=display">\mathcal{P}\left\{ y\ne y' \right\} \le 2\eta(\pmb{x})(1-\eta{\pmb{x}}) + \mathcal{c}\norm{\pmb{x} - \pmb{x}'}</script>
<p>But we are still one step away from explaining how we can compare this to the optimal estimator. In the above, we derived a bound for two labels being different. How is this related to our KNN model? The probability of getting a wrong prediction from KNN with $k=1$ (which we denoted $\expectsub{\Strain}{\cost{f_{\Strain}}}$) is the probability of the predicted label being different from the solution label.</p>
<p>We get to our lemma by the following reasoning:</p>
<script type="math/tex; mode=display">2\eta(\pmb{x})(1-\eta{\pmb{x}})
\le 2\min{\left\{ \eta(\pmb{x}), 1-\eta(\pmb{x}) \right\}}
= 2\cost{f_*}</script>
<p>Additionally, the average of the term $\mathcal{c}\norm{\pmb{x} - \pmb{x}’}$ is $\mathcal{c}\expectsub{\Strain, \pmb{x}\sim\mathcal{D}}{\norm{\pmb{x} - \text{nbh}_{\Strain, 1}(\pmb{x})}}$</p>
<p>If we had assumed that it was a ball instead of a cube, we would’ve gotten slightly different results. But that’s besides the point: the main insight from this is that it depends on the dimension, and that for low dimensions at least, we still have a fairly good classifier. But finding a closest neighbor in high dimension can quickly become meaningless.</p>
<h2 id="support-vector-machines">Support Vector Machines</h2>
<h3 id="definition">Definition</h3>
<p>Let’s re-consider binary classification. In the following it will be more convenient to consider $y_n\in\left\{\pm 1 \right\}$. This is equivalent under the mapping $0 \mapsto -1$ and $1\mapsto 1$. This can be done continuously in the range by adding a constant feature 1 and scaling the weights by $\frac{1}{2}$.</p>
<p>Previously, we used MSE or logistic loss. MSE is symmetric, so something being positive or negative is punished at an equal rate. With logistic regression, we always have a loss, but its value is asymmetric, shrinking the further we go right.</p>
<p>If we instead use hinge loss (as defined below), with an additional regularization term, we get <strong>Support Vector Machines</strong> (SVM).</p>
<script type="math/tex; mode=display">\text{Hinge}(z, y) = [1-yz]_+ = \max{\left\{ 0, 1-yz \right\}}</script>
<p>Here, we use $z$ as shorthand for $\pmb{x}^T \pmb{w}$. This function is linear when it’s below one, and does not punish predictions above one. This pushes us to give predictions that we can be very confident about (above one).</p>
<p><img src="/images/ml/hinge-mse-logistic.png" alt="Graph of hinge loss, MSE and logistic" /></p>
<p>SVMs correspond to the following optimization problem:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}}{\sum_{n=1}^N{\left[ 1 - y_n \pmb{x}_n^T \pmb{w}\right]_+} + \frac{\lambda}{2}\norm{\pmb{w}}^2}</script>
<p>The regularizing term allows us to give an incentive to pick a separating hyperplane as far as possible from the points. In the figure below, the pink region represents the “margin” created by the regularizer. Any points outside the margin, that is, points $\pmb{x}$ such that $\abs{\pmb{x}^T\pmb{w}}\le 1$ do not incur any cost. The margin here does not only depend on the direction of $\pmb{w}$, but also its norm: the width of the margin is $2/\norm{\pmb{w}}$.</p>
<p><img src="/images/ml/margin.png" alt="Margin of a dataset" /></p>
<p>Thus, depending on the $\pmb{w}$ that we choose, the orientation and size of the margin will change; there will be a different number of points in it, and the cost will change.</p>
<p>Let’s assume $\lambda$ is small; we won’t define that further, the main point is just that our priorities are as follows:</p>
<ol>
<li>We want a separating hyperplane</li>
<li>We want a scaling of $\pmb{w}$ so that no point of the data is in the margin</li>
<li>We want the margin to be as wide as possible</li>
</ol>
<p>With conditions 1 and 2, we can ensure that there is no cost incurred in the first expression (the sum over $[1 - y_n \pmb{x}_n^T \pmb{w}]_+$). The third condition is ensured by the fact that we’re minimizing $\norm{\pmb{w}}^2$. Since the size of the margin is inversely proportional to that, we’re maximizing the margin.</p>
<h3 id="alternative-formulation-duality">Alternative formulation: Duality</h3>
<p>Previously, we computed $[z]_+$. This could be restated as the following:</p>
<script type="math/tex; mode=display">[z]_+ = \max{\left\{ 0, z \right\}} = \max_{\alpha}{\alpha z}, \qquad \text{with } \alpha\in[0, 1]</script>
<p>Let’s say that we’re interested in minimizing a cost function $\cost{\pmb{w}}$. Let’s assume this can be defined through an auxiliary function $G$, such that:</p>
<script type="math/tex; mode=display">\cost{\pmb{w}} = \max_{\pmb{\alpha}}{G(\pmb{w}, \pmb{\alpha})}</script>
<p>The minimization in question is thus:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}}{\max_{\pmb{\alpha}}{G(\pmb{w}, \pmb{\alpha})}}</script>
<p>We call this the <strong>primal problem</strong>. In some cases though, it may be easier to find this in the other direction:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}}{\min_{\pmb{w}}{G(\pmb{w}, \pmb{\alpha})}}</script>
<p>We call this the <strong>dual problem</strong>. This leads us to a few questions:</p>
<h4 id="how-do-we-find-a-suitable-function-g">How do we find a suitable function G?</h4>
<p>There’s a general theory on this topic (see <a href="http://www.athenasc.com/nonlinbook.html">Nonlinear Programming</a> by Dimitri Bertsekas). In the case of SVMs though, the finding the function G is rather straightforward:</p>
<script type="math/tex; mode=display">\min_{\pmb{w}}{\max_{\pmb{\alpha}\in[0, 1]^N}{\sum_{n=1}^N{\left[ 1 - y_n \pmb{x}_n^T \pmb{w}\right]_+} + \frac{\lambda}{2}\norm{\pmb{w}}^2}}
= \min_{\pmb{w}}{\max_{\pmb{\alpha}\in[0, 1]^N}{G(\pmb{w}, \pmb{\alpha})}}</script>
<p>Note that G is convex in $\pmb{w}$, and linear (hence concave) in $\pmb{alpha}$.</p>
<h4 id="when-is-it-ok-to-switch-min-and-max">When is it OK to switch min and max?</h4>
<p>It is always true that:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}}{\min_{\pmb{w}}{G(\pmb{w}, \pmb{\alpha})}}
\le
\min_{\pmb{w}}{\max_{\pmb{\alpha}}{G(\pmb{w}, \pmb{\alpha})}}</script>
<p>This is proven by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\min_{\pmb{w}'}{G(\pmb{w}', \pmb{\alpha})} & \le G(\pmb{w}, \pmb{\alpha}) \quad \forall \pmb{w}, \pmb{\alpha} & \iff \\
\max_{\pmb{\alpha}}{\min_{\pmb{w}'}{G(\pmb{w}', \pmb{\alpha})}} & \le \max_{\pmb{\alpha}}{G(\pmb{w}, \pmb{\alpha})} \quad \forall \pmb{w} & \iff \\
\max_{\pmb{\alpha}}{\min_{\pmb{w}'}{G(\pmb{w}', \pmb{\alpha})}} & \le \min_{\pmb{w}} \max_{\pmb{\alpha}}{G(w, \pmb{\alpha})} & \\
\end{align} %]]></script>
<p>Equality is achieved when the function looks like a saddle: when $G$ is a continuous function that is convex in $\pmb{w}$, concave in $\pmb{\alpha}$, and the domains of both are compact and convex.</p>
<p>For SVMs, this condition is fulfilled, and the switch between min and max can be done. The alternative formulation of SVMs is:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}\in[0, 1]^N}{\min_{\pmb{w}}{\sum_{n=1}^N{\left[ 1 - y_n \pmb{x}_n^T \pmb{w}\right]_+} + \frac{\lambda}{2}\norm{\pmb{w}}^2}}</script>
<p>We can take the derivative with respect to $\pmb{w}$:</p>
<script type="math/tex; mode=display">\nabla_{\pmb{w}}G(\pmb{w}, \pmb{\alpha})
= -\sum_{n=1}^N{\alpha_n y_n \pmb{x}_n + \lambda\pmb{w}}</script>
<p>We’ll set this to zero to find a formulation of $\pmb{w}$ in terms of $\alpha$. We get:</p>
<script type="math/tex; mode=display">\pmb{w}(\pmb{\alpha}) = \frac{1}{\lambda}\sum_{n=1}^N{\alpha_n y_n \pmb{x}_n} = \frac{1}{\lambda}\pmb{X}^T\pmb{Y\alpha}</script>
<p>Where $\pmb{Y} := \text{diag}(\pmb{y})$.</p>
<p>If we plug this into the formulation of SVM, we get the following dual problem:</p>
<script type="math/tex; mode=display">\max_{\pmb{\alpha}\in[0, 1]^N}{\vec{1}^T\alpha - \frac{1}{2\lambda}\pmb{\alpha}^T\pmb{YXX}^T\pmb{Y\alpha}}
\label{eq:svm-quadratic-form} \tag{Quadratic form}</script>
<h4 id="when-is-the-dual-easier-to-optimize-than-the-primal">When is the dual easier to optimize than the primal?</h4>
<ol>
<li>When the dual is a differentiable quadratic problem (as SVM is). This is a problem that takes the same $\ref{eq:svm-quadratic-form}$ as above. In this case, we can optimize by using <strong>coordinate descent</strong> (or more precisely, ascent, as we’re searching for the maximum). Crucially, this method only changes one $\alpha_n$ variable at a time.</li>
<li>In the $\ref{eq:svm-quadratic-form}$ above, the data enters the formula in the form $\pmb{K} = \pmb{XX}^T$. This is called the <strong>kernel</strong>. We say this formulation is <em>kernelized</em>. Using this representation is called the <em>kernel trick</em>, and gives us some nice consequences that we’ll discuss later.</li>
<li>Typically, the solution $\pmb{\alpha}$ is sparse, being non-zero only in the training examples that are instrumental in determining the decision boundary. If we recall how we defined $\alpha$ as an alternative formulation of $[z]_+$, we can see that there are three distinct cases to consider:
<ol>
<li>Examples that lie on the correct side, and outside the margin, for which $\alpha_n = 0$. These are <strong>non-support vectors</strong></li>
<li>Examples that are on the correct side and just on the margin, for which $y_n \pmb{x}_n^T \pmb{w} = 1$, so $\alpha_n \in (0, 1)$. These $\pmb{x}_n$ are <strong>essential support vectors</strong></li>
<li>Examples that are strictly within the margin, or on the wrong side have $\alpha_n = 1$, and are called <strong>bound support vectors</strong></li>
</ol>
</li>
</ol>
<h3 id="kernel-trick">Kernel trick</h3>
<p>We saw previously that our data only enters the problem in the form of a kernel, $\pmb{K} = \pmb{XX}^T$. We’ll see now that when we’re using the kernel, we can easily go to a much larger dimensional space (even infinite dimensional space) without adding any complexity. This isn’t always applicable though, so we’ll also see which kernel functions are admissible for this trick.</p>
<h4 id="alternative-formulation-of-ridge-regression">Alternative formulation of ridge regression</h4>
<p>Let’s recall that least squares is a special case of ridge regression (where $\lambda = 0$). Ridge regression corresponds to the following optimization problem:</p>
<script type="math/tex; mode=display">\pmb{w}^* = \min_{\pmb{w}}{\sum_{n=1}^N{(y_n - \pmb{x}_n^T w)^2 + \frac{\lambda}{2}\norm{\pmb{w}}^2}}</script>
<p>We saw that the solution has a closed form:</p>
<script type="math/tex; mode=display">\pmb{w}^* = (\pmb{X}^T\pmb{X} + \lambda\pmb{I}_D)^{-1} \pmb{X}^T y</script>
<p>We claim that this can be alternatively written as:</p>
<script type="math/tex; mode=display">\pmb{w}^* =
\pmb{X}^T
(\underbrace{\pmb{XX}^T\pmb{X} + \lambda\pmb{I}_N}_{N\times N})^{-1}
y</script>
<p>The original formulation’s runtime is $\mathcal{O}(D^3 + ND^2)$, while ther alternative is $\mathcal{O}(N^3 + DN^2)$. Which is more efficient depends on $D$ and $N$.</p>
<details><summary><p>Proof</p>
</summary><div class="details-content">
<p>We can prove this formulation by using the following identity. If we let $P$ be an $N\times M$ matrix, and $Q$ be $M\times N$. Then:</p>
<script type="math/tex; mode=display">P(QP + I_M) = PQP + P = (PQ + I_N)P</script>
<p>Assuming that $(QP + I_M)$ and $(PQ + I_N)$ are invertible, we have the identity:</p>
<script type="math/tex; mode=display">(PQ+I_N)^{-1}P = P(QP+I_M)^{-1}</script>
<p>To derive the formula, we can let $P = X^T$ and $Q = \frac{1}{\lambda}X$.</p>
</div></details>
<h4 id="representer-theorem">Representer theorem</h4>
<p>The representer theorem generalizes what we just saw about least squares. For any cost function $\mathcal{L}_n$</p>
<script type="math/tex; mode=display">\min_{\pmb{w}}{\sum_{n=1}^N{
\mathcal{L}_n(\pmb{x}_n^T \pmb{w}, y_n) + \frac{\lambda}{2}\norm{\pmb{w}}^2
}}</script>
<p>there exists $\pmb{\alpha^*}$ such that $\pmb{w}^* = \pmb{X}^T \pmb{\alpha}^*$.</p>
<h4 id="kernelized-ridge-regression">Kernelized ridge regression</h4>
<p>The above theorem gives us a new way of searching for $\pmb{w}^*$: we can first search for $\pmb{\alpha^*}$, which might be easier, and then get back to the optimal weights by using the identity $\pmb{w}^* = \pmb{X}^T \pmb{\alpha}^*$.</p>
<p>Therefore, for ridge regression, we can equivalently optimize our alternative formula in terms of $\alpha$:</p>
<script type="math/tex; mode=display">\pmb{\alpha}^* = \argmin_{\pmb{\alpha}}{\frac{1}{2}\pmb{\alpha}^T(\pmb{XX}^T + \lambda I_N)\pmb{\alpha} - \pmb{\alpha}^T \pmb{y}}</script>
<p>We see that our data enters in kernel form. How do we get the solution to this minimization problem? We can, as always, take the gradient of the cost function according to $\pmb{\alpha}$ and set it to zero:</p>
<script type="math/tex; mode=display">\nabla_{\alpha}\cost{\alpha} = (\pmb{XX}^T + \lambda I)\alpha - y = 0</script>
<p>Solving for $\alpha$ results in:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\alpha^* & = (\pmb{XX}^T + \lambda I_N)^{-1} y \\
\pmb{w}^* & = X^T(\pmb{XX}^T + \lambda I_N)^{-1} y
\end{align} %]]></script>
<p>We’ve effectively gotten back to our claimed alternative formulation for the optimal weights.</p>
<h4 id="kernel-functions">Kernel functions</h4>
<p>The kernel is defined as $K = XX^T$. We’ll call this the linear kernel. The elements are defined as:</p>
<script type="math/tex; mode=display">K_{i, j} = \pmb{x}_i^T\pmb{x}_j</script>
<p>Assume that we had first augmented the feature space with $\phi(\pmb{x})$; the elements of the kernel would then be:</p>
<script type="math/tex; mode=display">K_{i, j} = \phi(\pmb{x}_i)^T\phi(\pmb{x}_j)</script>
<p>Using this formulation allows to keep the size of $K$ the same, regardless of how much we augment. In other words, we can now solve a problem where the size is independent of the feature space.</p>
<p>We’ll see a few examples where we augment the feature space by a few dimensions, which of course isn’t terribly useful (going to $N\times N$ just to be able to augment the feature space by 3 is rarely a good trade-off), but if we can augment our feature space by a lot and still keep our kernel finite (as we’ll see later with RBF kernels), then the trade-off becomes worth it.</p>
<p>The feature augmentation goes from $x_n \in \mathbb{R}^d$ to $\phi(x_n) \in \mathbb{R}^{d’}$ with $d’ \gg d$, or even to an infinite dimension. Depending on $d’$, it may or may not be worth it.</p>
<h4 id="kernel-trick-1">Kernel trick</h4>
<p>The big advantage of using kernels is that rather than first augmenting the feature space and then computing the kernel, we can do both steps together, and we can do it more efficiently.</p>
<p>Let’s define a kernel function $\kappa(\pmb{x}, \pmb{x}’)$. We’ll let entries in the kernel $K$ be defined by:</p>
<script type="math/tex; mode=display">K_{i, j} = \kappa(\pmb{x}_i, \pmb{x}_j)</script>
<p>We can pick different kernel functions and get some interesting results. If we pick the right kernel, it can be equivalent to augmenting the features with some $\phi(\pmb{x})$, and then computing the inner product:</p>
<script type="math/tex; mode=display">\kappa(\pmb{x}, \pmb{x}') = \phi(\pmb{x})^T\phi(\pmb{x}')</script>
<p>Let’s take a look at a few examples of choices for $\kappa$ and see what happens:</p>
<h5 id="radial-basis-function">Radial basis function</h5>
<p>The following kernel corresponds to an infinite feature map:</p>
<script type="math/tex; mode=display">\kappa(\pmb{x}, \pmb{x}') = \exp{\left[-\frac{1}{2}(\pmb{x} - \pmb{x}')^T(\pmb{x} - \pmb{x}')\right]}</script>
<p>This is called the <em>radial basis function</em> (RBF) kernel. Let’s look at this in more detail. Consider the special case in which $\pmb{x}$ and $\pmb{x}’$ are scalars; we’ll look at the Taylor expansion of the function:</p>
<script type="math/tex; mode=display">\kappa(\pmb{x}, \pmb{x}') = e^{-(x)^2} e^{-(x')^2} \sum_{k=0}^K{\frac{2^k(x)^k(x')^k}{k!}}</script>
<p>We can think of this infinite sum as the dot-product of two infinite vectors, whose $k$-th components are equal to, respectively:</p>
<script type="math/tex; mode=display">e^{-(x)^2} \sqrt{\frac{2^k}{k!}} (x)^k
\quad \text{and} \quad
e^{-(x')^2} \sqrt{\frac{2^k}{k!}} (x')^k</script>
<h4 id="new-kernel-functions-from-old-ones">New kernel functions from old ones</h4>
<p>We can simply construct a new kernel as a linear combination of old kernels:</p>
<script type="math/tex; mode=display">\kappa(\pmb{x}, \pmb{x'}) = a\kappa_1(\pmb{x}, \pmb{x'}) + b\kappa_2(\pmb{x}, \pmb{x'}), \quad \forall a, b \ge 0 \\
\kappa(\pmb{x}, \pmb{x'}) = \kappa_1(\pmb{x}, \pmb{x'}) \kappa_2(\pmb{x}, \pmb{x'}) \\
\kappa(\pmb{x}, \pmb{x'}) = \kappa_1(f(\pmb{x}), f(\pmb{x'})) \\</script>
<p>Proofs are in the lecture notes. If we accept these, we can combine them to prove much more complex kernel functions.</p>
<h3 id="properties-of-kernels">Properties of kernels</h3>
<p>A kernel function must be an inner-product in some feature space. Mercer’s condition states that this is true iff the following conditions are fulfilled:</p>
<ol>
<li>$K$ should be symmetric, i.e. $\kappa(\pmb{x}, \pmb{x’}) = \kappa(\pmb{x’}, \pmb{x})$</li>
<li>For any arbitrary input set $\left\{\pmb{x}_n\right\}$ and all $N$, $K$ should be positive semi-definite</li>
</ol>
<p>Todo make sure that kernel section is complete</p>
<h2 id="unsupervised-learning">Unsupervised learning</h2>
<p>So far, all we’ve done is supervised learning: we’ve gone from a training set with features vectors and labels, and we wanted to output a classification or a regression.</p>
<p>There is a second very important framework in ML called <em>unsupervised</em> learning. Here, the training set is only composed of the feature vectors:</p>
<script type="math/tex; mode=display">S_{\text{train}} = \left\{ (\pmb{x}) \right\}_{n=1}^N</script>
<p>We would then like to learn from this dataset without having access to the training labels. The two main directions in unsupervised learning:</p>
<ul>
<li>Representation learning & feature learning</li>
<li>Density estimation & generative models</li>
</ul>
<p>Let’s take a bird’s eye view of the existing techniques through some examples.</p>
<ol>
<li><strong>Matrix factorization</strong>: can be used for both supervised and unsupervised. We’ll give an example for each
<ol>
<li><strong>Netflix, collaborative filtering</strong>: this is an example of supervised learning. We have a large, sparse matrix with rows of users, columns of movies, containing ratings. If we can approximate the matrix reasonably well by a matrix of rank one (i.e. outer product of two vectors), then this extracts useful features both for the users and the movies; it might group movies by genres, and users by type.</li>
<li><strong>word2vec</strong>: this is an example of unsupervised learning. The idea is to map every word from a large corpus to a vector $w_i \in \mathbb{R}^K$, where K is relatively large. This would allow us to represent natural language in some numeric space. To get this, we build a matrix $N\times N$, with $N$ being the number of words in the corpus. We then factorize the matrix by means of two matrices of rank $K$ to give us the desired representation. The results are pretty astounding, as <a href="https://www.tensorflow.org/tutorials/representation/word2vec">this article</a> shows; closely related words are close in the vector space, and it’s easy to get a mapping from concepts to associated concepts (say, countries to capitals).</li>
</ol>
</li>
<li><strong>PCA and SVD</strong> (Principle Component Analysis and Singular Value Decomposition): Features are vectors in $\mathbb{R}^d$ for some d. If we wanted to “compress” this down to one dimension (this doesn’t have to be an existing feature, it could be a newly generated one from the existing ones), we could ask that the variance of the projected data be as large as possible. This will lead us to PCA, which we compute using SVD.</li>
<li><strong>Clustering</strong>: to reveal structure in data, we can cluster points given some similarity measure (e.g. Euclidean distance) and the number of clusters we want. We can also ask clusters to be hierarchical (clusters within clusters).</li>
<li><strong>Generative models</strong>: a generative model models the distribution of the data
<ol>
<li><strong>Auto-encoders</strong>: these are a form of compression algorithm, trying to find good weights for encoding and compressing the data</li>
<li><strong>Generative Adversarial Networks</strong> (GANs): the idea is to use two neural nets, one that tries to generate samples that look like the data we get, and another that tries to distinguish the real samples from the fake ones. The aim is that after sufficient training, a classifier cannot distinguish real samples from artificial ones. If we achieve that, then we have built a good model.</li>
</ol>
</li>
</ol>
<h3 id="k-means">K-Means</h3>
<p>A common algorithm for unsupervised learning is called K-means (also called vector quantization in signal processing). The aim of this algorithm is to cluster the data: we want to find a partition such that every point is exactly one group, such that within a group, the (Euclidian) distance between points is much smaller than across the groups.</p>
<p>In K-means, we find these clusters in terms of cluster centers (also called means). Each center dictates the partition: which cluster a point belongs to depends on which center is closest to the point. In other words, we’re minimizing the distance over all $N$ points and $K$ clusters:</p>
<script type="math/tex; mode=display">\mathcal{L}_{\text{K-means}} = \min_{\left\{\mu_k\right\}, \left\{z_{nk}\right\}}{\sum_{n=1}^N{\sum_{k=1}^K{z_{nk} \norm{x_n - \mu_k}^2}}}</script>
<p>The $z_{nk}$ is the k<sup>th</sup> number in the $z_n$ vector, which is a one-hot vector encoding the cluster assignment: if we’re looking at the n<sup>th</sup> datapoint, we have a vector $z_n$ of length K consisting of zeros and ones, which has exactly one 1. Mathematically, we can write this constraint as:</p>
<script type="math/tex; mode=display">z_{nk} \in \left\{0, 1\right\}, \quad \sum_{k=1}^K{z_{nk}} = 1</script>
<p>This formulation of the problem gives rise to two conditions, which will give us an intuitive algorithm for solving this iteratively. We see that there are two sets of variables to optimize under: $\mu_k$ and $z_{nk}$. The idea is to fix one and optimize the other.</p>
<p>First, let’s fix the centers $\left\{\mu_k \right\}$ and work on the assignments. To minimize the sum:</p>
<script type="math/tex; mode=display">% <![CDATA[
z_{nk} = \begin{cases}
1, & k = \argmin_{k'}{\norm{x_n - \mu_{k'}}^2} \\
0, & \text{ otherwise}
\end{cases} %]]></script>
<p>Intuitively, this means that we’re grouping the points by the closest center.</p>
<p>Having computed this, we can fix the assignments $z_{nk}$ to compute optimal centers $\mu_k$. These centers should correspond to the center of the cluster. This minimizes the distance that all points can have to the center.</p>
<script type="math/tex; mode=display">\mu_k = \frac{\sum_{n=1}^N{z_{nk} x_n}}{\sum_{n=1}^N{z_{nk}}}</script>
<p>Note that in this formulation, $k$ is fixed, and $n$ varies in the sum. This gives us some kind of average: the sum of all the positions of the points in the cluster, divided by the number of points in the cluster.</p>
<p>How did we get to this formulation? If we take the derivative of the cost function and set it to zero, and then solve it for $\mu_k$, we get to the above.</p>
<script type="math/tex; mode=display">\nabla_{\mu_k}\mathcal{L}_{\text{K-means}} = \sum_{n=1}^N{2z_{nk}\mu_k - 2z_{nk}x_n} = 0</script>
<p>Solving this confirms that taking the average position in the cluster indeed is the best way to optimize our cost.</p>
<p>These observations give rise to an algorithm:</p>
<ol>
<li>Initialize the centers $\left\{\mu_k^{(0)}\right\}$. In practice, the algorithm’s convergence may depend on this choice, but there is no general best strategy. As such, they can in general be initialized randomly.</li>
<li>Repeat until convergence:
<ol>
<li>Choose $z^{(t+1)}$ given $\mu^{(t)}$</li>
<li>Choose $\mu^{(t+1)}$ given $z^{(t+1)}$</li>
</ol>
</li>
</ol>
<p>Each of these two steps will only make the partitioning better, if possible. Still, this may get stuck at a local minimum, there’s no guarantee of it converging to the optimum; it’s a greedy algorithm.</p>
<h4 id="coordinate-descent-interpretation">Coordinate descent interpretation</h4>
<p>There are other ways to look at K-means. One way is to think of it as a coordinate descent, minimizing a cost function by finding parameters $\mu$ and $z$. This doesn’t actually give us much new insight, but it’s a nice way to think about it.</p>
<h4 id="matrix-factorization-interpretation">Matrix factorization interpretation</h4>
<p>Another way to think about it is as a matrix factorization. We can rewrite K-means as the minimization of the following:</p>
<script type="math/tex; mode=display">\min_{M, Z}{\norm{X^T - M Z^T}_F}</script>
<p>A few notes on this notation:</p>
<ul>
<li>$X$ is as always our data matrix; since it’s transpossed, it’s $D\times N$.</li>
<li>$M$ is a $D\times K$ representing the mean; each column represents a different center.</li>
<li>$Z^T$ is the $K\times N$ assignment matrix; it picks a single column from the means. This means that exactly one element of each row is 1.</li>
<li>The norm here is <a href="https://en.wikipedia.org/wiki/Matrix_norm#Frobenius_norm">Frobenius norm</a> (the sum of the squares of all elements in matrix).</li>
</ul>
<h4 id="probabilistic-interpretation">Probabilistic interpretation</h4>
<p>A probabilistic interpretation of K-means will lead us to GMMs. Having a probabilistic approach is useful because it allows us to account for the model that we think generated the data.</p>
<p>The assumption is that the samples $x_n$ come from one out of $K$ D-dimensional Gaussian distributions. These distributions are assumed to have means $\left\{ \mu_k \right\}$, and a covariance matrix that is the identity matrix (that is, variance 1 in each dimension, and the dimensions are i.i.d).</p>
<p>Let’s write down the likelihood. The Gaussian density function of a sample is:</p>
<script type="math/tex; mode=display">p(x_n \mid \mu, z) = \prod_{k=1}^K{\left(
\frac{1}{(2\pi)^{D/2}} \exp{\frac{-\norm{x_n - \mu_k}^2}{2}}
\right)^{z_{nk}}}</script>
<p>The density assuming that we know that the points are from a given $k$ is what’s inside of the large parentheses, but seeing that we don’t know from which of the K distribution the point is from, we have to take the product over all $z_{nk}$ (this is the same trick we used previously; it allows us to cancel all other sources than the assigned one).</p>
<p>Now, if we want to this for the whole set instead of for a single sample, assuming that the samples are i.i.d, we can write this as a product:</p>
<script type="math/tex; mode=display">\prod_{n=1}^N{p(x_n \mid \mu, z)} = \prod_{n=1}^N{\prod_{k=1}^K{\left(
\frac{1}{(2\pi)^{D/2}} \exp{\frac{-\norm{x_n - \mu_k}^2}{2}}
\right)^{z_{nk}}}}</script>
<p>This is the likelihood, which we want to maximize. We could equivalently minimizing the log-likelihood. We’ll also remove the constant factor as it has no influence on our minimization.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
-\log{\prod_{n=1}^N{p(x_n \mid \mu, z)}}
& = -\log{\prod_{n=1}^N{\prod_{k=1}^K{\left(
\exp{\frac{-\norm{x_n - \mu_k}^2}{2}}
\right)^{z_{nk}}}}} \\
& = \sum_{n=1}^N{\sum_{k=1}^K{z_{nk} \norm{x_n - \mu_k}^2}}
\end{align} %]]></script>
<p>And this is of course the cost function we were optimizing before.</p>
<h3 id="gaussian-mixture-model-gmm">Gaussian Mixture Model (GMM)</h3>
<p>So now that we’ve expressed K-means from a probabilistic view, let’s view the probabilistic generalization, which is called a Gaussian Mixture Model.</p>
<p>To generalize the previous, what if our data comes from Gaussian sources that aren’t perfectly circularly symmetric, that don’t have the identity matrix as variance? A more general solution is to allow for a general covariance matrix $\Sigma$. This will add another parameter that we need optimize over, but can help us more accurately model the data.</p>
<p>Another extension is that we were previously forced to be either from one or another distribution. This is called hard clustering. We can generalize this to soft clustering, where a point can be associated to multiple clusters. In soft clustering, we model $z_n$ as a random variable taking values in $\left\{1, \dots, K\right\}$, instead of a one-hot vector.</p>
<p>This assignment is given by a certain distribution. We denote the a priori probability that the sample comes from the k<sup>th</sup> Gaussian, $\mathcal{N}(\mu_k, \Sigma_k)$ by $\pi_k$:</p>
<script type="math/tex; mode=display">P(z_n = k) = \pi_k</script>
<p>What we’re trying to minimize in this extended model is then (still under the assumption that the data is independently distributed from K samples):</p>
<script type="math/tex; mode=display">\prod_{n=1}^N{p(z_n \mid \pi) \mathcal{N}(x_n \mid z_n, \mu, \Sigma)}</script>
<p>This is the model that we’ll use. It’s not something that we aim to prove or not prove, it’s just what we chose to base ourselves on. We’ll want to optimize over $\mu$ and $\Sigma$.</p>
<p>The $z_n$ variable is what’s known as a <strong>latent variable</strong>; it’s not something that we observe directly, it’s just something that we use to make our model more complex. If we’re not interested in this, what we can do is integrate over the latent variables. This gives us:</p>
<script type="math/tex; mode=display">\prod_{n=1}^N{\sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}}</script>
<p>This is a weighted sum of all the models. The weights sum up to one, so we have a valid density. In other words, we are now able to model much more complex distribution functions by building up our distribution from K Gaussian distributions; we can optimize the fit of the model by changing the weights and optimizing the log likelihood of the above, which is:</p>
<script type="math/tex; mode=display">-\sum_{n=1}^N{\log{\left( \sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)} \right)}}</script>
<p>This can be optimized over $\pi_k, \mu_k, \Sigma_k$. Unfortunately, we now have the log of a sum of Gaussians (which are exponentials), which isn’t a very nice formula. We’ll use this as an excuse to talk about another algorithm, the EM algorithm.</p>
<h3 id="em-algorithm">EM algorithm</h3>
<p>The expectation-maximization (EM) algorithm provides us with a general method to tackle this sort of problem.</p>
<p>Previously, in GMM, we had set of parameters:</p>
<script type="math/tex; mode=display">\theta^{(t)} = \set{
\set{\mu_k^{(t)}}_{k=1}^K,
\set{\Sigma_k^{(t)}}_{k=1}^K,
\set{\pi_k^{(t)}}_{k=1}^K,
}</script>
<p>Recall that we want to maximize:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\log{\left( \sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)} \right)}}</script>
<p>In this problem, we’re maximizing the cost function (instead of minimizing it as we’re used to). This is strictly equivalent to minimizing the negative of this, and we’re using maximizing and minimizing the negative equivalently.</p>
<p>We’re maximizing over all choices of $\theta$. At every step, we try to go from a set of parameters $\theta^{(k)}$ to a better set of parameters $\theta^{(k+1)}$.</p>
<p>The EM algorithm consists of optimizing for $q_{nk}$ and $\theta$ alternatively.</p>
<h4 id="expectation-step">Expectation step</h4>
<p>In the expectation step, we compute how well we’re doing:</p>
<script type="math/tex; mode=display">\cost{\theta^{(t)}} = \sum_{n=1}^N{ \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k) }</script>
<p>We can then choose the new $q_{nk}$ values:</p>
<script type="math/tex; mode=display">q_{nk}^{(t+1)} = \frac{
\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
}{
\sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}
}</script>
<p>This gives us a new lower bound on the cost:</p>
<script type="math/tex; mode=display">\cost{\theta^{(t+1)}} \ge \sum_{n=1}^N{\sum_{k=1}^K}{q_{nk}^{(t+1)} \log{\left(
\frac{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}{q_{nk}^{(t+1)}}
\right)}}</script>
<p>Getting a lower bound means that we have a monotonically non-decreasing cost over the steps $t$. Again, this is a good guarantee because we’re maximizing over the cost: it tells us that our E-step improves at every step.</p>
<p>This value is actually the expected value, hence the name of the E-step. We’ll see this in the interpretation section below.</p>
<details><summary><p>Derivation</p>
</summary><div class="details-content">
<p>Consider any probability distribution $q_n^{(t)}$. Since it is a probability distribution, we have:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} \ge 0, \quad \sum_{k=1}^K{q_{nk}^{(t)}} = 1</script>
<p>Due to the concavity of the log function, and because the $q_{nk}$ sum up to 1 (being a probability distribution), we can apply <a href="https://en.wikipedia.org/wiki/Jensen%27s_inequality">Jensen’s inequality</a> recursively to the cost function to get:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\log{\left( \sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)} \right)}
& = \log{\left(
\sum_{k=1}^K{
q_{nk}^{(t)}
\frac{
\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
}{
q_{nk}^{(t)}
}
} \right)} \\
& \ge \sum_{k=1}^K{
q_{nk}^{(t)}
\log{\frac{
\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
}{
q_{nk}^{(t)}
}}
} \\
\end{align} %]]></script>
<p>We have equality when each term in the log is equal:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} \sim \pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)</script>
<p>Since it is a probability, it must sum up to 1 so we have:</p>
<script type="math/tex; mode=display">q_{nk}^{(t)} = \frac{
\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
}{
\sum_{k=1}^K{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}
}</script>
</div></details>
<h4 id="maximization-step">Maximization step</h4>
<p>We update the $\theta$ parameters as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mu_k^{(t+1)} & := \frac{\sum_n{q_{nk}^{(t)} x_n}}{\sum_n{q_{nk}^{(t)}}} \\
\Sigma_k^{(t+1)} & := \frac{
\sum_n{q_{nk}^{(t)} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T}
}{
\sum_n{q_{nk}^{(t)}}
} \\
\pi_k^{(t+1)} & := \frac{1}{N}\sum_n{q_{nk}^{(t)}}
\end{align} %]]></script>
<details><summary><p>Derivation</p>
</summary><div class="details-content">
<p>We had previously let $q_{nk}$ be an abstract, undefined distribution. We now freeze the $q_n^{(t)}$ assignments, and optimize over $\theta$.</p>
<p>In the E step, we derived a lower bound for the cost function. In general, the lower bound is not equal to the original cost. We can however carefully choose $q_{nk}$ to achieve equality. And since we want to maximize the original cost function, it makes sense to maximize this lower bound. Thus, we’ll work under this locked assignment of $q_{nk}$ (thus achieving equality for the lower bound). Seeing that we have equality, our objective function (which we want to maximize) is:</p>
<script type="math/tex; mode=display">\prod_{n=1}^N \sum_{k=1}^K{
q_{nk}^{(t)}
\log{\frac{
\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)
}{
q_{nk}^{(t)}
}}
}</script>
<p>This leads us to maximizing the expression:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K}{
q_{nk}^{(t)} \left[
\log{\pi_k} - \log{q_{nk}^{(t)}} + \log{\mathcal{N}(x_n \mid \mu_k, \Sigma_k)}
\right]
}</script>
<p>The $\pi_k$ should sum up to one, so we’re dealing with a constrained optimization problem. We therefore add a term to turn it into an unconstrained problem. We therefore want to maximize:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K}{
q_{nk}^{(t)} \left[
\log{\pi_k} - \log{q_{nk}^{(t)}} + \log{\mathcal{N}(x_n \mid \mu_k, \Sigma_k)}
\right] + \lambda \sum_{k=1}^K{\pi_k}
}</script>
<p>Differentiating with respect to $\pi_k$, and setting the result to 0 yields:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{q_{nk}^{(t)} \frac{1}{\pi_k} + \lambda} = 0</script>
<p>Solving for $\pi_k$ gives us:</p>
<script type="math/tex; mode=display">\pi_k = -\frac{1}{\lambda} \sum_{n=1}^N{q_{nk}^{(t)}}</script>
<p>We can choose $\lambda$ so that this leads to a proper normalization ($\pi_k$ summing up to 1); this leads us to $\lambda = -N$. Hence, we have:</p>
<script type="math/tex; mode=display">\pi_k^{(t+1)} := \frac{1}{N}\sum_{n=1}^N {q_{nk}^{(t)}}</script>
<p>This is our first update rule. Let’s see how to derive the others. The term $\log{\mathcal{N}(x_n \mid \mu_k, \Sigma_k)}$ has the form:</p>
<script type="math/tex; mode=display">-\frac{D}{2}\log{(2\pi)}
+\frac{1}{2}\log{\abs{\pmb{\Sigma}^{-1}}}
-\frac{1}{2}(\pmb{x} - \pmb{\mu}_k)^T\pmb{\Sigma}^{-1}(\pmb{x} - \pmb{\mu}_k)</script>
<p>We used the fact that for an invertible matrix, $\abs{\pmb{\Sigma}} = 1/\abs{\pmb{\Sigma}^{-1}}$. Differentiating the cost function with respect to $\pmb{\mu}_k$ and setting the result to 0 yields:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N {q_{nk}^{(t)} \pmb{\Sigma}^{-1}(\pmb{x}_n - \pmb{\mu}_k)} = 0</script>
<p>We can multiply this by $\pmb{\Sigma}$ on the left to get rid of the $\pmb{\Sigma}^{-1}$, and solve for $\pmb{\mu}_k$ to get:</p>
<script type="math/tex; mode=display">\pmb{\mu}_k^{(t+1)} := \frac{
\sum_n q_{nk}^{(t)}\pmb{x}_n
}{
\sum_n{q_{nk}^{(t)}}
}</script>
<p>Finally, for the $\Sigma$ update rule, we take the derivative with respect to $\pmb{\Sigma}_k^{-1}$ and set the result to 0, yielding:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{q_{nk}^{(t)} \frac{1}{2} \pmb{\Sigma}^T_k}
- \frac{1}{2}\sum_{n=1}^N{q_{nk}^{(t)}(\pmb{x}_n - \pmb{\mu}_k)(\pmb{x}_n - \pmb{\mu}_k)^T}
= 0</script>
<p>Solving for $\Sigma$ yields:</p>
<script type="math/tex; mode=display">\Sigma_k^{(t+1)} := \frac{
\sum_n{q_{nk}^{(t)} (x_n - \mu_k^{(t+1)}) (x_n - \mu_k^{(t+1)})^T}
}{
\sum_n{q_{nk}^{(t)}}
}</script>
<p>We’re using the following fact, which I won’t go into details to prove:</p>
<script type="math/tex; mode=display">\frac{\partial}{\partial \pmb{A}} \log{\abs{\pmb{A}}} = \pmb{A}^{-T}</script>
</div></details>
<h4 id="interpretation">Interpretation</h4>
<p>The cost that we computed is indeed an expectation based on the prior $\pi_k$. Let’s now take a look at the posterior distribution.</p>
<script type="math/tex; mode=display">p(z = k \mid x, \theta)
= \frac{p(z=k, x \mid \theta)}{p(x\mid\theta)}
= \frac{
\pi_k \mathcal{N}(x \mid \mu_k, \Sigma_k)
}{
\sum_j{\mathcal{N}(x \mid \mu_j, \Sigma_j)}
}</script>
<p>This may look suspiciously familiar; in fact, it is the $q_{nk}$. The distribution that we previously just explained as an abstract, unknown distribution is in fact the posterior. If we have these posteriors, then it seems natural to optimize is the expected value of the log-likelihood of the posterior, which is exactly what we do in the E-step, as the expected value is then given by:</p>
<script type="math/tex; mode=display">\sum_{n=1}^N{\sum_{k=1}^K{q_{nk} \log{\pi_k \mathcal{N}(x_n \mid \mu_k, \Sigma_k)}}}</script>
<p>We can think of the whole EM algorithm as:</p>
<script type="math/tex; mode=display">\theta^{(t+1)} = \argmax_{\theta}{\expectsub{p(z\mid x, \theta^{(t)})}{\log{p(z, x\mid\theta)}}}</script>
<h2 id="matrix-factorization">Matrix Factorization</h2>
<p>Matrix factorization is a form of unsupervised learning. A well-known example in which matrix factorization was used is the Netflix prize. The goal was to predict ratings of users for movies, given a very sparse matrix of ratings. We’ll study the method that achieved the best error.</p>
<p>Let’s describe the data a little more formally. Given movies $d = 1, 2, \dots, D$ and users $n = 1, 2, \dots, N$, we define $X$ as the $D\times N$ matrix containing all rating entries; that is, $x_{dn}$ is the rating of the n<sup>th</sup> user for the d<sup>th</sup> movie. We don’t have any additional information on the users or on the movies, apart from the ID that’s been assigned to them. In practice, the matrix was $D=20’000$ and $N=500k$, and 99.98% unobserved.</p>
<p>We want to give a prediction for all the unobserved entries, so that we can give the top entries (say, top 10 movies) for each user.</p>
<h3 id="prediction-using-a-matrix-factorization">Prediction using a matrix factorization</h3>
<p>We will aim to find $\pmb{W}$ and $\pmb{Z}$ such that:</p>
<script type="math/tex; mode=display">\pmb{X} \sim \pmb{W}\pmb{Z}^T</script>
<p>The hope is to “explain” each rating $x_{dn}$ by a numerical representation of the corresponding movie and user.</p>
<p>Here, we have $\pmb{Z}\in\mathbb{R}^{N\times K}$, forming a is a “flat matrix” $\pmb{Z}^T$, and $W\in\mathbb{R}^{D\times K}$ is a “tall matrix”. The number $K$ is something that we must choose. In practice, compared to the size of $N$ or $D$, $K$ will be relatively small (maybe 50 or so).</p>
<p>We’ll assign a cost function that we’re trying to optimize:</p>
<script type="math/tex; mode=display">\min_{\pmb{W}, \pmb{Z}} \cost{\pmb{W}, \pmb{Z}} := \frac{1}{2} \sum_{(d, n)\in\Omega}{\left[
x_{dn} - (\pmb{WZ}^T)_{dn}
\right]^2}</script>
<p>Here, $\Omega\subseteq [D]\times[N]$ is given. It collects the indices of the observed ratings of the input matrix $\pmb{X}$. Our cost function here compares the number of stars $x_{dn}$ a user assigned to a movie, to the prediction of our model $\pmb{WZ}^T$, by using mean squares.</p>
<p>To optimize this cost function, we need to know whether it is jointly <em>convex</em> with respect to $\pmb{W}$ and $\pmb{Z}$, and whether it is <em>identifiable</em> (there is a unique minimum).</p>
<p>We won’t go into the full proof, but the answer is the minimum is not unique. Since $\pmb{WZ}^T$ is a product, we could just divide one by 10 and multiply the other by 10 to get a different solution with the same cost.</p>
<p>And in fact, it’s not even convex. If we think of $W$ and $Z$ as numbers (or as $1\times 1$ matrices), we can get the intuition for why this is. We could compute the Hessian, which is:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{bmatrix}
0 & 1 \\
1 & 0
\end{bmatrix} %]]></script>
<p>More simply though, we can just apply Jensen’s inequality. The function $w\cdot z$ <a href="https://www.wolframalpha.com/input/?i=xy">looks like a saddle function</a>, and just isn’t convex.</p>
<h3 id="choosing-k">Choosing K</h3>
<p>$K$ is the number of <em>latent features</em>. This is comparable to the K we chose in K-means, defining the number of clusters. Large values of K facilitate overfitting.</p>
<h3 id="regularization-1">Regularization</h3>
<p>We can add a regularizer and minimize the following cost:</p>
<script type="math/tex; mode=display">\frac{1}{2} \sum_{(d, n)\in\Omega}{\left[
x_{dn} - (\pmb{WZ}^T)_{dn}
\right]^2}
+ \frac{\lambda_w}{2}\norm{\pmb{W}}^2_{\text{Frob}}
+ \frac{\lambda_z}{2}\norm{\pmb{Z}}^2_{\text{Frob}}</script>
<p>With scalars $\lambda_w, \lambda_z > 0$.</p>
<h3 id="stochastic-gradient-descent">Stochastic gradient descent</h3>
<p>With our cost functions in place, we can look at our standard algorithm for minimization. We’ll define loss as a sum of many individual loss functions:</p>
<script type="math/tex; mode=display">\sum_{(d, n)\in\Omega}{f_{dn}}
= \sum_{(d, n)\in\Omega}{\frac{1}{2}\left[
x_{dn} - (\pmb{WZ}^T)_{dn}
\right]^2}</script>
<p>Let’s derive the stochastic gradient for an individual loss function (which is what we need to compute when doing SGD). When doing gradients by a matrix, we expect to find a matrix again. Let’s just look at the dimensions of what we’re computing:</p>
<script type="math/tex; mode=display">\nabla_W f_{dn} \in \mathbb{R}^{D\times K} \\
\nabla_Z f_{dn} \in \mathbb{R}^{N\times K}</script>
<p>For a fixed pair $(d, n)$, we will compute a single entry $(d’, k)$ in $\pmb{W}$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\partial}{\partial w_{d', k}} f_{d, n}(\pmb{W}, \pmb{Z})
= \begin{cases}
- \left[x_{dn} - (\pmb{WZ}^T)_{dn} \right] z_{n, k} & \text{if } d' = d \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>The same goes for the derivative by $\pmb{Z}$. We’ll compute a single entry $(n’, k)$ in the $\pmb{Z}$ derivative:</p>
<script type="math/tex; mode=display">% <![CDATA[
\frac{\partial}{\partial z_{n', k}} f_{d, n}(\pmb{W}, \pmb{Z})
= \begin{cases}
- \left[x_{dn} - (\pmb{WZ}^T)_{dn} \right] w_{d, k} & \text{if } n' = n \\
0 & \text{otherwise}
\end{cases} %]]></script>
<p>It turns out that computing this is very cheap; $\mathcal{O}(K)$. This is the greatest advantage of using SGD for this. There are no guarantees that this works though; this is still an open research question. But in practice, it works really well.</p>
<p>The update step is then:</p>
<script type="math/tex; mode=display">W^{(t+1)} = W^{(t)} - \gamma \nabla_W f_{dn} \\
Z^{(t+1)} = Z^{(t)} - \gamma \nabla_Z f_{dn} \\</script>
<p>We only update the d<sup>th</sup> row of W, and the n<sup>th</sup> row of Z.</p>
<h3 id="alternating-least-squares-als">Alternating least squares (ALS)</h3>
<p>The alternate minimization algorithm alternates between optimizing $\pmb{Z}$ and $\pmb{W}$. ALS is a special case of this, with square error.</p>
<h4 id="no-missing-entries">No missing entries</h4>
<p>For simplicity, let’s just assume that there are no missing entries in the data matrix, that is $\Omega = [D]\times[N]$ (instead of $\subseteq$). This makes our life a little easier, and we’ll be able to find a closed form solution (indeed, if $\Omega$ is the whole set, the problem is pretty easy to solve; if it’s an arbitrary subset, it becomes a NP-hard problem). Our cost is then:</p>
<script type="math/tex; mode=display">\frac{1}{2}\sum_{d=1}^D\sum_{n=1}^N \left[
x_{dn} - (\pmb{WZ}^T)_{dn}
\right]^2 + \lambda_W \frobnorm{\pmb{W}}^2 + \lambda_Z \frobnorm{\pmb{Z}}^2 \\
= \frac{1}{2}\frobnorm{\pmb{X} - \pmb{WZ}^T}^2 + \lambda_W \frobnorm{\pmb{W}}^2 + \lambda_Z \frobnorm{\pmb{Z}}^2</script>
<p>ALS then does a <strong>coordinate descent</strong> to minimize the cost (plus a regularizer). First, we fix $\pmb{W}$ and compute the minimum with respect to $\pmb{Z}$ (we ignore the other regularizer, as minimization is the same with or without an added constant):</p>
<script type="math/tex; mode=display">\min_{\pmb{Z}}{\frac{1}{2}\frobnorm{\pmb{X} - \pmb{WZ}^T}^2} + \lambda_Z \frobnorm{\pmb{Z}}^2</script>
<p>Then, we alternate, minimizing $\pmb{W}$ and fixing $\pmb{Z}$:</p>
<script type="math/tex; mode=display">\min_{\pmb{W}}{\frac{1}{2}\frobnorm{\pmb{X} - \pmb{WZ}^T}^2} + \lambda_W \frobnorm{\pmb{W}}^2</script>
<p>The update rule is thus given by:</p>
<script type="math/tex; mode=display">(\pmb{Z}^*)^T := (\pmb{W}^T \pmb{W} + \lambda_Z \pmb{I}_K)^{-1} \pmb{W}^T \pmb{X} \\
(\pmb{W}^*)^T := (\pmb{Z}^T \pmb{Z} + \lambda_W \pmb{I}_K)^{-1} \pmb{Z}^T \pmb{X} \\</script>
<p>Note that the regularization helps us make sure that problem indeed is invertible (since we’re adding an identity matrix). Note that we can find a closed form solution if we don’t have any missing entries. We do this by taking the derivative, and setting it to zero. The cost of finding the solution is then per column, $\mathcal{O}(N)$ and $\mathcal{O}(D)$, which is not quite as good as the $\mathcal{O}(K)$ with SGD. Still, it’s not too bad: we’re only inverting a $K\times K$ matrix, which is much nicer than dealing with D or N. Also note that there is no step size to tune, which makes it easier to deal with (although this approach is slower)</p>
<h4 id="missing-entries">Missing entries</h4>
<p>As before, we can derive the ALS updates for the more general setting, where we only have certain ratings $(d, n)\in\Omega$. The idea is to compute the gradient with respect to each group of variables, and set it to zero.</p>
<h3 id="text-representation-learning">Text representation learning</h3>
<h4 id="motivation-1">Motivation</h4>
<p>We can’t plug string-encoded words directly into our learning models. Can we find a meaningful numerical representation for all of our data? We’d like to find a mapping, or <strong>embedding</strong>, for each word $w_i$:</p>
<script type="math/tex; mode=display">w_i \mapsto \pmb{w}_i \in \mathbb{R}^K</script>
<p>The naive, first approach would be to pick $K$ to be the size of the vocabulary; we can then encode words as one-hot vectors. This works nicely, but has high dimensionality, and cannot capture the order of the words (which is why it’s also called the “bag of words” approach).</p>
<p>But we can do this in smarter way. The idea is to pick a much lower $K$, and try to group semantically similar words in this K-dimensional space.</p>
<h4 id="co-occurrence-matrix">Co-occurrence matrix</h4>
<p>To attempt to get the meaning of words, we can construct co-occurrence count from a big corpus or text. This is a matrix in which $n_{ij}$ is the number of contexts where word $w_i$ occurs together with word $w_j$. A context is a window of words occurring together (it could be a document, paragraph, sentence, or a window of $n$ words).</p>
<p>For a vocabulary $\nu = \set{w_1, \dots, w_D}$ and context words $w_n = 1, 2, \dots N$, the co-occurrence matrix is a very sparse $D\times N$.</p>
<h4 id="learning-word-representations">Learning word representations</h4>
<p>To construct a word embedding, we want to find a factorization of the co-occurrence matrix. This typically uses the log of the actual counts, i.e. $x_{dn} := \log{(n_{dn})}$. We’ll find a factorization s.t:</p>
<script type="math/tex; mode=display">\pmb{X} \approx \pmb{WZ}^T</script>
<p>For each pair of words $(w_d, w_n)$, we’ll try to explain their co-occurrence count by a numerical representation of the two words; $\pmb{W}$ is the representation of a word, while $\pmb{Z}$ is the representation of a context word.</p>
<p>The GloVe embedding (an alternative to word2vec) uses a little trick to give importance to each entry. It computes a weight $f_{dn}$ according to the following function:</p>
<script type="math/tex; mode=display">f_{dn} = \min\set{1, \left(\frac{n_{dn}}{ n_{\text{max}} }\right)^\alpha},
\quad \alpha\in[0, 1]</script>
<p>We can also choose $f_{dn} := 1$ if we don’t want to weigh the vectors, but GloVe achieves good results with this choice. For $K$, we can just choose a value, say 50, 100 or 200. Trial and error will serve us well. We can train the factorization with SGD or ALS.</p>
<h4 id="skip-gram-model">Skip-gram model</h4>
<p>This model uses binary classification to separate real word pairs $(w_d, w_n)$, appearing together in a context window, from fake world pairs, sampled randomly.</p>
<h4 id="fasttext">FastText</h4>
<p>This is another matrix factorization approach to learn classify documents. It combines bag of words and.</p>
<p>A sentence can be represented as $\pmb{x}_n \in\mathbb{R}^{\abs{\nu}}$, a vector that is 1 in the position of its words. We try to optimize over the following cost function.</p>
<script type="math/tex; mode=display">\min_{\pmb{W}, \pmb{Z}} \cost{\pmb{W}, \pmb{Z}} := \sum f(y_n \pmb{WZ}^T\pmb{x}_n)</script>
<p>Where:</p>
<ul>
<li>$\pmb{W}\in\mathbb{R}^{1\times K}$ and $\pmb{Z}\in\mathbb{R}^{\abs{\nu}\times k}$ are the factorization</li>
<li>$x_n\in\mathbb{R}^{\abs{\nu}}$ is the n<sup>th</sup> training sentence.</li>
<li>$f$ is a linear classifier loss function</li>
<li>$y_n\in\set{\pm 1}$ is the classification label for sentence $x_n$</li>
</ul>
<h2 id="svd-and-pca">SVD and PCA</h2>
<h3 id="motivation-2">Motivation</h3>
<p><strong>Principal Component Analysis</strong> (PCA) is a popular dimensionality reduction method. There are two properties that we get from PCA.</p>
<ul>
<li>It can be used to <em>compress data</em> from $D$ dimensions to $K$ dimensions, with $K \le D$. For machine learning, it’s often best not to compress data in this manner, but it can be necessary in certain situations (for reasons of interpretability for example)</li>
<li>It gives us a <em>linear transformation</em> that decorrelates the data; it finds $K$ new features from the initial $D$</li>
</ul>
<p>The PCA will be computed from the data matrix $\pmb{X}$ using <strong>singular value decomposition</strong> (SVD).</p>
<h3 id="svd">SVD</h3>
<p>The SVD decomposition of a $D \times N$ matrix $\pmb{X}$ is:</p>
<script type="math/tex; mode=display">\pmb{X} = \pmb{USV}^T</script>
<p>In the following, we assume $D < N$, but it is still true when that is not the case (we could just take the transpose)</p>
<ul>
<li>$\pmb{U}$ is a $D \times D$ orthonormal<sup id="fnref:orthonormal"><a href="#fn:orthonormal" class="footnote">1</a></sup> matrix</li>
<li>$\pmb{V}$ is a $N \times N$ orthonormal<sup id="fnref:orthonormal:1"><a href="#fn:orthonormal" class="footnote">1</a></sup> matrix</li>
<li>$\pmb{S}$ is a $D\times N$ diagonal matrix (with $D$ diagonal entries)</li>
</ul>
<p>One useful property about unitary matrices (like $\pmb{U}$ and $\pmb{V}$, which are orthonormal, a stronger claim) is that they preserve the norms (they don’t change the length of the vectors being transformed), meaning that we can think of them as a rotation. A small proof of this follows:</p>
<script type="math/tex; mode=display">\frobnorm{\pmb{Ux}}^2 = \pmb{x}^T\pmb{U}^T\pmb{Ux} = \pmb{x}^T\pmb{I}\pmb{x} = \frobnorm{\pmb{x}}^2</script>
<p>The diagonal entries in $\pmb{S}$ are the <em>singular values</em> in descending order: $s_1 \ge s_2 \ge \dots \ge s_D \ge 0$. The columns of $\pmb{U}$ and $\pmb{V}$ are the <em>left</em> and <em>right singular vectors</em>.</p>
<h4 id="dimensionality-reduction">Dimensionality reduction</h4>
<p>Suppose we want to compress a $D\times N$ data matrix $\pmb{X}$ to a $K\times N$ matrix $\tilde{\pmb{X}}$, where $1 \le K \le D$. We’ll define this transformation from $\pmb{X}$ to $\tilde{\pmb{X}}$ by the $K\times D$ compression matrix $\pmb{C}$. The decompression (or reconstruction) from $\tilde{\pmb{X}}$ to $\pmb{X}$ is $\pmb{R}$.</p>
<p>Can we find good matrices? Our criteria is that the error introduced when compressing and reconstructing should be small, over all choices of compression and reconstruction matrices:</p>
<script type="math/tex; mode=display">\frobnorm{\pmb{X} - \pmb{R}\pmb{C}\pmb{X}}^2</script>
<p>There are other ways of measuring the quality of a compression and reconstruction, but for the sake of simplicity, we’ll stick to this one.</p>
<p>We can actually place a bound on the reconstruction error using the following lemma.</p>
<hr />
<p><strong>Lemma</strong>: For any $D \times N$ matrix $\pmb{X}$ and any $D\times N$ rank-K matrix $\pmb{X}$:</p>
<script type="math/tex; mode=display">\frobnorm{\pmb{X} - \hat{\pmb{X}}}^2 \ge \frobnorm{\pmb{X} - \pmb{U}_K \pmb{U}_K^T \pmb{X}} = \sum_{i \ge K+1}{s_i^2}</script>
<p>Where $\pmb{X} = \pmb{U}\pmb{S}\pmb{V}^T$ is the SVD of decomposition, and $s_i$ are the singular values of $\pmb{X}$, and $\pmb{U}_K$ is the $D\times K$ matrix of the first $K$ rows of $\pmb{U}$.</p>
<p>This lemma tells us that if we use $\pmb{C} = \pmb{U}_K^T$ as our compression matrix, and $\pmb{R} = \pmb{U}_K$ as the reconstruction matrix, then we achieve an error that is better or equal to any other choice of reconstruction $\hat{\pmb{X}}$.</p>
<p>Note that the reconstruction error is the sum of the singular values after the cut-off $K$; intuitively, we can think of the error as coming from the singular values we ignored.</p>
<hr />
<p>Using this, we’ve managed to compress down to $K$ dimensions, and we see that we’re actually fairly close in terms of reconstruction error.</p>
<p>This fact really defines PCA. This lemma tells us that all the most important information about $\pmb{X}$ is contained in the K left-most singular vectors, the K first columns of $\pmb{U}$.</p>
<p>The term $\pmb{U}_K \pmb{U}_K^T \pmb{X}$ has another simple interpretation. Let $S^{(K)}$ be the $D\times N$ diagonal matrix corresponding to a truncated version of $S$. It is of the same size, but only has the K first diagonal values of S, and is zero everywhere else. We claim that:</p>
<script type="math/tex; mode=display">\pmb{U}_K \pmb{U}_K^T \pmb{X} = \pmb{U}_K \pmb{U}_K^T \pmb{USV}^T = \pmb{US}^{(K)}\pmb{V}^T</script>
<blockquote>
<p>👉🏼 It’s okay to drop the K subscript on the U matrix because $\pmb{S}^{(K)}$ already takes care of selecting the first K rows</p>
</blockquote>
<p>This tells us that the <em>best</em> rank K approximation of a matrix is obtained by computing its SVD, and truncating it at K (i.e., setting all the singular values at position $j \ge K + 1$ to zero).</p>
<h4 id="svd-and-matrix-factorization">SVD and matrix factorization</h4>
<p>Expressing $\pmb{X}$ as an SVD decomposition allows us to easily get a matrix factorization.</p>
<script type="math/tex; mode=display">\pmb{X}
= \pmb{USV}^T
= \underbrace{\pmb{U}}_{\pmb{W}} \underbrace{\pmb{SV}^T}_{Z^T}
= WZ^T</script>
<p>This is clearly a special case of the matrix factorization as we saw it previously. There are two differences from the general case. One of the nice things about this approach is that we don’t need to pre-select the rank K from the start. We can control it at any time later, and let it range from 1 to $\min(D, N)$.</p>
<p>As we’ve discussed, this is the <em>best</em> rank K approximation that we can find, as the Frobenius norm of the difference (between the approximation and the true value) is the smallest possible (sum of the squares of the singular values).</p>
<p>We can also just pre-choose our value K and compute the matrix factorization that defines our dimensionality reduction:</p>
<script type="math/tex; mode=display">\pmb{X}_K
= \pmb{U}_K \pmb{S}^{(K)} \pmb{V}^T
= \underbrace{\pmb{U}_K}_{\pmb{W}}
\underbrace{\pmb{S}^{(K)}\pmb{V}^T}_{Z^T}
= \pmb{W}\pmb{Z}^T</script>
<p>Note however that the SVD approach doesn’t allow us to deal with incomplete data matrices.</p>
<h3 id="pca-and-decorrelation">PCA and decorrelation</h3>
<p>Assume that we have N D-dimensional points in a $D\times N$ matrix $\pmb{X}$. We can compute the empirical mean and co-variance by:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\bar{\pmb{x}} & = \frac{1}{N} \sum_{n=1}^N {x_n} \\
\pmb{K} & = \frac{1}{N} \sum_{n=1}^N (\pmb{x}_n - \bar{\pmb{x}}) (\pmb{x}_n - \bar{\pmb{x}})^T
\end{align} %]]></script>
<p>The correlation matrix $\pmb{K}$ is a $D \times D$ rank-1 matrix. If our data is from i.i.d. samples then these empirical values will converge to the true values when $N \rightarrow \infty$.</p>
<p>Before we do PCA, we need to center the data around the mean. Let’s assume our data matrix $\pmb{X}$ has been pre-processed as such. Using the SVD, we can rewrite the empirical covariance matrix as:</p>
<script type="math/tex; mode=display">N\pmb{K}
= \sum_{n=1}^N {(\pmb{x}_n \pmb{x}_n^T)}
= \pmb{X}\pmb{X}^T
= \pmb{U}\pmb{S}\pmb{V}^T \pmb{V}\pmb{S}^T \pmb{U}^T
= \pmb{U}\pmb{S}\pmb{S}^T \pmb{U}^T
= \pmb{U}\pmb{S}_D^2 \pmb{U}^T</script>
<p>This works because $\pmb{V}$ is an orthogonal matrix, so $\pmb{V}^T\pmb{V} = I_N$, and $\pmb{S}$ is diagonal, so $\pmb{SS}^T = S_D^2$, where $S_D^2$ is a $D\times D$ diagonal matrix consisting of the D first columns of $\pmb{S}$.</p>
<p>Let’s try this again, but this time considering the transformed data $\tilde{\pmb{X}} = \pmb{U}_K^T\pmb{X}$. Remember that PCA finds orthogonal axes that represent the most variance (with the origin at the mean). Starting with orthogonal axes, it finds the rotation $\pmb{U}^T$ so that the axes point in the direction of maximum variance. The empirical covariance of along this transformed axis is:</p>
<script type="math/tex; mode=display">N \tilde{\pmb{K}} = \tilde{\pmb{X}} \tilde{\pmb{X}}^T = \pmb{U}^T\pmb{X}\pmb{X}^T\pmb{U} = \pmb{U}^T\pmb{US}_D^2\pmb{U}^T\pmb{U} = \pmb{S}_D^2</script>
<p>Here, the empirical co-variance is <em>diagonal</em>. This means that through PCA, we’ve transformed our data to make the various components <strong>uncorrelated</strong>. This gives us some intuition of why it may be useful to first transform the data with the rotation $\pmb{U}^T\pmb{X}$.</p>
<p>Additionally, by the definition of PCA, the singular values are in decreasing order (so the first one, $s_1$, is the greatest one). Since we have a diagonal matrix as our empirical variance, it means that the variance of the first component is $s_1^2$, which proves the property of PCA’s axes being in decreasing order of variance.</p>
<p>Assume that we’re doing classification. Intuitively, it makes sense that classifying features with a larger variance would be easier (when the variance is 0, all data is the same and it becomes impossible to classify using that component). From this point of view, it makes intuitive sense to only keep the first K rows of $\tilde{\pmb{X}}$ when we perform dimensionality reduction; we keep the features that have high variance and are uncorrelated, and we discard all features with variance close to 0 as they’re hard to classify.</p>
<h3 id="computing-the-svd-efficiently">Computing the SVD efficiently</h3>
<p>To compute the SVD decomposition of a matrix $\pmb{X}$, we must compute the matrices $\pmb{U}$ and $\pmb{S}$ Let’s see how we can do this efficiently.</p>
<p>Let’s consider the $D\times D$ matrix $\pmb{XX}^T$. As before, since $\pmb{V}$ is orthogonal, we can use the SVD decomposition to get:</p>
<script type="math/tex; mode=display">\pmb{XX}^T = \pmb{USS}^T\pmb{U}^T = \pmb{U} \pmb{S}_D^2 \pmb{U}^T</script>
<p>Let $\pmb{u}_j$ denote the j<sup>th</sup> column of $\pmb{U}$.</p>
<script type="math/tex; mode=display">\pmb{XX}^T \pmb{u}_j = \pmb{U}\pmb{S}_D^2 \pmb{U}^T \pmb{u}_j = s_j^2 \pmb{u}_j</script>
<p>We see that the the j<sup>th</sup> column of $\pmb{U}$ is the j<sup>th</sup> eigenvector of $\pmb{XX}^T$, with eigenvalue $s_j^2$. Therefore, finding the eigenvalues and eigenvectors for $\pmb{XX}^T$ gives us a way to compute $\pmb{U}$ and $\pmb{S}$.</p>
<p>There’s a subtle point to be made here about the sign of the eigenvector. If $\pmb{u}_j$ is an eigenvector, then so is $-\pmb{u}_j$. But if our goal is simply to use that decomposition to do PCA, then it doesn’t matter as the sign of the columns of $U_K^T$ disappear when computing $\pmb{U}_K\pmb{U}_K^T$. However, if the goal is simply to do SVD, we must fix some choice of signs, and be consistent in $\pmb{V}$.</p>
<p>To compute this decomposition, we can either work with $\pmb{X}^T\pmb{X}$ or $\pmb{XX}^T$. This is practical, as it allows us to pick the smaller of the two and work in dimension D or N.</p>
<h3 id="pitfalls-of-pca">Pitfalls of PCA</h3>
<p>Unfortunately, PCA is no miracle cure. The SVD is not invariant under scalings of the features in the original matrix $\pmb{X}$. This is why it’s so important to normalize features. But there are many ways of doing this, and the result of PCA is highly dependent on how we do this, and there is a large degree of arbitrariness. Still, the conventional approach for PCA is to remove the mean and normalize the variance to 1.</p>
<div class="footnotes">
<ol>
<li id="fn:orthonormal">
<p>The columns of an orthonormal matrix are orthogonal and unitary (they have have norm 1). The transpose is equal to the inverse, meaning that if $\pmb{U}$ is orthogonal, then $\pmb{U}^T\pmb{U} = \pmb{UU}^T = \pmb{I}$ <a href="#fnref:orthonormal" class="reversefootnote">↩</a> <a href="#fnref:orthonormal:1" class="reversefootnote">↩<sup>2</sup></a></p>
</li>
</ol>
</div>
⚠ Work in progressCS-452 Foundations of Software2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/fos
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#writing-a-parser-with-parser-combinators" id="markdown-toc-writing-a-parser-with-parser-combinators">Writing a parser with parser combinators</a> <ul>
<li><a href="#boilerplate" id="markdown-toc-boilerplate">Boilerplate</a></li>
<li><a href="#the-basic-idea" id="markdown-toc-the-basic-idea">The basic idea</a></li>
<li><a href="#simple-parser-primitives" id="markdown-toc-simple-parser-primitives">Simple parser primitives</a></li>
<li><a href="#parser-combinators" id="markdown-toc-parser-combinators">Parser combinators</a></li>
<li><a href="#shorthands" id="markdown-toc-shorthands">Shorthands</a></li>
<li><a href="#example-json-parser" id="markdown-toc-example-json-parser">Example: JSON parser</a></li>
<li><a href="#the-trouble-with-left-recursion" id="markdown-toc-the-trouble-with-left-recursion">The trouble with left-recursion</a></li>
</ul>
</li>
<li><a href="#arithmetic-expressions--abstract-syntax-and-proof-principles" id="markdown-toc-arithmetic-expressions--abstract-syntax-and-proof-principles">Arithmetic expressions — abstract syntax and proof principles</a> <ul>
<li><a href="#basics-of-induction" id="markdown-toc-basics-of-induction">Basics of induction</a></li>
<li><a href="#mathematical-representation-of-syntax" id="markdown-toc-mathematical-representation-of-syntax">Mathematical representation of syntax</a> <ul>
<li><a href="#mathematical-representation-1" id="markdown-toc-mathematical-representation-1">Mathematical representation 1</a></li>
<li><a href="#mathematical-representation-2" id="markdown-toc-mathematical-representation-2">Mathematical representation 2</a></li>
<li><a href="#mathematical-representation-3" id="markdown-toc-mathematical-representation-3">Mathematical representation 3</a></li>
<li><a href="#comparison-of-the-representations" id="markdown-toc-comparison-of-the-representations">Comparison of the representations</a></li>
</ul>
</li>
<li><a href="#induction-on-terms" id="markdown-toc-induction-on-terms">Induction on terms</a></li>
<li><a href="#inductive-function-definitions" id="markdown-toc-inductive-function-definitions">Inductive function definitions</a> <ul>
<li><a href="#what-is-a-function" id="markdown-toc-what-is-a-function">What is a function?</a></li>
<li><a href="#induction-example-1" id="markdown-toc-induction-example-1">Induction example 1</a></li>
<li><a href="#induction-example-2" id="markdown-toc-induction-example-2">Induction example 2</a></li>
</ul>
</li>
<li><a href="#operational-semantics-and-reasoning" id="markdown-toc-operational-semantics-and-reasoning">Operational semantics and reasoning</a> <ul>
<li><a href="#evaluation" id="markdown-toc-evaluation">Evaluation</a></li>
<li><a href="#derivations" id="markdown-toc-derivations">Derivations</a></li>
<li><a href="#inversion-lemma" id="markdown-toc-inversion-lemma">Inversion lemma</a></li>
</ul>
</li>
<li><a href="#abstract-machines" id="markdown-toc-abstract-machines">Abstract machines</a></li>
<li><a href="#normal-forms" id="markdown-toc-normal-forms">Normal forms</a> <ul>
<li><a href="#values-that-are-normal-form" id="markdown-toc-values-that-are-normal-form">Values that are normal form</a></li>
<li><a href="#values-that-are-not-normal-form" id="markdown-toc-values-that-are-not-normal-form">Values that are not normal form</a></li>
</ul>
</li>
<li><a href="#multi-step-evaluation" id="markdown-toc-multi-step-evaluation">Multi-step evaluation</a></li>
<li><a href="#termination-of-evaluation" id="markdown-toc-termination-of-evaluation">Termination of evaluation</a></li>
</ul>
</li>
<li><a href="#lambda-calculus" id="markdown-toc-lambda-calculus">Lambda calculus</a> <ul>
<li><a href="#pure-lambda-calculus" id="markdown-toc-pure-lambda-calculus">Pure lambda calculus</a> <ul>
<li><a href="#scope" id="markdown-toc-scope">Scope</a></li>
<li><a href="#operational-semantics" id="markdown-toc-operational-semantics">Operational semantics</a></li>
<li><a href="#evaluation-strategies" id="markdown-toc-evaluation-strategies">Evaluation strategies</a></li>
</ul>
</li>
<li><a href="#classical-lambda-calculus" id="markdown-toc-classical-lambda-calculus">Classical lambda calculus</a> <ul>
<li><a href="#confluence-in-full-beta-reduction" id="markdown-toc-confluence-in-full-beta-reduction">Confluence in full beta reduction</a></li>
<li><a href="#alpha-conversion" id="markdown-toc-alpha-conversion">Alpha conversion</a></li>
</ul>
</li>
<li><a href="#programming-in-lambda-calculus" id="markdown-toc-programming-in-lambda-calculus">Programming in lambda-calculus</a> <ul>
<li><a href="#multiple-arguments" id="markdown-toc-multiple-arguments">Multiple arguments</a></li>
<li><a href="#booleans" id="markdown-toc-booleans">Booleans</a></li>
<li><a href="#pairs" id="markdown-toc-pairs">Pairs</a></li>
<li><a href="#numbers" id="markdown-toc-numbers">Numbers</a></li>
<li><a href="#lists" id="markdown-toc-lists">Lists</a></li>
</ul>
</li>
<li><a href="#recursion-in-lambda-calculus" id="markdown-toc-recursion-in-lambda-calculus">Recursion in lambda-calculus</a></li>
<li><a href="#equivalence-of-lambda-terms" id="markdown-toc-equivalence-of-lambda-terms">Equivalence of lambda terms</a></li>
</ul>
</li>
<li><a href="#types" id="markdown-toc-types">Types</a> <ul>
<li><a href="#properties-of-the-typing-relation" id="markdown-toc-properties-of-the-typing-relation">Properties of the Typing Relation</a> <ul>
<li><a href="#inversion-lemma-1" id="markdown-toc-inversion-lemma-1">Inversion lemma</a></li>
<li><a href="#canonical-form" id="markdown-toc-canonical-form">Canonical form</a></li>
<li><a href="#progress-theorem" id="markdown-toc-progress-theorem">Progress Theorem</a></li>
<li><a href="#preservation-theorem" id="markdown-toc-preservation-theorem">Preservation Theorem</a></li>
</ul>
</li>
<li><a href="#messing-with-it" id="markdown-toc-messing-with-it">Messing with it</a> <ul>
<li><a href="#removing-a-rule" id="markdown-toc-removing-a-rule">Removing a rule</a></li>
<li><a href="#changing-type-checking-rule" id="markdown-toc-changing-type-checking-rule">Changing type-checking rule</a></li>
<li><a href="#adding-bit" id="markdown-toc-adding-bit">Adding bit</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#simply-typed-lambda-calculus" id="markdown-toc-simply-typed-lambda-calculus">Simply typed lambda calculus</a> <ul>
<li><a href="#type-annotations" id="markdown-toc-type-annotations">Type annotations</a></li>
<li><a href="#typing-rules" id="markdown-toc-typing-rules">Typing rules</a></li>
<li><a href="#inversion-lemma-2" id="markdown-toc-inversion-lemma-2">Inversion lemma</a></li>
<li><a href="#canonical-form-1" id="markdown-toc-canonical-form-1">Canonical form</a></li>
<li><a href="#progress" id="markdown-toc-progress">Progress</a></li>
<li><a href="#preservation" id="markdown-toc-preservation">Preservation</a> <ul>
<li><a href="#weakening-lemma" id="markdown-toc-weakening-lemma">Weakening lemma</a></li>
<li><a href="#permutation-lemma" id="markdown-toc-permutation-lemma">Permutation lemma</a></li>
<li><a href="#substitution-lemma" id="markdown-toc-substitution-lemma">Substitution lemma</a></li>
<li><a href="#proof" id="markdown-toc-proof">Proof</a></li>
</ul>
</li>
<li><a href="#erasure" id="markdown-toc-erasure">Erasure</a></li>
<li><a href="#curry-howard-correspondence" id="markdown-toc-curry-howard-correspondence">Curry-Howard Correspondence</a></li>
<li><a href="#extensions-to-stlc" id="markdown-toc-extensions-to-stlc">Extensions to STLC</a> <ul>
<li><a href="#base-types" id="markdown-toc-base-types">Base types</a></li>
<li><a href="#unit-type" id="markdown-toc-unit-type">Unit type</a></li>
<li><a href="#sequencing" id="markdown-toc-sequencing">Sequencing</a></li>
<li><a href="#ascription" id="markdown-toc-ascription">Ascription</a></li>
<li><a href="#pairs-1" id="markdown-toc-pairs-1">Pairs</a></li>
<li><a href="#tuples" id="markdown-toc-tuples">Tuples</a></li>
<li><a href="#records" id="markdown-toc-records">Records</a></li>
</ul>
</li>
<li><a href="#sums-and-variants" id="markdown-toc-sums-and-variants">Sums and variants</a> <ul>
<li><a href="#sum-type" id="markdown-toc-sum-type">Sum type</a></li>
<li><a href="#sums-and-uniqueness-of-type" id="markdown-toc-sums-and-uniqueness-of-type">Sums and uniqueness of type</a></li>
<li><a href="#variants" id="markdown-toc-variants">Variants</a></li>
</ul>
</li>
<li><a href="#recursion" id="markdown-toc-recursion">Recursion</a></li>
<li><a href="#references" id="markdown-toc-references">References</a> <ul>
<li><a href="#mutability" id="markdown-toc-mutability">Mutability</a></li>
<li><a href="#aliasing" id="markdown-toc-aliasing">Aliasing</a></li>
<li><a href="#typing-rules-1" id="markdown-toc-typing-rules-1">Typing rules</a></li>
<li><a href="#evaluation-1" id="markdown-toc-evaluation-1">Evaluation</a></li>
<li><a href="#store-typing" id="markdown-toc-store-typing">Store typing</a></li>
<li><a href="#safety" id="markdown-toc-safety">Safety</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<h2 id="writing-a-parser-with-parser-combinators">Writing a parser with parser combinators</h2>
<p>In Scala, you can (ab)use the operator overload to create an embedded DSL (EDSL) for grammars. While a grammar may look as follows in a grammar description language (Bison, Yak, ANTLR, …):</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>Expr ::= Term {'+' Term | '−' Term}
Term ::= Factor {'∗' Factor | '/' Factor}
Factor ::= Number | '(' Expr ')'</pre></td></tr></tbody></table></code></pre></figure>
<p>In Scala, we can model it as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is perhaps a little less elegant, but allows us to encode it directly into our language, which is often useful for interop.</p>
<p>The <code class="highlighter-rouge">~</code>, <code class="highlighter-rouge">|</code>, <code class="highlighter-rouge">rep</code> and <code class="highlighter-rouge">opt</code> are <strong>parser combinators</strong>. These are primitives with which we can construct a full parser for the grammar of our choice.</p>
<h3 id="boilerplate">Boilerplate</h3>
<p>First, let’s define a class <code class="highlighter-rouge">ParseResult[T]</code> as an ad-hoc monad; parsing can either succeed or fail:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="nc">case</span> <span class="k">class</span> <span class="nc">Success</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">result</span><span class="k">:</span> <span class="kt">T</span><span class="o">,</span> <span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">msg</span> <span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">ParseResult</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 <code class="highlighter-rouge">Nothing</code> is the bottom type in Scala; it contains no members, and nothing can extend it</p>
</blockquote>
<p>Let’s also define the tokens produced by the lexer (which we won’t define) as case classes extending <code class="highlighter-rouge">Token</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Keyword</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">NumericLit</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">StringLit</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">Identifier</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">Token</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Input into the parser is then a lazy stream of tokens (with positions for error diagnostics, which we’ll omit here):</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">type</span> <span class="kt">Input</span> <span class="o">=</span> <span class="nc">Reader</span><span class="o">[</span><span class="kt">Token</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can then define a standard, sample parser which looks as follows on the type-level:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">class</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="k">type</span> <span class="kt">Parser</span> <span class="o">=</span> <span class="nc">Input</span> <span class="k">=></span> <span class="nc">ParseResult</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="the-basic-idea">The basic idea</h3>
<p>For each language (defined by a grammar symbol <code class="highlighter-rouge">S</code>), define a function <code class="highlighter-rouge">f</code> that, given an input stream <code class="highlighter-rouge">i</code> (with tail <code class="highlighter-rouge">i'</code>):</p>
<ul>
<li>if a prefix of <code class="highlighter-rouge">i</code> is in <code class="highlighter-rouge">S</code>, return <code class="highlighter-rouge">Success(Pair(x, i'))</code>, where <code class="highlighter-rouge">x</code> is a result for <code class="highlighter-rouge">S</code></li>
<li>otherwise, return <code class="highlighter-rouge">Failure(msg, i)</code>, where <code class="highlighter-rouge">msg</code> is an error message string</li>
</ul>
<p>The first is called <em>success</em>, the second is <em>failure</em>. We can compose operations on this somewhat conveniently, like we would on a monad (like <code class="highlighter-rouge">Option</code>).</p>
<h3 id="simple-parser-primitives">Simple parser primitives</h3>
<p>All of the above boilerplate allows us to define a parser, which succeeds if the first token in the input satisfies some given predicate <code class="highlighter-rouge">pred</code>. When it succeeds, it reads the token string, and splits the input there.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">token</span><span class="o">(</span><span class="n">kind</span><span class="k">:</span> <span class="kt">String</span><span class="o">)(</span><span class="n">pred</span><span class="k">:</span> <span class="kt">Token</span> <span class="o">=></span> <span class="n">boolean</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span>
<span class="k">if</span> <span class="o">(</span><span class="n">pred</span><span class="o">(</span><span class="n">in</span><span class="o">.</span><span class="n">head</span><span class="o">))</span> <span class="nc">Success</span><span class="o">(</span><span class="n">in</span><span class="o">.</span><span class="n">head</span><span class="o">.</span><span class="n">chars</span><span class="o">,</span> <span class="n">in</span><span class="o">.</span><span class="n">tail</span><span class="o">)</span>
<span class="k">else</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">kind</span> <span class="o">+</span> <span class="s">" expected "</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We can use this to define a keyword parser:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre><span class="k">implicit</span> <span class="k">def</span> <span class="n">keyword</span><span class="o">(</span><span class="n">chars</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"'"</span> <span class="o">+</span> <span class="n">chars</span> <span class="o">+</span> <span class="s">"'"</span><span class="o">)</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Keyword</span><span class="o">(</span><span class="n">chars1</span><span class="o">)</span> <span class="k">=></span> <span class="n">chars</span> <span class="o">==</span> <span class="n">chars1</span>
<span class="k">case</span> <span class="k">_</span> <span class="k">=></span> <span class="kc">false</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Marking it as <code class="highlighter-rouge">implicit</code> allows us to write keywords as normal strings, where we can omit the <code class="highlighter-rouge">keyword</code> call (this helps us simplify the notation in our DSL; we can write <code class="highlighter-rouge">"if"</code> instead of <code class="highlighter-rouge">keyword("if")</code>).</p>
<p>We can make other parsers for our other case classes quite simply:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">numericLit</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"number"</span><span class="o">)(</span> <span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">NumericLit</span><span class="o">])</span>
<span class="k">def</span> <span class="n">stringLit</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"string literal"</span><span class="o">)(</span> <span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">StringLit</span><span class="o">])</span>
<span class="k">def</span> <span class="n">ident</span> <span class="k">=</span> <span class="n">token</span><span class="o">(</span><span class="s">"identifier"</span><span class="o">)(</span> <span class="o">.</span><span class="n">isInstanceOf</span><span class="o">[</span><span class="kt">Identifier</span><span class="o">])</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="parser-combinators">Parser combinators</h3>
<p>We are going to define the following parser combinators:</p>
<ul>
<li><code class="highlighter-rouge">~</code>: sequential composition</li>
<li><code class="highlighter-rouge"><~</code>, <code class="highlighter-rouge">>~</code>: sequential composition, keeping left / right only</li>
<li><code class="highlighter-rouge">|</code>: alternative</li>
<li><code class="highlighter-rouge">opt(X)</code>: option (like a <code class="highlighter-rouge">?</code> quantifier in a regex)</li>
<li><code class="highlighter-rouge">rep(X)</code>: repetition (like a <code class="highlighter-rouge">*</code> quantifier in a regex)</li>
<li><code class="highlighter-rouge">repsep(P, Q)</code>: interleaved repetition</li>
<li><code class="highlighter-rouge">^^</code>: result conversion (like a <code class="highlighter-rouge">map</code> on an <code class="highlighter-rouge">Option</code>)</li>
<li><code class="highlighter-rouge">^^^</code>: constant result (like a <code class="highlighter-rouge">map</code> on an <code class="highlighter-rouge">Option</code>, but returning a constant value regardless of result)</li>
</ul>
<p>But first, we’ll write some very basic parser combinators: <code class="highlighter-rouge">success</code> and <code class="highlighter-rouge">failure</code>, that respectively always succeed and always fail:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">success</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">result</span><span class="k">:</span> <span class="kt">T</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Success</span><span class="o">(</span><span class="n">result</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">failure</span><span class="o">(</span><span class="n">msg</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Failure</span><span class="o">(</span><span class="n">msg</span><span class="o">,</span> <span class="n">in</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>All of the above are methods on a <code class="highlighter-rouge">Parser[T]</code> class. Thanks to infix space notation in Scala, we can denote <code class="highlighter-rouge">x.y(z)</code> as <code class="highlighter-rouge">x y z</code>, which allows us to simplify our DSL notation; for instance <code class="highlighter-rouge">A ~ B</code> corresponds to <code class="highlighter-rouge">A.~(B)</code>.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
</pre></td><td class="code"><pre><span class="k">abstract</span> <span class="k">class</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="c1">// An abstract method that defines the parser function
</span> <span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span><span class="k">:</span> <span class="kt">ParseResult</span>
<span class="k">def</span> <span class="o">~[</span><span class="kt">U</span><span class="o">](</span><span class="n">rhs</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span> <span class="kt">~</span> <span class="kt">U</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span><span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">tail</span><span class="o">)</span> <span class="k">=></span> <span class="n">rhs</span><span class="o">(</span><span class="n">tail</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">y</span><span class="o">,</span> <span class="n">rest</span><span class="o">)</span> <span class="k">=></span> <span class="nc">Success</span><span class="o">(</span><span class="k">new</span> <span class="o">~(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">),</span> <span class="n">rest</span><span class="o">)</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">failure</span>
<span class="o">}</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">failure</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">|(</span><span class="n">rhs</span><span class="k">:</span> <span class="o">=></span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">s1</span> <span class="k">@</span> <span class="nc">Success</span><span class="o">(</span><span class="k">_</span><span class="o">,</span> <span class="k">_</span><span class="o">)</span> <span class="k">=></span> <span class="n">s1</span>
<span class="k">case</span> <span class="n">failure</span> <span class="k">=></span> <span class="n">rhs</span><span class="o">(</span><span class="n">in</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">^^[</span><span class="kt">U</span><span class="o">](</span><span class="n">f</span><span class="k">:</span> <span class="kt">T</span> <span class="o">=></span> <span class="n">U</span><span class="o">)</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">apply</span><span class="o">(</span><span class="n">in</span> <span class="k">:</span> <span class="kt">Input</span><span class="o">)</span> <span class="k">=</span> <span class="nc">Parser</span><span class="o">.</span><span class="k">this</span><span class="o">(</span><span class="n">in</span><span class="o">)</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">Success</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">tail</span><span class="o">)</span> <span class="k">=></span> <span class="nc">Success</span><span class="o">(</span><span class="n">f</span><span class="o">(</span><span class="n">x</span><span class="o">),</span> <span class="n">tail</span><span class="o">)</span>
<span class="k">case</span> <span class="n">x</span> <span class="k">=></span> <span class="n">x</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="o">^^^[</span><span class="kt">U</span><span class="o">](</span><span class="n">r</span><span class="k">:</span> <span class="kt">U</span><span class="o">)</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">]</span> <span class="k">=</span> <span class="o">^^(</span><span class="n">x</span> <span class="k">=></span> <span class="n">r</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 In Scala, <code class="highlighter-rouge">T ~ U</code> is syntactic sugar for <code class="highlighter-rouge">~[T, U]</code>, which is the type of the case class we’ll define below</p>
</blockquote>
<p>For the <code class="highlighter-rouge">~</code> combinator, when everything works, we’re using <code class="highlighter-rouge">~</code>, a case class that is equivalent to <code class="highlighter-rouge">Pair</code>, but prints the way we want to and allows for the concise type-level notation above.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">case</span> <span class="k">class</span> <span class="nc">~</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">](</span><span class="n">_1</span> <span class="k">:</span> <span class="kt">T</span><span class="o">,</span> <span class="n">_2</span> <span class="k">:</span> <span class="kt">U</span><span class="o">)</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="n">toString</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">+</span> <span class="n">_1</span> <span class="o">+</span> <span class="s">" ~ "</span> <span class="o">+</span> <span class="n">_2</span> <span class="o">+</span><span class="s">")"</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>At this point, we thus have <strong>two</strong> different meanings for <code class="highlighter-rouge">~</code>: a <em>function</em> <code class="highlighter-rouge">~</code> that produces a <code class="highlighter-rouge">Parser</code>, and the <code class="highlighter-rouge">~(a, b)</code> <em>case class</em> pair that this parser returns (all of this is encoded in the function signature of the <code class="highlighter-rouge">~</code> function).</p>
<p>Note that the <code class="highlighter-rouge">|</code> combinator takes the right-hand side parser as a call-by-name argument. This is because we don’t want to evaluate it unless it is strictly needed—that is, if the left-hand side fails.</p>
<p><code class="highlighter-rouge">^^</code> is like a <code class="highlighter-rouge">map</code> operation on <code class="highlighter-rouge">Option</code>; <code class="highlighter-rouge">P ^^ f</code> succeeds iff <code class="highlighter-rouge">P</code> succeeds, in which case it applies the transformation <code class="highlighter-rouge">f</code> on the result of P. Otherwise, it fails.</p>
<h3 id="shorthands">Shorthands</h3>
<p>We can now define shorthands for common combinations of parser combinators:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">opt</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Option</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span> <span class="n">p</span> <span class="o">^^</span> <span class="nc">Some</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">None</span><span class="o">)</span>
<span class="k">def</span> <span class="n">rep</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span>
<span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">p</span><span class="o">)</span> <span class="o">^^</span> <span class="o">{</span> <span class="k">case</span> <span class="n">x</span> <span class="o">~</span> <span class="n">xs</span> <span class="k">=></span> <span class="n">x</span> <span class="o">::</span> <span class="n">xs</span> <span class="o">}</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">Nil</span><span class="o">)</span>
<span class="k">def</span> <span class="n">repsep</span><span class="o">[</span><span class="kt">T</span>, <span class="kt">U</span><span class="o">](</span><span class="n">p</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">],</span> <span class="n">q</span> <span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">List</span><span class="o">[</span><span class="kt">T</span><span class="o">]]</span> <span class="k">=</span>
<span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">q</span> <span class="o">~></span> <span class="n">p</span><span class="o">)</span> <span class="o">^^</span> <span class="o">{</span> <span class="k">case</span> <span class="n">r</span> <span class="o">~</span> <span class="n">rs</span> <span class="k">=></span> <span class="n">r</span> <span class="o">::</span> <span class="n">rs</span> <span class="o">}</span> <span class="o">|</span> <span class="n">success</span><span class="o">(</span><span class="nc">Nil</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Note that none of the above can fail. They may, however, return <code class="highlighter-rouge">None</code> or <code class="highlighter-rouge">Nil</code> wrapped in <code class="highlighter-rouge">success</code>.</p>
<p>As an exercise, we can implement the <code class="highlighter-rouge">rep1(P)</code> parser combinator, which corresponds to the <code class="highlighter-rouge">+</code> regex quantifier:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">rep1</span><span class="o">[</span><span class="kt">T</span><span class="o">](</span><span class="n">p</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">T</span><span class="o">])</span> <span class="k">=</span> <span class="n">p</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="n">p</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="example-json-parser">Example: JSON parser</h3>
<p>We did not mention <code class="highlighter-rouge">lexical.delimiters</code> and <code class="highlighter-rouge">lexical.reserved</code> in the above, and for the sake of brevity, we omit the implementation of <code class="highlighter-rouge">stringLit</code> and <code class="highlighter-rouge">numericLit</code>.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">JSON</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">+=</span> <span class="o">(</span><span class="s">"{"</span><span class="o">,</span> <span class="s">"}"</span><span class="o">,</span> <span class="s">"["</span><span class="o">,</span> <span class="s">"]"</span><span class="o">,</span> <span class="s">":"</span><span class="o">)</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">reserved</span> <span class="o">+=</span> <span class="o">(</span><span class="s">"null"</span><span class="o">,</span> <span class="s">"true"</span><span class="o">,</span> <span class="s">"false"</span><span class="o">)</span>
<span class="c1">// Return Map
</span> <span class="k">def</span> <span class="n">obj</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"{"</span> <span class="o">~</span> <span class="n">repsep</span><span class="o">(</span><span class="n">member</span><span class="o">,</span> <span class="s">","</span><span class="o">)</span> <span class="o">~</span> <span class="s">"}"</span> <span class="o">^^</span> <span class="o">(</span><span class="n">ms</span> <span class="k">=></span> <span class="nc">Map</span><span class="o">()</span> <span class="o">++</span> <span class="n">ms</span><span class="o">)</span>
<span class="c1">// Return List
</span> <span class="k">def</span> <span class="n">arr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"["</span> <span class="o">~></span> <span class="n">repsep</span><span class="o">(</span><span class="n">value</span><span class="o">,</span> <span class="s">","</span><span class="o">)</span> <span class="o"><~</span> <span class="s">"]"</span>
<span class="c1">// Return name/value pair:
</span> <span class="k">def</span> <span class="n">member</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">stringLit</span> <span class="o">~</span> <span class="s">":"</span> <span class="o">~</span> <span class="n">value</span> <span class="o">^^</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">name</span> <span class="o">~</span> <span class="s">":"</span> <span class="o">~</span> <span class="n">value</span> <span class="k">=></span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span>
<span class="o">}</span>
<span class="c1">// Return correct Scala type
</span> <span class="k">def</span> <span class="n">value</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span>
<span class="n">obj</span>
<span class="o">|</span> <span class="n">arr</span>
<span class="o">|</span> <span class="n">stringLit</span>
<span class="o">|</span> <span class="n">numericLit</span> <span class="o">^^</span> <span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">toInt</span><span class="o">)</span>
<span class="o">|</span> <span class="s">"null"</span> <span class="o">^^^</span> <span class="kc">null</span>
<span class="o">|</span> <span class="s">"true"</span> <span class="o">^^^</span> <span class="kc">true</span>
<span class="o">|</span> <span class="s">"false"</span> <span class="o">^^^</span> <span class="kc">false</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="the-trouble-with-left-recursion">The trouble with left-recursion</h3>
<p>Parser combinators work top-down and therefore do not allow for left-recursion. For example, the following would go into an infinite loop, where the parser keeps recursively matching the same token unto <code class="highlighter-rouge">expr</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">expr</span> <span class="k">=</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">"-"</span> <span class="o">~</span> <span class="n">term</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Let’s take a look at an arithmetic expression parser:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">Arithmetic</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">++=</span> <span class="nc">List</span><span class="o">(</span><span class="s">"("</span><span class="o">,</span> <span class="s">")"</span><span class="o">,</span> <span class="s">"+"</span><span class="o">,</span> <span class="s">"−"</span><span class="o">,</span> <span class="s">"∗"</span><span class="o">,</span> <span class="s">"/"</span><span class="o">)</span>
<span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This definition of <code class="highlighter-rouge">expr</code>, namely <code class="highlighter-rouge">term ~ rep("-" ~ term)</code> produces a right-leaning tree. For instance, <code class="highlighter-rouge">1 - 2 - 3</code> produces <code class="highlighter-rouge">1 ~ List("-" ~ 2, ~ "-" ~ 3)</code>.</p>
<p>The solution is to combine calls to <code class="highlighter-rouge">rep</code> with a final foldLeft on the list:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
</pre></td><td class="code"><pre><span class="k">object</span> <span class="nc">Arithmetic</span> <span class="k">extends</span> <span class="nc">StandardTokenParsers</span> <span class="o">{</span>
<span class="n">lexical</span><span class="o">.</span><span class="n">delimiters</span> <span class="o">++=</span> <span class="nc">List</span><span class="o">(</span><span class="s">"("</span><span class="o">,</span> <span class="s">")"</span><span class="o">,</span> <span class="s">"+"</span><span class="o">,</span> <span class="s">"−"</span><span class="o">,</span> <span class="s">"∗"</span><span class="o">,</span> <span class="s">"/"</span><span class="o">)</span>
<span class="k">def</span> <span class="n">expr</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">term</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"+"</span> <span class="o">~</span> <span class="n">term</span> <span class="o">|</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">term</span><span class="o">)</span> <span class="o">^^</span> <span class="n">reduceList</span>
<span class="k">def</span> <span class="n">term</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="n">factor</span> <span class="o">~</span> <span class="n">rep</span><span class="o">(</span><span class="s">"∗"</span> <span class="o">~</span> <span class="n">factor</span> <span class="o">|</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">factor</span><span class="o">)</span> <span class="o">^^</span> <span class="n">reduceList</span>
<span class="k">def</span> <span class="n">factor</span><span class="k">:</span> <span class="kt">Parser</span><span class="o">[</span><span class="kt">Any</span><span class="o">]</span> <span class="k">=</span> <span class="s">"("</span> <span class="o">~</span> <span class="n">expr</span> <span class="o">~</span> <span class="s">")"</span> <span class="o">|</span> <span class="n">numericLit</span>
<span class="k">private</span> <span class="k">def</span> <span class="n">reduceList</span><span class="o">(</span><span class="n">list</span><span class="k">:</span> <span class="kt">Expr</span> <span class="kt">~</span> <span class="kt">List</span><span class="o">[</span><span class="kt">String</span> <span class="kt">~</span> <span class="kt">Expr</span><span class="o">])</span><span class="k">:</span> <span class="kt">Expr</span> <span class="o">=</span> <span class="n">list</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="n">x</span> <span class="o">~</span> <span class="n">xs</span> <span class="k">=></span> <span class="o">(</span><span class="n">x</span> <span class="n">foldLeft</span> <span class="n">ps</span><span class="o">)(</span><span class="n">reduce</span><span class="o">)</span>
<span class="o">}</span>
<span class="k">private</span> <span class="k">def</span> <span class="n">reduce</span><span class="o">(</span><span class="n">x</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">r</span><span class="k">:</span> <span class="kt">String</span> <span class="kt">~</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">r</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="s">"+"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"−"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">−</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"∗"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">∗</span> <span class="n">y</span>
<span class="k">case</span> <span class="s">"/"</span> <span class="o">~</span> <span class="n">y</span> <span class="k">=></span> <span class="n">x</span> <span class="o">/</span> <span class="n">y</span>
<span class="k">case</span> <span class="k">=></span> <span class="k">throw</span> <span class="k">new</span> <span class="nc">MatchError</span><span class="o">(</span><span class="s">"illegal case: "</span> <span class="o">+</span> <span class="n">r</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<blockquote>
<p>👉 It used to be that the standard library contained parser combinators, but those are now a <a href="https://github.com/scala/scala-parser-combinators">separate module</a>. This module contains a <code class="highlighter-rouge">chainl</code> (chain-left) method that reduces after a <code class="highlighter-rouge">rep</code> for you.</p>
</blockquote>
<h2 id="arithmetic-expressions--abstract-syntax-and-proof-principles">Arithmetic expressions — abstract syntax and proof principles</h2>
<p>This section follows Chapter 3 in TAPL.</p>
<h3 id="basics-of-induction">Basics of induction</h3>
<p>Ordinary induction is simply:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose P is a predicate on natural numbers.
Then:
If P(0)
and, for all i, P(i) implies P(i + 1)
then P(n) holds for all n
</code></pre></div></div>
<p>We can also do complete induction:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose P is a predicate on natural numbers.
Then:
If for each natural number n,
given P(i) for all i < n we can show P(n)
then P(n) holds for all n
</code></pre></div></div>
<p>It proves exactly the same thing as ordinary induction, it is simply a restated version. They’re <em>interderivable</em>; assuming one, we can prove the other. Which one to use is simply a matter of style or convenience. We’ll see some more equivalent styles as we go along.</p>
<h3 id="mathematical-representation-of-syntax">Mathematical representation of syntax</h3>
<p>Let’s assume the following grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>t ::=
true
false
if t then t else t
0
succ t
pred t
iszero t</pre></td></tr></tbody></table></code></pre></figure>
<p>What does this really define? A few suggestions:</p>
<ul>
<li>A set of character strings</li>
<li>A set of token lists</li>
<li>A set of abstract syntax trees</li>
</ul>
<p>It depends on how you read it; a grammar like the one above contains information about all three.</p>
<p>However, we are mostly interested in the ASTs. The above grammar is therefore called an <strong>abstract grammar</strong>. Its main purpose is to suggest a mapping from character strings to trees.</p>
<p>For our use of these, we won’t be too strict with these. For instance, we’ll freely use parentheses to disambiguate what tree we mean to describe, even though they’re not strictly supported by the grammar. What matters to us here aren’t strict implementation semantics, but rather that we have a framework to talk about ASTs. For our purposes, we’ll consider that two terms producing the same AST are basically the same; still, we’ll distinguish terms that only have the same evaluation result, as they don’t necessarily have the same AST.</p>
<p>How can we express our grammar as mathematical expressions? A grammar describes the legal <em>set</em> of terms in a program by offering a recursive definition. While recursive definitions may seem obvious and simple to a programmer, we have to go through a few hoops to make sense of them mathematically.</p>
<h4 id="mathematical-representation-1">Mathematical representation 1</h4>
<p>We can use a set $\mathcal{T}$ of terms. The grammar is then the smallest set such that:</p>
<ol>
<li>$\left\{ \text{true}, \text{false}, 0 \right\} \subseteq \mathcal{T}$,</li>
<li>If $t_1 \in \mathcal{T}$ then $\left\{ \text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \right\} \subseteq \mathcal{T}$,</li>
<li>If $t_1, t_2, t_3 \in \mathcal{T}$ then we also have $\text{if } t_1 \text{ then } t_2 \text{ else } t_3 \in \mathcal{T}$.</li>
</ol>
<h4 id="mathematical-representation-2">Mathematical representation 2</h4>
<p>We can also write this somewhat more graphically:</p>
<script type="math/tex; mode=display">\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\newcommand{\if}{\text{if }}
\newcommand{\then}{\text{ then }}
\newcommand{\else}{\text{ else }}
\newcommand{\ifelse}{\if t_1 \then t_2 \else t_3}
\newcommand{\defeq}{\overset{\text{def}}{=}}
\newenvironment{rcases}
{\left.\begin{aligned}}
{\end{aligned}\right\rbrace}
\text{true } \in \mathcal{T}, \quad
\text{false } \in \mathcal{T}, \quad
0 \in \mathcal{T} \\ \\
\frac{t_1 \in \mathcal{T}}{\text{succ } t_1 \in \mathcal{T}}, \quad
\frac{t_1 \in \mathcal{T}}{\text{pred } t_1 \in \mathcal{T}}, \quad
\frac{t_1 \in \mathcal{T}}{\text{iszero } t_1 \in \mathcal{T}} \\ \\
\frac{t_1 \in \mathcal{T}, \quad t_2 \in \mathcal{T}, \quad t_3 \in \mathcal{T}}{\ifelse \in \mathcal{T}}</script>
<p>This is exactly equivalent to representation 1, but we have just introduced a different notation. Note that “the smallest set closed under…” is often not stated explicitly, but implied.</p>
<h4 id="mathematical-representation-3">Mathematical representation 3</h4>
<p>Alternatively, we can build up our set of terms as an infinite union:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{S}_0 = & & \emptyset \\
\mathcal{S}_{i+1} =
& & \left\{ \text{true}, \text{ false}, 0 \right\} \\
& \cup & \left\{ \text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \mid t_1 \in \mathcal{S}_i \right\} \\
& \cup & \left\{ \ifelse \mid t_1, t_2, t_3 \in \mathcal{S}_i \right\}
\end{align} %]]></script>
<p>We can thus build our final set as follows:</p>
<script type="math/tex; mode=display">\mathcal{S} = \bigcup_i{\mathcal{S}_i}</script>
<p>Note that we can “pull out” the definition into a generating function $F$:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mathcal{S}_0 & = \emptyset \\
\mathcal{S}_{i+1} & = F(\mathcal{S}_i) \\
\mathcal{S} & = \bigcup_i{\mathcal{S}_i} \\
\end{align} %]]></script>
<p>The generating function is thus defined as:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
F_1(U) & = \left\{ \text{true} \right\} \\
F_2(U) & = \left\{ \text{false} \right\} \\
F_3(U) & = \left\{ 0 \right\} \\
F_4(U) & = \left\{ \text{succ } t_1 \mid t_1 \in U \right\} \\
F_5(U) & = \left\{ \text{pred } t_1 \mid t_1 \in U \right\} \\
F_6(U) & = \left\{ \text{iszero } t_1 \mid t_1 \in U \right\} \\
F_7(U) & = \left\{ \ifelse \mid t_1, t_2, t_3 \in U \right\} \\
\end{align} \\
F(U) = \bigcup_{i=1}^7{F_i(U)} %]]></script>
<p>Each function takes a set of terms $U$ as input and produces “terms justified by $U$” as output; that is, all terms that have the items of $U$ as subterms.</p>
<p>The set $U$ is said to be <strong>closed under F</strong> or <strong>F-closed</strong> if $F(U) \subseteq U$.</p>
<p>The set of terms $T$ as defined above is the smallest F-closed set. If $O$ is another F-closed set, then $T \subseteq O$.</p>
<h4 id="comparison-of-the-representations">Comparison of the representations</h4>
<p>We’ve seen essentially two ways of defining the set (as representation 1 and 2 are equivalent, but with different notation):</p>
<ol>
<li>The smallest set that is closed under certain rules. This is compact and easy to read.</li>
<li>The limit of a series of sets. This gives us an <em>induction principle</em> on which we can prove things on terms by induction.</li>
</ol>
<p>The first one defines the set “from above”, by intersecting F-closed sets.</p>
<p>The second one defines it “from below”, by starting with $\emptyset$ and getting closer and closer to being F-closed.</p>
<p>These are equivalent (we won’t prove it, but Proposition 3.2.6 in TAPL does so), but can serve different uses in practice.</p>
<h3 id="induction-on-terms">Induction on terms</h3>
<p>First, let’s define depth: the <strong>depth</strong> of a term $t$ is the smallest $i$ such that $t\in\mathcal{S_i}$.</p>
<p>The way we defined $\mathcal{S}_i$, it gets larger and larger for increasing $i$; the depth of a term $t$ gives us the step at which $t$ is introduced into the set.</p>
<p>We see that if a term $t$ is in <script type="math/tex">\mathcal{S}_i</script>, then all of its immediate subterms must be in $\mathcal{S}_{i-1}$, meaning that they must have smaller depth.</p>
<p>This justifies the principle of <strong>induction on terms</strong>, or <strong>structural induction</strong>. Let P be a predicate on a term:</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>If, for each term s,
given P(r) for all immediate subterms r of s we can show P(s)
then P(t) holds for all t
</code></pre></div></div>
<p>All this says is that if we can prove the induction step from subterms to terms (under the induction hypothesis), then we have proven the induction.</p>
<p>We can also express this structural induction using generating functions, which we <a href="#mathematical-representation-3">introduced previously</a>.</p>
<div class="highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Suppose T is the smallest F-closed set.
If, for each set U,
from the assumption "P(u) holds for every u ∈ U",
we can show that "P(v) holds for every v ∈ F(U)"
then
P(t) holds for all t ∈ T
</code></pre></div></div>
<p>Why can we use this?</p>
<ul>
<li>We assumed that $T$ was the smallest F-closed set, which means that $T\subseteq O$ for any other F-closed set $O$.</li>
<li>Showing the pre-condition (“for each set $U$, from the assumption…”) amounts to showing that the set of all terms satisfying $P$ (call it $O$) is itself an F-closed set.</li>
<li>Since $T\subseteq O$, every element of $T$ satisfies $P$.</li>
</ul>
<h3 id="inductive-function-definitions">Inductive function definitions</h3>
<p>An <a href="https://en.wikipedia.org/wiki/Recursive_definition">inductive definition</a> is used to define the elements in a set recursively, as we have done above. The <a href="https://en.wikipedia.org/wiki/Recursion#The_recursion_theorem">recursion theorem</a> states that a well-formed inductive definition defines a function. To understand what being well-formed means, let’s take a look at some examples.</p>
<p>Let’s define our grammar function a little more formally. Constants are the basic values that can’t be expanded further; in our example, they are <code class="highlighter-rouge">true</code>, <code class="highlighter-rouge">false</code>, <code class="highlighter-rouge">0</code>. As such, the set of constants appearing in a term $t$, written $\text{Consts}(t)$, is defined recursively as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{Consts}(\text{true}) & = \left\{ \text{true} \right\} \\
\text{Consts}(\text{false}) & = \left\{ \text{false} \right\} \\
\text{Consts}(0) & = \left\{ 0 \right\} \\
\text{Consts}(\text{succ } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\text{pred } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\text{iszero } t_1) & = \text{Consts}(t_1) \\
\text{Consts}(\ifelse & = \text{Consts}(t_1) \cup \text{Consts}(t_2) \cup \text{Consts}(t_3) \\
\end{align} %]]></script>
<p>This seems simple, but these semantics aren’t perfect. First off, a mathematical definition simply assigns a convenient name to some previously known thing. But here, we’re defining the thing in terms of itself, recursively. And the semantics above also allow us to define ill-formed inductive definitions:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{BadConsts}(\text{true}) & = \left\{ \text{true} \right\} \\
\text{BadConsts}(\text{false}) & = \left\{ \text{false} \right\} \\
\text{BadConsts}(0) & = \left\{ 0 \right\} \\
\text{BadConsts}(0) & = \left\{ \right\} = \emptyset \\
\text{BadConsts}(\text{succ } t_1) & = \text{BadConsts}(t_1) \\
\text{BadConsts}(\text{pred } t_1) & = \text{BadConsts}(t_1) \\
\text{BadConsts}(\text{iszero } t_1) & = \text{BadConsts}(\text{iszero iszero }t_1) \\
\end{align} %]]></script>
<p>The last rule produces infinitely large rules (if we implemented it, we’d expect some kind of stack overflow). We’re missing the rules for if-statements, and we have a useless rule for <code class="highlighter-rouge">0</code>, producing empty sets.</p>
<p>How do we tell the difference between a well-formed inductive definition, and an ill-formed one as above? What is well-formedness anyway?</p>
<h4 id="what-is-a-function">What is a function?</h4>
<p>A relation over $T, U$ is a subset of $T \times U$, where the Cartesian product is defined as:</p>
<script type="math/tex; mode=display">T\times U = \left\{ (t, u) : t\in T, u\in U \right\}</script>
<p>A function $f$ from $A$ (domain) to $B$ (co-domain) can be viewed as a two-place relation, albeit with two additional properties:</p>
<ul>
<li>It is <strong>total</strong>: $\forall a \in A, \exists b \in B : (a, b) \in f$</li>
<li>It is <strong>deterministic</strong>: $(a, b_1) \in f, (a, b_2) \in f \implies b_1 = b_2$</li>
</ul>
<p>Totality ensures that the A domain is covered, while being deterministic just means that the function always produces the same result for a given input.</p>
<h4 id="induction-example-1">Induction example 1</h4>
<p>As previously stated, $\text{Consts}$ is a <em>relation</em>. It maps terms (A) into the set of constants that they contain (B). The induction theorem states that it is also a <em>function</em>. The proof is as follows.</p>
<p>$\text{Consts}$ is total and deterministic: for each term $t$ there is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$<sup id="fnref:in-relation-notation"><a href="#fn:in-relation-notation" class="footnote">1</a></sup> . The proof is done by induction on $t$.</p>
<p>To be able to apply the induction principle for terms, we must first show that for an arbitrary term $t$, under the following induction hypothesis:</p>
<blockquote>
<p>For each immediate subterm $s$ of $t$, there is exactly one set of terms $C_s$ such that $(s, C_s) \in \text{Consts}$</p>
</blockquote>
<p>Then the following needs to be proven as an induction step:</p>
<blockquote>
<p>There is <strong>exactly one</strong> set of terms $C$ such that $(t, C) \in \text{Consts}$</p>
</blockquote>
<p>We proceed by cases on $t$:</p>
<ul>
<li>
<p>If $t$ is $0$, $\text{true}$ or $\text{false}$</p>
<p>We can immediately see from the definition that of $\text{Consts}$ that there is exactly one set of terms $C = \left\{ t \right\}$) such that $(t, C) \in \text{Consts}$.</p>
<p>This constitutes our base case.</p>
</li>
<li>
<p>If $t$ is $\text{succ } t_1$, $\text{pred } t_1$ or $\text{iszero } t_1$</p>
<p>The immediate subterm of $t$ is $t_1$, and the induction hypothesis tells us that there is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$. But then it is clear from the definition that there is exactly one set of terms $C = C_1$ such that $(t, C) \in \text{Consts}$.</p>
</li>
<li>
<p>If $t$ is $\ifelse$</p>
<p>The induction hypothesis tells us:</p>
<ul>
<li>There is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$</li>
<li>There is exactly one set of terms $C_2$ such that $(t_2, C_2) \in \text{Consts}$</li>
<li>There is exactly one set of terms $C_3$ such that $(t_3, C_3) \in \text{Consts}$</li>
</ul>
<p>It is clear from the definition of $\text{Consts}$ that there is exactly one set $C = C_1 \cup C_2 \cup C_3$ such that $(t, C) \in \text{Consts}$.</p>
</li>
</ul>
<p>This proves that $\text{Consts}$ is indeed a function.</p>
<p>But what about $\text{BadConsts}$? It is also a relation, but it isn’t a function. For instance, we have $\text{BadConsts}(0) = \left\{ 0 \right\}$ and $\text{BadConsts}(0) = \left\{ \right\}$, which violates determinism. To reformulate this in terms of the above, there are two sets $C$ such that $(0, C) \in \text{BadConsts}$, namely $C = \left\{ 0 \right\}$ and $C = \left\{ \right\}$.</p>
<p>Note that there are many other problems with $\text{BadConsts}$, but this is sufficient to prove that it isn’t a function.</p>
<h4 id="induction-example-2">Induction example 2</h4>
<p>Let’s introduce another inductive definition:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{size}(\text{true}) & = 1 \\
\text{size}(\text{false}) & = 1 \\
\text{size}(0) & = 1 \\
\text{size}(\text{succ}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\text{pred}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\text{iszero}\ t_1) & = \text{size}(t_1) + 1 \\
\text{size}(\ifelse) & = \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3)\\
\end{align} %]]></script>
<p>We’d like to prove that the number of distinct constants in a term is at most the size of the term. In other words, that $\abs{\text{Consts}(t)} \le \text{size}(t)$</p>
<p>The proof is by induction on $t$:</p>
<ul>
<li>
<p>$t$ is a constant; $t=\text{true}$, $t=\text{false}$ or $t=0$</p>
<p>The proof is immediate. For constants, the number of constants and the size are both one: $\abs{\text{Consts(t)}} = \abs{\left\{t\right\}} = 1 = \text{size}(t)$</p>
</li>
<li>
<p>$t$ is a function; $t = \text{succ}\ t_1$, $t = \text{pred}\ t_1$ or $t = \text{iszero}\ t_1$</p>
<p>By the induction hypothesis, $\abs{\text{Consts}(t1)} \le \text{size}(t_1)$.</p>
<p>We can then prove the proposition as follows: $\abs{\text{Consts}(t)} = \abs{\text{Consts}(t_1)} \overset{\text{IH}}{\le} \text{size}(t_1) = \text{size}(t) + 1 < \text{size}(t)$</p>
</li>
<li>
<p>$t$ is an if-statement: $t = \ifelse$</p>
<p>By the induction hypothesis, $\abs{\text{Consts}(t_1)} \le \text{size}(t_1)$, $\abs{\text{Consts}(t_2)} \le \text{size}(t_2)$ and $\abs{\text{Consts}(t_3)} \le \text{size}(t_3)$.</p>
<p>We can then prove the proposition as follows:</p>
</li>
</ul>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\abs{\text{Consts}}
& = \abs{\text{Consts}(t_1)\cup\text{Consts}(t_2)\cup\text{Consts}(t_3)} \\
& \le \abs{\text{Consts}(t_1)}+\abs{\text{Consts}(t_2)}+\abs{\text{Consts}(t_3)} \\
& \overset{\text{IH}}{\le} \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3) \\
& < \text{size}(t)
\end{align} %]]></script>
<h3 id="operational-semantics-and-reasoning">Operational semantics and reasoning</h3>
<h4 id="evaluation">Evaluation</h4>
<p>Suppose we have the following syntax</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>t ::= // terms
true // constant true
false // constant false
if t then t else t // conditional</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation relation $t \longrightarrow t’$ is the smallest relation closed under the following rules.</p>
<p>The following are <em>computation rules</em>, defining the “real” computation steps:</p>
<script type="math/tex; mode=display">\begin{align}
\text{if true then } t_2 \else t_3 \longrightarrow t_2
\tag{E-IfTrue}
\label{eq:e-iftrue} \\
\text{if false then } t_2 \else t_3 \longrightarrow t_3
\tag{E-IfFalse}
\label{eq:e-iffalse} \\
\end{align}</script>
<p>The following is a <em>congruence rule</em>, defining where the computation rule is applied next:</p>
<script type="math/tex; mode=display">\frac{t_1 \longrightarrow t_1'}
{\ifelse \longrightarrow \if t_1' \then t_2 \else t_3}
\tag{E-If}
\label{eq:e-if}</script>
<p>We want to evaluate the condition before the conditional clauses in order to save on evaluation; we’re not sure which one should be evaluated, so we need to know the condition first.</p>
<h4 id="derivations">Derivations</h4>
<p>We can describe the evaluation logically from the above rules using derivation trees. Suppose we want to evaluate the following (with parentheses added for clarity): <code class="highlighter-rouge">if (if true then true else false) then false else true</code>.</p>
<p>In an attempt to make all this fit onto the screen, <code class="highlighter-rouge">true</code> and <code class="highlighter-rouge">false</code> have been abbreviated <code class="highlighter-rouge">T</code> and <code class="highlighter-rouge">F</code> in the derivation below, and the <code class="highlighter-rouge">then</code> keyword has been replaced with a parenthesis notation for the condition.</p>
<script type="math/tex; mode=display">\frac{
\frac{
\if (T)\ T \else F
\longrightarrow
T
\quad (\ref{eq:e-iftrue})
}{
\if (\if (T)\ T \else F) \ F \else T
\longrightarrow
\if (T) \ F \else T
\quad (\ref{eq:e-if})
}
\qquad
\small{
\if (T) \ F \else T
\longrightarrow
F
\quad (\ref{eq:e-iftrue})
}
}{
\if (\if (T) \ T \else F) \ F \else T
\longrightarrow
T
}</script>
<p>The final statement is a <strong>conclusion</strong>. We say that the derivation is a <strong>witness</strong> for its conclusion (or a <strong>proof</strong> for its conclusion). The derivation records all reasoning steps that lead us to the conclusion.</p>
<h4 id="inversion-lemma">Inversion lemma</h4>
<p>We can introduce the <strong>inversion lemma</strong>, which tells us how we got to a term.</p>
<p>Suppose we are given a derivation $\mathcal{D}$ witnessing the pair $(t, t’)$ in the evaluation relation. Then either:</p>
<ol>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $\if true \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $\if false \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \if t_1 \then t_2 \else t_3$ and $t’ = t = \if t_1’ \then t_2 \else t_3$, for some $t_1, t_1’, t_2, t_3$. Moreover, the immediate subderivation of $\mathcal{D}$ witnesses $(t_1, t_1’) \in \longrightarrow$.</li>
</ol>
<p>This is super boring, but we do need to acknowledge the inversion lemma before we can do induction proofs on derivations. Thanks to the inversion lemma, given an arbitrary derivation $\mathcal{D}$ with conclusion $t \longrightarrow t’$, we can proceed with a case-by-case analysis on the final rule used in the derivation tree.</p>
<p>Let’s recall our <a href="#induction-example-2">definition of the size function</a>. In particular, we’ll need the rule for if-statements:</p>
<script type="math/tex; mode=display">\text{size}(\ifelse) = \text{size}(t_1) + \text{size}(t_2) + \text{size}(t_3)</script>
<p>We want to prove that if $t \longrightarrow t’$, then $\text{size}(t) > \text{size}(t’)$.</p>
<ol>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $t = \if true \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $t = \if false \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$</li>
<li>If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \ifelse$ and $t’ = \if t_1’ \then t_2 \else t_3$. In this case, $t_1 \longrightarrow t_1’$ is witnessed by a derivation $\mathcal{D}_1$. By the induction hypothesis, $\text{size}(t_1) > \text{size}(t_1’)$, and the result is then immediate from the definition of $\text{size}$</li>
</ol>
<h3 id="abstract-machines">Abstract machines</h3>
<p>An abstract machine consists of:</p>
<ul>
<li>A set of <strong>states</strong></li>
<li>A <strong>transition</strong> relation of states, written $\longrightarrow$</li>
</ul>
<p>$t \longrightarrow t’$ means that $t$ evaluates to $t’$ in one step. Note that $\longrightarrow$ is a relation, and that $t \longrightarrow t’$ is shorthand for $(t, t’) \in \longrightarrow$. Often, this relation is a partial function (not necessarily covering the domain A; there is at most one possible next state). But without loss of generality, there may be many possible next states, determinism isn’t a criterion here.</p>
<h3 id="normal-forms">Normal forms</h3>
<p>A normal form is a term that cannot be evaluated any further. More formally, a term $t$ is a normal form if there is no $t’$ such that $t \longrightarrow t’$. A normal form is a state where the abstract machine is halted; we can regard it as the result of a computation.</p>
<h4 id="values-that-are-normal-form">Values that are normal form</h4>
<p>Previously, we intended for our values (true and false) to be exactly that, the result of a computation. Did we get that right?</p>
<p>Let’s prove that a term $t$ is a value $\iff$ it is in normal form.</p>
<ul>
<li>The $\implies$ direction is immediate from the definition of the evaluation relation $\longrightarrow$.</li>
<li>
<p>The $\impliedby$ direction is more conveniently proven as its contrapositive: if $t$ is not a value, then it is not a normal form, which we can prove by induction on the term $t$.</p>
<p>Since $t$ is not a value, it must be of the form $\ifelse$. If $t_1$ is directly <code class="highlighter-rouge">true</code> or <code class="highlighter-rouge">false</code>, then $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ apply, and we are done.</p>
<p>Otherwise, if $t = \ifelse$ where $t_1$ isn’t a value, by the induction hypothesis, there is a $t_1’$ such that $t_1 \longrightarrow t_1’$. Then rule $\ref{eq:e-if}$ yields $\if t_1’ \then t_2 \else t_3$, which proves that $t$ is not in normal form.</p>
</li>
</ul>
<h4 id="values-that-are-not-normal-form">Values that are not normal form</h4>
<p>Let’s introduce new syntactic forms, with new evaluation rules.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>t ::= // terms
0 // constant 0
succ t // successor
pred t // predecessor
iszero t // zero test
v ::= nv // values
nv ::= // numeric values
0 // zero value
succ nv // successor value</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation rules are given as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 \longrightarrow t_1'}{\text{succ } t_1 \longrightarrow \text{succ } t_1'}
\tag{E-Succ} \label{eq:e-succ}
\\ \\
& \text{pred } 0 \longrightarrow 0
\tag{E-PredZero} \label{eq:e-predzero}
\\ \\
& \text{pred succ } nv_1 \longrightarrow nv_1
\tag{E-PredSucc} \label{eq:e-predsucc}
\\ \\
& \frac{t_1 \longrightarrow t_1'}{\text{pred } t_1 \longrightarrow \text{pred } t_1'}
\tag{E-Pred} \label{eq:e-pred}
\\ \\
& \text{iszero } 0 \longrightarrow true
\tag{E-IszeroZero} \label{eq:e-iszerozero}
\\ \\
& \text{iszero succ } nv_1 \longrightarrow false
\tag{E-IszeroSucc} \label{eq:e-iszerosucc}
\\ \\
& \frac{t_1 \longrightarrow t_1'}{\text{iszero } t_1 \longrightarrow \text{iszero } t_1'}
\tag{E-Iszero} \label{eq:e-iszero} \\
\end{align} %]]></script>
<p>All values are still normal forms. But are all normal forms values? Not in this case. For instance, <code class="highlighter-rouge">succ true</code>, <code class="highlighter-rouge">iszero true</code>, etc, are normal forms. These are <strong>stuck terms</strong>: they are in normal form, but are not values. In general, these correspond to some kind of type error, and one of the main purposes of a type system is to rule these kinds of situations out.</p>
<h3 id="multi-step-evaluation">Multi-step evaluation</h3>
<p>Let’s introduce the <em>multi-step evaluation</em> relation, $\longrightarrow^*$. It is the reflexive, transitive closure of single-step evaluation, i.e. the smallest relation closed under these rules:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t\longrightarrow t'}{t \longrightarrow^* t'} \\ \\
t \longrightarrow^* t \\ \\
\frac{t \longrightarrow^* t' \qquad t' \longrightarrow^* t''}{t \longrightarrow^* t''}
\end{align}</script>
<p>In other words, it corresponds to any number of single consecutive evaluations.</p>
<h3 id="termination-of-evaluation">Termination of evaluation</h3>
<p>We’ll prove that evaluation terminates, i.e. that for every term $t$ there is some normal form $t’$ such that $t\longrightarrow^* t’$.</p>
<p>First, let’s <a href="#induction-example-2">recall our proof</a> that $t\longrightarrow t’ \implies \text{size}(t) > \text{size}(t’)$. Now, for our proof by contradiction, assume that we have an infinite-length sequence $t_0, t_1, t_2, \dots$ such that:</p>
<script type="math/tex; mode=display">t_0 \longrightarrow t_1 \longrightarrow t_2 \longrightarrow \dots
\quad \implies \quad
\text{size}(t_0) > \text{size}(t_1) > \text{size}(t_2) > \dots</script>
<p>But this sequence cannot exist: since $\text{size}(t_0)$ is a finite, natural number, we cannot construct this infinite descending chain from it. This is a contradiction.</p>
<p>Most termination proofs have the same basic form. We want to prove that the relation $R\subseteq X \times X$ is terminating — that is, there are no infinite sequences $x_0, x_1, x_2, \dots$ such that $(x_i, x_{i+1}) \in R$ for each $i$. We proceed as follows:</p>
<ol>
<li>Choose a well-suited set $W$ with partial order $<$ such that there are no infinite descending chains $w_0 > w_1 > w_2 > \dots$ in $W$. Also choose a function $f: X \rightarrow W$.</li>
<li>Show $f(x) > f(y) \quad \forall (x, y) \in R$</li>
<li>Conclude that are no infinite sequences $(x_0, x_1, x_2, \dots)$ such that $(x_i, x_{i+1}) \in R$ for each $i$. If there were, we could construct an infinite descending chain in $W$.</li>
</ol>
<p>As a side-note, <strong>partial order</strong> is defined as the following properties:</p>
<ol>
<li><strong>Anti-symmetry</strong>: $\neg(x < y \land y < x)$</li>
<li><strong>Transitivity</strong>: $x<y \land y<z \implies x < z$</li>
</ol>
<p>We can add a third property to achieve <strong>total order</strong>, namely $x \ne y \implies x <y \lor y<x$.</p>
<h2 id="lambda-calculus">Lambda calculus</h2>
<p>Lambda calculus is Turing complete, and is higher-order (functions are data). In lambda calculus, all computation happens by means of function abstraction and application.</p>
<p>Lambda calculus is isomorphic to Turing machines.</p>
<p>Suppose we wanted to write a function <code class="highlighter-rouge">plus3</code> in our previous language:</p>
<figure class="highlight"><pre><code class="language-linenos" data-lang="linenos">plus3 x = succ succ succ x</code></pre></figure>
<p>The way we write this in lambda calculus is:</p>
<script type="math/tex; mode=display">\text{plus3 } = \lambda x. \text{ succ}(\text{succ}(\text{succ}(x)))</script>
<p>$\lambda x. t$ is written <code class="highlighter-rouge">x => t</code> in Scala, or <code class="highlighter-rouge">fun x -> t</code> in OCaml. Application of our function, say <code class="highlighter-rouge">plus3(succ 0)</code>, can be written as:</p>
<script type="math/tex; mode=display">(\lambda x. \text{succ succ succ } x)(\text{succ } 0)</script>
<p>Abstraction over functions is possible using higher-order functions, which we call $\lambda$-abstractions. An example of such an abstraction is the function $g$ below, which takes an argument $f$ and uses it in the function position.</p>
<script type="math/tex; mode=display">g = \lambda f. f(f(\text{succ } 0))</script>
<p>If we apply $g$ to an argument like $\text{plus3}$, we can just use the substitution rule to see how that defines a new function.</p>
<p>Another example: the double function below takes two arguments, as a curried function would. First, it takes the function to apply twice, then the argument on which to apply it, and then returns $f(f(y))$.</p>
<script type="math/tex; mode=display">\text{double} = \lambda f. \lambda y. f(f(y))</script>
<h3 id="pure-lambda-calculus">Pure lambda calculus</h3>
<p>Once we have $\lambda$-abstractions, we can actually throw out all other language primitives like booleans and other values; all of these can be expressed as functions, as we’ll see below. In pure lambda-calculus, <em>everything</em> is a function.</p>
<p>Variables will always denote a function, functions always take other functions as parameters, and the result of an evaluation is always a function.</p>
<p>The syntax of lambda-calculus is very simple:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
</pre></td><td class="code"><pre>t ::= // terms, also called λ-terms
x // variable
λx. t // abstraction, also called λ-abstractions
t t // application</pre></td></tr></tbody></table></code></pre></figure>
<p>A few rules and syntactic conventions:</p>
<ul>
<li>Application associates to the left, so $t\ u\ v$ means $(t\ u)\ v$, not $t\ (u\ v)$.</li>
<li>Bodies of lambda abstractions extend as far to the right as possible, so $\lambda x. \lambda y.\ x\ y$ means $\lambda x.\ (\lambda y. x\ y)$, not $\lambda x.\ (\lambda y.\ x)\ y$</li>
</ul>
<h4 id="scope">Scope</h4>
<p>The lambda expression $\lambda x.\ t$ <strong>binds</strong> the variable $x$, with a <strong>scope</strong> limited to $t$. Occurrences of $x$ inside of $t$ are said to be <em>bound</em>, while occurrences outside are said to be <em>free</em>.</p>
<p>Let $\text{fv}(t)$ be the set of free variables in a term $t$. It’s defined as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{fv}(x) & = \left\{ x\right\} \\
\text{fv}(\lambda x.\ t_1) & = \text{fv}(t_1) \setminus \left\{ x \right\} \\
\text{fv}(t_1 \ t_2) & = \text{fv}(t_1)\cup\text{fv}(t_2) \\
\end{align} %]]></script>
<h4 id="operational-semantics">Operational semantics</h4>
<p>As we saw with our previous language, the rules could be distinguished into <em>computation</em> and <em>congruence</em> rules. For lambda calculus, the only computation rule is:</p>
<script type="math/tex; mode=display">(\lambda x. t_{12})\ v_2 \longrightarrow \left[ x \mapsto v_2 \right] t_{12}
\tag{E-AppAbs}\label{eq:e-appabs}</script>
<p>The notation $\left[ x \mapsto v_2 \right] t_{12}$ means “the term that results from substituting free occurrences of $x$ in $t_{12}$ with $v_2$”.</p>
<p>The congruence rules are:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 \longrightarrow t_1'}{t_1\ t_2 \longrightarrow t_1'\ t_2} \tag{E-App1}\label{eq:e-app1} \\ \\
& \frac{t_2 \longrightarrow t_2'}{t_1\ t_2 \longrightarrow t_1\ t_2'} \tag{E-App2}\label{eq:e-app2} \\
\end{align} %]]></script>
<p>A lambda-expression applied to a value, $(\lambda x.\ t)\ v$, is called a <strong>reducible expression</strong>, or <strong>redex</strong>.</p>
<h4 id="evaluation-strategies">Evaluation strategies</h4>
<p>There are alternative evaluation strategies. In the above, we have chosen call by value (which is the standard in most mainstream languages), but we could also choose:</p>
<ul>
<li><strong>Full beta-reduction</strong>: any redex may be reduced at any time. This offers no restrictions, but in practice, we go with a set of restrictions like the ones below (because coding a fixed way is easier than coding probabilistic behavior).</li>
<li><strong>Normal order</strong>: the leftmost, outermost redex is always reduced first. This strategy allows to reduce inside unapplied lambda terms</li>
<li><strong>Call-by-name</strong>: allows no reductions inside lambda abstractions. Arguments are not reduced before being substituted in the body of lambda terms when applied. Haskell uses an optimized version of this, call-by-need (aka lazy evaluation).</li>
</ul>
<h3 id="classical-lambda-calculus">Classical lambda calculus</h3>
<p>Classical lambda calculus allows for full beta reduction.</p>
<h4 id="confluence-in-full-beta-reduction">Confluence in full beta reduction</h4>
<p>The congruence rules allow us to apply in different ways; we can choose between $\ref{eq:e-app1}$ and $\ref{eq:e-app2}$ every time we reduce an application, and this offers many possible reduction paths.</p>
<p>While the path is non-deterministic, is the result also non-deterministic? This question took a very long time to answer, but after 25 years or so, it was proven that the result is always the same. This is known the <strong>Church-Rosser confluence theorem</strong>:</p>
<p>Let $t, t_1, t_2$ be terms such that $t \longrightarrow^* t_1$ and $t \longrightarrow^* t_2$. Then there exists a term $t_3$ such that $t_1 \longrightarrow^* t_3$ and $t_2 \longrightarrow^* t_3$</p>
<h4 id="alpha-conversion">Alpha conversion</h4>
<p>Substitution is actually trickier than it looks! For instance, in the expression $\lambda x.\ (\lambda y.\ x)\ y$, the first occurrence of $y$ is bound (it refers to a parameter), while the second is free (it does not refer to a parameter). This is comparable to scope in most programming languages, where we should understand that these are two different variables in different scopes, $y_1$ and $y_2$.</p>
<p>The above example had a variable that is both bound and free, which is something that we’ll try to avoid. This is called a hygiene condition.</p>
<p>We can transform a unhygienic expression to a hygienic one by renaming bound variables before performing the substitution. This is known as <strong>alpha conversion</strong>. Alpha conversion is given by the following conversion rule:</p>
<script type="math/tex; mode=display">\frac{y \notin \text{fv}(t)}{(\lambda x.\ t) =_\alpha (\lambda y.\ \left[ x\mapsto y\right]\ t)}
\tag{$\alpha$}
\label{eq:alpha-conv}</script>
<p>And these equivalence rules (in mathematics, equivalence is defined as symmetry and transitivity):</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& \frac{t_1 =_\alpha t_2}{t_2 =_\alpha t_1}
\tag{$\alpha \text{-Symm}$}
\label{eq:alpha-sym}
\\ \\
& \frac{t_1 =_\alpha t_2 \quad t_2 =_\alpha t_3}{t_1 =_\alpha t_3}
\tag{$\alpha \text{-Trans}$}
\label{eq:alpha-trans}
\\
\end{align} %]]></script>
<p>The congruence rules are as usual.</p>
<h3 id="programming-in-lambda-calculus">Programming in lambda-calculus</h3>
<h4 id="multiple-arguments">Multiple arguments</h4>
<p>The way to handle multiple arguments is by currying: $\lambda x.\ \lambda y.\ t$</p>
<h4 id="booleans">Booleans</h4>
<p>The fundamental, universal operator on booleans is if-then-else, which is what we’ll replicate to model booleans. We’ll denote our booleans as $\text{tru}$ and $\text{fls}$ to be able to distinguish these pure lambda-calculus abstractions from the true and false values of our previous toy language.</p>
<p>We want <code class="highlighter-rouge">true</code> to be equivalent to <code class="highlighter-rouge">if (true)</code>, and <code class="highlighter-rouge">false</code> to <code class="highlighter-rouge">if (false)</code>. The terms $\text{tru}$ and $\text{fls}$ <em>represent</em> boolean values, in that we can use them to test the truth of a boolean value:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{tru } & = \lambda t.\ \lambda f.\ t \\
\text{fls } & = \lambda t.\ \lambda f.\ f \\
\end{align} %]]></script>
<p>We can consider these as booleans. Equivalently <code class="highlighter-rouge">tru</code> can be considered as a function performing <code class="highlighter-rouge">(t1, t2) => if (true) t1 else t2</code>. To understand this, let’s try to apply $\text{tru}$ to two arguments:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
& && \text{tru } v\ w \\
& = && (\lambda t.\ (\lambda f.\ t))\ v\ w \\
& \longrightarrow && (\lambda f.\ v)\ w \\
& \longrightarrow && v \\
\end{align} %]]></script>
<p>This works equivalently for <code class="highlighter-rouge">fls</code>.</p>
<p>We can also do inversion, conjunction and disjunction with lambda calculus, which can be read as particular if-else statements:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{not } & = \lambda b.\ b\ \text{fls}\ \text{true} \\
\text{and } & = \lambda b.\ \lambda c.\ b\ c\ \text{fls} \\
\text{or } & = \lambda b.\ \lambda c.\ b\ \text{tru}\ c \\
\end{align} %]]></script>
<ul>
<li><code class="highlighter-rouge">not</code> is a function that is equivalent to <code class="highlighter-rouge">not(b) = if (b) false else true</code>.</li>
<li><code class="highlighter-rouge">and</code> is equivalent to <code class="highlighter-rouge">and(b, c) = if (b) c else false</code></li>
<li><code class="highlighter-rouge">or</code> is equivalent to <code class="highlighter-rouge">or(b, c) = if (b) true else c</code></li>
</ul>
<h4 id="pairs">Pairs</h4>
<p>The fundamental operations are construction <code class="highlighter-rouge">pair(a, b)</code>, and selection <code class="highlighter-rouge">pair._1</code> and <code class="highlighter-rouge">pair._2</code>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{pair } & = \lambda f.\ \lambda s.\ \lambda b.\ b\ f\ s\\
\text{fst } & = \lambda p.\ p\ \text{tru} \\
\text{snd } & = \lambda p.\ p\ \text{fls} \\
\end{align} %]]></script>
<ul>
<li><code class="highlighter-rouge">pair</code> is equivalent to <code class="highlighter-rouge">pair(f, s) = (b => b f s)</code></li>
<li>When <code class="highlighter-rouge">tru</code> is applied to <code class="highlighter-rouge">pair</code>, it selects the first element, by definition of the boolean, and that is therefore the definition of <code class="highlighter-rouge">fst</code></li>
<li>Equivalently for <code class="highlighter-rouge">fls</code> applied to <code class="highlighter-rouge">pair</code>, it selects the second element</li>
</ul>
<h4 id="numbers">Numbers</h4>
<p>We’ve actually been representing numbers as lambda-calculus numbers all along! Our <code class="highlighter-rouge">succ</code> function represents what’s more formally called <strong>Church numerals</strong>.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
c_0 & = \lambda s.\ \lambda z.\ z \\
c_1 & = \lambda s.\ \lambda z.\ s\ z \\
c_2 & = \lambda s.\ \lambda z.\ s\ s\ z \\
c_3 & = \lambda s.\ \lambda z.\ s\ s\ s\ z \\
\end{align} %]]></script>
<p>Note that $c_0$’s implementation is the same as that of $\text{fls}$ (just with renamed variables).</p>
<p>Every number $n$ is represented by a term $c_n$ taking two arguments, which are $s$ and $z$ (for “successor” and “zero”), and applies $s$ to $z$, $n$ times. Fundamentally, a number is equivalent to the following:</p>
<script type="math/tex; mode=display">c_n = \lambda f.\ \lambda x.\ \underbrace{f\ \dots\ f}_{n \text{ times}}\ x</script>
<p>With this in mind, let us implement some functions on numbers.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{scc } & = \lambda n.\ \lambda s.\ \lambda z.\ s\ (n\ s\ z) \\
\text{add } & = \lambda s.\ \lambda z.\ m\ s\ (n\ s\ z) \\
\text{mul } & = \lambda m.\ \lambda n.\ m\ (\text{add } n)\ c_0 \\
\text{sub } & = \lambda m.\ \lambda n.\ n\ \text{pred}\ m \\
\text{iszero } & = \lambda m.\ m\ (\lambda x.\ \text{fls})\ \text{tru}
\end{align} %]]></script>
<ul>
<li><strong>Successor</strong> $\text{scc}$: we apply the successor function to $n$ (which has been correctly instantiated with $s$ and $z$)</li>
<li><strong>Addition</strong> $\text{add}$: we pass the instantiated $n$ as the zero of $m$</li>
<li><strong>Subtraction</strong> $\text{sub}$: we apply $\text{pred}$ $n$ times to $m$</li>
<li><strong>Multiplication</strong> $\text{mul}$: instead of the successor function, we pass the addition by $n$ function.</li>
<li><strong>Zero test</strong> $\text{iszero}$: zero has the same implementation as false, so we can lean on that to build an iszero function. An alternative understanding is that we’re building a number, in which we use true for the zero value $z$. If we have to apply the successor function $s$ once or more, we want to get false, so for the successor function we use a function ignoring its input and returning false if applied.</li>
</ul>
<p>What about predecessor? This is a little harder, and it’ll take a few steps to get there. The main idea is that we find the predecessor by rebuilding the whole succession up until our number. At every step, we must generate the number and its predecessor: zero is $(c_0, c_0)$, and all other numbers are $(c_{n-1}, c_n)$. Once we’ve reconstructed this pair, we can get the predecessor by taking the first element of the pair.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{zz} & = \text{pair } c_0 \ c_0 \\
\text{ss} & = \lambda p.\ \text{pair } (\text{snd } p)\ (\text{scc } (\text{snd } p)) \\
\text{prd} & = \lambda m.\ \text{fst } (m\ \text{ss zz}) \\
\end{align} %]]></script>
<details><summary><p>Sidenote</p>
</summary><div class="details-content">
<p>The story goes that Church was stumped by predecessors for a long time. This solution finally came to him while he was at the barber, and he jumped out half shaven to write it down.</p>
</div></details>
<h4 id="lists">Lists</h4>
<p>Now what about lists?</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{nil} & = \lambda f.\ \lambda g.\ g \\
\text{cons} & = \lambda x.\ \lambda xs.\ (\lambda f.\ \lambda g.\ f\ x\ xs) \\
\text{head} & = \lambda xs.\ (\lambda y.\ \lambda ys.\ y) \\
\text{isEmpty} & = \lambda xs.\ xs\ (\lambda y.\ \lambda ys.\ \text{fls}) \\
\end{align} %]]></script>
<h3 id="recursion-in-lambda-calculus">Recursion in lambda-calculus</h3>
<p>Let’s start by taking a step back. We talked about normal forms and terms for which we terminate; does lambda calculus always terminate? It’s Turing complete, so it must be able to loop infinitely (otherwise, we’d have solved the halting problem!).</p>
<p>The trick to recursion is self-application:</p>
<script type="math/tex; mode=display">\lambda x.\ x\ x</script>
<p>From a type-level perspective, we would cringe at this. This should not be possible in the typed world, but in the untyped world we can do it. We can construct a simple infinite loop in lambda calculus as follows:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{omega }
& = & (\lambda x.\ x\ x)\ (\lambda x.\ x\ x) \\
& \longrightarrow & \ (\lambda x.\ x\ x)\ (\lambda x.\ x\ x)
\end{align} %]]></script>
<p>The expression evaluates to itself in one step; it never reaches a normal form, it loops infinitely, diverges. This is not a stuck term though; evaluation is always possible.</p>
<p>In fact, there are no stuck terms in pure lambda calculus. Every term is either a value or reduces further.</p>
<p>So it turns out that $\text{omega}$ isn’t so terribly useful. Let’s try to construct something more practical:</p>
<script type="math/tex; mode=display">Y_f = (\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x))</script>
<p>Now, the divergence is a little more interesting:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
Y_f & = & (\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x)) \\
& \longrightarrow & f\ ((\lambda x.\ f\ (x\ x))\ (\lambda x.\ f\ (x\ x))) \\
& = & f\ (Y_f) \\
& \longrightarrow & \dots \\
& = & f\ (f\ (Y_f)) \\
\end{align} %]]></script>
<p>This $Y_f$ function is known as a <strong>Y-combinator</strong>. It still loops infinitely (though note that while it works in classical lambda calculus, it blows up in call-by-name), so let’s try to build something more useful.</p>
<p>To delay the infinite recursion, we could build something like a poison pill:</p>
<script type="math/tex; mode=display">\text{poisonpill} = \lambda y.\ \text{omega}</script>
<p>It can be passed around (after all, it’s just a value), but evaluating it will cause our program to loop infinitely. This is the core idea we’ll use for defining the <strong>fixed-point combinator</strong> $\text{fix}$, which allows us to do recursion. It’s defined as follows:</p>
<script type="math/tex; mode=display">\text{fix} = \lambda f.\ (\lambda x.\ f\ (\lambda y.\ x\ x\ y))\ (\lambda x.\ f\ (\lambda y.\ x\ x\ y))</script>
<p>This looks a little intricate, and we won’t need to fully understand the definition. What’s important is mostly how it is used to define a recursive function. For instance, if we wanted to define a modulo function in our toy language, we’d do it as follows:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">mod</span><span class="o">(</span><span class="n">x</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span> <span class="k">=</span>
<span class="k">if</span> <span class="o">(</span><span class="n">y</span> <span class="o">></span> <span class="n">x</span><span class="o">)</span> <span class="n">x</span>
<span class="k">else</span> <span class="n">mod</span><span class="o">(</span><span class="n">x</span> <span class="o">-</span> <span class="n">y</span><span class="o">,</span> <span class="n">y</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>In lambda calculus, we’d define this as:</p>
<script type="math/tex; mode=display">\text{mod} = \text{fix } (\lambda f.\ \lambda x.\ \lambda y.\
(\text{gt } y\ x)\ x\ (f (\text{sub } a\ b)\ b)
)</script>
<p>We’ve assumed that a greater-than $\text{gt}$ function was available here.</p>
<p>More generally, we can define a recursive function as:</p>
<script type="math/tex; mode=display">\text{fix } \bigl(\lambda f.\ (\textit{recursion on } f)\bigr)</script>
<h3 id="equivalence-of-lambda-terms">Equivalence of lambda terms</h3>
<p>We’ve seen how to define Church numerals and successor. How can we prove that $\text{succ } c_n$ is equal to $c_{n+1}$?</p>
<p>The naive approach unfortunately doesn’t work; they do not evaluate to the same value.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align*}
\text{scc } c_2
& = (\lambda n.\ \lambda s.\ \lambda z.\ s\ (n\ s\ z))\ (\lambda s.\ \lambda z.\ s\ (s\ z)) \\
& \longrightarrow \lambda s.\ \lambda z.\ s\ ((\lambda s.\ \lambda z.\ s\ (s\ z))\ s\ z) \\
& \neq \lambda s.\ \lambda z.\ s\ (s\ (s\ z)) \\
& = c_3 \\
\end{align*} %]]></script>
<p>This still seems very close. If we could simplify a little further, we do see how they would be the same.</p>
<p>The intuition behind the Church numeral representation was that a number $n$ is represented as a term that “does something $n$ times to something else”. $\text{scc}$ takes a term that “does something $n$ times to something else”, and returns a term that “does something $n+1$ times to something else”.</p>
<p>What we really care about is that $\text{scc } c_2$ <em>behaves</em> the same as $c_3$ when applied to two arguments. We want <em>behavioral equivalence</em>. But what does that mean? Roughly, two terms $s$ and $t$ are behaviorally equivalent if there is no “test” that distinguishes $s$ and $t$.</p>
<p>Let’s define this notion of “test” this a little more precisely, and specify how we’re going to observe the results of a test. We can use the notion of <strong>normalizability</strong> to define a simple notion of a test:</p>
<blockquote>
<p>Two terms $s$ and $t$ are said to be <strong>observationally equivalent</strong> if they are either both normalizable (i.e. they reach a normal form after a finite number of evaluation steps), or both diverge.</p>
</blockquote>
<p>In other words, we observe a term’s behavior by running it and seeing if it halts. Note that this is not decidable (by the halting problem).</p>
<p>For instance, $\text{omega}$ and $\text{tru}$ are not observationally equivalent (one diverges, one halts), while $\text{tru}$ and $\text{fls}$ are (they both halt).</p>
<p>Observational equivalence isn’t strong enough of a test for what we need; we need behavioral equivalence.</p>
<blockquote>
<p>Two terms $s$ and $t$ are said to be <strong>behaviorally equivalent</strong> if, for every finite sequence of values $v_1, v_2, \dots, v_n$ the applications $s\ v_1\ v_2\ \dots\ v_n$ and $t\ v_1\ v_2\ \dots\ v_n$ are observationally equivalent.</p>
</blockquote>
<p>This allows us to assert that true and false are indeed different:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{tru}\ x\ \Omega & \longrightarrow x \\
\text{fls}\ x\ \Omega & \longrightarrow \Omega \\
\end{align} %]]></script>
<p>The former returns a normal form, while the latter diverges.</p>
<h2 id="types">Types</h2>
<p>As previously, to define a language, we start with a <em>set of terms</em> and <em>values</em>, as well as an <em>evaluation relation</em>. But now, we’ll also define a set of <strong>types</strong> (denoted with a first capital letter) classifying values according to their “shape”. We can define a <em>typing relation</em> $t:\ T$. We must check that the typing relation is <em>sound</em> in the sense that:</p>
<script type="math/tex; mode=display">\frac{t: T \qquad t\longrightarrow^* v}{v: T}
\qquad\text{and}\qquad
\frac{t: T}{\exists t' \text{ such that } t\longrightarrow t'}</script>
<p>These rules represent some kind of safety and liveness, but are more commonly referred to as <a href="#properties-of-the-typing-relation">progress and preservation</a>, which we’ll talk about later. The first one states that types are preserved throughout evaluation, while the second says that if we can type-check, then evaluation of $t$ will not get stuck.</p>
<p>In our previous toy language, we can introduce two types, booleans and numbers:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>T ::= // types
Bool // type of booleans
Nat // type of numbers</pre></td></tr></tbody></table></code></pre></figure>
<p>Our typing rules are then given by:</p>
<script type="math/tex; mode=display">\begin{align}
\text{true } : \text{ Bool}
\tag{T-True} \label{eq:t-true} \\ \\
\text{false } : \text{ Bool}
\tag{T-False} \label{eq:t-false} \\ \\
0: \text{ Nat}
\tag{T-Zero} \label{eq:t-zero} \\ \\
\frac{t_1: \text{Bool} \quad t_2 : T \quad t_3: T}{\ifelse}
\tag{T-If} \label{eq:t-if} \\ \\
\frac{t_1: \text{Nat}}{\text{succ } t_1: \text{Nat}}
\tag{T-Succ} \label{eq:t-succ} \\ \\
\frac{t_1: \text{Nat}}{\text{pred } t_1: \text{Nat}}
\tag{T-Pred} \label{eq:t-pred} \\ \\
\frac{t_1: \text{Nat}}{\text{iszero } t_1: \text{Nat}}
\tag{T-IsZero} \label{eq:t-iszero} \\ \\
\end{align}</script>
<p>With these typing rules in place, we can construct typing derivations to justify every pair $t: T$ (which we can also denote as a $(t, T)$ pair) in the typing relation, as we have done previously with evaluation. Proofs of properties about the typing relation often proceed by induction on these typing derivations.</p>
<p>Like other static program analyses, type systems are generally imprecise. They do not always predict exactly what kind of value will be returned, but simply a conservative approximation. For instance, <code class="highlighter-rouge">if true then 0 else false</code> cannot be typed with the above rules, even though it will certainly evaluate to a number. We could of course add a typing rule for <code class="highlighter-rouge">if true</code> statements, but there is still a question of how useful this is, and how much complexity it adds to the type system, and especially for proofs. Indeed, the inversion lemma below becomes much more tedious when we have more rules.</p>
<h3 id="properties-of-the-typing-relation">Properties of the Typing Relation</h3>
<p>The safety (or soundness) of this type system can be expressed by the following two properties:</p>
<ul>
<li>
<p><strong>Progress</strong>: A well-typed term is not stuck.</p>
<p>If $t\ :\ T$ then either $t$ is a value, or else $t\longrightarrow t’$ for some $t’$.</p>
</li>
<li>
<p><strong>Preservation</strong>: Types are preserved by one-step evaluation.</p>
<p>If $t\ :\ T$ and $t\longrightarrow t’$, then $t’\ :\ T$.</p>
</li>
</ul>
<p>We will prove these later, but first we must state a few lemmas.</p>
<h4 id="inversion-lemma-1">Inversion lemma</h4>
<p>Again, for types we need to state the same (boring) inversion lemma:</p>
<ol>
<li>If $\text{true}: R$, then $R = \text{Bool}$.</li>
<li>If $\text{false}: R$, then $R = \text{Bool}$.</li>
<li>If $\ifelse: R$, then $t_1: \text{ Bool}$, $t_2: R$ and $t_3: R$</li>
<li>If $0: R$ then $R = \text{Nat}$</li>
<li>If $\text{succ } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$</li>
<li>If $\text{pred } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$</li>
<li>If $\text{iszero } t_1: R$ then $R = \text{Bool}$ and $t_1: \text{Nat}$</li>
</ol>
<p>From the inversion lemma, we can directly derive a typechecking algorithm:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t</span><span class="k">:</span> <span class="kt">Expr</span><span class="o">)</span><span class="k">:</span> <span class="kt">T</span> <span class="o">=</span> <span class="n">t</span> <span class="k">match</span> <span class="o">{</span>
<span class="k">case</span> <span class="nc">True</span> <span class="o">|</span> <span class="nc">False</span> <span class="k">=></span> <span class="nc">Bool</span>
<span class="k">case</span> <span class="nc">If</span><span class="o">(</span><span class="n">t1</span><span class="o">,</span> <span class="n">t2</span><span class="o">,</span> <span class="n">t3</span><span class="o">)</span> <span class="k">=></span>
<span class="k">val</span> <span class="n">type1</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span>
<span class="k">val</span> <span class="n">type2</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t2</span><span class="o">)</span>
<span class="k">val</span> <span class="n">type3</span> <span class="k">=</span> <span class="n">typeof</span><span class="o">(</span><span class="n">t3</span><span class="o">)</span>
<span class="k">if</span> <span class="o">(</span><span class="n">type1</span> <span class="o">==</span> <span class="nc">Bool</span> <span class="o">&&</span> <span class="n">type2</span> <span class="o">==</span> <span class="n">type3</span><span class="o">)</span> <span class="n">type2</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">Zero</span> <span class="k">=></span> <span class="nc">Nat</span>
<span class="k">case</span> <span class="nc">Succ</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Nat</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">Pred</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Nat</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="k">case</span> <span class="nc">IsZero</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="k">=></span>
<span class="k">if</span> <span class="o">(</span><span class="n">typeof</span><span class="o">(</span><span class="n">t1</span><span class="o">)</span> <span class="o">==</span> <span class="nc">Nat</span><span class="o">)</span> <span class="nc">Bool</span>
<span class="k">else</span> <span class="k">throw</span> <span class="nc">Error</span><span class="o">(</span><span class="s">"not typable"</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="canonical-form">Canonical form</h4>
<p>A simple lemma that will be useful for lemma is that of canonical forms. Given a type, it tells us what kind of values we can expect:</p>
<ol>
<li>If $v$ is a value of type Bool, then $v$ is either $\text{true}$ or $\text{false}$</li>
<li>If $v$ is a value of type Nat, then $v$ is a numeric value</li>
</ol>
<p>The proof is somewhat immediate from the syntax of values.</p>
<h4 id="progress-theorem">Progress Theorem</h4>
<p><strong>Theorem</strong>: suppose that $t$ is a well-typed term of type $T$. Then either $t$ is a value, or else there exists some $t’$ such that $t\longrightarrow t’$.</p>
<p><strong>Proof</strong>: by induction on a derivation of $t: T$.</p>
<ul>
<li>The $\ref{eq:t-true}$, $\ref{eq:t-false}$ and $\ref{eq:t-zero}$ are immediate, since $t$ is a value in these cases.</li>
<li>
<p>For $\ref{eq:t-if}$, we have $t=\ifelse$, with $t_1: \text{Bool}$, $t_2: T$ and $t_3: T$. By the induction hypothesis, there is some $t_1’$ such that $t_1 \longrightarrow t_1’$.</p>
<p>If $t_1$ is a value, then rule 1 of the <a href="#canonical-form">canonical form lemma</a> tells us that $t_1$ must be either $\text{true}$ or $\text{false}$, in which case $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ applies to $t$.</p>
<p>Otherwise, if $t_1 \longrightarrow t_1’$, then by $\ref{eq:e-if}$, $t\longrightarrow \if t_1’ \then t_2 \text{ else } t_3$</p>
</li>
<li>
<p>For $\ref{eq:t-succ}$, we have $t = \text{succ } t_1$.</p>
<p>$t_1$ is a value, by rule 5 of the <a href="#inversion-lemma">inversion lemma</a> and by rule 2 of the <a href="#canonical-form">canonical form</a>, $t_1 = nv$ for some numeric value $nv$. Therefore, $\text{succ }(t_1)$ is a value. If $t_1 \longrightarrow t_1’$, then $t\longrightarrow \text{succ }t_1$.</p>
</li>
<li>The cases for $\ref{eq:t-zero}$, $\ref{eq:t-pred}$ and $\ref{eq:t-iszero}$ are similar.</li>
</ul>
<h4 id="preservation-theorem">Preservation Theorem</h4>
<p><strong>Theorem</strong>: Types are preserved by one-step evaluation. If $t: T$ and $t\longrightarrow t’$, then $t’: T$.</p>
<p><strong>Proof</strong>: by induction on the given typing derivation</p>
<ul>
<li>For $\ref{eq:t-true}$ and $\ref{eq:t-false}$, the precondition doesn’t hold (no reduction is possible), so it’s trivially true. Indeed, $t$ is already a value, either $t=\text{ true}$ or $t=\text{ false}$.</li>
<li>For $\ref{eq:t-if}$, there are three evaluation rules by which $t\longrightarrow t’$ can be derived, depending on $t_1$
<ul>
<li>If $t_1 = \text{true}$, then by $\ref{eq:e-iftrue}$ we have $t’=t_2$, and from rule 3 of the <a href="#inversion-lemma-1">inversion lemma</a> and the assumption that $t: T$, we have $t_2: T$, that is $t’: T$</li>
<li>If $t_1 = \text{false}$, then by $\ref{eq:e-iffalse}$ we have $t’=t_3$, and from rule 3 of the <a href="#inversion-lemma-1">inversion lemma</a> and the assumption that $t: T$, we have $t_3: T$, that is $t’: T$</li>
<li>If $t_1 \longrightarrow t_1’$, then by the induction hypothesis, $t_1’: \text{Bool}$. Combining this with the assumption that $t_2: T$ and $t_3: T$, we can apply $\ref{eq:t-if}$ to conclude $\if t_1’ \then t_2 \else t_3: T$, that is $t’: T$</li>
</ul>
</li>
</ul>
<h3 id="messing-with-it">Messing with it</h3>
<h4 id="removing-a-rule">Removing a rule</h4>
<p>What if we remove $\ref{eq:e-predzero}$? Then <code class="highlighter-rouge">pred 0</code> type checks, but it is stuck and is not a value; the <a href="#progress-theorem">progress theorem</a> fails.</p>
<h4 id="changing-type-checking-rule">Changing type-checking rule</h4>
<p>What if we change the $\ref{eq:t-if}$ to the following?</p>
<script type="math/tex; mode=display">\frac{
t_1 : \text{Bool} \quad
t_2 : \text{Nat} \quad
t_3 : \text{Nat}
}{
(\ifelse) : \text{Nat}
}
\tag{T-If 2}
\label{eq:t-if2}</script>
<p>This doesn’t break our type system. It’s still sound, but it rejects if-else expressions that return other things than numbers (e.g. booleans). But that is an expressiveness problem, not a soundness problem; our type system disallows things that would otherwise be fine by the evaluation rules.</p>
<h4 id="adding-bit">Adding bit</h4>
<p>We could add a boolean to natural function <code class="highlighter-rouge">bit(t)</code>. We’d have to add it to the grammar, add some evaluation and typing rules, and prove progress and preservation.</p>
<script type="math/tex; mode=display">\begin{align}
\text{bit true} \longrightarrow 0 \\ \\
\text{bit false} \longrightarrow 1 \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{bit }t_1 \longrightarrow \text{bit }t_1'}
\\ \\
\frac{t : \text{Bool}}{\text{bit } t : \text{Nat}}
\end{align}</script>
<p>We’ll do something similar this below, so the full proof is omitted.</p>
<h2 id="simply-typed-lambda-calculus">Simply typed lambda calculus</h2>
<p>Simply Typed Lambda Calculus (STLC) is also denoted $\lambda_\rightarrow$. The “pure” form of STLC is not very interesting on the type-level (unlike for the term-level of pure lambda calculus), so we’ll allow base values that are not functions, like booleans and integers. To talk about STLC, we always begin with some set of “base types”:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>T ::= // types
Bool // type of booleans
T -> T // type of functions</pre></td></tr></tbody></table></code></pre></figure>
<p>In the following examples, we’ll work with a mix of our previously defined toy language, and lambda calculus. This will give us a little syntactic sugar.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre>t ::= // terms
x // variable
λx. t // abstraction
t t // application
true // constant true
false // constant false
if t then t else t // conditional
v ::= // values
λx. t // abstraction value
true // true value
false // false value</pre></td></tr></tbody></table></code></pre></figure>
<h3 id="type-annotations">Type annotations</h3>
<p>We will annotate lambda-abstractions with the expected type of the argument, as follows:</p>
<script type="math/tex; mode=display">\lambda x: T_1 .\ t_1</script>
<p>We could also omit it, and let type inference do the job (as in OCaml), but for now, we’ll do the above. This will make it simpler, as we won’t have to discuss inference just yet.</p>
<h3 id="typing-rules">Typing rules</h3>
<p>In STLC, we’ve introduced abstraction. To add a typing rule for that, we need to encode the concept of an environment $\Gamma$, which is a set of variable assignments. We also introduce the “turnstile” symbol $\vdash$, meaning that the environment can verify the right hand-side typing, or that $\Gamma$ must imply the right-hand side.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{
\bigl( \Gamma \cup (x_1 : T_1) \bigr) \vdash t_2 : T_2
}{ \Gamma\vdash(\lambda x: T_1.\ t_2): T_1 \rightarrow T_2 }
\tag{T-Abs} \label{eq:t-abs} \\ \\
\frac{x: T \in \Gamma}{\Gamma\vdash x: T}
\tag{T-Var} \label{eq:t-var} \\ \\
\frac{
\Gamma\vdash t_1 : T_{11}\rightarrow T_{12}
\quad
\Gamma\vdash t_2 : T_{11}
}{\Gamma\vdash t_1\ t_2 : T_{12}}
\tag{T-App} \label{eq:t-app}
\end{align}</script>
<p>This additional concept must be taken into account in our definition of progress and preservation:</p>
<ul>
<li><strong>Progress</strong>: If $\Gamma\vdash t : T$, then either $t$ is a value or else $t\longrightarrow t’$ for some $t’$</li>
<li><strong>Preservation</strong>: If $\Gamma\vdash t : T$ and $t\longrightarrow t’$, then $\Gamma\vdash t’ : T$</li>
</ul>
<p>To prove these, we must take the same steps as above. We’ll introduce the inversion lemma for typing relations, and restate the canonical forms lemma in order to prove the progress theorem.</p>
<h3 id="inversion-lemma-2">Inversion lemma</h3>
<p>Let’s start with the inversion lemma.</p>
<ol>
<li>If $\Gamma\vdash\text{true} : R$ then $R = \text{Bool}$</li>
<li>If $\Gamma\vdash\text{false} : R$ then $R = \text{Bool}$</li>
<li>If $\Gamma\vdash\ifelse : R$ then $\Gamma\vdash t_1 : \text{Bool}$ and $\Gamma\vdash t_2, t_3: R$.</li>
<li>If $\Gamma\vdash x: R$ then $x: R \in\Gamma$</li>
<li>If $\Gamma\vdash\lambda x: T_1 .\ t_2 : R$ then $R = T_1 \rightarrow T_2$ for some $R_2$ with $\Gamma\cup(x: T_1)\vdash t_2: R_2$</li>
<li>If $\Gamma\vdash t_1\ t_2 : R$ then there is some type $T_{11}$ such that $\Gamma\vdash t_1 : T_{11} \rightarrow R$ and $\Gamma\vdash t_2 : T_{11}$.</li>
</ol>
<h3 id="canonical-form-1">Canonical form</h3>
<p>The canonical forms are given as follows:</p>
<ol>
<li>If $v$ is a value of type Bool, then it is either $\text{true}$ or $\text{false}$</li>
<li>If $v$ is a value of type $T_1 \rightarrow T_2$ then $v$ has the form $\lambda x: T_1 .\ t_2$</li>
</ol>
<h3 id="progress">Progress</h3>
<p>Finally, we get to prove the progress by induction on typing derivations.</p>
<p><strong>Theorem</strong>: Suppose that $t$ is a closed, well typed term (that is, $\Gamma\vdash t: T$ for some type $T$). Then either $t$ is a value, or there is some $t’$ such that $t\longrightarrow t’$.</p>
<ul>
<li>For boolean constants, the proof is immediate as $t$ is a value</li>
<li>For variables, the proof is immediate as $t$ is closed, and the precondition therefore doesn’t hold</li>
<li>For abstraction, the proof is immediate as $t$ is a value</li>
<li>
<p>Application is the only case we must treat.</p>
<p>Consider $t = t_1\ t_2$, with $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2: T_{11}$.</p>
<p>By the induction hypothesis, $t_1$ is either a value, or it can make a step of evaluation. The same goes for $t_2$.</p>
<p>If $t_1$ can reduce, then rule $\ref{eq:e-app1}$ applies to $t$. Otherwise, if it is a value, and $t_2$ can take a step, then $\ref{eq:e-app2}$ applies. Otherwise, if they are both values (and we cannot apply $\beta$-reduction), then the canonical forms lemma above tells us that $t_1$ has the form $\lambda x: T_11.\ t_{12}$, and so rule $\ref{eq:e-appabs}$ applies to $t$.</p>
</li>
</ul>
<h3 id="preservation">Preservation</h3>
<p><strong>Theorem</strong>: If $\Gamma\vdash t: T$ and $t \longrightarrow t’$ then $\Gamma\vdash t’: T$.</p>
<p><strong>Proof</strong>: by induction on typing derivations. We proceed on a case-by-case basis, as we have done so many times before. But one case is hard: application.</p>
<p>For $t = t_1\ t_2$, such that $\Gamma\vdash t_1 : T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2 : T_{11}$, and where $T=T_{12}$, we want to show $\Gamma\vdash t’ : T_{12}$.</p>
<p>To do this, we must use the <a href="#inversion-lemma">inversion lemma for evaluation</a> (note that we haven’t written it down for STLC, but the idea is the same). There are three subcases for it, starting with the following:</p>
<p>The left-hand side is $t_1 = \lambda x: T_{11}.\ t_{12}$, and the right-hand side of application $t_2$ is a value $v_2$. In this case, we know that the result of the evaluation is given by $t’ = \left[ x\mapsto v_2 \right] t_{12}$.</p>
<p>And here, we already run into trouble, because we do not know about how types act under substitution. We will therefore need to introduce some lemmas.</p>
<h4 id="weakening-lemma">Weakening lemma</h4>
<p>Weakening tells us that we can <em>add</em> assumptions to the context without losing any true typing statements:</p>
<p>If $\Gamma\vdash t: T$, and the environment $\Gamma$ has no information about $x$—that is, $x\notin \text{dom}(\Gamma)$—then the initial assumption still holds if we add information about $x$ to the environment:</p>
<script type="math/tex; mode=display">\bigl(\Gamma \cup (x: S)\bigr)\vdash t: T</script>
<p>Moreover, the latter $\vdash$ derivation has the same depth as the former.</p>
<h4 id="permutation-lemma">Permutation lemma</h4>
<p>Permutation tells us that the order of assumptions in $\Gamma$ does not matter.</p>
<p>If $\Gamma \vdash t: T$ and $\Delta$ is a permutation of $\Gamma$, then $\Delta\vdash t: T$.</p>
<p>Moreover, the latter $\vdash$ derivation has the same depth as the former.</p>
<h4 id="substitution-lemma">Substitution lemma</h4>
<p>Substitution tells us that types are preserved under substitution.</p>
<p>That is, if $\Gamma\cup(x: S) \vdash t: T$ and $\Gamma\vdash s: S$, then $\Gamma\vdash \left[x\mapsto s\right] t: T$.</p>
<p>The proof goes by induction on the derivation of $\Gamma\cup(x: S) \vdash t: T$, that is, by cases on the final typing rule used in the derivation.</p>
<ul>
<li>
<p>Case $\ref{eq:t-app}$: in this case, $t = t_1\ t_2$.</p>
<p>Thanks to typechecking, we know that the environment validates $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_1: T_2 \rightarrow T_1$ and $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_2: T_2$. In this case, the resulting type of the application is $T=T_1$.</p>
<p>By the induction hypothesis, $\Gamma\vdash[x\mapsto s]t_1 : T_2 \rightarrow T_1$, and $\Gamma\vdash[x\mapsto s]t_2 : T_2$.</p>
<p>By $\ref{eq:t-app}$, the environment then also verifies the application of these two substitutions as $T$: $\Gamma\vdash[x\mapsto s]t_1\ [x\mapsto s]t_2: T$. We can factorize the substitution to obtain the conclusion, i.e. $\Gamma\vdash \left[x\mapsto s\right](t_1\ t_2): T$</p>
</li>
<li>Case $\ref{eq:t-var}$: if $t=z$ ($t$ is a simple variable $z$) where $z: T \in \bigl(\Gamma\cup (x: S)\bigr)$. There are two subcases to consider here, depending on whether $z$ is $x$ or another variable:
<ul>
<li>If $z=x$, then $\left[x\mapsto s\right] z = s$. The result is then $\Gamma\vdash s: S$, which is among the assumptions of the lemma</li>
<li>If $z\ne x$, then $\left[x\mapsto s\right] z = z$, and the desired result is immediate</li>
</ul>
</li>
<li>
<p>Case $\ref{eq:t-abs}$: if $t=\lambda y: T_2.\ t_1$, with $T=T_2\rightarrow T_1$, and $\bigl(\Gamma\cup (x: S)\cup (y: T_2)\bigr)\vdash t_1 : T_1$.</p>
<p>Based on our <a href="#alpha-conversion">hygiene convention</a>, we may assume $x\ne y$ and $y \notin \text{fv}(s)$.</p>
<p>Using <a href="#permutation-lemma">permutation</a> on the first given subderivation in the lemma ($\Gamma\cup(x: S) \vdash t: T$), we obtain $\bigl(\Gamma\cup (y: T_2)\cup (x: S)\bigr)\vdash t_1 : T_1$ (we have simply changed the order of $x$ and $y$).</p>
<p>Using <a href="#weakening-lemma">weakening</a> on the other given derivation in the lemma ($\Gamma\vdash s: S$), we obtain $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash s: S$.</p>
<p>By the induction hypothesis, $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash\left[x\mapsto s\right] t_1: T_1$.</p>
<p>By $\ref{eq:t-abs}$, we have $\Gamma\vdash(\lambda y: T_2.\ [x\mapsto s]t_1): T_1$</p>
<p>By the definition of substitution, this is $\Gamma\vdash([x\mapsto s]\lambda y: T_2.\ t_1): T_2 \rightarrow T_1$.</p>
</li>
</ul>
<h4 id="proof">Proof</h4>
<p>We’ve now proven the following lemmas:</p>
<ul>
<li>Weakening</li>
<li>Permutation</li>
<li>Type preservation under substitution</li>
<li>Type preservation under reduction (i.e. preservation)</li>
</ul>
<p>We won’t actually do the proof, we’ve just set up the pieces we need for it.</p>
<h3 id="erasure">Erasure</h3>
<p>Type annotations do not play any role in evaluation. In STLC, we don’t do any run-time checks, we only run compile-time type checks. Therefore, types can be removed before evaluation. This often happens in practice, where types do not appear in the compiled form of a program; they’re typically encoded in an untyped fashion. The semantics of this conversion can be formalized by an erasure function:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\text{erase}(x) & = x \\
\text{erase}(\lambda x: T_1. t_2) & = \lambda x. \text{erase}(t_2) \\
\text{erase}(t_1\ t_2) & = \text{erase}(t_1)\ \text{erase}(t_2)
\end{align} %]]></script>
<h3 id="curry-howard-correspondence">Curry-Howard Correspondence</h3>
<p>The Curry-Howard correspondence tells us that there is a correspondence between propositional logic and types.</p>
<p>An implication $P\supset Q$ (which could also be written $P\implies Q$) can be proven by transforming evidence for $P$ into evidence for $Q$. A conjunction $P\land Q$ is a <a href="#pairs-1">pair</a> of evidence for $P$ and evidence for $Q$. For more examples of these correspondences, see the <a href="https://en.wikipedia.org/wiki/Brouwer–Heyting–Kolmogorov_interpretation">Brouwer–Heyting–Kolmogorov (BHK) interpretation</a>.</p>
<table>
<thead>
<tr>
<th style="text-align: left">Logic</th>
<th style="text-align: left">Programming languages</th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">Propositions</td>
<td style="text-align: left">Types</td>
</tr>
<tr>
<td style="text-align: left">$P \supset Q$</td>
<td style="text-align: left">Type $P\rightarrow Q$</td>
</tr>
<tr>
<td style="text-align: left">$P \land Q$</td>
<td style="text-align: left"><a href="#pairs-1">Pair type</a> $P\times Q$</td>
</tr>
<tr>
<td style="text-align: left">$P \lor Q$</td>
<td style="text-align: left"><a href="#sum-type">Sum type</a> $P+Q$</td>
</tr>
<tr>
<td style="text-align: left">$\exists x\in S: \phi(x)$</td>
<td style="text-align: left">Dependent type $\sum{x: S, \phi(x)}$</td>
</tr>
<tr>
<td style="text-align: left">$\forall x\in S: \phi(x)$</td>
<td style="text-align: left">$\forall (x:S): \phi(x)$</td>
</tr>
<tr>
<td style="text-align: left">Proof of $P$</td>
<td style="text-align: left">Term $t$ of type $P$</td>
</tr>
<tr>
<td style="text-align: left">$P$ is provable</td>
<td style="text-align: left">Type $P$ is inhabited</td>
</tr>
<tr>
<td style="text-align: left">Proof simplification</td>
<td style="text-align: left">Evaluation</td>
</tr>
</tbody>
</table>
<p>In Scala, all types are inhabited except for the bottom type <code class="highlighter-rouge">Nothing</code>. Singleton types are only inhabited by a single term.</p>
<p>As an example of the equivalence, we’ll see that application is equivalent to <a href="https://en.wikipedia.org/wiki/Modus_ponens">modus ponens</a>:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash t_1 : P \supset Q \quad \Gamma\vdash t_2 : P}{\Gamma\vdash t_1\ t_2 : Q}</script>
<p>This also tells us that if we can prove something, we can evaluate it.</p>
<p>How can we prove the following? Remember that $\rightarrow$ is right-associative.</p>
<script type="math/tex; mode=display">(A \land B) \rightarrow C \rightarrow ((C\land A)\land B)</script>
<p>The proof is actually a somewhat straightforward conversion to lambda calculus:</p>
<script type="math/tex; mode=display">\lambda p: A\times B.\ \lambda c: C.\ \text{pair} (\text{pair} (c\ \text{fst}(p)) \text{snd}(p))</script>
<h3 id="extensions-to-stlc">Extensions to STLC</h3>
<h4 id="base-types">Base types</h4>
<p>Up until now, we’ve defined our base types (such as $\text{Nat}$ and $\text{Bool}$) manually: we’ve added them to the syntax of types, with associated constants ($\text{zero}, \text{true}, \text{false}$) and operators ($\text{succ}, \text{pred}$), as well as associated typing and evaluation rules.</p>
<p>This is a lot of minutiae though, especially for theoretical discussions. For those, we can often ignore the term.level inhabitants of the base types, and just treat them as uninterpreted constants: we don’t really need the distinction between constants and values. For theory, we can just assume that some generic base types (e.g. $B$ and $C$) exist, without defining them further.</p>
<h4 id="unit-type">Unit type</h4>
<p>In C-like languages, this type is usually called <code class="highlighter-rouge">void</code>. To introduce it, we do not add any computation rules. We must only add it to the grammar, values and types, and then add a single typing rule that trivially verifies units.</p>
<script type="math/tex; mode=display">\Gamma\vdash\text{unit}:\text{Unit}
\label{eq:t-unit} \tag{T-Unit}</script>
<p>Units are not too interesting, but <em>are</em> quite useful in practice, in part because they allow for other extensions.</p>
<h4 id="sequencing">Sequencing</h4>
<p>We can define sequencing as two statements following each other:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
t1; t2</pre></td></tr></tbody></table></code></pre></figure>
<p>This implies adding some evaluation and typing rules, defined below:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \longrightarrow t_1'}{t_1;\ t_2 \longrightarrow t_1';\ t_2}
\label{eq:e-seq}\tag{E-Seq} \\ \\
(\text{unit};\ t_2) \longrightarrow t_2
\label{eq:e-seqnext}\tag{E-SeqNext} \\ \\
\frac{\Gamma\vdash t_1 : \text{Unit} \quad \Gamma\vdash t_2: T_2}{\Gamma\vdash t_1;\ t_2 : T_2}
\label{eq:t-seq}\tag{T-Seq} \\
\end{align}</script>
<p>But there’s another way that we could define sequencing: simply as syntactic sugar, a derived form for something else. In this way, we define an external language, that is transformed to an internal language by the compiler in the desugaring step.</p>
<script type="math/tex; mode=display">t_1;\ t_2 \defeq (\lambda x: \text{Unit}.\ t_2)\ t_1
\qquad \text{where } x\notin\text{ FV}(t_2)</script>
<p>This is useful to know, because it makes proving soundness much easier. We do not need to re-state the inversion lemma, re-prove preservation and progress. We can simple rely on the proof for the underlying internal language.</p>
<h4 id="ascription">Ascription</h4>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
t as T</pre></td></tr></tbody></table></code></pre></figure>
<p>Ascription allows us to have a compiler type-check a term as really being of the correct type.</p>
<p>The typing rule is simply:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash t_1 : T}{\Gamma\vdash t_1 \text{ as } T: T}
\label{eq:t-ascribe}\tag{T-Ascribe}</script>
<p>This seems like it preserves soundness, but instead of doing the whole proof over again, we’ll just propose a simple desugaring:</p>
<script type="math/tex; mode=display">t \text{ as } T \defeq (\lambda x: T.\ x)\ t</script>
<p>An ascription is equivalent to the term $t$ applied the identity function, typed to return $T$.</p>
<h4 id="pairs-1">Pairs</h4>
<p>We can introduce pairs into our grammar.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
</pre></td><td class="code"><pre>t ::=
...
{t, t} // pair
t.1 // first projection
t.2 // second projection
v ::=
...
{v, v} // pair value
T ::=
...
T1 x T2 // product types</pre></td></tr></tbody></table></code></pre></figure>
<p>We can also introduce evaluation rules for pairs:</p>
<script type="math/tex; mode=display">\begin{align}
\left\{v_1, v_2\right\}.1 \longrightarrow v_1
\tag{E-PairBeta1}\label{eq:e-pairbeta1} \\ \\
\left\{v_1, v_2\right\}.2 \longrightarrow v_2
\tag{E-PairBeta2}\label{eq:e-pairbeta2} \\ \\
\frac{t_1 \longrightarrow t_1'}{t_1.1\longrightarrow t_1'.1}
\tag{E-Proj1}\label{eq:e-proj1} \\ \\
\frac{t_1 \longrightarrow t_1'}{t_1.2\longrightarrow t_1'.2}
\tag{E-Proj2}\label{eq:e-proj2} \\ \\
\frac{t_1 \longrightarrow t_1'}{\left\{t_1, t_2\right\} \longrightarrow \left\{t_1', t_2\right\}}
\tag{E-Pair1}\label{eq:e-pair1} \\ \\
\frac{t_2 \longrightarrow t_2'}{\left\{t_1, t_2\right\} \longrightarrow \left\{t_1, t_2'\right\}}
\tag{E-Pair2}\label{eq:e-pair2} \\ \\
\end{align}</script>
<p>The typing rules are then:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{
\Gamma\vdash t_1: T_1 \quad \Gamma\vdash t_2: T_2
}{
\Gamma\vdash\left\{ t_1, t_2 \right\} : T_1 \times T_2
} \label{eq:t-pair} \tag{T-Pair} \\ \\
\frac{\Gamma\vdash t_1 : T_{11}\times T_{12}}{\Gamma\vdash t_1.1:T_{11}}
\label{eq:t-proj1}\tag{T-Proj1} \\ \\
\frac{\Gamma\vdash t_1 : T_{11}\times T_{12}}{\Gamma\vdash t_1.2:T_{12}}
\label{eq:t-proj2}\tag{T-Proj2} \\ \\
\end{align}</script>
<p>Pairs have to be added “the hard way”: we do not really have a way to define them in a derived form, as we have no existing language features to piggyback onto.</p>
<h4 id="tuples">Tuples</h4>
<p>Tuples are like pairs, except that we do not restrict it to 2 elements; we allow an arbitrary number from 1 to n. We can use pairs to encode tuples: <code class="highlighter-rouge">(a, b, c)</code> can be encoded as <code class="highlighter-rouge">(a, (b, c))</code>. Though for performance and convenience, most languages implement them natively.</p>
<h4 id="records">Records</h4>
<p>We can easily generalize tuples to records by annotating each field with a label. A record is a bundle of values with labels; it’s a map of labels to values and types. Order of records doesn’t matter, the only index is the label.</p>
<p>If we allow numeric labels, then we can encode a tuple as a record, where the index implicitly encodes the numeric label of the record representation.</p>
<p>No mainstream language has language-level support for records (two case classes in Scala may have the same arguments but a different constructor, so it’s not quite the same; records are more like anonymous objects). This is because they’re often quite inefficient in practice, but we’ll still use them as a theoretical abstraction.</p>
<h3 id="sums-and-variants">Sums and variants</h3>
<h4 id="sum-type">Sum type</h4>
<p>A sum type $T = T_1 + T_2$ is a <em>disjoint</em> union of $T_1$ and $T_2$. Pragmatically, we can have sum types in Scala with case classes extending an abstract object:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="k">sealed</span> <span class="k">trait</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">+T</span><span class="o">]</span>
<span class="nc">case</span> <span class="k">class</span> <span class="nc">Some</span><span class="o">[</span><span class="kt">+T</span><span class="o">]</span> <span class="nc">extends</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">T</span><span class="o">]</span>
<span class="k">case</span> <span class="k">object</span> <span class="nc">None</span> <span class="k">extends</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">Nothing</span><span class="o">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>In this example, <code class="highlighter-rouge">Option = Some + None</code>. We say that $T_1$ is on the left, and $T_2$ on the right. Disjointness is ensured by the tags $\text{inl}$ and $\text{inr}$. We can <em>think</em> of these as functions that inject into the left or right of the sum type $T$:</p>
<script type="math/tex; mode=display">\text{inl}: T_1 \rightarrow T_1 + T_2 \\
\text{inr}: T_2 \rightarrow T_1 + T_2</script>
<p>Still, these aren’t really functions, they don’t actually have function type. Instead, we use them them to tag the left and right side of a sum type, respectively.</p>
<p>Another way to think of these stems from <a href="/#curry-howard-correspondence">Curry-Howard correspondence</a>. Recall that in the <a href="https://en.wikipedia.org/wiki/Brouwer%E2%80%93Heyting%E2%80%93Kolmogorov_interpretation">BHK interpretation</a>, a proof of $P \lor Q$ is a pair <code class="highlighter-rouge"><a, b></code> where <code class="highlighter-rouge">a</code> is 0 (also denoted $\text{inl}$) and <code class="highlighter-rouge">b</code> a proof of $P$, <em>or</em> <code class="highlighter-rouge">a</code> is 1 (also denoted $\text{inr}$) and <code class="highlighter-rouge">b</code> is a proof of $Q$.</p>
<p>To use elements of a sum type, we can introduce a <code class="highlighter-rouge">case</code> construct that allows us to pattern-match on a sum type, allowing us to distinguishing the left type from the right one.</p>
<p>We need to introduce these three special forms in our syntax:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
</pre></td><td class="code"><pre>t ::= ... // terms
inl t // tagging (left)
inr t // tagging (right)
case t of inl x => t | inr x => t // case
v ::= ... // values
inl v // tagged value (left)
inr v // tagged value (right)
T ::= ... // types
T + T // sum type</pre></td></tr></tbody></table></code></pre></figure>
<p>This also leads us to introduce some new evaluation rules:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\begin{rcases}
\text{case } (& \text{inl } v_0) \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2 \\
\end{rcases} \longrightarrow [x_1 \mapsto v_0] t_1
\label{eq:e-caseinl}\tag{E-CaseInl} \\ \\
\begin{rcases}
\text{case } (& \text{inr } v_0) \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inl } x_2 \Rightarrow t_2 \\
\end{rcases} \longrightarrow [x_2 \mapsto v_0] t_2
\label{eq:e-caseinr}\tag{E-CaseInr} \\ \\
\frac{t_0 \longrightarrow t_0'}{
\begin{rcases}
\text{case } & t_0 \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2
\end{rcases} \longrightarrow \begin{cases}
\text{case } & t_0' \text{ of} \\
& \text{inl } x_1 \Rightarrow t_1 \ \mid \\
& \text{inr } x_2 \Rightarrow t_2
\end{cases}
} \label{eq:e-case}\tag{E-Case} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{inl }t_1 \longrightarrow \text{inl }t_1'}
\label{eq:e-inl}\tag{E-Inl} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{inr }t_1 \longrightarrow \text{inr }t_1'}
\label{eq:e-inr}\tag{E-Inr} \\ \\
\end{align} %]]></script>
<p>And we’ll also introduce three typing rules:</p>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t_1 : T_1}{\Gamma\vdash\text{inl } t_1 : T_1 + T_2}
\label{eq:t-inl}\tag{T-Inl} \\ \\
\frac{\Gamma\vdash t_1 : T_2}{\Gamma\vdash\text{inr } t_1 : T_1 + T_2}
\label{eq:t-inr}\tag{T-Inr} \\ \\
\frac{
\Gamma\vdash t_0 : T_1 + T_2 \quad
\Gamma\cup(x_1: T_1) \vdash t_1 : T \quad
\Gamma\cup(x_2: T_2) \vdash t_2 : T
}{
\Gamma\vdash\text{case } t_0 \text{ of inl } x_1 \Rightarrow t_1 \mid \text{inr } x_2 \Rightarrow t_2 : T
}
\label{eq:t-case}\tag{T-Case} \\
\end{align}</script>
<h4 id="sums-and-uniqueness-of-type">Sums and uniqueness of type</h4>
<p>The rules $\ref{eq:t-inr}$ and $\ref{eq:t-inl}$ may seem confusing at first. We only have one type to deduce from, so what do we assign to $T_2$ and $T_1$, respectively? These rules mean that we have lost uniqueness of types: if $t$ has type $T$, then $\text{inl } t$ has type $T+U$ <strong>for every</strong> $U$.</p>
<p>There are a couple of solutions to this:</p>
<ol>
<li>We can infer $U$ as needed during typechecking</li>
<li>Give constructors different names and only allow each name to appear in one sum type. This requires generalization to <a href="#variants">variants</a>, which we’ll see next. OCaml adopts this solution.</li>
<li>Annotate each inl and inr with the intended sum type.</li>
</ol>
<p>For now, we don’t want to look at type inference and variance, so we’ll choose the third approach for simplicity. We’ll introduce these annotation as ascriptions on the injection operators in our grammar:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre>t ::=
...
inl t as T
inr t as T
v ::=
...
inl v as T
inr v as T</pre></td></tr></tbody></table></code></pre></figure>
<p>The evaluation rules would be exactly the same as previously, but with ascriptions in the syntax. The injection operators just now also specify <em>which</em> sum type we’re injecting into, for the sake of uniqueness of type.</p>
<h4 id="variants">Variants</h4>
<p>Just as we generalized binary products to labeled records, we can generalize binary sums to labeled variants. We can label the members of the sum type, so that we write $\langle l_1: T_1, l_2: T_2 \rangle$ instead of $T_1 + T_2$ ($l_1$ and $l_2$ are the labels).</p>
<p>As a motivating example, we’ll show a useful idiom that is possible with variants, the optional value. We’ll use this to create a table. The example below is just like in OCaml.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="nc">OptionalNat</span> <span class="k">=</span> <span class="o"><</span><span class="n">none</span><span class="k">:</span> <span class="kt">Unit</span><span class="o">,</span> <span class="n">some</span><span class="k">:</span> <span class="kt">Nat></span><span class="o">;</span>
<span class="nc">Table</span> <span class="k">=</span> <span class="nc">Nat</span> <span class="o">-></span> <span class="nc">OptionalNat</span><span class="o">;</span>
<span class="n">emptyTable</span> <span class="k">=</span> <span class="n">λt</span><span class="k">:</span> <span class="kt">Nat.</span> <span class="kt"><none</span><span class="o">=</span><span class="n">unit</span><span class="o">></span> <span class="n">as</span> <span class="nc">OptionalNat</span><span class="o">;</span>
<span class="n">extendTable</span> <span class="k">=</span>
<span class="n">λt</span><span class="k">:</span> <span class="kt">Table.</span> <span class="kt">λkey:</span> <span class="kt">Nat.</span> <span class="kt">λval:</span> <span class="kt">Nat.</span>
<span class="kt">λsearch:</span> <span class="kt">Nat.</span>
<span class="kt">if</span> <span class="o">(</span><span class="kt">equal</span> <span class="kt">search</span> <span class="kt">key</span><span class="o">)</span> <span class="kt">then</span> <span class="kt"><some</span><span class="o">=</span><span class="k">val</span><span class="o">></span> <span class="n">as</span> <span class="nc">OptionalNat</span>
<span class="k">else</span> <span class="o">(</span><span class="n">t</span> <span class="n">search</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The implementation works a bit like a linked list, with linear look-up. We can use the result from the table by distinguishing the outcome with a <code class="highlighter-rouge">case</code>:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre><span class="n">x</span> <span class="k">=</span> <span class="k">case</span> <span class="n">t</span><span class="o">(</span><span class="mi">5</span><span class="o">)</span> <span class="n">of</span>
<span class="o"><</span><span class="n">none</span><span class="k">=</span><span class="n">u</span><span class="o">></span> <span class="k">=></span> <span class="mi">999</span>
<span class="o">|</span> <span class="o"><</span><span class="n">some</span><span class="k">=</span><span class="n">v</span><span class="o">></span> <span class="k">=></span> <span class="n">v</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="recursion">Recursion</h3>
<p>In STLC, all programs terminate. We won’t go into too much detail on this topic, but the main idea is that evaluation of a well-typed program is guaranteed to halt; we say that the well-typed terms are <em>normalizable</em>.</p>
<p>Indeed, the infinite recursions from untyped lambda calculus (terms like $\text{omega}$ and $\text{fix}$) are not typable, and thus cannot appear in STLC. Since we can’t express $\text{fix}$ in STLC, instead of defining it as a term in the language, we can add it as a primitive instead to get recursion.</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
</pre></td><td class="code"><pre>t ::=
...
fix t</pre></td></tr></tbody></table></code></pre></figure>
<p>We’ll need to add evaluation rules recreating its behavior, and a typing rule that restricts its use to the intended use-case.</p>
<script type="math/tex; mode=display">\begin{align}
\text{fix } (\lambda x: T_1.\ t_2) \longrightarrow \left[
x\mapsto (\text{fix }(\lambda x: T_1.\ t_2))
\right] t_2
\label{eq:e-fixbeta}\tag{E-FixBeta} \\ \\
\frac{t_1 \longrightarrow t_1'}{\text{fix }t_1 \longrightarrow \text{fix }t_1'}
\label{eq:e-fix}\tag{E-Fix} \\ \\
\frac{\Gamma\vdash t_1 : T_1 \rightarrow T_1}{\Gamma\vdash\text{fix }t_1:T_1}
\label{eq:t-fix}\tag{T-Fix}
\end{align}</script>
<p>In order for a function to be recursive, the function needs to map a type to the same type, hence the restriction of $T_1 \rightarrow T_1$. The type $T_1$ will itself be a function type if we’re doing a recursion. Still, note that the type system doesn’t enforce this. There will actually be situations in which it will be handy to use something else than a function type inside a fix operator.</p>
<p>Seeing that this fixed-point notation can be a little involved, we can introduce some nice syntactic sugar to work with it:</p>
<script type="math/tex; mode=display">\text{letrec } x: T_1 = t_1 \text{ in } t_2
\quad \defeq \quad
\text{let } x = \text{fix } (\lambda x: T_1.\ t_1) \text{ in } t_2</script>
<p>This $t_1$ can now refer to the $x$; that’s the convenience offered by the construct. Although we don’t strictly need to introduce typing rules (it’s syntactic sugar, we’re relying on existing constructs), a typing rule for this could be:</p>
<script type="math/tex; mode=display">\frac{\Gamma\cup(x:T_1)\vdash t_1:T_1 \quad \Gamma\cup(x: T_1)\vdash t_2:T_2}{\Gamma\vdash\text{letrec } x: T_1 = t_1 \text{ in } t_2:T_2}</script>
<p>In Scala, a common error message is that a recursive function needs an explicit return type, for the same reasons as the typing rule above.</p>
<h3 id="references">References</h3>
<h4 id="mutability">Mutability</h4>
<p>In most programming languages, variables are (or can be) mutable. That is, variables can provide a name referring to a previously calculated value, as well as a way of overwriting this value with another (under the same name). How can we model this in STLC?</p>
<p>Some languages (e.g. OCaml) actually formally separate variables from mutation. In OCaml, variables are only for naming, the binding between a variable and a value is immutable. However, there is the concept of <em>mutable values</em>, also called <em>reference cells</em> or <em>references</em>. This is the style we’ll study, as it is easier to work with formally. A mutable value is represented in the type-level as a <code class="highlighter-rouge">Ref T</code> (or perhaps even a <code class="highlighter-rouge">Ref(Option T)</code>, since the null pointer cannot produce a value).</p>
<p>The basic operations are allocation with the <code class="highlighter-rouge">ref</code> operator, dereferencing with <code class="highlighter-rouge">!</code> (in C, we use the <code class="highlighter-rouge">*</code> prefix), and assignment with <code class="highlighter-rouge">:=</code>, which updates the content of the reference cell. Assignment returns a <code class="highlighter-rouge">unit</code> value.</p>
<h4 id="aliasing">Aliasing</h4>
<p>Two variables can reference the same cell: we say that they are <em>aliases</em> for the same cell. Aliasing is when we have different references (under different names) to the same cell. Modifying the value of the reference cell through one alias modifies the value for all other aliases.</p>
<p>The possibility of aliasing is all around us, in object references, explicit pointers (in C), arrays, communication channels, I/O devices; there’s practically no way around it. Yet, alias analysis is quite complex, costly, and often makes is hard for compilers to do optimizations they would like to do.</p>
<p>With mutability, the order of operations now matters; <code class="highlighter-rouge">r := 1; r := 2</code> isn’t the same as <code class="highlighter-rouge">r := 2; r := 1</code>. If we recall the <a href="#confluence-in-full-beta-reduction">Church-Rosser theorem</a>, we’ve lost the principle that all reduction paths lead to the same result. Therefore, some language designers disallow it (Haskell). But there are benefits to allowing it, too: efficiency, dependency-driven data flow (e.g. in GUI), shared resources for concurrency (locks), etc. Therefore, most languages provide it.</p>
<p>Still, languages without mutability have come up with a bunch of abstractions that allow us to have some of the benefits of mutability, like monads and lenses.</p>
<h4 id="typing-rules-1">Typing rules</h4>
<p>We’ll introduce references as a type <code class="highlighter-rouge">Ref T</code> to represent a variable of type <code class="highlighter-rouge">T</code>. We can construct a reference as <code class="highlighter-rouge">r = ref 5</code>, and access the contents of the reference using <code class="highlighter-rouge">!r</code> (this would return <code class="highlighter-rouge">5</code> instead of <code class="highlighter-rouge">ref 5</code>).</p>
<p>Let’s define references in our language:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre>t ::= // terms
unit // unit constant
x // variable
λx: T. t // abstraction
t t // application
ref t // reference creation
!t // dereference
t := t // assignment</pre></td></tr></tbody></table></code></pre></figure>
<script type="math/tex; mode=display">\begin{align}
\frac{\Gamma\vdash t_1 : T_1}{\Gamma\vdash \text{ref } t_1 : \text{Ref } T_1}
\label{eq:t-ref}\tag{T-Ref} \\ \\
\frac{\Gamma\vdash t_1: \text{Ref } T_1}{\Gamma\vdash !t_1 : T_1}
\label{eq:t-deref}\tag{T-Deref} \\ \\
\frac{\Gamma\vdash t_1 : \text{Ref } T_1 \quad \Gamma\vdash t_2: T_1}{\Gamma\vdash t_1 := t_2 : \text{Unit}}
\label{eq:t-assign}\tag{T-Assign} \\ \\
\end{align}</script>
<h4 id="evaluation-1">Evaluation</h4>
<p>What is the <em>value</em> of <code class="highlighter-rouge">ref 0</code>? The crucial observation is that evaluation <code class="highlighter-rouge">ref 0</code> must <em>do</em> something. Otherwise, the two following would behave the same:</p>
<figure class="highlight"><pre><code class="language-c" data-lang="c"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">r</span> <span class="o">=</span> <span class="n">ref</span> <span class="mi">0</span>
<span class="n">s</span> <span class="o">=</span> <span class="n">r</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Evaluating <code class="highlighter-rouge">ref 0</code> should allocate some storage, and return a reference (or pointer) to that storage. A reference names a location in the <strong>store</strong> (also known as the <em>heap</em>, or just <em>memory</em>). Concretely, the store could be an array of 8-bit bytes, indexed by 32-bit integers. More abstractly, it’s an array of values, or even more abstractly, a partial function from locations to values.</p>
<p>We can introduce this idea of locations in our syntax. This syntax is exactly the same as the previous one, but adds the notion of locations:</p>
<figure class="highlight"><pre><code class="language-antlr" data-lang="antlr"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre>v ::= // values
unit // unit constant
λx: T. t // abstraction value
l // store location
t ::= // terms
unit // unit constant
x // variable
λx: T. t // abstraction
t t // application
ref t // reference creation
!t // dereference
t := t // assignment
l // store location </pre></td></tr></tbody></table></code></pre></figure>
<p>This doesn’t mean that we’ll allow programmers to write explicit locations in their programs. We just use this as a modeling trick; we’re enriching the internal language to include some run-time structures.</p>
<p>With this added notion of stores and locations, the result of an evaluation now depends on the store in which it is evaluated, which we need to reflect in our evaluation rules. Evaluation must now include terms $t$ <strong>and</strong> store $\mu$:</p>
<script type="math/tex; mode=display">t \mid \mu \longrightarrow t' \mid \mu'</script>
<p>Let’s take a look for the evaluation rules for STLC with references, operator by operator.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1'\mid\mu'}{t_1 := t_2 \mid \mu \longrightarrow t_1' := t_2 \mid \mu'}
\label{eq:e-assign1}\tag{E-Assign1} \\ \\
\frac{t_2 \mid \mu \longrightarrow t_2'\mid\mu'}{t_1 := t_2 \mid \mu \longrightarrow t_1 := t_2' \mid \mu'}
\label{eq:e-assign2}\tag{E-Assign2} \\ \\
l := v_2 \mid \mu \longrightarrow \text{unit}\mid[l\mapsto v_2]\mu
\label{eq:e-assign}\tag{E-Assign} \\ \\
\end{align}</script>
<p>The assignments $\ref{eq:e-assign1}$ and $\ref{eq:e-assign2}$ evaluate terms until they become values. When they have been reduced, we can do that actual assignment: as per $\ref{eq:e-assign}$, we update the store and return return <code class="highlighter-rouge">unit</code>.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1' \mid \mu'}{\text{ref } t_1 \mid \mu \longrightarrow \text{ref } t_1' \mid \mu'}
\label{eq:e-ref}\tag{E-Ref} \\ \\
\frac{l \notin \text{dom}(\mu)}{\text{ref } v_1 \mid \mu \longrightarrow l \mid (\mu \cup (l\mapsto v_1))}
\label{eq:e-refv}\tag{E-RefV}
\end{align}</script>
<p>A reference $\text{ref }t_1$ first evaluates $t_1$ until it is a value ($\ref{eq:e-ref}$). To evaluate the reference operator, we find a fresh location $l$ in the store, to which it binds $v_1$, and it returns the location $l$.</p>
<script type="math/tex; mode=display">\begin{align}
\frac{t_1 \mid \mu \longrightarrow t_1' \mid \mu'}{!t_1 \mid \mu \longrightarrow !t_1' \mid \mu'}
\label{eq:e-deref}\tag{E-Deref} \\ \\
\frac{\mu(l) = v}{!l\mid\mu \longrightarrow v\mid\mu}
\label{eq:e-derefloc}\tag{E-DerefLoc}
\end{align}</script>
<p>We find the same congruence rule as usual in $\ref{eq:e-deref}$, where a term $!t_1$ first evaluates $t_1$ until it is a value. Once it is a value, we can return the value in the current store using $\ref{eq:e-derefloc}$.</p>
<p>The evaluation rules for abstraction and application are augmented with stores, but otherwise unchanged.</p>
<h4 id="store-typing">Store typing</h4>
<p>What is the type of a location? The answer to this depends on what is in the store. Unless we specify it, a store could contain anything at a given location, which is problematic for typechecking. The solution is to type the locations themselves. This leads us to a typed store:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mu = (& l_1 \mapsto \text{Nat}, \\
& l_2 \mapsto \lambda x: \text{Unit}. x)
\end{align} %]]></script>
<p>As a first attempt at a typing rule, we can just say that the type of a location is given by the type of the value in the store at that location:</p>
<script type="math/tex; mode=display">\frac{\Gamma\vdash\mu(l) : T_1}{\Gamma\vdash l : \text{Ref } T_1}</script>
<p>This is problematic though; in the following, the typing derivation for $!l_2$ would be infinite because we have a cyclic reference:</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
\mu =\ (& l_1 \mapsto \lambda x: \text{Nat}.\ !l_2\ x, \\
& l_2 \mapsto \lambda x: \text{Nat}.\ !l_1\ x)
\end{align} %]]></script>
<p>The core of the problem here is that we would need to recompute the type of a location every time. But shouldn’t be necessary. Seeing that references are strongly typed as <code class="highlighter-rouge">Ref T</code>, we know exactly what type of value we can place in a given store location. Indeed, the typing rules we chose for references guarantee that a given location in the store always is used to hold values of the same type.</p>
<p>So to fix this problem, we need to introduce a <strong>store typing</strong>. This is a partial function from location to types, which we’ll denote by $\Sigma$.</p>
<p>Suppose we’re given a store typing $\Sigma$ describing the store $\mu$. We can use $\Sigma$ to look up the types of locations, without doing a lookup in $\mu$:</p>
<script type="math/tex; mode=display">\frac{\Sigma(l) = T_1}{\Gamma\mid\Sigma\vdash l : \text{Ref } T_1}
\label{eq:t-loc}\tag{T-Loc}</script>
<p>This tells us how to check the store typing, but how do we create it? We can start with an empty typing $\Sigma = \emptyset$, and add a typing relation with the type of $v_1$ when a new location is created during evaluation of $\ref{eq:e-refv}$.</p>
<p>The rest of the typing rules remain the same, but are augmented with the store typing. So in conclusion, we have updated our evaluation rules with a <em>store</em> $\mu$, and our typing rules with a <em>store typing</em> $\Sigma$.</p>
<h4 id="safety">Safety</h4>
<p>Let’s take a look at progress and preservation in this new type system. Preservation turns out to be more interesting, so let’s look at that first.</p>
<p>We’ve added a store and a store typing, so we need to add those to the statement of preservation to include these. Naively, we’d write:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma\vdash t: T \text{ and }
t\mid\mu\longrightarrow t'\mid\mu'
\quad \implies \quad
\Gamma\mid\Sigma\vdash t': T</script>
<p>But this would be wrong! In this statement, $\Sigma$ and $\mu$ would not be constrained to be correlated at all, which they need to be. This constraint can be defined as follows:</p>
<p>A store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$ (which we denote by $\Gamma\mid\Sigma\vdash\mu$) if the following is satisfied:</p>
<script type="math/tex; mode=display">\text{dom}(\mu) = \text{dom}(\Sigma)
\quad \text{and} \quad
\Gamma\mid\Sigma\vdash\mu(l) : \Sigma(l),\ \forall l\in\text{dom}(\mu)</script>
<p>This gets us closer, and we can write the following preservation statement:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma \vdash t : T \text{ and }
t\mid\mu \longrightarrow t'\mid\mu \text{ and }
\Gamma\mid\Sigma \vdash \mu
\quad \implies \quad
\Gamma\mid\Sigma\vdash t' : T</script>
<p>But this is still wrong! When we create a new cell with $\ref{eq:e-refv}$, we would break the correspondence between store typing and store.</p>
<p>The correct version of the progress theorem is the following:</p>
<script type="math/tex; mode=display">\Gamma\mid\Sigma \vdash t : T \text{ and }
t\mid\mu \longrightarrow t'\mid\mu \text{ and }
\Gamma\mid\Sigma \vdash \mu
\quad \implies \quad
\text{for some } \Sigma' \supseteq \Sigma, \;\;
\Gamma\mid\Sigma'\vdash t' : T</script>
<p>This progress theorem just asserts that there is <em>some</em> store typing $\Sigma’ \supseteq \Sigma$ (agreeing with $\Sigma$ on the values of all old locations, but that may have also add new locations), such that $t’$ is well typed in $\Sigma’$.</p>
<p>The progress theorem must also be extended with stores and store typings:</p>
<p>Suppose that $t$ is a closed, well-typed term; that is, $\emptyset\mid\Sigma\vdash t: T$ for some type $T$ and some store typing $\Sigma$. Then either $t$ is a value or else, for any store $\mu$ such that $\emptyset\mid\Sigma\vdash\mu$<sup id="fnref:well-typed-store-notation"><a href="#fn:well-typed-store-notation" class="footnote">2</a></sup>, there is some term $t’$ and store $\mu’$ with $t\mid\mu \longrightarrow t’\mid\mu’$.</p>
<div class="footnotes">
<ol>
<li id="fn:in-relation-notation">
<p>$(t, C) \in \text{Consts}$ is equivalent to $\text{Consts}(t) = C$ <a href="#fnref:in-relation-notation" class="reversefootnote">↩</a></p>
</li>
<li id="fn:well-typed-store-notation">
<p>Recall that this notation is used to say a store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$, as defined in the section on <a href="#safety">safety in STLC with stores</a>. <a href="#fnref:well-typed-store-notation" class="reversefootnote">↩</a></p>
</li>
</ol>
</div>
Writing a parser with parser combinators Boilerplate The basic idea Simple parser primitives Parser combinators Shorthands Example: JSON parser The trouble with left-recursion Arithmetic expressions — abstract syntax and proof principles Basics of induction Mathematical representation of syntax Mathematical representation 1 Mathematical representation 2 Mathematical representation 3 Comparison of the representations Induction on terms Inductive function definitions What is a function? Induction example 1 Induction example 2 Operational semantics and reasoning Evaluation Derivations Inversion lemma Abstract machines Normal forms Values that are normal form Values that are not normal form Multi-step evaluation Termination of evaluation Lambda calculus Pure lambda calculus Scope Operational semantics Evaluation strategies Classical lambda calculus Confluence in full beta reduction Alpha conversion Programming in lambda-calculus Multiple arguments Booleans Pairs Numbers Lists Recursion in lambda-calculus Equivalence of lambda terms Types Properties of the Typing Relation Inversion lemma Canonical form Progress Theorem Preservation Theorem Messing with it Removing a rule Changing type-checking rule Adding bit Simply typed lambda calculus Type annotations Typing rules Inversion lemma Canonical form Progress Preservation Weakening lemma Permutation lemma Substitution lemma Proof Erasure Curry-Howard Correspondence Extensions to STLC Base types Unit type Sequencing Ascription Pairs Tuples Records Sums and variants Sum type Sums and uniqueness of type Variants Recursion References Mutability Aliasing Typing rules Evaluation Store typing Safety ⚠ Work in progressWriting a parser with parser combinatorsIn Scala, you can (ab)use the operator overload to create an embedded DSL (EDSL) for grammars. While a grammar may look as follows in a grammar description language (Bison, Yak, ANTLR, …):123Expr ::= Term {'+' Term | '−' Term}Term ::= Factor {'∗' Factor | '/' Factor}Factor ::= Number | '(' Expr ')'In Scala, we can model it as follows:123def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term)def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor)def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLitThis is perhaps a little less elegant, but allows us to encode it directly into our language, which is often useful for interop.The ~, |, rep and opt are parser combinators. These are primitives with which we can construct a full parser for the grammar of our choice.BoilerplateFirst, let’s define a class ParseResult[T] as an ad-hoc monad; parsing can either succeed or fail:123sealed trait ParseResult[T]case class Success[T](result: T, in: Input) extends ParseResult[T]case class Failure(msg : String, in: Input) extends ParseResult[Nothing] 👉 Nothing is the bottom type in Scala; it contains no members, and nothing can extend itLet’s also define the tokens produced by the lexer (which we won’t define) as case classes extending Token:12345sealed trait Tokencase class Keyword(chars: String) extends Tokencase class NumericLit(chars: String) extends Tokencase class StringLit(chars: String) extends Tokencase class Identifier(chars: String) extends TokenInput into the parser is then a lazy stream of tokens (with positions for error diagnostics, which we’ll omit here):1type Input = Reader[Token]We can then define a standard, sample parser which looks as follows on the type-level:123class StandardTokenParsers { type Parser = Input => ParseResult}The basic ideaFor each language (defined by a grammar symbol S), define a function f that, given an input stream i (with tail i'): if a prefix of i is in S, return Success(Pair(x, i')), where x is a result for S otherwise, return Failure(msg, i), where msg is an error message stringThe first is called success, the second is failure. We can compose operations on this somewhat conveniently, like we would on a monad (like Option).Simple parser primitivesAll of the above boilerplate allows us to define a parser, which succeeds if the first token in the input satisfies some given predicate pred. When it succeeds, it reads the token string, and splits the input there.12345def token(kind: String)(pred: Token => boolean) = new Parser[String] { def apply(in : Input) = if (pred(in.head)) Success(in.head.chars, in.tail) else Failure(kind + " expected ", in)}We can use this to define a keyword parser:1234implicit def keyword(chars: String) = token("'" + chars + "'") { case Keyword(chars1) => chars == chars1 case _ => false}Marking it as implicit allows us to write keywords as normal strings, where we can omit the keyword call (this helps us simplify the notation in our DSL; we can write "if" instead of keyword("if")).We can make other parsers for our other case classes quite simply:123def numericLit = token("number")( .isInstanceOf[NumericLit])def stringLit = token("string literal")( .isInstanceOf[StringLit])def ident = token("identifier")( .isInstanceOf[Identifier])Parser combinatorsWe are going to define the following parser combinators: ~: sequential composition <~, >~: sequential composition, keeping left / right only |: alternative opt(X): option (like a ? quantifier in a regex) rep(X): repetition (like a * quantifier in a regex) repsep(P, Q): interleaved repetition ^^: result conversion (like a map on an Option) ^^^: constant result (like a map on an Option, but returning a constant value regardless of result)But first, we’ll write some very basic parser combinators: success and failure, that respectively always succeed and always fail:1234567def success[T](result: T) = new Parser[T] { def apply(in: Input) = Success(result, in)}def failure(msg: String) = new Parser[Nothing] { def apply(in: Input) = Failure(msg, in)}All of the above are methods on a Parser[T] class. Thanks to infix space notation in Scala, we can denote x.y(z) as x y z, which allows us to simplify our DSL notation; for instance A ~ B corresponds to A.~(B).123456789101112131415161718192021222324252627282930abstract class Parser[T] { // An abstract method that defines the parser function def apply(in : Input): ParseResult def ~[U](rhs: Parser[U]) = new Parser[T ~ U] { def apply(in: Input) = Parser.this(in) match { case Success(x, tail) => rhs(tail) match { case Success(y, rest) => Success(new ~(x, y), rest) case failure => failure } case failure => failure } } def |(rhs: => Parser[T]) = new Parser[T] { def apply(in : Input) = Parser.this(in) match { case s1 @ Success(_, _) => s1 case failure => rhs(in) } } def ^^[U](f: T => U) = new Parser[U] { def apply(in : Input) = Parser.this(in) match { case Success(x, tail) => Success(f(x), tail) case x => x } } def ^^^[U](r: U): Parser[U] = ^^(x => r)} 👉 In Scala, T ~ U is syntactic sugar for ~[T, U], which is the type of the case class we’ll define belowFor the ~ combinator, when everything works, we’re using ~, a case class that is equivalent to Pair, but prints the way we want to and allows for the concise type-level notation above.123case class ~[T, U](_1 : T, _2 : U) { override def toString = "(" + _1 + " ~ " + _2 +")"}At this point, we thus have two different meanings for ~: a function ~ that produces a Parser, and the ~(a, b) case class pair that this parser returns (all of this is encoded in the function signature of the ~ function).Note that the | combinator takes the right-hand side parser as a call-by-name argument. This is because we don’t want to evaluate it unless it is strictly needed—that is, if the left-hand side fails.^^ is like a map operation on Option; P ^^ f succeeds iff P succeeds, in which case it applies the transformation f on the result of P. Otherwise, it fails.ShorthandsWe can now define shorthands for common combinations of parser combinators:1234567def opt[T](p : Parser[T]): Parser[Option[T]] = p ^^ Some | success(None)def rep[T](p : Parser[T]): Parser[List[T]] = p ~ rep(p) ^^ { case x ~ xs => x :: xs } | success(Nil)def repsep[T, U](p : Parser[T], q : Parser[U]): Parser[List[T]] = p ~ rep(q ~> p) ^^ { case r ~ rs => r :: rs } | success(Nil)Note that none of the above can fail. They may, however, return None or Nil wrapped in success.As an exercise, we can implement the rep1(P) parser combinator, which corresponds to the + regex quantifier:1def rep1[T](p: Parser[T]) = p ~ rep(p)Example: JSON parserWe did not mention lexical.delimiters and lexical.reserved in the above, and for the sake of brevity, we omit the implementation of stringLit and numericLit.12345678910111213141516171819202122232425object JSON extends StandardTokenParsers { lexical.delimiters += ("{", "}", "[", "]", ":") lexical.reserved += ("null", "true", "false") // Return Map def obj: Parser[Any] = "{" ~ repsep(member, ",") ~ "}" ^^ (ms => Map() ++ ms) // Return List def arr: Parser[Any] = "[" ~> repsep(value, ",") <~ "]" // Return name/value pair: def member: Parser[Any] = stringLit ~ ":" ~ value ^^ { case name ~ ":" ~ value => (name, value) } // Return correct Scala type def value: Parser[Any] = obj | arr | stringLit | numericLit ^^ (_.toInt) | "null" ^^^ null | "true" ^^^ true | "false" ^^^ false}The trouble with left-recursionParser combinators work top-down and therefore do not allow for left-recursion. For example, the following would go into an infinite loop, where the parser keeps recursively matching the same token unto expr:1def expr = expr ~ "-" ~ termLet’s take a look at an arithmetic expression parser:123456object Arithmetic extends StandardTokenParsers { lexical.delimiters ++= List("(", ")", "+", "−", "∗", "/") def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term) def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor) def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLit}This definition of expr, namely term ~ rep("-" ~ term) produces a right-leaning tree. For instance, 1 - 2 - 3 produces 1 ~ List("-" ~ 2, ~ "-" ~ 3).The solution is to combine calls to rep with a final foldLeft on the list:123456789101112131415161718object Arithmetic extends StandardTokenParsers { lexical.delimiters ++= List("(", ")", "+", "−", "∗", "/") def expr: Parser[Any] = term ~ rep("+" ~ term | "−" ~ term) ^^ reduceList def term: Parser[Any] = factor ~ rep("∗" ~ factor | "/" ~ factor) ^^ reduceList def factor: Parser[Any] = "(" ~ expr ~ ")" | numericLit private def reduceList(list: Expr ~ List[String ~ Expr]): Expr = list match { case x ~ xs => (x foldLeft ps)(reduce) } private def reduce(x: Int, r: String ~ Int) = r match { case "+" ~ y => x + y case "−" ~ y => x − y case "∗" ~ y => x ∗ y case "/" ~ y => x / y case => throw new MatchError("illegal case: " + r) }} 👉 It used to be that the standard library contained parser combinators, but those are now a separate module. This module contains a chainl (chain-left) method that reduces after a rep for you.Arithmetic expressions — abstract syntax and proof principlesThis section follows Chapter 3 in TAPL.Basics of inductionOrdinary induction is simply:Suppose P is a predicate on natural numbers.Then: If P(0) and, for all i, P(i) implies P(i + 1) then P(n) holds for all nWe can also do complete induction:Suppose P is a predicate on natural numbers.Then: If for each natural number n, given P(i) for all i < n we can show P(n) then P(n) holds for all nIt proves exactly the same thing as ordinary induction, it is simply a restated version. They’re interderivable; assuming one, we can prove the other. Which one to use is simply a matter of style or convenience. We’ll see some more equivalent styles as we go along.Mathematical representation of syntaxLet’s assume the following grammar:12345678t ::= true false if t then t else t 0 succ t pred t iszero tWhat does this really define? A few suggestions: A set of character strings A set of token lists A set of abstract syntax treesIt depends on how you read it; a grammar like the one above contains information about all three.However, we are mostly interested in the ASTs. The above grammar is therefore called an abstract grammar. Its main purpose is to suggest a mapping from character strings to trees.For our use of these, we won’t be too strict with these. For instance, we’ll freely use parentheses to disambiguate what tree we mean to describe, even though they’re not strictly supported by the grammar. What matters to us here aren’t strict implementation semantics, but rather that we have a framework to talk about ASTs. For our purposes, we’ll consider that two terms producing the same AST are basically the same; still, we’ll distinguish terms that only have the same evaluation result, as they don’t necessarily have the same AST.How can we express our grammar as mathematical expressions? A grammar describes the legal set of terms in a program by offering a recursive definition. While recursive definitions may seem obvious and simple to a programmer, we have to go through a few hoops to make sense of them mathematically.Mathematical representation 1We can use a set $\mathcal{T}$ of terms. The grammar is then the smallest set such that: $\left\{ \text{true}, \text{false}, 0 \right\} \subseteq \mathcal{T}$, If $t_1 \in \mathcal{T}$ then $\left\{ \text{succ } t_1, \text{pred } t_1, \text{iszero } t_1 \right\} \subseteq \mathcal{T}$, If $t_1, t_2, t_3 \in \mathcal{T}$ then we also have $\text{if } t_1 \text{ then } t_2 \text{ else } t_3 \in \mathcal{T}$.Mathematical representation 2We can also write this somewhat more graphically:This is exactly equivalent to representation 1, but we have just introduced a different notation. Note that “the smallest set closed under…” is often not stated explicitly, but implied.Mathematical representation 3Alternatively, we can build up our set of terms as an infinite union:We can thus build our final set as follows:Note that we can “pull out” the definition into a generating function $F$:The generating function is thus defined as:Each function takes a set of terms $U$ as input and produces “terms justified by $U$” as output; that is, all terms that have the items of $U$ as subterms.The set $U$ is said to be closed under F or F-closed if $F(U) \subseteq U$.The set of terms $T$ as defined above is the smallest F-closed set. If $O$ is another F-closed set, then $T \subseteq O$.Comparison of the representationsWe’ve seen essentially two ways of defining the set (as representation 1 and 2 are equivalent, but with different notation): The smallest set that is closed under certain rules. This is compact and easy to read. The limit of a series of sets. This gives us an induction principle on which we can prove things on terms by induction.The first one defines the set “from above”, by intersecting F-closed sets.The second one defines it “from below”, by starting with $\emptyset$ and getting closer and closer to being F-closed.These are equivalent (we won’t prove it, but Proposition 3.2.6 in TAPL does so), but can serve different uses in practice.Induction on termsFirst, let’s define depth: the depth of a term $t$ is the smallest $i$ such that $t\in\mathcal{S_i}$.The way we defined $\mathcal{S}_i$, it gets larger and larger for increasing $i$; the depth of a term $t$ gives us the step at which $t$ is introduced into the set.We see that if a term $t$ is in , then all of its immediate subterms must be in $\mathcal{S}_{i-1}$, meaning that they must have smaller depth.This justifies the principle of induction on terms, or structural induction. Let P be a predicate on a term:If, for each term s, given P(r) for all immediate subterms r of s we can show P(s) then P(t) holds for all tAll this says is that if we can prove the induction step from subterms to terms (under the induction hypothesis), then we have proven the induction.We can also express this structural induction using generating functions, which we introduced previously.Suppose T is the smallest F-closed set.If, for each set U, from the assumption "P(u) holds for every u ∈ U", we can show that "P(v) holds for every v ∈ F(U)"then P(t) holds for all t ∈ TWhy can we use this? We assumed that $T$ was the smallest F-closed set, which means that $T\subseteq O$ for any other F-closed set $O$. Showing the pre-condition (“for each set $U$, from the assumption…”) amounts to showing that the set of all terms satisfying $P$ (call it $O$) is itself an F-closed set. Since $T\subseteq O$, every element of $T$ satisfies $P$.Inductive function definitionsAn inductive definition is used to define the elements in a set recursively, as we have done above. The recursion theorem states that a well-formed inductive definition defines a function. To understand what being well-formed means, let’s take a look at some examples.Let’s define our grammar function a little more formally. Constants are the basic values that can’t be expanded further; in our example, they are true, false, 0. As such, the set of constants appearing in a term $t$, written $\text{Consts}(t)$, is defined recursively as follows:This seems simple, but these semantics aren’t perfect. First off, a mathematical definition simply assigns a convenient name to some previously known thing. But here, we’re defining the thing in terms of itself, recursively. And the semantics above also allow us to define ill-formed inductive definitions:The last rule produces infinitely large rules (if we implemented it, we’d expect some kind of stack overflow). We’re missing the rules for if-statements, and we have a useless rule for 0, producing empty sets.How do we tell the difference between a well-formed inductive definition, and an ill-formed one as above? What is well-formedness anyway?What is a function?A relation over $T, U$ is a subset of $T \times U$, where the Cartesian product is defined as:A function $f$ from $A$ (domain) to $B$ (co-domain) can be viewed as a two-place relation, albeit with two additional properties: It is total: $\forall a \in A, \exists b \in B : (a, b) \in f$ It is deterministic: $(a, b_1) \in f, (a, b_2) \in f \implies b_1 = b_2$Totality ensures that the A domain is covered, while being deterministic just means that the function always produces the same result for a given input.Induction example 1As previously stated, $\text{Consts}$ is a relation. It maps terms (A) into the set of constants that they contain (B). The induction theorem states that it is also a function. The proof is as follows.$\text{Consts}$ is total and deterministic: for each term $t$ there is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$1 . The proof is done by induction on $t$.To be able to apply the induction principle for terms, we must first show that for an arbitrary term $t$, under the following induction hypothesis: For each immediate subterm $s$ of $t$, there is exactly one set of terms $C_s$ such that $(s, C_s) \in \text{Consts}$Then the following needs to be proven as an induction step: There is exactly one set of terms $C$ such that $(t, C) \in \text{Consts}$We proceed by cases on $t$: If $t$ is $0$, $\text{true}$ or $\text{false}$ We can immediately see from the definition that of $\text{Consts}$ that there is exactly one set of terms $C = \left\{ t \right\}$) such that $(t, C) \in \text{Consts}$. This constitutes our base case. If $t$ is $\text{succ } t_1$, $\text{pred } t_1$ or $\text{iszero } t_1$ The immediate subterm of $t$ is $t_1$, and the induction hypothesis tells us that there is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$. But then it is clear from the definition that there is exactly one set of terms $C = C_1$ such that $(t, C) \in \text{Consts}$. If $t$ is $\ifelse$ The induction hypothesis tells us: There is exactly one set of terms $C_1$ such that $(t_1, C_1) \in \text{Consts}$ There is exactly one set of terms $C_2$ such that $(t_2, C_2) \in \text{Consts}$ There is exactly one set of terms $C_3$ such that $(t_3, C_3) \in \text{Consts}$ It is clear from the definition of $\text{Consts}$ that there is exactly one set $C = C_1 \cup C_2 \cup C_3$ such that $(t, C) \in \text{Consts}$. This proves that $\text{Consts}$ is indeed a function.But what about $\text{BadConsts}$? It is also a relation, but it isn’t a function. For instance, we have $\text{BadConsts}(0) = \left\{ 0 \right\}$ and $\text{BadConsts}(0) = \left\{ \right\}$, which violates determinism. To reformulate this in terms of the above, there are two sets $C$ such that $(0, C) \in \text{BadConsts}$, namely $C = \left\{ 0 \right\}$ and $C = \left\{ \right\}$.Note that there are many other problems with $\text{BadConsts}$, but this is sufficient to prove that it isn’t a function.Induction example 2Let’s introduce another inductive definition:We’d like to prove that the number of distinct constants in a term is at most the size of the term. In other words, that $\abs{\text{Consts}(t)} \le \text{size}(t)$The proof is by induction on $t$: $t$ is a constant; $t=\text{true}$, $t=\text{false}$ or $t=0$ The proof is immediate. For constants, the number of constants and the size are both one: $\abs{\text{Consts(t)}} = \abs{\left\{t\right\}} = 1 = \text{size}(t)$ $t$ is a function; $t = \text{succ}\ t_1$, $t = \text{pred}\ t_1$ or $t = \text{iszero}\ t_1$ By the induction hypothesis, $\abs{\text{Consts}(t1)} \le \text{size}(t_1)$. We can then prove the proposition as follows: $\abs{\text{Consts}(t)} = \abs{\text{Consts}(t_1)} \overset{\text{IH}}{\le} \text{size}(t_1) = \text{size}(t) + 1 < \text{size}(t)$ $t$ is an if-statement: $t = \ifelse$ By the induction hypothesis, $\abs{\text{Consts}(t_1)} \le \text{size}(t_1)$, $\abs{\text{Consts}(t_2)} \le \text{size}(t_2)$ and $\abs{\text{Consts}(t_3)} \le \text{size}(t_3)$. We can then prove the proposition as follows: Operational semantics and reasoningEvaluationSuppose we have the following syntax1234t ::= // terms true // constant true false // constant false if t then t else t // conditionalThe evaluation relation $t \longrightarrow t’$ is the smallest relation closed under the following rules.The following are computation rules, defining the “real” computation steps:The following is a congruence rule, defining where the computation rule is applied next:We want to evaluate the condition before the conditional clauses in order to save on evaluation; we’re not sure which one should be evaluated, so we need to know the condition first.DerivationsWe can describe the evaluation logically from the above rules using derivation trees. Suppose we want to evaluate the following (with parentheses added for clarity): if (if true then true else false) then false else true.In an attempt to make all this fit onto the screen, true and false have been abbreviated T and F in the derivation below, and the then keyword has been replaced with a parenthesis notation for the condition.The final statement is a conclusion. We say that the derivation is a witness for its conclusion (or a proof for its conclusion). The derivation records all reasoning steps that lead us to the conclusion.Inversion lemmaWe can introduce the inversion lemma, which tells us how we got to a term.Suppose we are given a derivation $\mathcal{D}$ witnessing the pair $(t, t’)$ in the evaluation relation. Then either: If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $\if true \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $\if false \then t_2 \else t_3$ and $t’=t_2$ for some $t_2$ and $t_3$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \if t_1 \then t_2 \else t_3$ and $t’ = t = \if t_1’ \then t_2 \else t_3$, for some $t_1, t_1’, t_2, t_3$. Moreover, the immediate subderivation of $\mathcal{D}$ witnesses $(t_1, t_1’) \in \longrightarrow$.This is super boring, but we do need to acknowledge the inversion lemma before we can do induction proofs on derivations. Thanks to the inversion lemma, given an arbitrary derivation $\mathcal{D}$ with conclusion $t \longrightarrow t’$, we can proceed with a case-by-case analysis on the final rule used in the derivation tree.Let’s recall our definition of the size function. In particular, we’ll need the rule for if-statements:We want to prove that if $t \longrightarrow t’$, then $\text{size}(t) > \text{size}(t’)$. If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iftrue})$, then we have $t = \if true \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-iffalse})$, then we have $t = \if false \then t_2 \else t_3$ and $t’=t_2$, and the result is immediate from the definition of $\text{size}$ If the final rule applied in $\mathcal{D}$ was $(\ref{eq:e-if})$, then we have $t = \ifelse$ and $t’ = \if t_1’ \then t_2 \else t_3$. In this case, $t_1 \longrightarrow t_1’$ is witnessed by a derivation $\mathcal{D}_1$. By the induction hypothesis, $\text{size}(t_1) > \text{size}(t_1’)$, and the result is then immediate from the definition of $\text{size}$Abstract machinesAn abstract machine consists of: A set of states A transition relation of states, written $\longrightarrow$$t \longrightarrow t’$ means that $t$ evaluates to $t’$ in one step. Note that $\longrightarrow$ is a relation, and that $t \longrightarrow t’$ is shorthand for $(t, t’) \in \longrightarrow$. Often, this relation is a partial function (not necessarily covering the domain A; there is at most one possible next state). But without loss of generality, there may be many possible next states, determinism isn’t a criterion here.Normal formsA normal form is a term that cannot be evaluated any further. More formally, a term $t$ is a normal form if there is no $t’$ such that $t \longrightarrow t’$. A normal form is a state where the abstract machine is halted; we can regard it as the result of a computation.Values that are normal formPreviously, we intended for our values (true and false) to be exactly that, the result of a computation. Did we get that right?Let’s prove that a term $t$ is a value $\iff$ it is in normal form. The $\implies$ direction is immediate from the definition of the evaluation relation $\longrightarrow$. The $\impliedby$ direction is more conveniently proven as its contrapositive: if $t$ is not a value, then it is not a normal form, which we can prove by induction on the term $t$. Since $t$ is not a value, it must be of the form $\ifelse$. If $t_1$ is directly true or false, then $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ apply, and we are done. Otherwise, if $t = \ifelse$ where $t_1$ isn’t a value, by the induction hypothesis, there is a $t_1’$ such that $t_1 \longrightarrow t_1’$. Then rule $\ref{eq:e-if}$ yields $\if t_1’ \then t_2 \else t_3$, which proves that $t$ is not in normal form. Values that are not normal formLet’s introduce new syntactic forms, with new evaluation rules.1234567891011t ::= // terms 0 // constant 0 succ t // successor pred t // predecessor iszero t // zero testv ::= nv // valuesnv ::= // numeric values 0 // zero value succ nv // successor valueThe evaluation rules are given as follows:All values are still normal forms. But are all normal forms values? Not in this case. For instance, succ true, iszero true, etc, are normal forms. These are stuck terms: they are in normal form, but are not values. In general, these correspond to some kind of type error, and one of the main purposes of a type system is to rule these kinds of situations out.Multi-step evaluationLet’s introduce the multi-step evaluation relation, $\longrightarrow^*$. It is the reflexive, transitive closure of single-step evaluation, i.e. the smallest relation closed under these rules:In other words, it corresponds to any number of single consecutive evaluations.Termination of evaluationWe’ll prove that evaluation terminates, i.e. that for every term $t$ there is some normal form $t’$ such that $t\longrightarrow^* t’$.First, let’s recall our proof that $t\longrightarrow t’ \implies \text{size}(t) > \text{size}(t’)$. Now, for our proof by contradiction, assume that we have an infinite-length sequence $t_0, t_1, t_2, \dots$ such that:But this sequence cannot exist: since $\text{size}(t_0)$ is a finite, natural number, we cannot construct this infinite descending chain from it. This is a contradiction.Most termination proofs have the same basic form. We want to prove that the relation $R\subseteq X \times X$ is terminating — that is, there are no infinite sequences $x_0, x_1, x_2, \dots$ such that $(x_i, x_{i+1}) \in R$ for each $i$. We proceed as follows: Choose a well-suited set $W$ with partial order $<$ such that there are no infinite descending chains $w_0 > w_1 > w_2 > \dots$ in $W$. Also choose a function $f: X \rightarrow W$. Show $f(x) > f(y) \quad \forall (x, y) \in R$ Conclude that are no infinite sequences $(x_0, x_1, x_2, \dots)$ such that $(x_i, x_{i+1}) \in R$ for each $i$. If there were, we could construct an infinite descending chain in $W$.As a side-note, partial order is defined as the following properties: Anti-symmetry: $\neg(x < y \land y < x)$ Transitivity: $x<y \land y<z \implies x < z$We can add a third property to achieve total order, namely $x \ne y \implies x <y \lor y<x$.Lambda calculusLambda calculus is Turing complete, and is higher-order (functions are data). In lambda calculus, all computation happens by means of function abstraction and application.Lambda calculus is isomorphic to Turing machines.Suppose we wanted to write a function plus3 in our previous language:plus3 x = succ succ succ xThe way we write this in lambda calculus is:$\lambda x. t$ is written x => t in Scala, or fun x -> t in OCaml. Application of our function, say plus3(succ 0), can be written as:Abstraction over functions is possible using higher-order functions, which we call $\lambda$-abstractions. An example of such an abstraction is the function $g$ below, which takes an argument $f$ and uses it in the function position.If we apply $g$ to an argument like $\text{plus3}$, we can just use the substitution rule to see how that defines a new function.Another example: the double function below takes two arguments, as a curried function would. First, it takes the function to apply twice, then the argument on which to apply it, and then returns $f(f(y))$.Pure lambda calculusOnce we have $\lambda$-abstractions, we can actually throw out all other language primitives like booleans and other values; all of these can be expressed as functions, as we’ll see below. In pure lambda-calculus, everything is a function.Variables will always denote a function, functions always take other functions as parameters, and the result of an evaluation is always a function.The syntax of lambda-calculus is very simple:1234t ::= // terms, also called λ-terms x // variable λx. t // abstraction, also called λ-abstractions t t // applicationA few rules and syntactic conventions: Application associates to the left, so $t\ u\ v$ means $(t\ u)\ v$, not $t\ (u\ v)$. Bodies of lambda abstractions extend as far to the right as possible, so $\lambda x. \lambda y.\ x\ y$ means $\lambda x.\ (\lambda y. x\ y)$, not $\lambda x.\ (\lambda y.\ x)\ y$ScopeThe lambda expression $\lambda x.\ t$ binds the variable $x$, with a scope limited to $t$. Occurrences of $x$ inside of $t$ are said to be bound, while occurrences outside are said to be free.Let $\text{fv}(t)$ be the set of free variables in a term $t$. It’s defined as follows:Operational semanticsAs we saw with our previous language, the rules could be distinguished into computation and congruence rules. For lambda calculus, the only computation rule is:The notation $\left[ x \mapsto v_2 \right] t_{12}$ means “the term that results from substituting free occurrences of $x$ in $t_{12}$ with $v_2$”.The congruence rules are:A lambda-expression applied to a value, $(\lambda x.\ t)\ v$, is called a reducible expression, or redex.Evaluation strategiesThere are alternative evaluation strategies. In the above, we have chosen call by value (which is the standard in most mainstream languages), but we could also choose: Full beta-reduction: any redex may be reduced at any time. This offers no restrictions, but in practice, we go with a set of restrictions like the ones below (because coding a fixed way is easier than coding probabilistic behavior). Normal order: the leftmost, outermost redex is always reduced first. This strategy allows to reduce inside unapplied lambda terms Call-by-name: allows no reductions inside lambda abstractions. Arguments are not reduced before being substituted in the body of lambda terms when applied. Haskell uses an optimized version of this, call-by-need (aka lazy evaluation).Classical lambda calculusClassical lambda calculus allows for full beta reduction.Confluence in full beta reductionThe congruence rules allow us to apply in different ways; we can choose between $\ref{eq:e-app1}$ and $\ref{eq:e-app2}$ every time we reduce an application, and this offers many possible reduction paths.While the path is non-deterministic, is the result also non-deterministic? This question took a very long time to answer, but after 25 years or so, it was proven that the result is always the same. This is known the Church-Rosser confluence theorem:Let $t, t_1, t_2$ be terms such that $t \longrightarrow^* t_1$ and $t \longrightarrow^* t_2$. Then there exists a term $t_3$ such that $t_1 \longrightarrow^* t_3$ and $t_2 \longrightarrow^* t_3$Alpha conversionSubstitution is actually trickier than it looks! For instance, in the expression $\lambda x.\ (\lambda y.\ x)\ y$, the first occurrence of $y$ is bound (it refers to a parameter), while the second is free (it does not refer to a parameter). This is comparable to scope in most programming languages, where we should understand that these are two different variables in different scopes, $y_1$ and $y_2$.The above example had a variable that is both bound and free, which is something that we’ll try to avoid. This is called a hygiene condition.We can transform a unhygienic expression to a hygienic one by renaming bound variables before performing the substitution. This is known as alpha conversion. Alpha conversion is given by the following conversion rule:And these equivalence rules (in mathematics, equivalence is defined as symmetry and transitivity):The congruence rules are as usual.Programming in lambda-calculusMultiple argumentsThe way to handle multiple arguments is by currying: $\lambda x.\ \lambda y.\ t$BooleansThe fundamental, universal operator on booleans is if-then-else, which is what we’ll replicate to model booleans. We’ll denote our booleans as $\text{tru}$ and $\text{fls}$ to be able to distinguish these pure lambda-calculus abstractions from the true and false values of our previous toy language.We want true to be equivalent to if (true), and false to if (false). The terms $\text{tru}$ and $\text{fls}$ represent boolean values, in that we can use them to test the truth of a boolean value:We can consider these as booleans. Equivalently tru can be considered as a function performing (t1, t2) => if (true) t1 else t2. To understand this, let’s try to apply $\text{tru}$ to two arguments:This works equivalently for fls.We can also do inversion, conjunction and disjunction with lambda calculus, which can be read as particular if-else statements: not is a function that is equivalent to not(b) = if (b) false else true. and is equivalent to and(b, c) = if (b) c else false or is equivalent to or(b, c) = if (b) true else cPairsThe fundamental operations are construction pair(a, b), and selection pair._1 and pair._2. pair is equivalent to pair(f, s) = (b => b f s) When tru is applied to pair, it selects the first element, by definition of the boolean, and that is therefore the definition of fst Equivalently for fls applied to pair, it selects the second elementNumbersWe’ve actually been representing numbers as lambda-calculus numbers all along! Our succ function represents what’s more formally called Church numerals.Note that $c_0$’s implementation is the same as that of $\text{fls}$ (just with renamed variables).Every number $n$ is represented by a term $c_n$ taking two arguments, which are $s$ and $z$ (for “successor” and “zero”), and applies $s$ to $z$, $n$ times. Fundamentally, a number is equivalent to the following:With this in mind, let us implement some functions on numbers. Successor $\text{scc}$: we apply the successor function to $n$ (which has been correctly instantiated with $s$ and $z$) Addition $\text{add}$: we pass the instantiated $n$ as the zero of $m$ Subtraction $\text{sub}$: we apply $\text{pred}$ $n$ times to $m$ Multiplication $\text{mul}$: instead of the successor function, we pass the addition by $n$ function. Zero test $\text{iszero}$: zero has the same implementation as false, so we can lean on that to build an iszero function. An alternative understanding is that we’re building a number, in which we use true for the zero value $z$. If we have to apply the successor function $s$ once or more, we want to get false, so for the successor function we use a function ignoring its input and returning false if applied.What about predecessor? This is a little harder, and it’ll take a few steps to get there. The main idea is that we find the predecessor by rebuilding the whole succession up until our number. At every step, we must generate the number and its predecessor: zero is $(c_0, c_0)$, and all other numbers are $(c_{n-1}, c_n)$. Once we’ve reconstructed this pair, we can get the predecessor by taking the first element of the pair.SidenoteThe story goes that Church was stumped by predecessors for a long time. This solution finally came to him while he was at the barber, and he jumped out half shaven to write it down.ListsNow what about lists?Recursion in lambda-calculusLet’s start by taking a step back. We talked about normal forms and terms for which we terminate; does lambda calculus always terminate? It’s Turing complete, so it must be able to loop infinitely (otherwise, we’d have solved the halting problem!).The trick to recursion is self-application:From a type-level perspective, we would cringe at this. This should not be possible in the typed world, but in the untyped world we can do it. We can construct a simple infinite loop in lambda calculus as follows:The expression evaluates to itself in one step; it never reaches a normal form, it loops infinitely, diverges. This is not a stuck term though; evaluation is always possible.In fact, there are no stuck terms in pure lambda calculus. Every term is either a value or reduces further.So it turns out that $\text{omega}$ isn’t so terribly useful. Let’s try to construct something more practical:Now, the divergence is a little more interesting:This $Y_f$ function is known as a Y-combinator. It still loops infinitely (though note that while it works in classical lambda calculus, it blows up in call-by-name), so let’s try to build something more useful.To delay the infinite recursion, we could build something like a poison pill:It can be passed around (after all, it’s just a value), but evaluating it will cause our program to loop infinitely. This is the core idea we’ll use for defining the fixed-point combinator $\text{fix}$, which allows us to do recursion. It’s defined as follows:This looks a little intricate, and we won’t need to fully understand the definition. What’s important is mostly how it is used to define a recursive function. For instance, if we wanted to define a modulo function in our toy language, we’d do it as follows:123def mod(x, y) = if (y > x) x else mod(x - y, y)In lambda calculus, we’d define this as:We’ve assumed that a greater-than $\text{gt}$ function was available here.More generally, we can define a recursive function as:Equivalence of lambda termsWe’ve seen how to define Church numerals and successor. How can we prove that $\text{succ } c_n$ is equal to $c_{n+1}$?The naive approach unfortunately doesn’t work; they do not evaluate to the same value.This still seems very close. If we could simplify a little further, we do see how they would be the same.The intuition behind the Church numeral representation was that a number $n$ is represented as a term that “does something $n$ times to something else”. $\text{scc}$ takes a term that “does something $n$ times to something else”, and returns a term that “does something $n+1$ times to something else”.What we really care about is that $\text{scc } c_2$ behaves the same as $c_3$ when applied to two arguments. We want behavioral equivalence. But what does that mean? Roughly, two terms $s$ and $t$ are behaviorally equivalent if there is no “test” that distinguishes $s$ and $t$.Let’s define this notion of “test” this a little more precisely, and specify how we’re going to observe the results of a test. We can use the notion of normalizability to define a simple notion of a test: Two terms $s$ and $t$ are said to be observationally equivalent if they are either both normalizable (i.e. they reach a normal form after a finite number of evaluation steps), or both diverge.In other words, we observe a term’s behavior by running it and seeing if it halts. Note that this is not decidable (by the halting problem).For instance, $\text{omega}$ and $\text{tru}$ are not observationally equivalent (one diverges, one halts), while $\text{tru}$ and $\text{fls}$ are (they both halt).Observational equivalence isn’t strong enough of a test for what we need; we need behavioral equivalence. Two terms $s$ and $t$ are said to be behaviorally equivalent if, for every finite sequence of values $v_1, v_2, \dots, v_n$ the applications $s\ v_1\ v_2\ \dots\ v_n$ and $t\ v_1\ v_2\ \dots\ v_n$ are observationally equivalent.This allows us to assert that true and false are indeed different:The former returns a normal form, while the latter diverges.TypesAs previously, to define a language, we start with a set of terms and values, as well as an evaluation relation. But now, we’ll also define a set of types (denoted with a first capital letter) classifying values according to their “shape”. We can define a typing relation $t:\ T$. We must check that the typing relation is sound in the sense that:These rules represent some kind of safety and liveness, but are more commonly referred to as progress and preservation, which we’ll talk about later. The first one states that types are preserved throughout evaluation, while the second says that if we can type-check, then evaluation of $t$ will not get stuck.In our previous toy language, we can introduce two types, booleans and numbers:123T ::= // types Bool // type of booleans Nat // type of numbersOur typing rules are then given by:With these typing rules in place, we can construct typing derivations to justify every pair $t: T$ (which we can also denote as a $(t, T)$ pair) in the typing relation, as we have done previously with evaluation. Proofs of properties about the typing relation often proceed by induction on these typing derivations.Like other static program analyses, type systems are generally imprecise. They do not always predict exactly what kind of value will be returned, but simply a conservative approximation. For instance, if true then 0 else false cannot be typed with the above rules, even though it will certainly evaluate to a number. We could of course add a typing rule for if true statements, but there is still a question of how useful this is, and how much complexity it adds to the type system, and especially for proofs. Indeed, the inversion lemma below becomes much more tedious when we have more rules.Properties of the Typing RelationThe safety (or soundness) of this type system can be expressed by the following two properties: Progress: A well-typed term is not stuck. If $t\ :\ T$ then either $t$ is a value, or else $t\longrightarrow t’$ for some $t’$. Preservation: Types are preserved by one-step evaluation. If $t\ :\ T$ and $t\longrightarrow t’$, then $t’\ :\ T$. We will prove these later, but first we must state a few lemmas.Inversion lemmaAgain, for types we need to state the same (boring) inversion lemma: If $\text{true}: R$, then $R = \text{Bool}$. If $\text{false}: R$, then $R = \text{Bool}$. If $\ifelse: R$, then $t_1: \text{ Bool}$, $t_2: R$ and $t_3: R$ If $0: R$ then $R = \text{Nat}$ If $\text{succ } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$ If $\text{pred } t_1: R$ then $R = \text{Nat}$ and $t_1: \text{Nat}$ If $\text{iszero } t_1: R$ then $R = \text{Bool}$ and $t_1: \text{Nat}$From the inversion lemma, we can directly derive a typechecking algorithm:12345678910111213141516171819def typeof(t: Expr): T = t match { case True | False => Bool case If(t1, t2, t3) => val type1 = typeof(t1) val type2 = typeof(t2) val type3 = typeof(t3) if (type1 == Bool && type2 == type3) type2 else throw Error("not typable") case Zero => Nat case Succ(t1) => if (typeof(t1) == Nat) Nat else throw Error("not typable") case Pred(t1) => if (typeof(t1) == Nat) Nat else throw Error("not typable") case IsZero(t1) => if (typeof(t1) == Nat) Bool else throw Error("not typable")}Canonical formA simple lemma that will be useful for lemma is that of canonical forms. Given a type, it tells us what kind of values we can expect: If $v$ is a value of type Bool, then $v$ is either $\text{true}$ or $\text{false}$ If $v$ is a value of type Nat, then $v$ is a numeric valueThe proof is somewhat immediate from the syntax of values.Progress TheoremTheorem: suppose that $t$ is a well-typed term of type $T$. Then either $t$ is a value, or else there exists some $t’$ such that $t\longrightarrow t’$.Proof: by induction on a derivation of $t: T$. The $\ref{eq:t-true}$, $\ref{eq:t-false}$ and $\ref{eq:t-zero}$ are immediate, since $t$ is a value in these cases. For $\ref{eq:t-if}$, we have $t=\ifelse$, with $t_1: \text{Bool}$, $t_2: T$ and $t_3: T$. By the induction hypothesis, there is some $t_1’$ such that $t_1 \longrightarrow t_1’$. If $t_1$ is a value, then rule 1 of the canonical form lemma tells us that $t_1$ must be either $\text{true}$ or $\text{false}$, in which case $\ref{eq:e-iftrue}$ or $\ref{eq:e-iffalse}$ applies to $t$. Otherwise, if $t_1 \longrightarrow t_1’$, then by $\ref{eq:e-if}$, $t\longrightarrow \if t_1’ \then t_2 \text{ else } t_3$ For $\ref{eq:t-succ}$, we have $t = \text{succ } t_1$. $t_1$ is a value, by rule 5 of the inversion lemma and by rule 2 of the canonical form, $t_1 = nv$ for some numeric value $nv$. Therefore, $\text{succ }(t_1)$ is a value. If $t_1 \longrightarrow t_1’$, then $t\longrightarrow \text{succ }t_1$. The cases for $\ref{eq:t-zero}$, $\ref{eq:t-pred}$ and $\ref{eq:t-iszero}$ are similar.Preservation TheoremTheorem: Types are preserved by one-step evaluation. If $t: T$ and $t\longrightarrow t’$, then $t’: T$.Proof: by induction on the given typing derivation For $\ref{eq:t-true}$ and $\ref{eq:t-false}$, the precondition doesn’t hold (no reduction is possible), so it’s trivially true. Indeed, $t$ is already a value, either $t=\text{ true}$ or $t=\text{ false}$. For $\ref{eq:t-if}$, there are three evaluation rules by which $t\longrightarrow t’$ can be derived, depending on $t_1$ If $t_1 = \text{true}$, then by $\ref{eq:e-iftrue}$ we have $t’=t_2$, and from rule 3 of the inversion lemma and the assumption that $t: T$, we have $t_2: T$, that is $t’: T$ If $t_1 = \text{false}$, then by $\ref{eq:e-iffalse}$ we have $t’=t_3$, and from rule 3 of the inversion lemma and the assumption that $t: T$, we have $t_3: T$, that is $t’: T$ If $t_1 \longrightarrow t_1’$, then by the induction hypothesis, $t_1’: \text{Bool}$. Combining this with the assumption that $t_2: T$ and $t_3: T$, we can apply $\ref{eq:t-if}$ to conclude $\if t_1’ \then t_2 \else t_3: T$, that is $t’: T$ Messing with itRemoving a ruleWhat if we remove $\ref{eq:e-predzero}$? Then pred 0 type checks, but it is stuck and is not a value; the progress theorem fails.Changing type-checking ruleWhat if we change the $\ref{eq:t-if}$ to the following?This doesn’t break our type system. It’s still sound, but it rejects if-else expressions that return other things than numbers (e.g. booleans). But that is an expressiveness problem, not a soundness problem; our type system disallows things that would otherwise be fine by the evaluation rules.Adding bitWe could add a boolean to natural function bit(t). We’d have to add it to the grammar, add some evaluation and typing rules, and prove progress and preservation.We’ll do something similar this below, so the full proof is omitted.Simply typed lambda calculusSimply Typed Lambda Calculus (STLC) is also denoted $\lambda_\rightarrow$. The “pure” form of STLC is not very interesting on the type-level (unlike for the term-level of pure lambda calculus), so we’ll allow base values that are not functions, like booleans and integers. To talk about STLC, we always begin with some set of “base types”:123T ::= // types Bool // type of booleans T -> T // type of functionsIn the following examples, we’ll work with a mix of our previously defined toy language, and lambda calculus. This will give us a little syntactic sugar.123456789101112t ::= // terms x // variable λx. t // abstraction t t // application true // constant true false // constant false if t then t else t // conditionalv ::= // values λx. t // abstraction value true // true value false // false valueType annotationsWe will annotate lambda-abstractions with the expected type of the argument, as follows:We could also omit it, and let type inference do the job (as in OCaml), but for now, we’ll do the above. This will make it simpler, as we won’t have to discuss inference just yet.Typing rulesIn STLC, we’ve introduced abstraction. To add a typing rule for that, we need to encode the concept of an environment $\Gamma$, which is a set of variable assignments. We also introduce the “turnstile” symbol $\vdash$, meaning that the environment can verify the right hand-side typing, or that $\Gamma$ must imply the right-hand side.This additional concept must be taken into account in our definition of progress and preservation: Progress: If $\Gamma\vdash t : T$, then either $t$ is a value or else $t\longrightarrow t’$ for some $t’$ Preservation: If $\Gamma\vdash t : T$ and $t\longrightarrow t’$, then $\Gamma\vdash t’ : T$To prove these, we must take the same steps as above. We’ll introduce the inversion lemma for typing relations, and restate the canonical forms lemma in order to prove the progress theorem.Inversion lemmaLet’s start with the inversion lemma. If $\Gamma\vdash\text{true} : R$ then $R = \text{Bool}$ If $\Gamma\vdash\text{false} : R$ then $R = \text{Bool}$ If $\Gamma\vdash\ifelse : R$ then $\Gamma\vdash t_1 : \text{Bool}$ and $\Gamma\vdash t_2, t_3: R$. If $\Gamma\vdash x: R$ then $x: R \in\Gamma$ If $\Gamma\vdash\lambda x: T_1 .\ t_2 : R$ then $R = T_1 \rightarrow T_2$ for some $R_2$ with $\Gamma\cup(x: T_1)\vdash t_2: R_2$ If $\Gamma\vdash t_1\ t_2 : R$ then there is some type $T_{11}$ such that $\Gamma\vdash t_1 : T_{11} \rightarrow R$ and $\Gamma\vdash t_2 : T_{11}$.Canonical formThe canonical forms are given as follows: If $v$ is a value of type Bool, then it is either $\text{true}$ or $\text{false}$ If $v$ is a value of type $T_1 \rightarrow T_2$ then $v$ has the form $\lambda x: T_1 .\ t_2$ProgressFinally, we get to prove the progress by induction on typing derivations.Theorem: Suppose that $t$ is a closed, well typed term (that is, $\Gamma\vdash t: T$ for some type $T$). Then either $t$ is a value, or there is some $t’$ such that $t\longrightarrow t’$. For boolean constants, the proof is immediate as $t$ is a value For variables, the proof is immediate as $t$ is closed, and the precondition therefore doesn’t hold For abstraction, the proof is immediate as $t$ is a value Application is the only case we must treat. Consider $t = t_1\ t_2$, with $\Gamma\vdash t_1: T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2: T_{11}$. By the induction hypothesis, $t_1$ is either a value, or it can make a step of evaluation. The same goes for $t_2$. If $t_1$ can reduce, then rule $\ref{eq:e-app1}$ applies to $t$. Otherwise, if it is a value, and $t_2$ can take a step, then $\ref{eq:e-app2}$ applies. Otherwise, if they are both values (and we cannot apply $\beta$-reduction), then the canonical forms lemma above tells us that $t_1$ has the form $\lambda x: T_11.\ t_{12}$, and so rule $\ref{eq:e-appabs}$ applies to $t$. PreservationTheorem: If $\Gamma\vdash t: T$ and $t \longrightarrow t’$ then $\Gamma\vdash t’: T$.Proof: by induction on typing derivations. We proceed on a case-by-case basis, as we have done so many times before. But one case is hard: application.For $t = t_1\ t_2$, such that $\Gamma\vdash t_1 : T_{11} \rightarrow T_{12}$ and $\Gamma\vdash t_2 : T_{11}$, and where $T=T_{12}$, we want to show $\Gamma\vdash t’ : T_{12}$.To do this, we must use the inversion lemma for evaluation (note that we haven’t written it down for STLC, but the idea is the same). There are three subcases for it, starting with the following:The left-hand side is $t_1 = \lambda x: T_{11}.\ t_{12}$, and the right-hand side of application $t_2$ is a value $v_2$. In this case, we know that the result of the evaluation is given by $t’ = \left[ x\mapsto v_2 \right] t_{12}$.And here, we already run into trouble, because we do not know about how types act under substitution. We will therefore need to introduce some lemmas.Weakening lemmaWeakening tells us that we can add assumptions to the context without losing any true typing statements:If $\Gamma\vdash t: T$, and the environment $\Gamma$ has no information about $x$—that is, $x\notin \text{dom}(\Gamma)$—then the initial assumption still holds if we add information about $x$ to the environment:Moreover, the latter $\vdash$ derivation has the same depth as the former.Permutation lemmaPermutation tells us that the order of assumptions in $\Gamma$ does not matter.If $\Gamma \vdash t: T$ and $\Delta$ is a permutation of $\Gamma$, then $\Delta\vdash t: T$.Moreover, the latter $\vdash$ derivation has the same depth as the former.Substitution lemmaSubstitution tells us that types are preserved under substitution.That is, if $\Gamma\cup(x: S) \vdash t: T$ and $\Gamma\vdash s: S$, then $\Gamma\vdash \left[x\mapsto s\right] t: T$.The proof goes by induction on the derivation of $\Gamma\cup(x: S) \vdash t: T$, that is, by cases on the final typing rule used in the derivation. Case $\ref{eq:t-app}$: in this case, $t = t_1\ t_2$. Thanks to typechecking, we know that the environment validates $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_1: T_2 \rightarrow T_1$ and $\bigl(\Gamma\cup (x: S)\bigr)\vdash t_2: T_2$. In this case, the resulting type of the application is $T=T_1$. By the induction hypothesis, $\Gamma\vdash[x\mapsto s]t_1 : T_2 \rightarrow T_1$, and $\Gamma\vdash[x\mapsto s]t_2 : T_2$. By $\ref{eq:t-app}$, the environment then also verifies the application of these two substitutions as $T$: $\Gamma\vdash[x\mapsto s]t_1\ [x\mapsto s]t_2: T$. We can factorize the substitution to obtain the conclusion, i.e. $\Gamma\vdash \left[x\mapsto s\right](t_1\ t_2): T$ Case $\ref{eq:t-var}$: if $t=z$ ($t$ is a simple variable $z$) where $z: T \in \bigl(\Gamma\cup (x: S)\bigr)$. There are two subcases to consider here, depending on whether $z$ is $x$ or another variable: If $z=x$, then $\left[x\mapsto s\right] z = s$. The result is then $\Gamma\vdash s: S$, which is among the assumptions of the lemma If $z\ne x$, then $\left[x\mapsto s\right] z = z$, and the desired result is immediate Case $\ref{eq:t-abs}$: if $t=\lambda y: T_2.\ t_1$, with $T=T_2\rightarrow T_1$, and $\bigl(\Gamma\cup (x: S)\cup (y: T_2)\bigr)\vdash t_1 : T_1$. Based on our hygiene convention, we may assume $x\ne y$ and $y \notin \text{fv}(s)$. Using permutation on the first given subderivation in the lemma ($\Gamma\cup(x: S) \vdash t: T$), we obtain $\bigl(\Gamma\cup (y: T_2)\cup (x: S)\bigr)\vdash t_1 : T_1$ (we have simply changed the order of $x$ and $y$). Using weakening on the other given derivation in the lemma ($\Gamma\vdash s: S$), we obtain $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash s: S$. By the induction hypothesis, $\bigl(\Gamma\cup (y: T_2)\bigr)\vdash\left[x\mapsto s\right] t_1: T_1$. By $\ref{eq:t-abs}$, we have $\Gamma\vdash(\lambda y: T_2.\ [x\mapsto s]t_1): T_1$ By the definition of substitution, this is $\Gamma\vdash([x\mapsto s]\lambda y: T_2.\ t_1): T_2 \rightarrow T_1$. ProofWe’ve now proven the following lemmas: Weakening Permutation Type preservation under substitution Type preservation under reduction (i.e. preservation)We won’t actually do the proof, we’ve just set up the pieces we need for it.ErasureType annotations do not play any role in evaluation. In STLC, we don’t do any run-time checks, we only run compile-time type checks. Therefore, types can be removed before evaluation. This often happens in practice, where types do not appear in the compiled form of a program; they’re typically encoded in an untyped fashion. The semantics of this conversion can be formalized by an erasure function:Curry-Howard CorrespondenceThe Curry-Howard correspondence tells us that there is a correspondence between propositional logic and types.An implication $P\supset Q$ (which could also be written $P\implies Q$) can be proven by transforming evidence for $P$ into evidence for $Q$. A conjunction $P\land Q$ is a pair of evidence for $P$ and evidence for $Q$. For more examples of these correspondences, see the Brouwer–Heyting–Kolmogorov (BHK) interpretation. Logic Programming languages Propositions Types $P \supset Q$ Type $P\rightarrow Q$ $P \land Q$ Pair type $P\times Q$ $P \lor Q$ Sum type $P+Q$ $\exists x\in S: \phi(x)$ Dependent type $\sum{x: S, \phi(x)}$ $\forall x\in S: \phi(x)$ $\forall (x:S): \phi(x)$ Proof of $P$ Term $t$ of type $P$ $P$ is provable Type $P$ is inhabited Proof simplification Evaluation In Scala, all types are inhabited except for the bottom type Nothing. Singleton types are only inhabited by a single term.As an example of the equivalence, we’ll see that application is equivalent to modus ponens:This also tells us that if we can prove something, we can evaluate it.How can we prove the following? Remember that $\rightarrow$ is right-associative.The proof is actually a somewhat straightforward conversion to lambda calculus:Extensions to STLCBase typesUp until now, we’ve defined our base types (such as $\text{Nat}$ and $\text{Bool}$) manually: we’ve added them to the syntax of types, with associated constants ($\text{zero}, \text{true}, \text{false}$) and operators ($\text{succ}, \text{pred}$), as well as associated typing and evaluation rules.This is a lot of minutiae though, especially for theoretical discussions. For those, we can often ignore the term.level inhabitants of the base types, and just treat them as uninterpreted constants: we don’t really need the distinction between constants and values. For theory, we can just assume that some generic base types (e.g. $B$ and $C$) exist, without defining them further.Unit typeIn C-like languages, this type is usually called void. To introduce it, we do not add any computation rules. We must only add it to the grammar, values and types, and then add a single typing rule that trivially verifies units.Units are not too interesting, but are quite useful in practice, in part because they allow for other extensions.SequencingWe can define sequencing as two statements following each other:123t ::= ... t1; t2This implies adding some evaluation and typing rules, defined below:But there’s another way that we could define sequencing: simply as syntactic sugar, a derived form for something else. In this way, we define an external language, that is transformed to an internal language by the compiler in the desugaring step.This is useful to know, because it makes proving soundness much easier. We do not need to re-state the inversion lemma, re-prove preservation and progress. We can simple rely on the proof for the underlying internal language.Ascription123t ::= ... t as TAscription allows us to have a compiler type-check a term as really being of the correct type.The typing rule is simply:This seems like it preserves soundness, but instead of doing the whole proof over again, we’ll just propose a simple desugaring:An ascription is equivalent to the term $t$ applied the identity function, typed to return $T$.PairsWe can introduce pairs into our grammar.12345678910111213t ::= ... {t, t} // pair t.1 // first projection t.2 // second projectionv ::= ... {v, v} // pair valueT ::= ... T1 x T2 // product typesWe can also introduce evaluation rules for pairs:The typing rules are then:Pairs have to be added “the hard way”: we do not really have a way to define them in a derived form, as we have no existing language features to piggyback onto.TuplesTuples are like pairs, except that we do not restrict it to 2 elements; we allow an arbitrary number from 1 to n. We can use pairs to encode tuples: (a, b, c) can be encoded as (a, (b, c)). Though for performance and convenience, most languages implement them natively.RecordsWe can easily generalize tuples to records by annotating each field with a label. A record is a bundle of values with labels; it’s a map of labels to values and types. Order of records doesn’t matter, the only index is the label.If we allow numeric labels, then we can encode a tuple as a record, where the index implicitly encodes the numeric label of the record representation.No mainstream language has language-level support for records (two case classes in Scala may have the same arguments but a different constructor, so it’s not quite the same; records are more like anonymous objects). This is because they’re often quite inefficient in practice, but we’ll still use them as a theoretical abstraction.Sums and variantsSum typeA sum type $T = T_1 + T_2$ is a disjoint union of $T_1$ and $T_2$. Pragmatically, we can have sum types in Scala with case classes extending an abstract object:123sealed trait Option[+T]case class Some[+T] extends Option[T]case object None extends Option[Nothing]In this example, Option = Some + None. We say that $T_1$ is on the left, and $T_2$ on the right. Disjointness is ensured by the tags $\text{inl}$ and $\text{inr}$. We can think of these as functions that inject into the left or right of the sum type $T$:Still, these aren’t really functions, they don’t actually have function type. Instead, we use them them to tag the left and right side of a sum type, respectively.Another way to think of these stems from Curry-Howard correspondence. Recall that in the BHK interpretation, a proof of $P \lor Q$ is a pair <a, b> where a is 0 (also denoted $\text{inl}$) and b a proof of $P$, or a is 1 (also denoted $\text{inr}$) and b is a proof of $Q$.To use elements of a sum type, we can introduce a case construct that allows us to pattern-match on a sum type, allowing us to distinguishing the left type from the right one.We need to introduce these three special forms in our syntax:1234567891011t ::= ... // terms inl t // tagging (left) inr t // tagging (right) case t of inl x => t | inr x => t // casev ::= ... // values inl v // tagged value (left) inr v // tagged value (right)T ::= ... // types T + T // sum typeThis also leads us to introduce some new evaluation rules:And we’ll also introduce three typing rules:Sums and uniqueness of typeThe rules $\ref{eq:t-inr}$ and $\ref{eq:t-inl}$ may seem confusing at first. We only have one type to deduce from, so what do we assign to $T_2$ and $T_1$, respectively? These rules mean that we have lost uniqueness of types: if $t$ has type $T$, then $\text{inl } t$ has type $T+U$ for every $U$.There are a couple of solutions to this: We can infer $U$ as needed during typechecking Give constructors different names and only allow each name to appear in one sum type. This requires generalization to variants, which we’ll see next. OCaml adopts this solution. Annotate each inl and inr with the intended sum type.For now, we don’t want to look at type inference and variance, so we’ll choose the third approach for simplicity. We’ll introduce these annotation as ascriptions on the injection operators in our grammar:123456789t ::= ... inl t as T inr t as Tv ::= ... inl v as T inr v as TThe evaluation rules would be exactly the same as previously, but with ascriptions in the syntax. The injection operators just now also specify which sum type we’re injecting into, for the sake of uniqueness of type.VariantsJust as we generalized binary products to labeled records, we can generalize binary sums to labeled variants. We can label the members of the sum type, so that we write $\langle l_1: T_1, l_2: T_2 \rangle$ instead of $T_1 + T_2$ ($l_1$ and $l_2$ are the labels).As a motivating example, we’ll show a useful idiom that is possible with variants, the optional value. We’ll use this to create a table. The example below is just like in OCaml.123456789OptionalNat = <none: Unit, some: Nat>;Table = Nat -> OptionalNat;emptyTable = λt: Nat. <none=unit> as OptionalNat;extendTable = λt: Table. λkey: Nat. λval: Nat. λsearch: Nat. if (equal search key) then <some=val> as OptionalNat else (t search)The implementation works a bit like a linked list, with linear look-up. We can use the result from the table by distinguishing the outcome with a case:123x = case t(5) of <none=u> => 999 | <some=v> => vRecursionIn STLC, all programs terminate. We won’t go into too much detail on this topic, but the main idea is that evaluation of a well-typed program is guaranteed to halt; we say that the well-typed terms are normalizable.Indeed, the infinite recursions from untyped lambda calculus (terms like $\text{omega}$ and $\text{fix}$) are not typable, and thus cannot appear in STLC. Since we can’t express $\text{fix}$ in STLC, instead of defining it as a term in the language, we can add it as a primitive instead to get recursion.123t ::= ... fix tWe’ll need to add evaluation rules recreating its behavior, and a typing rule that restricts its use to the intended use-case.In order for a function to be recursive, the function needs to map a type to the same type, hence the restriction of $T_1 \rightarrow T_1$. The type $T_1$ will itself be a function type if we’re doing a recursion. Still, note that the type system doesn’t enforce this. There will actually be situations in which it will be handy to use something else than a function type inside a fix operator.Seeing that this fixed-point notation can be a little involved, we can introduce some nice syntactic sugar to work with it:This $t_1$ can now refer to the $x$; that’s the convenience offered by the construct. Although we don’t strictly need to introduce typing rules (it’s syntactic sugar, we’re relying on existing constructs), a typing rule for this could be:In Scala, a common error message is that a recursive function needs an explicit return type, for the same reasons as the typing rule above.ReferencesMutabilityIn most programming languages, variables are (or can be) mutable. That is, variables can provide a name referring to a previously calculated value, as well as a way of overwriting this value with another (under the same name). How can we model this in STLC?Some languages (e.g. OCaml) actually formally separate variables from mutation. In OCaml, variables are only for naming, the binding between a variable and a value is immutable. However, there is the concept of mutable values, also called reference cells or references. This is the style we’ll study, as it is easier to work with formally. A mutable value is represented in the type-level as a Ref T (or perhaps even a Ref(Option T), since the null pointer cannot produce a value).The basic operations are allocation with the ref operator, dereferencing with ! (in C, we use the * prefix), and assignment with :=, which updates the content of the reference cell. Assignment returns a unit value.AliasingTwo variables can reference the same cell: we say that they are aliases for the same cell. Aliasing is when we have different references (under different names) to the same cell. Modifying the value of the reference cell through one alias modifies the value for all other aliases.The possibility of aliasing is all around us, in object references, explicit pointers (in C), arrays, communication channels, I/O devices; there’s practically no way around it. Yet, alias analysis is quite complex, costly, and often makes is hard for compilers to do optimizations they would like to do.With mutability, the order of operations now matters; r := 1; r := 2 isn’t the same as r := 2; r := 1. If we recall the Church-Rosser theorem, we’ve lost the principle that all reduction paths lead to the same result. Therefore, some language designers disallow it (Haskell). But there are benefits to allowing it, too: efficiency, dependency-driven data flow (e.g. in GUI), shared resources for concurrency (locks), etc. Therefore, most languages provide it.Still, languages without mutability have come up with a bunch of abstractions that allow us to have some of the benefits of mutability, like monads and lenses.Typing rulesWe’ll introduce references as a type Ref T to represent a variable of type T. We can construct a reference as r = ref 5, and access the contents of the reference using !r (this would return 5 instead of ref 5).Let’s define references in our language:12345678t ::= // terms unit // unit constant x // variable λx: T. t // abstraction t t // application ref t // reference creation !t // dereference t := t // assignmentEvaluationWhat is the value of ref 0? The crucial observation is that evaluation ref 0 must do something. Otherwise, the two following would behave the same:12345r = ref 0s = ref 0r = ref 0 s = rEvaluating ref 0 should allocate some storage, and return a reference (or pointer) to that storage. A reference names a location in the store (also known as the heap, or just memory). Concretely, the store could be an array of 8-bit bytes, indexed by 32-bit integers. More abstractly, it’s an array of values, or even more abstractly, a partial function from locations to values.We can introduce this idea of locations in our syntax. This syntax is exactly the same as the previous one, but adds the notion of locations:1234567891011121314v ::= // values unit // unit constant λx: T. t // abstraction value l // store locationt ::= // terms unit // unit constant x // variable λx: T. t // abstraction t t // application ref t // reference creation !t // dereference t := t // assignment l // store location This doesn’t mean that we’ll allow programmers to write explicit locations in their programs. We just use this as a modeling trick; we’re enriching the internal language to include some run-time structures.With this added notion of stores and locations, the result of an evaluation now depends on the store in which it is evaluated, which we need to reflect in our evaluation rules. Evaluation must now include terms $t$ and store $\mu$:Let’s take a look for the evaluation rules for STLC with references, operator by operator.The assignments $\ref{eq:e-assign1}$ and $\ref{eq:e-assign2}$ evaluate terms until they become values. When they have been reduced, we can do that actual assignment: as per $\ref{eq:e-assign}$, we update the store and return return unit.A reference $\text{ref }t_1$ first evaluates $t_1$ until it is a value ($\ref{eq:e-ref}$). To evaluate the reference operator, we find a fresh location $l$ in the store, to which it binds $v_1$, and it returns the location $l$.We find the same congruence rule as usual in $\ref{eq:e-deref}$, where a term $!t_1$ first evaluates $t_1$ until it is a value. Once it is a value, we can return the value in the current store using $\ref{eq:e-derefloc}$.The evaluation rules for abstraction and application are augmented with stores, but otherwise unchanged.Store typingWhat is the type of a location? The answer to this depends on what is in the store. Unless we specify it, a store could contain anything at a given location, which is problematic for typechecking. The solution is to type the locations themselves. This leads us to a typed store:As a first attempt at a typing rule, we can just say that the type of a location is given by the type of the value in the store at that location:This is problematic though; in the following, the typing derivation for $!l_2$ would be infinite because we have a cyclic reference:The core of the problem here is that we would need to recompute the type of a location every time. But shouldn’t be necessary. Seeing that references are strongly typed as Ref T, we know exactly what type of value we can place in a given store location. Indeed, the typing rules we chose for references guarantee that a given location in the store always is used to hold values of the same type.So to fix this problem, we need to introduce a store typing. This is a partial function from location to types, which we’ll denote by $\Sigma$.Suppose we’re given a store typing $\Sigma$ describing the store $\mu$. We can use $\Sigma$ to look up the types of locations, without doing a lookup in $\mu$:This tells us how to check the store typing, but how do we create it? We can start with an empty typing $\Sigma = \emptyset$, and add a typing relation with the type of $v_1$ when a new location is created during evaluation of $\ref{eq:e-refv}$.The rest of the typing rules remain the same, but are augmented with the store typing. So in conclusion, we have updated our evaluation rules with a store $\mu$, and our typing rules with a store typing $\Sigma$.SafetyLet’s take a look at progress and preservation in this new type system. Preservation turns out to be more interesting, so let’s look at that first.We’ve added a store and a store typing, so we need to add those to the statement of preservation to include these. Naively, we’d write:But this would be wrong! In this statement, $\Sigma$ and $\mu$ would not be constrained to be correlated at all, which they need to be. This constraint can be defined as follows:A store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$ (which we denote by $\Gamma\mid\Sigma\vdash\mu$) if the following is satisfied:This gets us closer, and we can write the following preservation statement:But this is still wrong! When we create a new cell with $\ref{eq:e-refv}$, we would break the correspondence between store typing and store.The correct version of the progress theorem is the following:This progress theorem just asserts that there is some store typing $\Sigma’ \supseteq \Sigma$ (agreeing with $\Sigma$ on the values of all old locations, but that may have also add new locations), such that $t’$ is well typed in $\Sigma’$.The progress theorem must also be extended with stores and store typings:Suppose that $t$ is a closed, well-typed term; that is, $\emptyset\mid\Sigma\vdash t: T$ for some type $T$ and some store typing $\Sigma$. Then either $t$ is a value or else, for any store $\mu$ such that $\emptyset\mid\Sigma\vdash\mu$2, there is some term $t’$ and store $\mu’$ with $t\mid\mu \longrightarrow t’\mid\mu’$. $(t, C) \in \text{Consts}$ is equivalent to $\text{Consts}(t) = C$ ↩ Recall that this notation is used to say a store $\mu$ is well typed with respect to a typing context $\Gamma$ and a store typing $\Sigma$, as defined in the section on safety in STLC with stores. ↩ CS-451 Distributed Algorithms2018-09-18T00:00:00+00:002018-09-18T00:00:00+00:00https://kjaer.io/distributed-algorithms
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a> <ul>
<li><a href="#links" id="markdown-toc-links">Links</a> <ul>
<li><a href="#fair-loss-link-fll" id="markdown-toc-fair-loss-link-fll">Fair loss link (FLL)</a></li>
<li><a href="#stubborn-link" id="markdown-toc-stubborn-link">Stubborn link</a></li>
<li><a href="#perfect-link" id="markdown-toc-perfect-link">Perfect link</a></li>
</ul>
</li>
<li><a href="#impossibility-of-consensus" id="markdown-toc-impossibility-of-consensus">Impossibility of consensus</a> <ul>
<li><a href="#solvable-atomicity-problem" id="markdown-toc-solvable-atomicity-problem">Solvable atomicity problem</a></li>
<li><a href="#unsolvable-atomicity-problem" id="markdown-toc-unsolvable-atomicity-problem">Unsolvable atomicity problem</a></li>
</ul>
</li>
<li><a href="#failure-detection" id="markdown-toc-failure-detection">Failure detection</a></li>
</ul>
</li>
<li><a href="#reliable-broadcast" id="markdown-toc-reliable-broadcast">Reliable broadcast</a> <ul>
<li><a href="#best-effort-broadcast" id="markdown-toc-best-effort-broadcast">Best-effort broadcast</a></li>
<li><a href="#reliable-broadcast-1" id="markdown-toc-reliable-broadcast-1">Reliable broadcast</a></li>
<li><a href="#uniform-reliable-broadcast" id="markdown-toc-uniform-reliable-broadcast">Uniform reliable broadcast</a></li>
</ul>
</li>
<li><a href="#causal-order-broadcast" id="markdown-toc-causal-order-broadcast">Causal order broadcast</a> <ul>
<li><a href="#motivation" id="markdown-toc-motivation">Motivation</a></li>
<li><a href="#causality" id="markdown-toc-causality">Causality</a></li>
<li><a href="#algorithm" id="markdown-toc-algorithm">Algorithm</a></li>
</ul>
</li>
<li><a href="#total-order-broadcast" id="markdown-toc-total-order-broadcast">Total order broadcast</a></li>
<li><a href="#consensus" id="markdown-toc-consensus">Consensus</a> <ul>
<li><a href="#consensus-algorithm" id="markdown-toc-consensus-algorithm">Consensus algorithm</a></li>
<li><a href="#uniform-consensus-algorithm" id="markdown-toc-uniform-consensus-algorithm">Uniform consensus algorithm</a></li>
<li><a href="#uniform-consensus-algorithm-with-eventually-perfect-failure-detector" id="markdown-toc-uniform-consensus-algorithm-with-eventually-perfect-failure-detector">Uniform consensus algorithm with eventually perfect failure detector</a></li>
</ul>
</li>
<li><a href="#atomic-commit" id="markdown-toc-atomic-commit">Atomic commit</a> <ul>
<li><a href="#non-blocking-atomic-commit-nbac" id="markdown-toc-non-blocking-atomic-commit-nbac">Non-Blocking Atomic Commit (NBAC)</a></li>
<li><a href="#2-phase-commit" id="markdown-toc-2-phase-commit">2-Phase Commit</a></li>
</ul>
</li>
<li><a href="#terminating-reliable-broadcast-trb" id="markdown-toc-terminating-reliable-broadcast-trb">Terminating reliable broadcast (TRB)</a></li>
<li><a href="#group-membership" id="markdown-toc-group-membership">Group membership</a></li>
<li><a href="#view-synchronous-vs-communication" id="markdown-toc-view-synchronous-vs-communication">View-Synchronous (VS) communication</a></li>
<li><a href="#from-message-passing-to-shared-memory" id="markdown-toc-from-message-passing-to-shared-memory">From message passing to Shared memory</a></li>
<li><a href="#byzantine-failures" id="markdown-toc-byzantine-failures">Byzantine failures</a></li>
</ul>
<p>⚠ <em>Work in progress</em></p>
<!-- More -->
<h2 id="introduction">Introduction</h2>
<ul>
<li><a href="http://dcl.epfl.ch/site/education/da">Website</a></li>
<li>Course follows the book <em>Introduction to Reliable (and Secure) Distributed Programming</em></li>
<li>Final exam is 60%</li>
<li>Projects in teams of 2-3 are 40%
<ul>
<li>The project is the implementation of a blockchain</li>
<li>Send team members to matej.pavlovic@epfl.ch</li>
</ul>
</li>
<li>No midterm</li>
</ul>
<p>Distributed algorithms are between the application and the channel.</p>
<p>We have a few commonly used abstractions:</p>
<ul>
<li><strong>Processes</strong> abstract computers</li>
<li><strong>Channels</strong> abstract networks</li>
<li><strong>Failure detectors</strong> abstract time</li>
</ul>
<p>When defining a problem, there are two important properties that we care about:</p>
<ul>
<li><strong>Safety</strong> states that nothing bad should happen</li>
<li><strong>Liveness</strong> states that something good should happen</li>
</ul>
<p>Safety is trivially implemented by doing nothing, so we also need liveness to make sure that the correct things actually happen.</p>
<h3 id="links">Links</h3>
<p>Two nodes can communicate through a link by passing messages. However, this message passing can be faulty: it can drop messages or repeat them. How can we ensure correct and reliable message passing under such conditions?</p>
<p>A link has two basic types of events:</p>
<ul>
<li>Send</li>
<li>Deliver</li>
</ul>
<h4 id="fair-loss-link-fll">Fair loss link (FLL)</h4>
<p>A fair loss link is a link that may lose or repeat some packets. This is the weakest type of link we can assume. In practice, it corresponds to UDP.</p>
<p>Deliver can be thought of as a reception event on the receiver end. The terminology used here (“deliver”) implies that the link delivers to the client, but this can equally be thought of as the client receiving from the link.</p>
<p>For a link to be considered a fair-loss link, we must respect the following three properties:</p>
<ul>
<li><strong>Fair loss</strong>: if the sender sends infinitely many times, the receiver must deliver infinitely many times. This does not guarantee that all messages get through, but at least ensures that some messages get through.</li>
<li><strong>No creation</strong>: every delivery must be the result of a send; no message must be created out of the blue.</li>
<li><strong>Finite duplication</strong>: a message can only be repeated by the link a finite number of times.</li>
</ul>
<h4 id="stubborn-link">Stubborn link</h4>
<p>A stubborn link is one that stubbornly delivers messages; that is, it ensures that the message is received, with no regard to performance.</p>
<p>A stubborn link can be implemented with a FLL as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="n">upon</span> <span class="n">send</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
<span class="n">FLL</span><span class="o">.</span><span class="n">send</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">FLL</span><span class="o">.</span><span class="n">deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="n">trigger</span> <span class="n">deliver</span><span class="p">(</span><span class="n">m</span><span class="p">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The above uses generic pseudocode, but the syntax we’ll use in this course is as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">SubbornLinks</span> <span class="p">(</span><span class="n">sp2p</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">FairLossLinks</span> <span class="p">(</span><span class="n">flp2p</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">sp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="k">while</span> <span class="bp">True</span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">flp2p</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">flp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">sp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Note that this piece of code is meant to sit between two abstraction levels; it is between the channel and the application. As such, it receives sends from the application and forwards them to the link, and receives delivers from the link and forwards them to the application.</p>
<p>It must respect the interface of the underlying FLL, and as such, only specifies send and receive hooks.</p>
<h4 id="perfect-link">Perfect link</h4>
<p>Here again, we respect the send/deliver interface. The properties are:</p>
<ul>
<li><strong>Validity</strong> or reliable delivery: if both peers are correct, then every message sent is eventually delivered</li>
<li><strong>No duplication</strong></li>
<li><strong>No creation</strong></li>
</ul>
<p>This is the type of link that we usually use: TCP is a perfect link, although it also has more guarantees (notably on message ordering, which this definition of a perfect link does not have). TCP keeps retransmitting a message stubbornly, until it gets an acknowledgement, which means that it can stop transmitting. Acknowledgements aren’t actually needed <em>in theory</em>, it would still work without them, but we would also completely flood the network, so acknowledgements are a practical consideration for performance; just note that the theorists don’t care about them.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">PerfectLinks</span> <span class="p">(</span><span class="n">pp2p</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">StubbornLinks</span> <span class="p">(</span><span class="n">sp2p</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">pp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">sp2pSend</span><span class="p">,</span> <span class="n">dest</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">sp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span> <span class="n">then</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">pp2pDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">;</span>
<span class="n">add</span> <span class="n">m</span> <span class="n">to</span> <span class="n">delivered</span><span class="p">;</span></pre></td></tr></tbody></table></code></pre></figure>
<h3 id="impossibility-of-consensus">Impossibility of consensus</h3>
<p>Suppose we’d like to compute prime numbers on a distributed system. Let <em>P</em> be the producer of prime numbers. Whenever it finds one, it notifies two servers, <em>S1</em> and <em>S2</em> about it. A client <em>C</em> may request the full list of known prime numbers from either server.</p>
<p>As in any distributed system, we want the servers to behave as a single (abstract) machine.</p>
<h4 id="solvable-atomicity-problem">Solvable atomicity problem</h4>
<p><em>P</em> finds 1013 as a new prime number, and sends it to <em>S1</em>, which receives it immediately, and <em>S2</em>, which receives it after a long delay. In the meantime, before both servers have received the update, we have an atomicity problem: one server has a different list from the other. In this time window, <em>C</em> will get different results from <em>S1</em> (which has numbers up to 1013) and <em>S2</em> (which only has numbers up to 1009, which is the previous prime).</p>
<p>A simple way to solve this is to have <em>C</em> send the new number (1013) to the other servers; if it requested from <em>S1</em> it’ll send the update to <em>S2</em> as a kind of write back, to make sure that <em>S2</em> also has it for the next request. We haven’t strictly defined the problem or its requirements, but this may need to assume a link that guarantees delivery and order (i.e. TCP, not UDP).</p>
<h4 id="unsolvable-atomicity-problem">Unsolvable atomicity problem</h4>
<p>Now assume that we have two prime number producers <em>P1</em> and <em>P2</em>. This introduces a new atomicity problem: the updates may not reach all servers atomically in order, and the servers cannot agree on the order.</p>
<p>This is <strong>impossible</strong> to solve; we won’t prove it, but universality of Turing is lost (unless we make very strong assumptions). This is known as the <a href="https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf"><em>impossibility of consensus</em></a>.</p>
<h3 id="failure-detection">Failure detection</h3>
<p>A <strong>failure detector</strong> is a distributed oracle that provides processes with suspicions about crashed processes. There are two kinds of failure detectors, with the following properties</p>
<ul>
<li><strong>Perfect</strong>
<ul>
<li><strong>Strong completeness</strong>: eventually, every process that crashed is permanently suspected by every correct process</li>
<li><strong>Strong accuracy</strong>: no process is suspected before it crashes</li>
</ul>
</li>
<li><strong>Eventually perfect</strong>
<ul>
<li><strong>Strong completeness</strong></li>
<li><strong>Eventual strong accuracy</strong>: eventually, no correct process is ever suspsected</li>
</ul>
</li>
</ul>
<p>An eventually perfect detector may make mistakes and may operate under a delay. But eventually, it will tell us the truth.</p>
<p>A failure detector can be implemented by the following algorithm:</p>
<ol>
<li>Processes periodically send heartbeat messages</li>
<li>A process sets a timeout based on worst case round trip of a message exchange</li>
<li>A process suspects another process has failed if it timeouts that process</li>
<li>A process that delivers a message from a suspected process revises its suspicion and doubles the time-out</li>
</ol>
<p>Failure detection algorithms are all designed under certain <strong>timing assumptions</strong>. The following timing assumptions are possible:</p>
<ul>
<li><strong>Synchronous</strong>
<ul>
<li><strong>Processing</strong>: the time it takes for a process to execute is bounded and known.</li>
<li><strong>Delays</strong>: there is a known upper bound limit on the time it takes for a message to be received</li>
<li><strong>Clocks</strong>: the drift between a local clock and the global, real-time clock is bounded and known</li>
</ul>
</li>
<li><strong>Eventually synchronous</strong>: the timing assumptions hold eventually</li>
<li><strong>Asynchronous</strong>: no assumptions</li>
</ul>
<p>These 3 possible assumption levels mean that the world is divised into 3 kinds of failure algorithms. The algorithm above is based on the eventually synchronous assumption (I think?).</p>
<details><summary><p>Not exam material</p>
</summary><div class="details-content">
<h2 id="mathematically-robust-distributed-systems">Mathematically robust distributed systems</h2>
<p>Some bugs in distributed systems can be very difficult to catch (it could involve long and costly simulation; with $n$ computers, it takes time $2^n$ to simulate all possible cases), and can be very costly when it happens.</p>
<p>The only way to be sure that there are no bugs is to <em>prove</em> it formally and mathematically.</p>
<h3 id="definition-of-the-distributed-system-graph">Definition of the distributed system graph</h3>
<p>Let $G(V, E)$ be a graph, where $V$ is the set of process nodes, and $E$ is the set of channel edges connecting the processes.</p>
<p>Two nodes $p$ and $q$ are <strong>neighbors</strong> if and only if there is an edge $\left\{ p, q \right\} \in E$.</p>
<p>Let $X \subseteq V$ be the set of <strong>crashed nodes</strong>. The other nodes are <strong>correct nodes</strong>.</p>
<p>We’ll define the <strong>path</strong> as the sequence of nodes $(p_1, p_2, \dots, p_n)$ such that $\forall i \in \left\{i, \dots, n-1\right\}$, $p_i$ and $p_{i+1}$ are neighbors.</p>
<p>Two nodes $p$ and $q$ are <strong>connected</strong> if we have a path $(p_1, p_2, \dots, p_n)$ such that $p_1 = p$ and $p_2 = q$.</p>
<p>They are <strong>n-connected</strong> if there are $n$ disjoint paths connecting them; two paths $A = \left\{ p_1, \dots, p_n \right\}$ and $B = \left\{ p_1, \dots, p_n \right\}$ are disjoint if $A \cap B = \left\{ p, q \right\}$ (i.e. $p$ and $q$ are the two only nodes in common in the path).</p>
<p>The graph is <strong>k-connected</strong> if, $\forall \left\{ p, q \right\} \subseteq V$ there are $k$ disjoint paths between $p$ and $q$.</p>
<h3 id="example-on-a-simple-algorithm">Example on a simple algorithm</h3>
<p>Each node $p$ holds a message $m_p$ and a set $p.R$. The goal is for two nodes $p$ and $q$ to have $(p, m_p) \in q.R$ and $(q, m_q) \in p.R$; that is, they want to exchange messages, to <em>communicate reliably</em>. The algorithm is as follows:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
</pre></td><td class="code"><pre><span class="k">for</span> <span class="n">each</span> <span class="n">node</span> <span class="n">p</span><span class="p">:</span>
<span class="n">initially</span><span class="p">:</span>
<span class="n">send</span> <span class="p">(</span><span class="n">p</span><span class="p">,</span> <span class="n">m</span><span class="p">(</span><span class="n">p</span><span class="p">))</span> <span class="n">to</span> <span class="nb">all</span> <span class="n">neighbors</span>
<span class="n">upon</span> <span class="n">reception</span> <span class="n">of</span> <span class="n">of</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">):</span>
<span class="n">add</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="n">to</span> <span class="n">p</span><span class="o">.</span><span class="n">R</span>
<span class="n">send</span> <span class="p">(</span><span class="n">v</span><span class="p">,</span> <span class="n">m</span><span class="p">)</span> <span class="n">to</span> <span class="nb">all</span> <span class="n">neighbors</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="reliable-communication">Reliable communication</h4>
<p>Now, let’s prove that if two nodes $p$ and $q$ are connected, then they communicate reliably. We’ll do this by induction; formally, we’d like to prove that the proposition $\mathcal{P}_k$, defined as “$p_k \text{ receives } (p, m_p)$”, is true for $k\in \left\{ 1, \dots, n \right\}$.</p>
<ul>
<li>
<p><strong>Base case</strong></p>
<p>According to the algorithm, $p=p_1$ initially sends $(p, m_p)$ to $p_2$. So $p_2$ receives $(p, m_p)$ from $p_1$, and $\mathcal{P}_2$ is true.</p>
</li>
<li>
<p><strong>Induction step</strong></p>
<p>Suppose that the induction hypothesis $\mathcal{P}$ is true for $k \in \left\{2, \dots, n-1 \right\}$.</p>
<p>Then, according to the algorithm, $p_k$ sends $(p, m_p)$ to $p_{k+1}$, meaning that $p_{k+1}$ receives $(p, m_p)$ from $p_k$, which means that $\mathcal{P}_{k+1}$ is true.</p>
</li>
</ul>
<p>Thus $\mathcal{P}_k$ is true.</p>
<h3 id="robustness-property">Robustness property</h3>
<p>If at most $k$ nodes are crashed, and the graph is $(k+1)$-connected, then all correct nodes <strong>communicate reliably</strong>.</p>
<p>We prove this by contradiction. We want to prove $\mathcal{P}$, so let’s suppose that the opposite, $\bar{\mathcal{P}}$ is true; to prove this, we must be able to conclude that the graph is $(k+1)$-connected, but there are 2 correct nodes $p$ and $q$ that <em>do not</em> communicate reliably. Hopefully, doing so will lead us to a paradoxical conclusion that allows us to assert $\mathcal{P}$.</p>
<p>As we are $(k+1)$-connected, there exists $k+1$ paths $(P_1, P_2, \dots, P_{k+1})$ paths connecting any two nodes $p$ and $q$. We want to prove that $p$ and $q$ do not communicate reliably, meaning that all paths between them are “cut” by at least one crashed node. As the paths are disjoint, this requires at least $k+1$ crashed nodes to cut them all.</p>
<p>This is a contradiction: we were working under the assumption that $k$ nodse were crashed, and proved that $k+1$ nodes were crashed. This disproves $\bar{\mathcal{P}}$ and proves $\mathcal{P}$.</p>
<h3 id="random-failures">Random failures</h3>
<p>Let’s assume that $p$ and $q$ are connected by a single path of length 1, only separated by a node $n$. If each node has a probability $f$ of crashing, then the probability of communicating reliably is $1-f$.</p>
<p>Now, suppose that the path is of length $n$; the probability of communicating reliably is the probability that none of the nodes crashing; individually, that is $1-f$, so for the whole chain, the probability is $(1-f)^n$.</p>
<p>However, if we have $n$ paths of length 1 (that is, instead of setting them up serially like previously, we set them up in parallel), the probability of not communicating reliably is that of all intermediary nodes crashing, which is $f^n$; thus, the probability of actually communicating reliably is $1-f^n$.</p>
<p>If our nodes are connecting by $n$ paths of length $m$, the probability of not communicating reliably is that of all lines being cut. The probability of a single line being cut is $1 - (1 - f)^m$. The probability of any line being cut is one minus the probability of no line being cut, so the final probability is $1 - (1 - (1 - f)^m)^n$.</p>
<h3 id="example-proof">Example proof</h3>
<p>Assume an infinite 2D grid of nodes. Nodes $p$ and $q$ are connected, with the distance in the shortest path being $D$. What is the probability of communicating reliably when this distance tends to infinity?</p>
<script type="math/tex; mode=display">\newcommand{\abs}[1]{\left\lvert#1\right\rvert}
\lim_{D \rightarrow \infty} = \dots</script>
<p>First, let’s define a sequence of grids $G_k$. $G_0$ is a single node, $G_{k+1}$ is built from 9 grids $G_k$.</p>
<p>$G_{k+1}$ is <strong>correct</strong> if at least 8 of its 9 grids are correct.</p>
<p>We’ll introduce the concept of a “meta-correct” node; this is not really anything official, just something we’re making up for the purpose of this proof. Consider a grid $G_n$. A node $p$ is “meta-correct” if:</p>
<ul>
<li>It is in a correct grid $G_n$, and</li>
<li>It is in a correct grid $G_{n-1}$, and</li>
<li>It is in a correct grid $G_{n-2}$, …</li>
</ul>
<p>For the sake of this proof, let’s just admit that all meta-correct nodes are connected; if you take two nodes $p$ and $q$ that are both meta-correct, there will be a path of nodes connecting them.</p>
<h4 id="step-1">Step 1</h4>
<p>If $x$ is the probability that $G_k$ is correct, what is the probability $P(x)$ that $G_{k+1}$ is correct?</p>
<p>$G_{k+1}$ is built up of 9 subgrids $G_k$. Let $P_i$ be the probability of $i$ nodes failing; the probability of $G_k$ being correct is the probability at most one subgrid being incorrect.</p>
<script type="math/tex; mode=display">% <![CDATA[
\begin{align}
P_0 & = x^9 \\
P_1 & = 9(1-x)x^8 \\
P(x) & = P_0 + P_1 = x^9 + 9(1-x)x^8 \\
\end{align} %]]></script>
<h4 id="step-2">Step 2</h4>
<p>Let $\alpha = 0.9$, and $z(x) = 1 + \alpha (x-1)$.</p>
<p>We will admit the following: if $x \in [0.99, 1]$ then $z(x) \le P(x)$.</p>
<p>Let $P_k$ be the result of applying $P$ (as defined in step 1) to $1-f$, $k$ times: $P_k = P(P(P(\dots P(1-f))))$. We will prove that $P_k \ge 1 - \alpha^k, \forall k > 0$, by induction:</p>
<ul>
<li><strong>Base case</strong>: $P_0 = 1-f = 0.99$ and $1-\alpha^0 = 1-1 = 0$, so $P_0 \ge 1-\alpha^0$.</li>
<li>
<p><strong>Induction step</strong>:</p>
<p>Let’s suppose that $P_k \ge 1-\alpha^k$. We want to prove this for $k+1$, namely $P_{k+1} \ge 1 - \alpha^{k+1}$.</p>
<script type="math/tex; mode=display">P_{k+1} \ge P(P_k) \ge z(P_k) \ge z(1 - \alpha^k) \\
P_{k+1} \ge 1 + \alpha(1 - \alpha^k - 1) \\
P_{k+1} \ge 1 - \alpha^{k+1}</script>
</li>
</ul>
<p>This proves the result that $\forall k, P_k \ge 1 - \alpha^k$.</p>
<h4 id="step-3">Step 3</h4>
<p>Todo.</p>
</div></details>
<h2 id="reliable-broadcast">Reliable broadcast</h2>
<p>Broadcast is useful for some applications with pubsub-like mechanisms, where the subscribers might need some reliability guarantees from the publisher (we sometimes say quality of service QoS).</p>
<h3 id="best-effort-broadcast">Best-effort broadcast</h3>
<p>Best-effort broadcast (beb) has the following properties:</p>
<ul>
<li><strong>BEB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>BEB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>BEB3 No creation</strong>: no message is delivered unless it was broadcast</li>
</ul>
<p>The broadcasting machine may still crash in the middle of a broadcast, where it hasn’t broadcast the message to everyone yet. It offers no guarantee against that.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">PerfectLinks</span> <span class="p">(</span><span class="n">pp2p</span><span class="p">)</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">forall</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span><span class="p">,</span> <span class="n">the</span> <span class="nb">set</span> <span class="n">of</span> <span class="nb">all</span> <span class="n">nodes</span> <span class="ow">in</span> <span class="n">the</span> <span class="n">system</span><span class="p">,</span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">pp2pSend</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">pp2pDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>This is not the most efficient algorithm, but we’re not concerned about that. We just care about whether it’s correct, which we’ll sketch out a proof for:</p>
<ul>
<li><strong>Validity</strong>: By the validity property of perfect links and the very facts that:
<ul>
<li>the sender sends the message to all</li>
<li>every correct process that <code class="highlighter-rouge">pp2pDelivers</code> delivers a message to, <code class="highlighter-rouge">bebDelivers</code> it too</li>
</ul>
</li>
<li><strong>No duplication</strong>: by the no duplication property of perfect links</li>
<li><strong>No creation</strong>: by the no creation property of perfect links</li>
</ul>
<h3 id="reliable-broadcast-1">Reliable broadcast</h3>
<p>Reliable broadcast has the following properties:</p>
<ul>
<li><strong>RB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>RB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>RB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>RB4 Agreement</strong>: for any message $m$, if a <strong>correct</strong> process delivers $m$, then every correct process delivers $m$</li>
</ul>
<p>Notice that RB has the same properties as best-effort, but also adds a guarantee RB4: even if the broadcaster crashes in the middle of a broadcast and is unable to send to other processes, we’ll honor the agreement property. This is done by distinguishing receiving and delivering; the broadcaster may not have sent to everyone, but in that case, reliable broadcast makes sure that no one delivers.</p>
<p>Note that a process may still deliver and crash before others deliver; it is then incorrect, and we have no guarantees that the message will be delivered to others.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEfforBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span>
<span class="n">forall</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span> <span class="n">do</span><span class="p">:</span>
<span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span> <span class="c1"># application tells us to broadcast
</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="c1"># deliver to itself
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="c1"># broadcast to others using beb
</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="k">if</span> <span class="n">pi</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">correct</span><span class="p">:</span> <span class="c1"># echo if sender not in correct
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="k">else</span><span class="p">:</span>
<span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">forall</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span> <span class="ow">in</span> <span class="k">from</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="n">do</span><span class="p">:</span> <span class="c1"># echo all previous messages from crashed pi
</span> <span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>The idea is to echo all messages from a node that has crashed. From the moment we get the crash message from the oracle, we may have received messages from an actually crashed node, even though we didn’t know it was crashed yet. This is because our failure detector is eventually correct, which means that the crash notification may eventually come. To solve this, we also send all the old messages.</p>
<p>We’ll sketch a proof for the properties:</p>
<ul>
<li><strong>Validity</strong>: as above</li>
<li><strong>No duplication</strong>: as above</li>
<li><strong>No creation</strong>: as above</li>
<li><strong>Agreement</strong>: Assume some correct process $p_i$ <code class="highlighter-rouge">rbDelivers</code> a message $m$ that was broadcast through <code class="highlighter-rouge">rbBroadcast</code> by some process $p_k$. If $p_k$ is correct, then by the validity property of best-effort broadcast, all correct processes will get the message through <code class="highlighter-rouge">bebDeliver</code>, and then deliver $m$ through <code class="highlighter-rouge">rebDeliver</code>. If $p_k$ crashes, then by the completeness property of the failure detector $P$, $p_i$ detects the crash and broadcasts $m$ with <code class="highlighter-rouge">bebBroadcast</code> to all. Since $p_i$ is correct, then by the validity property of best effort, all correct processes <code class="highlighter-rouge">bebDeliver</code> and then <code class="highlighter-rouge">rebDeliver</code> $m$.</li>
</ul>
<p>Note that the proof only uses the completeness property of the failure detector, not the accuracy. Therefore, the predictor can either be perfect or eventually perfect.</p>
<h3 id="uniform-reliable-broadcast">Uniform reliable broadcast</h3>
<p>Uniform broadcast satisfies the following properties:</p>
<ul>
<li><strong>URB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>URB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>URB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>URB4 Uniform agreement</strong>: for any message $m$, if a process delivers $m$, then every correct process delivers $m$</li>
</ul>
<p>We’ve removed the word “correct” in the agreement, and this changes everything. This is the strongest assumption, which guarantees that all messages are delivered to everyone, no matter their future correctness status.</p>
<p>The algorithm is given by:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">uniformBroadcast</span> <span class="p">(</span><span class="n">urb</span><span class="p">)</span><span class="o">.</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span><span class="o">.</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span><span class="o">.</span>
<span class="n">Upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span> <span class="c1"># set of correct nodes, initiated to all nodes
</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># set of delivered and already forwarded messages
</span> <span class="n">ack</span><span class="p">[</span><span class="n">Message</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># set of nodes that have acknowledged Message
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="c1"># before broadcasting, save message in forward
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">urbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="n">U</span> <span class="p">{[</span><span class="bp">self</span><span class="p">,</span><span class="n">m</span><span class="p">]}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="bp">self</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="c1"># if I haven't sent the message, echo it
# if I've already sent it, don't do it again
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="k">if</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">forward</span><span class="p">:</span>
<span class="n">forward</span> <span class="p">:</span><span class="o">=</span> <span class="n">forward</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]};</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="c1"># deliver the message when we know that all correct processes have delivered
# (and if we haven't delivered already)
</span><span class="n">upon</span> <span class="n">event</span> <span class="p">(</span><span class="k">for</span> <span class="nb">any</span> <span class="p">[</span><span class="n">pj</span><span class="p">,</span><span class="n">m</span><span class="p">]</span> <span class="ow">in</span> <span class="n">forward</span><span class="p">)</span> <span class="n">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">)</span> <span class="ow">and</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">urbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="k">def</span> <span class="nf">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">return</span> <span class="n">correct</span> <span class="err">⊆</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span></pre></td></tr></tbody></table></code></pre></figure>
<p>To prove the correctness, we must first have a simple lemma: if a correct process $p_i$ <code class="highlighter-rouge">bebDeliver</code>s a message $m$, then $p_i$ eventually <code class="highlighter-rouge">urbDeliver</code>s the message $m$.</p>
<p>This can be proven as follows: any process that <code class="highlighter-rouge">bebDeliver</code>s $m$ <code class="highlighter-rouge">bebBroadcast</code>s $m$. By the completeness property of the failure detector $P$, and the validity property of best-effort broadcasting, there is a time at which $p_i$ <code class="highlighter-rouge">bebDeliver</code>s $m$ from every correct process and hence <code class="highlighter-rouge">urbDeliver</code>s it.</p>
<p>The proof is then:</p>
<ul>
<li><strong>Validity</strong>: If a correct process $p_i$ <code class="highlighter-rouge">urbBroadcast</code>s a message $m$, then $p_i$ eventually <code class="highlighter-rouge">bebBroadcast</code>s and <code class="highlighter-rouge">bebDeliver</code>s $m$. By our lemma, $p_i$ <code class="highlighter-rouge">urbDeliver</code>s it.</li>
<li><strong>No duplication</strong>: as best-effort</li>
<li><strong>No creation</strong>: as best-effort</li>
<li><strong>Uniform agreement</strong>: Assume some process $p_i$ <code class="highlighter-rouge">urbDeliver</code>s a message $m$. By the algorithm and the completeness <em>and</em> accuracy properties of the failure detector, every correct process <code class="highlighter-rouge">bebDeliver</code>s $m$. By our lemma, every correct process will <code class="highlighter-rouge">urbDeliver</code> $m$.</li>
</ul>
<p>Unlike previous algorithms, this relies on perfect failure detection. But under the assumption that the majority of processes stay correct, we can do with an eventually perfect failure detector. To do so, we remove the crash event above, and replace the <code class="highlighter-rouge">can_deliver</code> method with the following:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
</pre></td><td class="code"><pre><span class="k">def</span> <span class="nf">can_deliver</span><span class="p">(</span><span class="n">m</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">])</span> <span class="o">></span> <span class="n">N</span><span class="o">/</span><span class="mi">2</span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="causal-order-broadcast">Causal order broadcast</h2>
<h3 id="motivation">Motivation</h3>
<p>So far, we didn’t consider ordering among messages. In particular, we considered messages to be independent. Two messages from the same process might not be delivered in the order they were broadcast.</p>
<h3 id="causality">Causality</h3>
<p>The above means that <strong>causality</strong> is broken: a message $m_1$ that causes $m_2$ might be delivered by some process after $m_1$.</p>
<p>Let $m_1$ and $m_2$ be any two messages. $m_1\longrightarrow m_2$ ($m_1$ <strong>causally precedes</strong> $m_2$) if and only if:</p>
<ul>
<li><strong>C1 (FIFO Order)</strong>: Some process $p_i$ broadcasts $m_1$ before broadcasting $m_2$</li>
<li><strong>C2 (Causal Order)</strong>: Some process $p_i$ delivers $m_1$ and then broadcasts $m_2$</li>
<li><strong>C3 (Transitivity)</strong>: There is a message $m_3$ such that $m_1 \longrightarrow m_3$ and $m_3 \longrightarrow m_2$.</li>
</ul>
<p>The <strong>causal order property (CO)</strong> is given by the following: if any process $p_i$ delivers a message $m_2$, then $p_i$ must have delivered every message $m_1$ such that $m_1 \longrightarrow m_2$.</p>
<h3 id="algorithm">Algorithm</h3>
<p>We get reliable causal broadcast by using reliable broadcast, uniform causal broadcast using uniform reliable broadcast.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableCausalOrderBroadcast</span> <span class="p">(</span><span class="n">rco</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rcoBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">past</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span> <span class="n">pastm</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span> <span class="n">do</span><span class="p">:</span>
<span class="k">if</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="k">for</span> <span class="p">[</span><span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="p">]</span> <span class="ow">in</span> <span class="n">pastm</span><span class="p">:</span>
<span class="k">if</span> <span class="n">n</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="o">></span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">n</span><span class="p">}</span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="n">sn</span><span class="p">,</span> <span class="n">n</span><span class="p">]}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">m</span><span class="p">}</span>
<span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="n">U</span> <span class="p">{[</span><span class="n">pi</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This algorithm ensures causal reliable broadcast. The idea is to re-broadcast all past messages every time, making sure we don’t deliver twice. This is obviously not efficient, but it works in theory.</p>
<p>To improve this, we can implement a form of garbage collection. We can delete the <code class="highlighter-rouge">past</code> only when all others have delivered. To do this, we need a perfect failure detector.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
</pre></td><td class="code"><pre><span class="n">Implements</span> <span class="n">GarbageCollection</span> <span class="o">+</span> <span class="n">previous</span> <span class="n">algorithm</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span> <span class="c1"># set of all nodes
</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># forall m
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">upon</span> <span class="k">for</span> <span class="n">some</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">,</span> <span class="bp">self</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="bp">self</span><span class="p">}</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">ACK</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="p">[</span><span class="n">ACK</span><span class="p">,</span> <span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="k">if</span> <span class="n">correct</span><span class="o">.</span><span class="n">forall</span><span class="p">(</span><span class="k">lambda</span> <span class="n">pj</span><span class="p">:</span> <span class="n">pj</span> <span class="ow">in</span> <span class="n">ack</span><span class="p">[</span><span class="n">m</span><span class="p">]):</span> <span class="c1"># if all correct in ack
</span> <span class="n">past</span> <span class="p">:</span><span class="o">=</span> <span class="n">past</span> \ <span class="p">{[</span><span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="p">]}</span> <span class="c1"># remove message from past</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We need the perfect failure detector’s strong accuracy property to prove the causal order property. We don’t need the failure detector’s completeness property; if we don’t know that a process is crashed, it has no impact on correctness, only on performance, since it just means that we won’t delete the past.</p>
<p>Another algorithm is given below. It uses a <a href="https://en.wikipedia.org/wiki/Vector_clock">“vector clock” VC</a> as an alternative, more efficient encoding of the past. A VC is updated under the following rules:</p>
<ul>
<li>Initially all clocks are empty</li>
<li>Each time a process sends a message, it increments its own logical clock in the vector by one and then sends a copy of its own vecto.</li>
<li>Each time a process receives a message, it increments its own logical clock in the vector by one and updates each element in its vector by taking the maximum of the value in its own vector clock and the value in the vector in the received message (for every element).</li>
</ul>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">ReliableCausalOrderBroadcast</span> <span class="p">(</span><span class="n">rco</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span> <span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="k">for</span> <span class="nb">all</span> <span class="n">pi</span> <span class="ow">in</span> <span class="n">S</span><span class="p">:</span>
<span class="n">VC</span><span class="p">[</span><span class="n">pi</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="mi">0</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">upon</span> <span class="n">event</span><span class="o"><</span><span class="n">rcoBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VC</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span>
<span class="n">VC</span><span class="p">[</span><span class="bp">self</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">VC</span><span class="p">[</span><span class="bp">self</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span><span class="p">;</span> <span class="c1"># we have seen the message, so increment VC
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">pj</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">]</span><span class="o">></span><span class="p">:</span>
<span class="k">if</span> <span class="n">pj</span> <span class="o">!=</span> <span class="bp">self</span><span class="p">:</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="n">pending</span> <span class="n">U</span> <span class="p">(</span><span class="n">pj</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span>
<span class="n">deliver</span><span class="o">-</span><span class="n">pending</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">deliver</span><span class="o">-</span><span class="n">pending</span><span class="p">():</span>
<span class="k">while</span> <span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span> <span class="ow">in</span> <span class="n">pending</span><span class="p">:</span>
<span class="n">forall</span> <span class="n">pk</span> <span class="n">such</span> <span class="n">that</span> <span class="p">(</span><span class="n">VC</span><span class="p">[</span><span class="n">pk</span><span class="p">]</span> <span class="o"><=</span> <span class="n">VCm</span><span class="p">[</span><span class="n">pk</span><span class="p">]):</span>
<span class="n">pending</span> <span class="p">:</span><span class="o">=</span> <span class="n">pending</span> <span class="n">U</span> <span class="p">(</span><span class="n">s</span><span class="p">,</span> <span class="p">[</span><span class="n">Data</span><span class="p">,</span><span class="n">VCm</span><span class="p">,</span><span class="n">m</span><span class="p">])</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rcoDeliver</span><span class="p">,</span> <span class="bp">self</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">VC</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="p">:</span><span class="o">=</span> <span class="n">VC</span><span class="p">[</span><span class="n">s</span><span class="p">]</span> <span class="o">+</span> <span class="mi">1</span></pre></td></tr></tbody></table></code></pre></figure>
<h2 id="total-order-broadcast">Total order broadcast</h2>
<p>In <a href="#reliable-broadcast">reliable broadcast</a>, the processes are free to deliver in any order they wish. In <a href="#causal-broadcast">causal broadcast</a>, the processes must deliver in causal order. But causal order is only partial: some message may be delivered in a different order by the processes.</p>
<p>In <strong>total order</strong> broadcast, the processes must deliver all messages according to the same order. Note that this is orthogonal to causality, or even FIFO ordering. It can be <em>made</em> to respect causal or FIFO ordering, but at its core, it is only concerned with all processes delivering in the same order.</p>
<p>An application using total order broadcast would be Bitcoin; for the blockchain, we want to make sure that everybody gets messages in the same order, for consistency.</p>
<p>The properties are:</p>
<ul>
<li><strong>RB1 Validity</strong>: if $p_i$ and $p_j$ are correct then every message broadcast by $p_i$ is eventually delivered by $p_j$</li>
<li><strong>RB2 No duplication</strong>: no message is delivered more than once</li>
<li><strong>RB3 No creation</strong>: no message is delivered unless it was broadcast</li>
<li><strong>RB4 Agreement</strong>: for any message $m$, if a <strong>correct</strong> process delivers $m$, then every correct process delivers $m$</li>
<li><strong>TO1 (Uniform) Total Order</strong>: Let $m$ and $m’$ be any two messages. Let $p_i$ be any (correct) process that delivers $m$ without having delivered $m’$ before. Then no (correct) process delivers $m’$ before $m$</li>
</ul>
<p>The algorithm can be implemented as:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
</pre></td><td class="code"><pre><span class="n">Implements</span><span class="p">:</span> <span class="n">TotalOrder</span> <span class="p">(</span><span class="n">to</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">ReliableBroadcast</span> <span class="p">(</span><span class="n">rb</span><span class="p">)</span>
<span class="n">Consensus</span> <span class="p">(</span><span class="n">cons</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">init</span><span class="o">></span><span class="p">:</span>
<span class="n">unordered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span> <span class="c1"># two sets
</span> <span class="n">wait</span> <span class="p">:</span><span class="o">=</span> <span class="bp">False</span>
<span class="n">sn</span> <span class="p">:</span><span class="o">=</span> <span class="mi">1</span> <span class="c1"># sequence number
</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">toBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">rbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">rbDeliver</span><span class="p">,</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="ow">and</span> <span class="n">m</span> <span class="ow">not</span> <span class="ow">in</span> <span class="n">delivered</span><span class="p">:</span>
<span class="n">unordered</span><span class="o">.</span><span class="n">add</span><span class="p">((</span><span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="p">))</span>
<span class="n">upon</span> <span class="n">unordered</span> <span class="ow">not</span> <span class="n">empty</span> <span class="ow">and</span> <span class="ow">not</span> <span class="n">wait</span><span class="p">:</span>
<span class="n">wait</span> <span class="p">:</span><span class="o">=</span> <span class="bp">True</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">propose</span><span class="p">,</span> <span class="n">unordered</span><span class="o">></span> <span class="k">with</span> <span class="n">sn</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">decide</span><span class="p">,</span> <span class="n">decided</span><span class="o">></span> <span class="k">with</span> <span class="n">sn</span><span class="p">:</span>
<span class="n">unordered</span><span class="o">.</span><span class="n">remove</span><span class="p">(</span><span class="n">decided</span><span class="p">)</span>
<span class="n">ordered</span> <span class="o">=</span> <span class="n">sort</span><span class="p">(</span><span class="n">decided</span><span class="p">)</span>
<span class="k">for</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span> <span class="ow">in</span> <span class="n">ordered</span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">toDeliver</span><span class="p">,</span> <span class="n">sm</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="n">delivered</span><span class="o">.</span><span class="n">add</span><span class="p">(</span><span class="n">m</span><span class="p">)</span>
<span class="n">sn</span> <span class="o">+=</span> <span class="mi">1</span>
<span class="n">wait</span> <span class="o">=</span> <span class="bp">False</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Our total order broadcast is based on consensus, which we describe below.</p>
<h2 id="consensus">Consensus</h2>
<p>In the (uniform) consensus problem, the processes all propose values, and need to agree on one of these. This gives rise to two basic events: a proposition, and a decision. Solving consensus is key to solving many problems in distributed computing (total order broadcast, atomic commit, …).</p>
<p>The properties that we would like to see are:</p>
<ul>
<li><strong>C1 Validity</strong>: if a value is decided, it has been proposed</li>
<li><strong>C2 (Uniform) Agreement</strong>: no two correct (any) processes decide differently</li>
<li><strong>C3 Termination</strong>: every correct process eventually decides</li>
<li><strong>C4 Integrity</strong>: Every process decides at most once</li>
</ul>
<p>If C2 is Uniform Agreement, then we talk about uniform consensus.</p>
<p>Todo: write about consensus and fairness, does it violate validity?</p>
<p>We can build consensus using total order broadcast, which is described above. But total broadcast can be built with consensus. It turns out that <strong>consensus and total order broadcast are equivalent problems in a system with reliable channels</strong>.</p>
<p>Blockchain is based on consensus. Bitcoin mining is actually about solving consensus: a leader is chosen to decide on the broadcast order, and this leader gains 50 bitcoin. Seeing that this is a lot of money, many people want to be the leader; but we only want a single leader. Nakamoto’s solution is to choose the leader by giving out a hard problem. The computation can only be done with brute-force, there are no smart tricks or anything. So people put <a href="https://digiconomist.net/bitcoin-energy-consumption">enormous amounts of energy</a> towards solving this. Usually, only a single person will win the mining block; the probability is small, but the <a href="https://bitcoin.org/bitcoin.pdf">original Bitcoin paper</a> specifies that we should wait a little before rewarding the winner, in case there are two winners.</p>
<h3 id="consensus-algorithm">Consensus algorithm</h3>
<p>Suppose that there are $n$ processes. At the beginning, every process proposes a value; to decide, the processes go through $n$ rounds incrementally. At each round, the process with the id corresponding to the round number is the leader of the round. Note that the rounds are not global time; we may make them so in examples for the sake of simplicity, but rounds are simply a local thing, which are somewhat synchronized by message passing from the leader.</p>
<p>The leader decides its current proposal and broadcasts it to all. A process that is not the leader waits. It can either deliver the proposal of the leader to adopt it, or suspect the leader. In any case, we can move on to the next round at that moment. Note that processes don’t need to move on at the same time, they can do so at different moments.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="n">todo</span></pre></td></tr></tbody></table></code></pre></figure>
<p>correctness argument todo</p>
<h3 id="uniform-consensus-algorithm">Uniform consensus algorithm</h3>
<p>The idea is here is to do the same thing, but instead of deciding at the beginning of the round, we wait until round n.</p>
<p>not taking notes today, don’t feel like it.</p>
<h3 id="uniform-consensus-algorithm-with-eventually-perfect-failure-detector">Uniform consensus algorithm with eventually perfect failure detector</h3>
<p>This assumes a correct majority, and an eventually perfect failure detector.</p>
<p>When you suspect a process, you send them a message. When a new leader arrives, he asks what the previous value was, and at least one process will respond.</p>
<h2 id="atomic-commit">Atomic commit</h2>
<p>The unit of data processing in a distributed system is the <em>transaction</em>. A transaction describes the actions to be taken, and can be terminated either by <strong>committing</strong> or <strong>aborting</strong>.</p>
<h3 id="non-blocking-atomic-commit-nbac">Non-Blocking Atomic Commit (NBAC)</h3>
<p>The <strong>nonblocking atomic commit (NBAC)</strong> abstraction is used to solve this problem in a reliable way. As in consensus, every process proposes an initial value of 0 or 1 (no or yes), and must decide on a final value 0 or 1 (abort or commit). Unlike consensus, the processes here seek to decide 1, but every process has a veto right.</p>
<p>The properties of NBAC are:</p>
<ul>
<li><strong>NBAC1. Agreement</strong>: no two processes decide differently</li>
<li><strong>NBAC2. Termination</strong>: every correct process eventually decides</li>
<li><strong>NBAC3. Commit-validity</strong>: 1 can only be decided if all processes propose 1</li>
<li><strong>NBAC4. Abort-validity</strong>: 0 can only be decided if some process crashes or votes 0</li>
</ul>
<p>Note that here, NBAC must decide to abort if some process crashes, even though all processes have proposed 1 (commit).</p>
<p>We can implement NBAC using three underlying abstractions:</p>
<ul>
<li>A perfect failure detector P</li>
<li>Uniform consensus</li>
<li>Best-effort broadcast BEB</li>
</ul>
<p>It works as follows: every process $p$ broadcasts its initial vote (0 or 1, abort or commit) to all other processes using BEB. It waits to hear something from every process $q$ in the system; this is either done through <em>beb</em>-delivery from $q$, or by detecting the crash of $q$. At this point, two situations are possible:</p>
<ul>
<li>If $p$ gets 0 (abort) from any other process, or if it detects a crash, it invokes consensus with a proposal to abort (0).</li>
<li>Otherwise, if it receives the vote to commit (1) from all processes, then it invokes consensus with a proposal to commit (1).</li>
</ul>
<p>Once the consensus is over, every process nbac decides according to the outcome of the consensus.</p>
<p>We can write this more formally:</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
</pre></td><td class="code"><pre><span class="n">Events</span><span class="p">:</span>
<span class="n">Request</span><span class="p">:</span> <span class="o"><</span><span class="n">Propose</span><span class="p">,</span> <span class="n">v1</span><span class="o">></span>
<span class="n">Indication</span><span class="p">:</span> <span class="o"><</span><span class="n">Decide</span><span class="p">,</span> <span class="n">v2</span><span class="o">></span>
<span class="n">Properties</span><span class="p">:</span>
<span class="n">NBAC1</span><span class="p">,</span> <span class="n">NBAC2</span><span class="p">,</span> <span class="n">NBAC3</span><span class="p">,</span> <span class="n">NBAC4</span>
<span class="n">Implements</span><span class="p">:</span> <span class="n">nonBlockingAtomicCommit</span> <span class="p">(</span><span class="n">nbac</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">UniformConsensus</span> <span class="p">(</span><span class="n">uc</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="mi">1</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="err">Ø</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">all_processes</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span><span class="p">:</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">correct</span> \ <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Propose</span><span class="p">,</span> <span class="n">v</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">pi</span><span class="p">,</span> <span class="n">v</span><span class="o">></span><span class="p">:</span>
<span class="n">delivered</span> <span class="p">:</span><span class="o">=</span> <span class="n">delivered</span> <span class="n">U</span> <span class="p">{</span><span class="n">pi</span><span class="p">}</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="n">prop</span> <span class="o">*</span> <span class="n">v</span>
<span class="n">upon</span> <span class="n">event</span> <span class="n">correct</span> \ <span class="n">delivered</span> <span class="o">=</span> <span class="err">Ø</span><span class="p">:</span>
<span class="k">if</span> <span class="n">correct</span> <span class="o">!=</span> <span class="n">all_processes</span><span class="p">:</span>
<span class="n">prop</span> <span class="p">:</span><span class="o">=</span> <span class="mi">0</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">ucPropose</span><span class="p">,</span> <span class="n">prop</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">ucDecide</span><span class="p">,</span> <span class="n">decision</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">Decide</span><span class="p">,</span> <span class="n">decision</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>We use multiplication to factor in the decisions we get from other processes; if we get a single 0, the final proposition will be 0 too. If we get only 1s, the final proposition will be 1 too. Otherwise, this should be a fairly straight-forward implementation of the description we gave.</p>
<p>We need a perfect failure detector $P$. An eventually perfect failure detector $\diamond P$ is not enough (todo why?).</p>
<h3 id="2-phase-commit">2-Phase Commit</h3>
<p>This is a <em>blocking</em> algorithm. Unlike NBAC, this algorithm does not use consensus. It operates under a relaxed set of constraints; the termination property has been replaced with weak termination, which just says that if a process $p$ doesn’t crash, then all correct processes eventually decide.</p>
<p>In 2PC, we have a leading coordinator process $p$ which takes the decision. It asks everyone to vote, makes a decision, and notifies everyone of the decision.</p>
<p>As the name indicates, there are two phases in this algorithm:</p>
<ol>
<li><strong>Voting phase:</strong> As before, proposals are sent with best-effort broadcast. A process collects all these proposals.</li>
<li><strong>Commit phase</strong>: Again, just as before, it decides to abort if it receives any abort proposals, or if it detects any crashes with its perfect failure detector. Otherwise, if it receives proposals to commit from everyone, it will decide to commit. It then sends this decision to all processes with BEB.</li>
</ol>
<p>If $p$ crashes, all processes are blocked, waiting for its response.</p>
<h2 id="terminating-reliable-broadcast-trb">Terminating reliable broadcast (TRB)</h2>
<p>Like reliable broadcast, terminating reliable broadcast (TRB) is a communication primitive used to disseminate a message among a set of processes in a reliable way. However, TRB is stricter than URB.</p>
<p>In TRB, there si a specific broadcaster process $p_{\text{src}}$, known by all processes. It is supposed to broadcast a message $m$. We’ll also define a distinct message $\phi \ne m$. The other processes need to deliver $m$ if $p_{\text{src}}$ is correct, but may deliver $\phi$ if $p_{\text{src}}$ crashes.</p>
<p>The idea is that if $p_{\text{src}}$ crashes, the other processes may detect that it’s crashed, without having ever received $m$. But this doesn’t mean that $m$ wasn’t sent; $p_{\text{src}}$ may have crashed while it was in the process of sending $m$, so some processes may have delivered it while others might never do so.</p>
<p>For a process $p$, the following cases cannot be distinguished:</p>
<ul>
<li>Some other process $q$ has delivered $m$; this means that $p$ should keep waiting for it</li>
<li>No process will ever deliver $m$; this means that $p$ should <strong>not</strong> keep waiting for it</li>
</ul>
<p>TRB solves this by adding this missing piece of information to (uniform) reliable broadcast. It ensures that every process either delivers the messaeg $m$ or sends a failure indicator $\phi$.</p>
<p>The properties of TRB are:</p>
<ul>
<li><strong>TRB1. Integrity</strong>: If a process delivers a message $m$, then either $m$ is $\phi$ or $m$ was broadcast by $p_{\text{src}}$</li>
<li><strong>TRB2. Validity</strong>: If the sender $p_{\text{src}}$ is correct and broadcasts a message $m$, then $p_{\text{src}}$ eventually delivers $m$</li>
<li><strong>TRB3. (Uniform) Agreement</strong>: For any message $m$, if a correct process (any process) delivers $m$, then every correct process delivers $m$</li>
<li><strong>TRB4. Termination</strong>: Every correct process eventually delivers exactly one message</li>
</ul>
<p>Unlike reliable broadcast, every correct process delivers a message, even if the broadcaster crashes. Indeed, with (uniform) reliable broadcast, when the broadcaster crashes, the other processes may deliver <em>nothing</em>.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
</pre></td><td class="code"><pre><span class="n">Events</span><span class="p">:</span>
<span class="n">Request</span><span class="p">:</span> <span class="o"><</span><span class="n">trbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="c1"># broadcasts a message m to all processes
</span> <span class="n">Indication</span><span class="p">:</span> <span class="o"><</span><span class="n">trbDeliver</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="c1"># delivers a message m, or the failure ϕ
</span>
<span class="n">Properties</span><span class="p">:</span>
<span class="n">TRB1</span><span class="p">,</span> <span class="n">TRB2</span><span class="p">,</span> <span class="n">TRB3</span><span class="p">,</span> <span class="n">TRB4</span>
<span class="n">Implements</span><span class="p">:</span>
<span class="n">trbBroadcast</span> <span class="p">(</span><span class="n">trb</span><span class="p">)</span>
<span class="n">Uses</span><span class="p">:</span>
<span class="n">BestEffortBroadcast</span> <span class="p">(</span><span class="n">beb</span><span class="p">)</span>
<span class="n">PerfectFailureDetector</span> <span class="p">(</span><span class="n">P</span><span class="p">)</span>
<span class="n">Consensus</span> <span class="p">(</span><span class="n">cons</span><span class="p">)</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Init</span><span class="o">></span><span class="p">:</span>
<span class="n">proposal</span> <span class="p">:</span><span class="o">=</span> <span class="n">null</span>
<span class="n">correct</span> <span class="p">:</span><span class="o">=</span> <span class="n">S</span>
<span class="c1"># When application broadcasts:
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">trbBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">bebBroadcast</span><span class="p">,</span> <span class="n">m</span><span class="o">></span>
<span class="c1"># When the perfect failure detector detects a crash
</span><span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Crash</span><span class="p">,</span> <span class="n">pi</span><span class="o">></span> <span class="ow">and</span> <span class="p">(</span><span class="n">proposal</span> <span class="o">=</span> <span class="n">null</span><span class="p">):</span>
<span class="k">if</span> <span class="n">pi</span> <span class="o">==</span> <span class="n">p_src</span><span class="p">:</span>
<span class="n">proposal</span> <span class="p">:</span><span class="o">=</span> <span class="err">ϕ</span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">bebDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">m</span><span class="o">></span> <span class="ow">and</span> <span class="p">(</span><span class="n">proposal</span> <span class="o">=</span> <span class="n">null</span><span class="p">):</span>
<span class="n">proposal</span> <span class="p">:</span><span class="o">=</span> <span class="n">m</span>
<span class="n">upon</span> <span class="n">event</span> <span class="p">(</span><span class="n">proposal</span> <span class="o">!=</span> <span class="n">null</span><span class="p">):</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">Propose</span><span class="p">,</span> <span class="n">proposal</span><span class="o">></span>
<span class="n">upon</span> <span class="n">event</span> <span class="o"><</span><span class="n">Decide</span><span class="p">,</span> <span class="n">decision</span><span class="o">></span><span class="p">:</span>
<span class="n">trigger</span> <span class="o"><</span><span class="n">trbDeliver</span><span class="p">,</span> <span class="n">src</span><span class="p">,</span> <span class="n">decision</span><span class="o">></span></pre></td></tr></tbody></table></code></pre></figure>
<p>todo explain how we use consensus, and why P is necessary.</p>
<h2 id="group-membership">Group membership</h2>
<p>Every view is a pair $(i, M)$, where $i$ is the numbering of the view, and $M$ is a set of processes.</p>
<p>Properties:</p>
<ul>
<li><strong>Memb1. Local Monotonicity</strong>: If a process installs view $(j, M)$ after $(k, N)$, then $j > k$ and $\abs{M} < \abs{N}$ (the only reason to change a view is to remove a process from the set when it crashes).</li>
<li><strong>Memb2. Agreement</strong>: No two processes install views $(j, M)$ and $(j, M’)$ such that $M \ne M’$.</li>
<li><strong>Memb3. Completeness</strong>: If a process $p$ crashes, then there is an integer $j$ such that every correct process installs view $(j, M)$ in which $p\notin M$</li>
<li><strong>Memb4. Accuracy</strong>: If some process installs a view $(i, M)$ and $p\notin M$ then $p$ has crashed.</li>
</ul>
<p>The implementation uses consensus and a perfect failure detector.</p>
<figure class="highlight"><pre><code class="language-python" data-lang="python"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
</pre></td><td class="code"><pre><span class="n">todo</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We use a <code class="highlighter-rouge">wait</code> variable, just like in total order. This allows to prevent a process from triggering a new view installation before the previous one has been done.</p>
<h2 id="view-synchronous-vs-communication">View-Synchronous (VS) communication</h2>
<p>This abstraction brings together reliable broadcast and group membership. However, this introduces a subtle problem, justifying the introduction of a solution as a new abstraction. Indeed, if a message is broadcast right as we’re installing a view, we’re breaking things. To solve this, we must introduce some notion of phases in which messages can or cannot be sent.</p>
<h2 id="from-message-passing-to-shared-memory">From message passing to Shared memory</h2>
<p>The Cloud is an example of shared memory, with which we interact by message passing.</p>
<p>A register contains integers….</p>
<h2 id="byzantine-failures">Byzantine failures</h2>
<p>So far, we’ve only considered situations in which nodes crash. In this section, we’ll consider a new case: the one where nodes go “evil”, a situation we call <strong>byzantine failures</strong>.</p>
<p>Suppose that our nodes are arranged in a grid. $S$ sends a message $m$ to $R$ by broadcasting $(S, m)$. With a simple broadcast algorithm, we just broadcast the message to the neighbor, which may be a byzantine node $B$ that alters the message before rebroadcasting it. Because $B$ can simply do that, we see that this simple algorithm is not enough to deal with byzantine failures.</p>
<p>To deal with this problem, we’ll consider some other algorithms.</p>
<p>First, consider the case where there are $n$ intermediary nodes between $S$ and $R$ (this is not a daisy chain of nodes, but instead just $m$ paths of length 2 between $S$ and $R$). We assume that $S$ and $R$ are both correct (non-Byzantine) nodes, but the intermediary nodes may be.</p>
<p>For this algorithm, we define $k = \frac{n}{2} - 1$ if $n$ is even, and $k = \frac{n - 1}{2}$ if it is odd. The idea is to have $k+1$ be the smallest number of nodes to have a majority among the $n$ intermediary nodes. Let’s also assume that $R$ has a set $\Omega$ that acts as its memory, and a variable $x$, initially set to $x = 0$. Our goal is to have $x = m$.</p>
<p>$S$ simply sends out the message $m$ to its neighbors. The intermediary nodes forward messages that they receive to $R$. Finally, when $R$ receives a message $m$ from $p$, it adds it to the set $\Omega$. When there are $k+1$ nodes in the set, it can set $x = m$ (essentially, deliver the message).</p>
<p>We’ll prove properties on this. The main point to note is that these proofs make no assumption on the potentially Byzantine nodes.</p>
<ul>
<li>
<p><strong>Safety</strong>: if the number of Byzantine nodes $f$ is $f \le k$, then $x = 0$ or $x = m$.</p>
<p>The proof is by contradiction. Let’s suppose that the opposite is true, i.e. that $x = m’$, where $m’ \ne m$. Then, according to the algorithm, this means that there must be $k+1$ nodes such that $\forall i \in \left\{ 1, \dots, k+1 \right\}$, we have $(p_i, m) \in \Omega$. But according to the algorithm, there are only two reasons for such a message being in the set; that is, either $p_i$ operates in good faith, receiving $m’$ from $S$, or it operates in bad faith, being a Byzantine node. The first case is impossible, as $S$ is correct. The alternative case can only happen if there are $k+1$ byzantine nodes, which is also impossible (since by assumption $f \le k$. This contradiction proves the safety property.</p>
</li>
<li>
<p><strong>Liveness</strong>: if $f \le k$, we eventually have $x = m$.</p>
<p>To prove this, we first define a set of $k+1$ correct (non-Byzantine) intermediary nodes. These nodes all receive $m$ from $S$, send it to $R$, which places it in $\Omega$. Eventually, we’ll have $k+1$ nodes in the set, and then $x=m$.</p>
<p>By the liveness and safety property, we know that initially $x=0, eventually $x=m$, and we never have $x=m’$.</p>
</li>
<li>
<p><strong>Optimality</strong>: if $f \ge k + 1$, it is impossible to ensure the safety property.</p>
<p>Assume we have $k+1$ Byzantine nodes sending $m’$ to $R$. According to the algorithm, we get $x = m’$, so no safety.</p>
<p>We can conclude that we can tolerate at most $k$ Byzantine nodes.</p>
</li>
</ul>
<p>But here we only considered the specific case of length 2 paths. Let’s now consider the general case, which is the $(2k+1)$ connected graph. In this case, we consider any graph, and each node needs to broadcast a message $m_p$. Every node has a set $p_R$ to send messages, and a set $p_X$ of received messages.</p>
<p>The algorithm is as follows. Initially, the nodes send $(p, \emptyset, m_p)$ to their neighbors. When a node $p$ receives $(u, \Omega, m)$ from a neighbor $q$, with $p\notin\Omega$ and $q\notin\Omega$, the node sends $(u, \Omega\cup u, m)$ to its neighbors, and add that to $p_X$. When there exists a node $q$, a message $m$ and $k+1$ sets $\Omega_1, \dots, \Omega_{k+1}$ such that $\bigcup_{i=1}^{k+1} \Omega_i = \left\{ q \right\}$, and we have $k+1$ message in $p_X$, we can add $(q, m)$ to $p_R$.</p>
<p>We’ll prove the following properties under the hypotheses that we have at most $k$ Byzantine nodes (a minority), and that the graph is connected (otherwise we couldn’t broadcast messages between the nodes)</p>
<ul>
<li>
<p><strong>Safety</strong>: If $p$ and $q$ are two correct nodes, we never have $(p, m_p’)\in q_R$ (where $m_p’ \ne m_p$). In other words, no fake messages are accepted.</p>
<p>The proof is by contradiction, in which we use induction to arrive to a contradictory conclusion. We’ll try to prove the opposite of our claim, namely that there are two correct nodes $p$ and $q$ such that $(p, m_p’) \in q_R$.</p>
<p>According to our algorithm, we have $k+1$ disjoint sets whose intersection is $p$, and $k+1$ elements $(p, \Omega_i, m) \in q_X$.</p>
<p>To prove this, we’ll need to prove a sub-property: that each set $\Omega_i$ contains at least one byzantine node. We prove this by contradiction. We’ll suppose the opposite, namely that $\Omega_i$ contains no byzantine node (i.e. that they are all correct). I won’t write down the proof of this, but it’s in the lecture notes if ever (it’s by induction).</p>
</li>
</ul>
⚠ Work in progressCS-323 Operating Systems2018-02-21T00:00:00+00:002018-02-21T00:00:00+00:00https://kjaer.io/os
<img src="https://kjaer.io/images/hero/trees.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#intro" id="markdown-toc-intro">Intro</a> <ul>
<li><a href="#a-little-bit-of-history" id="markdown-toc-a-little-bit-of-history">A little bit of history</a></li>
<li><a href="#what-does-the-os-do" id="markdown-toc-what-does-the-os-do">What does the OS do?</a></li>
<li><a href="#where-does-the-os-live" id="markdown-toc-where-does-the-os-live">Where does the OS live?</a></li>
<li><a href="#os-interfaces" id="markdown-toc-os-interfaces">OS interfaces</a> <ul>
<li><a href="#system-calls" id="markdown-toc-system-calls">System calls</a></li>
<li><a href="#traps" id="markdown-toc-traps">Traps</a></li>
<li><a href="#interrupts" id="markdown-toc-interrupts">Interrupts</a></li>
</ul>
</li>
<li><a href="#os-control-flow" id="markdown-toc-os-control-flow">OS control flow</a></li>
<li><a href="#os-design-goals" id="markdown-toc-os-design-goals">OS design goals</a></li>
<li><a href="#high-level-os-structure" id="markdown-toc-high-level-os-structure">High-level OS structure</a></li>
</ul>
</li>
<li><a href="#processes" id="markdown-toc-processes">Processes</a> <ul>
<li><a href="#what-is-a-process" id="markdown-toc-what-is-a-process">What is a process?</a></li>
<li><a href="#outline-of-the-linux-shell" id="markdown-toc-outline-of-the-linux-shell">Outline of the Linux shell</a></li>
<li><a href="#linux-process-tree" id="markdown-toc-linux-process-tree">Linux process tree</a></li>
<li><a href="#multiprocessing" id="markdown-toc-multiprocessing">Multiprocessing</a> <ul>
<li><a href="#process-switching" id="markdown-toc-process-switching">Process switching</a></li>
<li><a href="#process-scheduling" id="markdown-toc-process-scheduling">Process scheduling</a></li>
<li><a href="#scheduling-algorithm" id="markdown-toc-scheduling-algorithm">Scheduling algorithm</a></li>
<li><a href="#application-multiprocess-structuring" id="markdown-toc-application-multiprocess-structuring">Application multiprocess structuring</a></li>
</ul>
</li>
<li><a href="#interprocess-communication" id="markdown-toc-interprocess-communication">Interprocess communication</a> <ul>
<li><a href="#message-passing" id="markdown-toc-message-passing">Message passing</a></li>
<li><a href="#remote-procedure-call" id="markdown-toc-remote-procedure-call">Remote procedure call</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#application-multithreading-and-synchronization" id="markdown-toc-application-multithreading-and-synchronization">Application multithreading and synchronization</a> <ul>
<li><a href="#multithreading-vs-multiprocessing" id="markdown-toc-multithreading-vs-multiprocessing">Multithreading vs. multiprocessing</a></li>
<li><a href="#pthreads" id="markdown-toc-pthreads">Pthreads</a></li>
<li><a href="#synchronization" id="markdown-toc-synchronization">Synchronization</a></li>
<li><a href="#kernel-multithreading" id="markdown-toc-kernel-multithreading">Kernel multithreading</a></li>
</ul>
</li>
<li><a href="#memory-management" id="markdown-toc-memory-management">Memory management</a> <ul>
<li><a href="#goals-and-assumptions" id="markdown-toc-goals-and-assumptions">Goals and assumptions</a></li>
<li><a href="#virtual-and-physical-address-spaces" id="markdown-toc-virtual-and-physical-address-spaces">Virtual and physical address spaces</a></li>
<li><a href="#mapping-methods-and-allocation" id="markdown-toc-mapping-methods-and-allocation">Mapping methods and allocation</a> <ul>
<li><a href="#base-and-bounds" id="markdown-toc-base-and-bounds">Base and bounds</a></li>
<li><a href="#segmentation" id="markdown-toc-segmentation">Segmentation</a></li>
<li><a href="#paging" id="markdown-toc-paging">Paging</a></li>
<li><a href="#segmentation-with-paging" id="markdown-toc-segmentation-with-paging">Segmentation with Paging</a></li>
</ul>
</li>
<li><a href="#optimizations" id="markdown-toc-optimizations">Optimizations</a> <ul>
<li><a href="#memory-allocation" id="markdown-toc-memory-allocation">Memory allocation</a></li>
<li><a href="#protection" id="markdown-toc-protection">Protection</a></li>
<li><a href="#sharing" id="markdown-toc-sharing">Sharing</a></li>
<li><a href="#feature-comparison" id="markdown-toc-feature-comparison">Feature comparison</a></li>
<li><a href="#tlbs" id="markdown-toc-tlbs">TLBs</a></li>
</ul>
</li>
<li><a href="#dealing-with-large-virtual-address-spaces" id="markdown-toc-dealing-with-large-virtual-address-spaces">Dealing with large virtual address spaces</a></li>
<li><a href="#process-switching-and-memory-management" id="markdown-toc-process-switching-and-memory-management">Process switching and memory management</a></li>
</ul>
</li>
<li><a href="#demand-paging" id="markdown-toc-demand-paging">Demand Paging</a> <ul>
<li><a href="#page-replacement" id="markdown-toc-page-replacement">Page replacement</a></li>
<li><a href="#frame-allocation" id="markdown-toc-frame-allocation">Frame allocation</a></li>
<li><a href="#optimizations-1" id="markdown-toc-optimizations-1">Optimizations</a> <ul>
<li><a href="#prepaging" id="markdown-toc-prepaging">Prepaging</a></li>
<li><a href="#cleaning" id="markdown-toc-cleaning">Cleaning</a></li>
<li><a href="#free-frame-pool" id="markdown-toc-free-frame-pool">Free frame pool</a></li>
<li><a href="#copy-on-write" id="markdown-toc-copy-on-write">Copy-on-write</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#file-systems" id="markdown-toc-file-systems">File Systems</a> <ul>
<li><a href="#what-is-a-file-system" id="markdown-toc-what-is-a-file-system">What is a file system?</a></li>
<li><a href="#interface" id="markdown-toc-interface">Interface</a> <ul>
<li><a href="#access-primitives" id="markdown-toc-access-primitives">Access primitives</a></li>
<li><a href="#concurrency-primitives" id="markdown-toc-concurrency-primitives">Concurrency primitives</a></li>
<li><a href="#naming-primitives" id="markdown-toc-naming-primitives">Naming primitives</a></li>
</ul>
</li>
<li><a href="#disks" id="markdown-toc-disks">Disks</a> <ul>
<li><a href="#how-does-a-disk-work" id="markdown-toc-how-does-a-disk-work">How does a disk work?</a></li>
<li><a href="#interface-1" id="markdown-toc-interface-1">Interface</a></li>
<li><a href="#performance" id="markdown-toc-performance">Performance</a> <ul>
<li><a href="#caching" id="markdown-toc-caching">Caching</a></li>
<li><a href="#read-ahead" id="markdown-toc-read-ahead">Read-ahead</a></li>
<li><a href="#disk-scheduling" id="markdown-toc-disk-scheduling">Disk scheduling</a></li>
<li><a href="#disk-allocation" id="markdown-toc-disk-allocation">Disk allocation</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#os-file-system-implementation" id="markdown-toc-os-file-system-implementation">OS File System Implementation</a> <ul>
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
<li><a href="#disk-structure" id="markdown-toc-disk-structure">Disk structure</a></li>
<li><a href="#data-allocation" id="markdown-toc-data-allocation">Data allocation</a></li>
<li><a href="#in-memory-data-structures" id="markdown-toc-in-memory-data-structures">In-memory data structures</a></li>
<li><a href="#pseudo-code" id="markdown-toc-pseudo-code">Pseudo-code</a></li>
<li><a href="#loose-ends" id="markdown-toc-loose-ends">Loose ends</a></li>
<li><a href="#alternative-file-access-method-memory-mapping" id="markdown-toc-alternative-file-access-method-memory-mapping">Alternative file access method: memory mapping</a></li>
</ul>
</li>
<li><a href="#dealing-with-crashes" id="markdown-toc-dealing-with-crashes">Dealing with crashes</a> <ul>
<li><a href="#atomic-writes" id="markdown-toc-atomic-writes">Atomic writes</a></li>
<li><a href="#intentions-log" id="markdown-toc-intentions-log">Intentions log</a></li>
<li><a href="#comparison" id="markdown-toc-comparison">Comparison</a></li>
</ul>
</li>
<li><a href="#log-structured-file-system-lfs" id="markdown-toc-log-structured-file-system-lfs">Log-Structured File System (LFS)</a> <ul>
<li><a href="#writing" id="markdown-toc-writing">Writing</a></li>
<li><a href="#reading" id="markdown-toc-reading">Reading</a></li>
<li><a href="#checkpoints" id="markdown-toc-checkpoints">Checkpoints</a></li>
<li><a href="#disk-cleaning" id="markdown-toc-disk-cleaning">Disk cleaning</a></li>
<li><a href="#summary" id="markdown-toc-summary">Summary</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#alternative-storage-media" id="markdown-toc-alternative-storage-media">Alternative storage media</a> <ul>
<li><a href="#raid" id="markdown-toc-raid">RAID</a></li>
<li><a href="#ssd" id="markdown-toc-ssd">SSD</a></li>
</ul>
</li>
<li><a href="#virtual-machines" id="markdown-toc-virtual-machines">Virtual Machines</a> <ul>
<li><a href="#virtualization" id="markdown-toc-virtualization">Virtualization</a></li>
<li><a href="#terminology" id="markdown-toc-terminology">Terminology</a></li>
<li><a href="#history" id="markdown-toc-history">History</a></li>
<li><a href="#vmm-implementation" id="markdown-toc-vmm-implementation">VMM Implementation</a> <ul>
<li><a href="#popekgoldberg-theorem" id="markdown-toc-popekgoldberg-theorem">Popek/Goldberg Theorem</a></li>
</ul>
</li>
<li><a href="#virtual-memory" id="markdown-toc-virtual-memory">Virtual memory</a></li>
</ul>
</li>
</ul>
<!-- More -->
<h2 id="intro">Intro</h2>
<h3 id="a-little-bit-of-history">A little bit of history</h3>
<p>In the early, early days, users would program the raw machine. The first “abstractions” were libraries for scientific functions (<code class="highlighter-rouge">sin</code>, <code class="highlighter-rouge">cos</code>, …), and libraries for doing I/O (at the time, considered a major improvement; it was an extra card deck that you’d put behind yours). And indeed, I/O libraries are the first pieces of an OS.</p>
<p>In the early days, only one program would run at a time. Computers were expensive, so to get more use of the resources, we started doing multiprogramming; this just means that multiple programs are in memory at once.</p>
<h3 id="what-does-the-os-do">What does the OS do?</h3>
<ul>
<li>It provides <strong>abstraction</strong>, and makes the hardware easier to use. It serves as an interface between hardware and user-defined programs. In hardware, we have CPU, memory, disks and devices. The OS abstracts to processes, memory locations, files, virtual devices.</li>
<li>It does <strong>resource management</strong>: it allocates hardware resources (memory, CPU) between programs and users, to make sure that they can all use the resources fairly and independently (not corrupting each other’s memory for instance)</li>
</ul>
<h3 id="where-does-the-os-live">Where does the OS live?</h3>
<p>The CPU always operates in dual mode: there’s the <em>kernel mode</em> and the <em>user mode</em>. In terms of hardware, there’s a bit that tells you which mode you are in. In short, kernel mode is the God-mode, and user-mode is very restricted.</p>
<p>In kernel mode, there are additional, privileged instructions: for instance, the instruction to set the mode bit is only available in kernel mode. In this mode, you have direct access to all of memory and to devices. In user mode, you don’t actually have direct access to the disk; you can ask the kernel to write something for you (the OS is the only entity that can write to a device).</p>
<p>The OS runs in kernel mode, and applications run in user mode. This allows the OS to protect itself, and to solely manage applications and devices.</p>
<h3 id="os-interfaces">OS interfaces</h3>
<h4 id="system-calls">System calls</h4>
<p>Switching from kernel mode to user mode is easy; in kernel mode, we can simply change the mode bit and give control to the user program. The other way is more tricky; to go from user to kernel mode, a device generates an <em>interrupt</em>, or a program executes a <em>trap</em> or a <em>system call</em>. For that purpose, there’s a system call interface that allows the kernel and user entities to communicate. This interface is essential to the integrity of the OS. System calls include process management, memory management, file systems, device management, <a href="https://github.com/freebsd/freebsd/blob/master/sys/sys/syscall.h">etc</a>.</p>
<p>Traditionally, system calls went through interrupts with a number; more recently, ISAs provide dedicated instructions (you put the system call number in register <code class="highlighter-rouge">%eax</code>, then execute system call instruction. If you need more space for parameters, you can put them in registers, or use registers as pointers etc). That being said, in practice, the universally used solution is the kernel API, which is an interface to the system call interface (i.e. nobody writes system calls by hand, the library takes care of putting the appropriate data in the appropriate registers, and doing the syscalls).</p>
<p><em>Side note</em>: <a href="https://meltdownattack.com">Meltdown and Spectre</a> were about some of the things in kernel space being visible in user space.</p>
<p><em>Side note</em>: when doing a <code class="highlighter-rouge">printf</code>, we’re not writing a <code class="highlighter-rouge">printf</code> kernel call directly. First of all, there’s no kernel call code specifically for print: the console is considered to be a device that the kernel can write to. Second, we’re using the language library (called libc for C) as an additional abstraction over the kernel API (see <a href="https://github.com/lattera/glibc/blob/master/stdio-common/printf.c">source for printf</a>). Note that libc makes system calls look like a function call (deliberately) but it is not just a regular function call: it’s also a user-kernel transition from one program (user) to another (kernel). This is a much more expensive operation.</p>
<h4 id="traps">Traps</h4>
<p>A trap is generated by the CPU as a result of an error (divide by zero, privileged instruction execution in user mode, illegal access to memory, …). Generally, the result is that God (the kernel) sees this and punishes you (kills the process). Still, this works like an “involuntary” system call. It sets the mode to the kernel mode and lets it do its job.</p>
<h4 id="interrupts">Interrupts</h4>
<p>Generated by a device that needs attention (packet arrived from network, disk IO completed, …). It’s identified by an interrupt number which, roughly speaking, identifies the device.</p>
<h3 id="os-control-flow">OS control flow</h3>
<p>The OS is an event-driven program. In other words, if there’s nothing to do, it does nothing. If there’s an interrupt, a system call or an interrupt, it starts running.</p>
<p>The (simplified) execution flow is as follows:</p>
<ol>
<li>The user executes a system call or a trap with code <em>i</em>, or there’s an interrupt with code <code class="highlighter-rouge">i</code>.</li>
<li>The hardware receives this, puts the machine in kernel mode, and depending on what it’s received, does the following:
<ul>
<li><strong>System call</strong>: put the machine in kernel mode, set <code class="highlighter-rouge">PC = SystemCallVector[i]</code></li>
<li><strong>Trap</strong>: put the machine in kernel mode, set <code class="highlighter-rouge">PC = TrapVector[i]</code></li>
<li><strong>Interrupt</strong>: put the machine in kernel mode, set <code class="highlighter-rouge">PC = InterruptVector[i]</code></li>
</ul>
</li>
<li>The kernel executes system call <code class="highlighter-rouge">i</code> handler routine (after checking that the <code class="highlighter-rouge">i</code> parameter is valid), and then executes a “return from kernel” instruction</li>
<li>The hardware puts the machine back in user mode</li>
<li>The user executes the next instruction after the system call.</li>
</ol>
<h3 id="os-design-goals">OS design goals</h3>
<ul>
<li><strong>Correct abstractions</strong></li>
<li><strong>Performance</strong></li>
<li><strong>Portability</strong>: You cannot do this completely, but you can try to minimize the architecture-independent sections so that you only have to rewrite those to port to a new system.</li>
<li><strong>Reliability</strong>: The OS must never fail. It must carefully check inbound parameters. For instance, the inbound address parameter must be valid.</li>
</ul>
<h3 id="high-level-os-structure">High-level OS structure</h3>
<p>The simplest OS structure is the “monolithic OS”, where there’s a hard line between the whole OS, in kernel mode, and the user programs, in user mode. But the OS is a huge piece of software (millions of lines of code), so if something goes wrong in kernel mode, the machine will likely halt or crash.</p>
<p>Luckily, many pieces of the OS can be in user mode, and the above is a strong incentive to put them there. Indeed, some parts of the OS don’t need the privileged access, and we save some effort that way. A good example of those are daemons (printer daemon, system log, etc.), that are placed in user mode. This is known as the “systems programs” structure.</p>
<p>Bringing this idea to the extreme, we have the microkernel. The idea is to place the absolute minimum in kernel mode, and all the rest in user mode. This does not only place deamons in user mode, but also process management, memory management, the file system, etc. Note that these still communicate with the microkernel, but run the majority of their code in user mode. In practice, microkernels have been a commercial failure, and the “systems programs” has won out.</p>
<h2 id="processes">Processes</h2>
<h3 id="what-is-a-process">What is a process?</h3>
<p>A process is a program in execution: a program is executable code on a file on disk, and when it runs, it’s a process. A process can do just about anything, as far as the user is concerned (shell, compiler, editor, browser, etc). Each process has a unique process identifier, the <code class="highlighter-rouge">pid</code>.</p>
<p>Basic operations on a process:</p>
<ul>
<li>Create</li>
<li>Terminate
<ul>
<li>Normal exit</li>
<li>Error (trap)</li>
<li>Terminated by another process (<code class="highlighter-rouge">KILLSIG</code>)</li>
</ul>
</li>
</ul>
<p>Linux has a simple yet powerful list of process primitives:</p>
<ul>
<li><code class="highlighter-rouge">fork()</code>: if a process runs a fork, it becomes the parent of the new process. The new process is an <em>identical copy of the process that called fork</em>, except for one thing: in the parent, the call to <code class="highlighter-rouge">fork</code> returns the <code class="highlighter-rouge">pid</code>, while in the child, it returns <code class="highlighter-rouge">0</code>.</li>
<li><code class="highlighter-rouge">exec(filename)</code>: loads into memory the file with the given filename. Note that running this completely replaces the code that called the <code class="highlighter-rouge">exec</code>, it’s not like a function call that returns to the caller’s execution context. But it does leave the environment intact! More on that later</li>
<li><code class="highlighter-rouge">wait()</code>: waits for one of the children to terminate (the basic <code class="highlighter-rouge">wait</code> function waits for any child; there are more advanced versions that wait for a particular one).</li>
<li><code class="highlighter-rouge">exit()</code>: terminates the process (it’s a suicide)</li>
</ul>
<p>Here’s some typical forking code:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="p">(</span><span class="n">pid</span> <span class="o">=</span> <span class="n">fork</span><span class="p">())</span> <span class="p">{</span>
<span class="n">wait</span><span class="p">();</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="n">exec</span><span class="p">(</span><span class="n">filename</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p><code class="highlighter-rouge">fork</code> will execute the fork, return the <code class="highlighter-rouge">pid</code> of the child in the parent, and <code class="highlighter-rouge">0</code> in the child. The parent therefore enters the if-statement and waits, while the child goes into the else-statement, and executes the file. We’re assuming that the C file in filename ends with an <code class="highlighter-rouge">exit()</code> (it’s inserted by the C compiler if you don’t explicitly write it), which causes the child process to terminate, and the parent process to stop waiting.</p>
<p><em>Side note</em>: This may seem a bit strange: why would the new thread need to be a copy? Why use fork+exec instead of simply create (like Windows does)? A process is not just code, it’s environment and code. The environment includes ownership, open files and environment variables; note that it does not include registers or the heap or stack, or anything like that (that would probably be a big security issue). A call to <code class="highlighter-rouge">fork</code> doesn’t replace the environment, it only changes the code (and the registers, etc are reset). This default behavior allows us to have simple function calls, without a million arguments for environment like in Windows.</p>
<h3 id="outline-of-the-linux-shell">Outline of the Linux shell</h3>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">forever</span> <span class="p">{</span>
<span class="n">read</span> <span class="n">from</span> <span class="n">input</span>
<span class="k">if</span> <span class="p">(</span><span class="n">logout</span><span class="p">)</span> <span class="n">exit</span><span class="p">()</span>
<span class="k">if</span> <span class="p">(</span><span class="n">pid</span> <span class="o">=</span> <span class="n">fork</span><span class="p">())</span> <span class="p">{</span>
<span class="n">wait</span><span class="p">()</span>
<span class="p">}</span> <span class="k">else</span> <span class="p">{</span>
<span class="c1">// You can make changes to the environment</span>
<span class="c1">// of the child here.</span>
<span class="n">exec</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>You can manipulate the environment of the child without changing the environment of the parent by adding code where the comment is. For instance, a command shell may redirect <code class="highlighter-rouge">stdin</code> and <code class="highlighter-rouge">stdout</code> to a file before doing an <code class="highlighter-rouge">exec</code>.</p>
<h3 id="linux-process-tree">Linux process tree</h3>
<p>The first process, <code class="highlighter-rouge">init</code> runs from boot. When the user logs in, it forks and waits, and the child executes the shell (say, bash): at this stage, the child process is the shell process for the user running bash. Now, the user runs <code class="highlighter-rouge">make</code>, so the shell forks and waits, and its child executes <code class="highlighter-rouge">make</code> and exits. Now another user logs in, so the initial parent creates another child, and opens another shell (say, zsh). Note that init can create another process while waiting because it’s not really a normal process, it’s a “black magic” custom process. All of this creates a tree hierarchy of processes and their children.</p>
<p><img src="/images/os/process-tree.gif" alt="Linux process tree example gif" /></p>
<h3 id="multiprocessing">Multiprocessing</h3>
<p>As far as the OS is concerned, it either computes or does I/O. Back in the old days, in single-process systems, I/O would leave the CPU idle until it was complete. In a multiprocess system, another process can do processing in that time.</p>
<p>There are five stages that a multiprocessing system can be in:</p>
<ol>
<li>New</li>
<li>Ready</li>
<li>Running</li>
<li>Waiting</li>
<li>Terminated.</li>
</ol>
<p><img src="/images/os/process-states.png" alt="Process state diagram" /></p>
<p>Typically, a process will start in new (1), go to ready (2), then running (3), then go to waiting to do I/O (4), then to ready (2, using an interrupt), then back to running (3, picked by the scheduler); it can loop in 2-4 an arbitrary number of times, and eventually terminates (5, leaving from the running state 3).</p>
<h4 id="process-switching">Process switching</h4>
<p>A process switch is a switch from one process running on the CPU to another CPU, so that you can later switch back to the process currently holding the CPU.</p>
<p>This is not so simple, because a process consists of many things: the code, stack, heap, values in register (including PC) and MMU info (memory management unit, we can ignore this for now). The first three reside in process-private location, but the last two reside in shared locations. To make the swap possible, the registers are backed up to the process control block (PCB, though Linux has another name for it); in addition it contains information about the process, and some save area (free space).</p>
<p>A process switch is an expensive operation. It requires saving and restoring a lot of stuff, so it has to be implemented very efficiently (often written in machine code, as one of the only parts of the kernel), and has to be used with care.</p>
<h4 id="process-scheduling">Process scheduling</h4>
<p>In general, many processes may be in the ready state. It is the scheduler’s job to pick out a process in the ready state and set it to the running state.</p>
<p>In practice, there’s often a timer on the running state that sets a thread back to ready after say, 10ms, if it isn’t doing IO. A scheduler that does this is called a <em>preemptive scheduler</em>.</p>
<p>In a non-preemptive scheduler, a process can monopolize the CPU, as it is only stopped when it voluntarily does so; thus a non-preemptive scheduler is only useful in special circumstances (for instance, in the computer of an airplane autopilot, where we absolutely must have the thread running at all times).</p>
<p>The scheduler must remember the running process, and maintain sets of queues (ready queue, I/O device queue). The PCBs sit in queues.</p>
<p>The scheduler runs when:</p>
<ol>
<li>A process starts or terminates (system call)</li>
<li>The running process performs an I/O (system call)</li>
<li>I/O completes (I/O interrupt)</li>
<li>Timer expires (timer interrupt)</li>
</ol>
<h4 id="scheduling-algorithm">Scheduling algorithm</h4>
<p>What makes a good scheduling algorithm? It depends.</p>
<p>There are two kinds of processes, that require different scheduling tactics:</p>
<ul>
<li><strong>Interactive processes</strong>: the user is waiting for the result (browser, editor, …)</li>
<li><strong>Batch processes</strong>: the user will look at the result later (supercomputing center, …).</li>
</ul>
<p>For batch, we want high throughput, meaning that we want a high number of jobs completed over say, a day. From a throughput perspective, the scheduler is overhead, so we must run it as little as possible. But for interactive applications, we want to optimize response time, so we want to go from ready to running quickly, which means running the scheduler often.</p>
<p>We’ll look at a few algorithms.</p>
<ul>
<li><strong>FCFS (first come first served)</strong>: when a process goes to ready, we insert at the tail of the queue; this way, we always run the oldest process (head of the queue). By definition, this is non-preemptive. It has low overhead, good throughput, but uneven response time (you can get stuck behind a long job), and in extreme cases, a process can monopolize the CPU.</li>
<li><strong>SJF (shortest job first)</strong>: we order the queue according to job length. We always run the head of the queue, which is the shortest available job. It can be preemptive or non-preemptive, but we’ll consider the preemptive only. It has good response time for short, but can lead to starvation of long jobs, and it’s difficult to predict job length.</li>
<li><strong>RR (Round Robin)</strong>: We define a time quantum x. When a process is ready, we put it at the tail (like FCFS). We run the head and run it for x time, after which we put it at the tail of the queue.</li>
</ul>
<p>RR works as a nice compromise for long and short jobs. Short jobs finish quickly, long jobs are not postponed forever. There’s no need to know job length; we discover length by counting the number of time quantums it needs.</p>
<h4 id="application-multiprocess-structuring">Application multiprocess structuring</h4>
<p>If we wanted to build a web server, the minimal version would just wait from incoming requests, read files from disks and send the file back in response. This is pretty terrible, so we can use processes to accept other connections while we’re reading from disk, so that another process can use the CPU in the meantime, overlapping computing with IO.</p>
<p>Thus, the server now has a structure with a single listener, and a number of worker processes. But generating processes is expensive, especially if we have to do it for <em>every</em> request! Therefore, we can create what’s called a <em>process pool</em>: we create worker processes during initialization, and hand incoming requests to them. The structure remains the same (a listener, multiple workers), but the amount of work per request decreases, since sending a message is cheaper than creating a worker.</p>
<p>However, to be able to do so, we need a mechanism for interprocess communication, which is what we’ll see in the next section.</p>
<h3 id="interprocess-communication">Interprocess communication</h3>
<p>There are two key ways of doing this, and they’re somewhat related: message passing, and remote procedure call.</p>
<h4 id="message-passing">Message passing</h4>
<p>This is the simplest, with only two primitives: send a message, and receive a message. Note that message passing is always by value, never by reference.</p>
<p>This is implemented in the kernel through a table (<code class="highlighter-rouge">proctable</code>) which, for every <code class="highlighter-rouge">pid</code> has a queue (linked list) of waiting messages for the process with <code class="highlighter-rouge">pid</code> to process; other processes can add to this queue.</p>
<p>There are of course more complex alternatives:</p>
<ul>
<li><strong>Symmetric vs. asymmetric</strong>: in symmetric, the sender and the receiver both communicate with a specific, given process (they know each other). In asymmetric, the sender sends to a particular process, and the client receives the message from any process, and get the message + pid (to be able to reply). Asymmetric is more common.</li>
<li><strong>Blocking vs. non-blocking</strong>: for sending, non-blocking returns immediately after the message is sent, while blocking blocks until the message has been received. For receiving, non-blocking returns immediately, while blocking blocks until a message is present. For receiving, non-blocking sending is more common.</li>
</ul>
<p>A very common pattern is:</p>
<ul>
<li>Client:
<ul>
<li>Send request</li>
<li>Blocking receive</li>
</ul>
</li>
<li>Server
<ul>
<li>Blocking receive</li>
<li>Send reply</li>
</ul>
</li>
</ul>
<p>This works somewhat like a function call, and it’s basically how remote procedure calls work.</p>
<h4 id="remote-procedure-call">Remote procedure call</h4>
<p>Another possibility is to use remote procedure calls. To use these, we link a client stub with the client code, and a server stub with the server code. In other words, every code instance has an associated stub. The stubs’ job is to:</p>
<ol>
<li>Client code does a proc call to the client stub</li>
<li>Client stub sends a message through the kernel</li>
<li>Server stub receives the request from the kernel</li>
<li>Server stub calls the correct code through a proc call</li>
<li>Server code sends a return proc call with the return value</li>
<li>Server stub replies to client stub through a message through the kernel</li>
<li>Client stub returns correct value to client code through a proc call</li>
</ol>
<h2 id="application-multithreading-and-synchronization">Application multithreading and synchronization</h2>
<h3 id="multithreading-vs-multiprocessing">Multithreading vs. multiprocessing</h3>
<p>In our example about the web server, there’s still a problem: disk access is expensive. To remediate this problem, we usually use caches for recently read data. But worker 1 and worker 2 are different processes, their memory is separate, and what we’ve cached for worker 1 hasn’t been cached for worker 2, who will store a separate copy (waste of time, waste of cache).</p>
<p>This is exactly what multiprocessing is bad at, and multithreading is good at. If not sharing memory is the problem, then sharing memory is the solution.</p>
<p>A thread is just like a process, but it does <strong>not</strong> have its own heap and globals. Two threads in a process share code, globals and heap, but have separate stacks, registers and program counters (PC).</p>
<p>Processes provide separation, and in particular, memory separation. This is useful for things that really should remain separate, so for coarse-grain separation. Threads do not provide such separation; in particular, they share memory, so they are suitable for tighter integrations.</p>
<h3 id="pthreads">Pthreads</h3>
<p>Having multiple threads reading and writing to shared data is obviously a performance advantage, but is also a problem, as it can lead to data races. Data races are a result of <em>interleaving</em> of the thread executions. This means that to write a multi-threaded program, a program must be correct for all interleavings. To do so, we use pthreads.</p>
<p>The basic approach to multithreading is to divide “work” among threads. Then you need to think of which data is shared, where it’s accessed, and put the shared data in a critical section (n.b. this sentence would give professors a heart attack, but this is a good way to think about it). Putting something in a critical section means that the processor can’t interleave a code into a critical section.</p>
<p>Pthreads have a few available methods:</p>
<ul>
<li><code class="highlighter-rouge">Pthread_create(&threadid, threadcode, arg)</code>: Creates thread, returns <code class="highlighter-rouge">threadid</code>, runs <code class="highlighter-rouge">threadcode</code> with argument <code class="highlighter-rouge">arg</code>.</li>
<li><code class="highlighter-rouge">Pthread_exit(status)</code>: terminates the thread, optionally returns status</li>
<li><code class="highlighter-rouge">Pthread_join(threadid, &status)</code>: waits for thread <code class="highlighter-rouge">threadid</code> to exit, and receives status, if any.</li>
<li><code class="highlighter-rouge">Pthread_mutex_lock(mutex)</code>: if mutex is held, block. Otherwise, acquire it and proceed.</li>
<li><code class="highlighter-rouge">Pthread_mutex_unlock(mutex)</code>: release mutex</li>
</ul>
<h3 id="synchronization">Synchronization</h3>
<p>To synchronize code, we use <em>mutex locks</em>. A common mistake is that lines such as <code class="highlighter-rouge">a += 1</code> aren’t atomic; in reality, it corresponds to a load into a register, an increment of said register, and a store of the register into memory. During this process, the instruction sequence may be interleaved. Note that some machines have atomic increments available.</p>
<p>We can use locks, but single locking inhibits parallelism. To remediate this, there are two approaches: fine-grained locking, and privatization.</p>
<p>Privatization is the idea of defining local variables for each thread, using them for accesses in the loop, and only access shared data after the loop; when things are being put in common, we can lock.</p>
<p>As an example:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ListenerThread</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">MAX_THREADS</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="kr">thread</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">Pthread_create</span><span class="p">(</span><span class="err">…</span><span class="p">)</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Receive</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">put</span> <span class="n">request</span> <span class="n">in</span> <span class="n">queue</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">WorkerThread</span> <span class="p">{</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">take</span> <span class="n">request</span> <span class="n">out</span> <span class="n">of</span> <span class="n">queue</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">read</span> <span class="n">file</span> <span class="n">from</span> <span class="n">disk</span>
<span class="n">Send</span><span class="p">(</span><span class="n">reply</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p><strong>Warning</strong>: This won’t work! We need to tell worker(s) that there is something for them to do, an item in the queue for them. This is sometimes called task parallelism. Pthreads have a mechanism for this:</p>
<ul>
<li><code class="highlighter-rouge">Pthread_cond_wait(cond, mutex)</code>: wait for a signal on cond, and release the mutex</li>
<li><code class="highlighter-rouge">Pthread_cond_signal(cond, mutex)</code>: Signal one thread waiting on cond. The signaled thread re-acquires the mutex at some point in the future. If no thread is waiting, this is a no-op. Note that the function signature isn’t actually correct (there’s no mutex parameter) but it makes it easier to explain.</li>
<li><code class="highlighter-rouge">Pthread_cond_broadcast(cond, mutex)</code>: signal all threads waiting on cond. If no thread is waiting, it’s a no-op.</li>
</ul>
<p>We must hold the mutex when calling any of these.</p>
<p>We can now fix our previous example:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ListenerThread</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">MAX_THREADS</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="kr">thread</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">Pthread_create</span><span class="p">(</span><span class="err">…</span><span class="p">)</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Receive</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">put</span> <span class="n">request</span> <span class="n">in</span> <span class="n">queue</span>
<span class="c1">// Signal that the queue isn't empty</span>
<span class="n">Pthread_cond_signal</span><span class="p">(</span><span class="n">notempty</span><span class="p">,</span> <span class="n">queuelock</span><span class="p">)</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">WorkerThread</span> <span class="p">{</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="c1">// Wait until there is something in the queue before taking it.</span>
<span class="n">Pthread_cond_wait</span><span class="p">(</span><span class="n">notempty</span><span class="p">,</span> <span class="n">queuelock</span><span class="p">)</span>
<span class="n">take</span> <span class="n">request</span> <span class="n">out</span> <span class="n">of</span> <span class="n">queue</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">read</span> <span class="n">file</span> <span class="n">from</span> <span class="n">disk</span>
<span class="n">Send</span><span class="p">(</span><span class="n">reply</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Still, there is a slight bug in the above; if all worker threads are busy doing something, the <code class="highlighter-rouge">Pthread_cond_signal</code> will no-op. This is important to note: if no thread is waiting, either because they’re not listening or because they’re busy, the signal is a no-op. In general, signals have no memory, and they’re forgotten if no thread is waiting. So we need an extra variable to remember them:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ListenerThread</span> <span class="p">{</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span><span class="o">=</span><span class="mi">0</span><span class="p">;</span> <span class="n">i</span><span class="o"><</span><span class="n">MAX_THREADS</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="kr">thread</span><span class="p">[</span><span class="n">i</span><span class="p">]</span> <span class="o">=</span> <span class="n">Pthread_create</span><span class="p">(</span><span class="err">…</span><span class="p">)</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Receive</span><span class="p">(</span><span class="n">request</span><span class="p">)</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">put</span> <span class="n">request</span> <span class="n">in</span> <span class="n">queue</span>
<span class="n">avail</span><span class="o">++</span>
<span class="n">Pthread_cond_signal</span><span class="p">(</span><span class="n">notempty</span><span class="p">,</span> <span class="n">queuelock</span><span class="p">)</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="n">WorkerThread</span> <span class="p">{</span>
<span class="n">forever</span> <span class="p">{</span>
<span class="n">Pthread_mutex_lock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="k">while</span> <span class="p">(</span><span class="n">avail</span> <span class="o"><=</span> <span class="mi">0</span><span class="p">)</span> <span class="n">Pthread_cond_wait</span><span class="p">(</span><span class="n">notempty</span><span class="p">,</span> <span class="n">queuelock</span><span class="p">)</span>
<span class="n">take</span> <span class="n">request</span> <span class="n">out</span> <span class="n">of</span> <span class="n">queue</span>
<span class="n">avail</span><span class="o">--</span>
<span class="n">Pthread_mutex_unlock</span><span class="p">(</span><span class="n">queuelock</span><span class="p">)</span>
<span class="n">read</span> <span class="n">file</span> <span class="n">from</span> <span class="n">disk</span>
<span class="n">Send</span><span class="p">(</span><span class="n">reply</span><span class="p">)</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note that we’re doing a <code class="highlighter-rouge">while</code> here, not <code class="highlighter-rouge">if</code>. If we used an <code class="highlighter-rouge">if</code> instead, we may have a problem. It’s important not to think of the signaler passing the mutex to the waiter. The only thing we know is that the waiter will acquire the mutex lock <em>at some point</em> in the future; another thread may come and steal it in the meantime, and take the item in the queue! This is subtle, and we’ll have exercises about it (week 4).</p>
<h3 id="kernel-multithreading">Kernel multithreading</h3>
<p>The kernel is a server. There are requests from users (system calls, traps), and from devices (interrupts). It’s an event-driven program, waiting for one of the above to start running. In Linux, there’s one kernel thread for each user thread; this is not the case in other OSes. This is called a 1-to-1 mapping.</p>
<p>How does it work? The user thread makes a system call, there’s a switch to kernel mode, set the PC to the system call handler routine, the SP to the kernel stack of the kernel thread. To go back to user mode, we set the SP to the stack of the user thread, and the PC to the user thread PC, return from kernel mode, and now run in user thread.</p>
<p>The kernel and the user threads have different stacks; otherwise, another thread in user mode could read and modify the kernel’s stack, which would be a huge security risk.</p>
<p>For synchronization in the kernel, there’s a different synchronization library (not Pthreads, which is a user-level library), but otherwise, it’s in any multithreaded program. There is a notable difference though; there are also interrupts in the kernel. When a device interrupts, the PC is set to the interrupt handler, the SP to the interrupt thread stack, we run the interrupt handler.</p>
<p>This is not exam-stuff, but let’s talk about it anyway. The interrupts must be served quickly, we cannot block it. The solution (in Linux) is to add another set of threads (soft interrupt threads). On an interrupt, we put the request in queue for soft interrupt threads, which will do the bulk of the work.</p>
<h2 id="memory-management">Memory management</h2>
<h3 id="goals-and-assumptions">Goals and assumptions</h3>
<p>For this part of the course, a few assumptions:</p>
<ul>
<li>We will only be concerned about processor and main memory; we do not worry about L1, L2, L3 caches and disk.</li>
<li>A program must be in memory to be executed (we’ll revisit this next week).</li>
</ul>
<p>The kernel is generally located in low physical memory (close to 0), because interrupt vectors are in low memory.</p>
<h3 id="virtual-and-physical-address-spaces">Virtual and physical address spaces</h3>
<p>For protection, one process must not be able to read or write in the memory of another process, or of the kernel. We cannot have unprotected access: all CPU memory accesses must be checked somehow (this is done in hardware; checking errors are traps).</p>
<p>Additionnally, we want transparency: the programmer should not have to worry where his program is in memory, or where/what other programs in memory. With transparency, the program can be anywhere in memory, and it shouldn’t affect the behavior or code of the program.</p>
<p>This is why we have the abstraction of virtual address space. The virtual (or logical) address space is what the program(mer) thinks is its memory, and is generated by the CPU. Physical is where it actually resides in memory.</p>
<p>The piece of hardware that provides mapping from virtual-to-physical and protection is the Memory Management Unit (MMU).</p>
<p>The size of the virtual address space is only limited by the size of addresses that CPUs can generate; on older machines, 2^32 bytes (4GB), on newer, 64-bit machines, 2^64 (16 Exabyte!). The physical address space is limited by the physical size of the memory; nowadays, it’s in the order of a few GB.</p>
<h3 id="mapping-methods-and-allocation">Mapping methods and allocation</h3>
<h4 id="base-and-bounds">Base and bounds</h4>
<p>The virtual memory space is a linear address space, where addresses go from <code class="highlighter-rouge">0</code> to <code class="highlighter-rouge">MAX</code>.</p>
<p>The physical memory space is a linear address space, where addresses go from <code class="highlighter-rouge">BASE</code> to <code class="highlighter-rouge">BOUNDS = BASE + MAX</code>.</p>
<p><img src="/images/os/base-bounds-mmu.png" alt="MMU for base and bounds" /></p>
<p>The MMU has a relocation register that holds the base value, and a limit register holding the bounds value. If a virtual address is smaller than the limit, the MMU adds the value of the relocation register to the virtual address. Otherwise, it generates a trap.</p>
<p>All early machines were done this way, but this is not commonly used anymore, except in very specific machines (CRAY supercomputers keep it simple to go fast).</p>
<h4 id="segmentation">Segmentation</h4>
<p>With segmentation, you have multiple base and bounds within a single address space. There’s a set of segments from <code class="highlighter-rouge">0</code> to <code class="highlighter-rouge">n</code>, and each segment <code class="highlighter-rouge">i</code> is linear from <code class="highlighter-rouge">0</code> to <code class="highlighter-rouge">MAX(i)</code>. A segment is whatever you want it to be. It’s a more user-centric view of memory, where each segment contains one or more units of the program (a procedure’s code, an array, …). Segmentation reduces internal fragmentation (segments are not fixed size, can be fit exactly to requirements), but can create external fragmentation (some small free memory locations may remain which never get used).</p>
<p>The virtual address space is two-dimensional: a segment number <code class="highlighter-rouge">s</code>, and an offset <code class="highlighter-rouge">d</code> within the segment.</p>
<p>The physical address space is a set of segments, each linear. The segments don’t have to be ordered like the virtual space.</p>
<p><img src="/images/os/segmentation-mmu.png" alt="MMU for segmentation" /></p>
<p>The MMU contains a STBR (Segment Table Base Register) that points to the the segment table in memory, and the STLR (Segment Table Length Register) containing the length of the segment table. The segment table is indexed by segment number, contains <code class="highlighter-rouge">(base, limit)</code> pairs where <code class="highlighter-rouge">base</code> is the physical address of the segment in memory, and <code class="highlighter-rouge">limit</code> is the length of the segment.</p>
<p>Segmentation is an Intel invention, but over time they’ve given it up for paging (it got programmers confused).</p>
<h4 id="paging">Paging</h4>
<p>A page is fixed-size portion of virtual memory. A frame is a fixed-size portion of physical memory. These two are always the same size, and they have to be. A typical size these days is 4-8KB. It reduces external fragmentation (no unusable free frames, since they are all of equal size), but may increase internal fragmentation (memory is allocated in say 4KB chunks, so going over 10B into the next page wastes 4086B).</p>
<p>The virtual address space is linear from <code class="highlighter-rouge">0</code> up to a multiple of the page size. For mapping purposes, the virtual address is split into two pieces: a page number <code class="highlighter-rouge">o</code> and an offset <code class="highlighter-rouge">p</code> within that page.</p>
<p>The physical address space is a non-contiguous set of frames. There’s one frame per page, but not necessarily in order.</p>
<p><img src="/images/os/paging-mmu.png" alt="MMU for paging" /></p>
<p>The MMU has a page table to translate page number to frame number; the offset stays the same.</p>
<h4 id="segmentation-with-paging">Segmentation with Paging</h4>
<p>This is a combination of the two previous methods. It looks like segmentation, but underneath it’s paged. It aims to reduce both internal and external fragmentation.</p>
<p>The virtual address space is exactly like segmentation: we define segments for every abstract memory group of varying sizes.</p>
<p>However, the physical is just like in paging (non-contiguous set of frames).</p>
<p><img src="/images/os/segmentation-paging-mmu.png" alt="MMU for segmentation with paging" /></p>
<p>Each process keeps one segment table, and multiple page tables (one for each segment). The logical addresses are first interpreted as <code class="highlighter-rouge">(segment, offset)</code>, which the segment table can translate into <code class="highlighter-rouge">(page_table_number, page_number, offset)</code>, which we can give to the page table to resolve a physical address.</p>
<h3 id="optimizations">Optimizations</h3>
<h4 id="memory-allocation">Memory allocation</h4>
<p>How do we find memory for a newly arrived process? Base-and-bounds leaves “holes” in the memory, so we need to pick a “hole” for the new process to use. There are a few dynamic memory allocation mehtods:</p>
<ul>
<li><strong>First-fit</strong>: Take the first hole that fits</li>
<li><strong>Best-fit</strong>: Take the smallest hole where we can fit what’s been requested. This sounds nice, but it also leaves small holes behind, which is problematic</li>
<li><strong>Worst-fit</strong>: takes the largest hole, and leaves big holes behind</li>
</ul>
<p>The problem here is external fragmentation: small holes become unusable, part of memory can never be used. This is a serious problem. Main memory allocation is easier to do in segmentation. Pieces are smaller, but there are more than one. Main memory allocation is easiest in paging, as pages have a fixed size. But we now have an internal fragmentation problem: part of the last page may be unused. With a reasonable page size though, this is not too much of a problem.</p>
<h4 id="protection">Protection</h4>
<p>Fine-grain protection means having different protections for different parts of the address space. In the page table, we can have a few bits to indicate read/write/execute rights on the memory, and whether the memory is valid. For instance, code should be valid, read-only (don’t let an attacker inject code) and executable.</p>
<p>In base-and-bounds, this is not really possible; we do not have any break-up, there’s no page table. In segmentation, this is easy to do: there’s a segment for the code, so we set the bits in the segment table. Same goes for paging, we can set the bits, but we have to do it in every code page.</p>
<h4 id="sharing">Sharing</h4>
<p>Sharing memory between processes allows us to run twice the same program in different processes (we can share the code), or read twice the same file in different processes (we may want to share memory corresponding to the file).</p>
<p>This is not possible in base and bounds. With segmentation, we create a segment for shared data, and add an entry in the segment tables of both processes pointing to shared segments in memory. With paging, this is somewhat more complex; we need to share pages, add entries in pages of both processes and point to the shared pages, and manage multiple shared pages.</p>
<h4 id="feature-comparison">Feature comparison</h4>
<p>Segmentation is best for sharing and fine-grain protection, while paging is best for memory allocation. This is the reason we have segmentation with paging: it combines the best of both worlds and allows us to do all three fairly easily.</p>
<h4 id="tlbs">TLBs</h4>
<p>For performance, we have TLBs (translation lookaside buffers), a piece of hardware outside physical memory.</p>
<p>They are there because the page table actually is in memory, not in a separate device as we may have been led to believe. A virtual address reference would (without the TLB) result in 2 physical memory accesses, which would reduce performance by a factor of 2. The TLB is a small fast cache of mappings <code class="highlighter-rouge">(page_nb -> frame_nb)</code>. If the mapping for <code class="highlighter-rouge">page_nb</code> is found in the TLB, we use the associated value instead of looking up the page table. If not, we have to do a lookup, but we add it the mapping to the TLB.</p>
<p>The TLB is highly specialized, expensive hardware, but it’s also very fast. It’s an associative memory cache, meaning that there’s a hardware comparator for every entry, so the TLB is usually rather small (64-1024 entries). If it’s full, we replace existing entries.</p>
<p>Note that flushing the TLB is actually the <em>real</em> reason that process switching is so expensive (moreso than switching register values, like we said earlier). Typical TLB hit rates are 95-100%.</p>
<h3 id="dealing-with-large-virtual-address-spaces">Dealing with large virtual address spaces</h3>
<p>The virtual address space is 64 bits, and we’ve got 4KB pages, which means 12 bits are needed for offset, which leaves 52 bits for page number, which would require 2<sup>52</sup> page table entries. Say they’re 4 bytes each, we’d need 2<sup>54</sup> bytes (18 petabytes) of memory, which is way more than we have!</p>
<p>There are a few solutions to this (hierarchical page tables, hashed page tables, inverted page tables, …) but we’ll just take a look at hierarchical page tables.</p>
<p>The number of levels of hierarchy can vary, but here we’ll just talk about two-level page tables:</p>
<ul>
<li>In a single level page table, we have a set number of bits indicating the page number. In a two-level page table, we break this page number <code class="highlighter-rouge">p</code> into two numbers <code class="highlighter-rouge">p1</code> and <code class="highlighter-rouge">p2</code>.</li>
<li>The top-level page table entry is indexed by <code class="highlighter-rouge">p1</code>, and contains a pointer to second-level page table (and a valid bit)</li>
<li>The second-level page table entry is indexed by <code class="highlighter-rouge">p2</code>, and contains a frame number containing page <code class="highlighter-rouge">(p1, p2)</code> (and a valid bit)</li>
</ul>
<p>This is useful because most address spaces are sparsely populated: As we said above, the virtual address space is very large (say, 2<sup>52</sup> pages), and we won’t be using most virtual pages. But still, in a one-level page table we need an entry for every page. Therefore, most entries in the page table will be empty (in reality, they’re not empty, just invalid: the valid bit will be 0).</p>
<p>We can solve this with hierarchical page tables. The first layer needs 2<sup><code class="highlighter-rouge">p1</code></sup> entries (one for every possible value of the first part of the page number), but we only need second-level page tables for the populated parts of the first level. For every invalid entry in the first level, we can save ourselves 2<sup><code class="highlighter-rouge">p2</code></sup> bytes, since we don’t need a level two for that.</p>
<p>Note that hierarchical page tables are counter-productive for dense address spaces, but since most address spaces are quite sparse, it’s usually worth it. The price to be paid here is that each level adds another memory access; instead of 1 memory access, we’ll have n+1. But the TLB still works, so if this is rare (i.e. TLB hit rate 99%), we can live with it!</p>
<h3 id="process-switching-and-memory-management">Process switching and memory management</h3>
<p>One solution, is to invalidate all TLB entries on process switch. This makes the process switch expensive!</p>
<p>The other solution is to have the process identifier in TLB entries, so that a match is a match on both <code class="highlighter-rouge">pid</code> and <code class="highlighter-rouge">pageno</code>. The process switch is now much cheaper; although this makes the TLB more complicated and expensive, all modern machines have this feature.</p>
<h2 id="demand-paging">Demand Paging</h2>
<p>We’re now going to drop the assumption that all of a program must be in memory. Typically, part of program is in memory, and all of it is on disk.</p>
<p>Remember: the CPU can only directly access memory. It cannot acess data on disk directly, it must go through the OS to do IO.</p>
<p>But what if the program acesses a part of the program that’s only on disk? The program will be suspended, the OS is run to get the page from disk, and the program is restarted. This is called a <em>page fault</em>, and what the OS does in response is <em>page fault handling</em>.</p>
<p>To discover the page fault, we use the valid bit in the page table. Without demand paging, if it’s 1, the page is valid, otherwise it’s invalid. With demand paging:</p>
<ul>
<li>If the valid bit is 0: the page is invalid <strong>or</strong> the page is on disk</li>
<li>If the valid bit is 1: the page is valid <strong>and</strong> it’s in memory</li>
</ul>
<p><em>Sidenote</em>: there’s also an additional table to see invalid, on-disk pages.</p>
<p>Access to an invalid bit generates a trap. The OS allocates a free frame to the process, finds the page on disk, gets the disk to transfer the page from disk to this allocated frame. While the disk is busy, we can invoke the scheduler to run another process; when the disk interrupt comes in, we go back to the page fault handling, which updates the page table, and sets the valid bit to 1, gives control back to the process, restarting the previously faulting instruction which will now find a valid bit in the page table.</p>
<h3 id="page-replacement">Page replacement</h3>
<p>If no free frame is available for the disk to fill up, we must pick a frame to be replaced, replace its page table and TLB entry (maybe even writeback if it has a modified bit). Picking the correct page to replace is very important for performance: a normal memory access is 1ns, a page fault is 10ms. In general, we prefer replacing clean over dirty (one disk IO vs two, because we can save a writeback by picking the clean).</p>
<p>There are many replacement policies:</p>
<ul>
<li><strong>Random</strong>: easy to implement, performs poorly</li>
<li><strong>FIFO</strong>: oldest page (brought in the earliest) is replaced. This is very easy to implement with a queue of pages, but performs pretty badly.</li>
<li><strong>OPT</strong>: replace the page that will be referenced furthest in the future. This sounds fantastic and perfect, and it is: it’s an optimal algorithm, but we can’t actually implement this, because we can’t predict the future. We just use it as a basis for comparison.</li>
<li><strong>LRU</strong>: replace least recently accessed page. This is not the same as FIFO, which looks at when a page was brought in (vs. when it was used for LRU). LRU is difficult to implement perfectly: we’d need to timestamp every memory reference, which would be too expensive. But we <em>can</em> approximate it fairly well with a single reference bit: periodically, we read out and store all reference bits (those are bits set by the hardware when a page is referenced), and reset them all to 0. We keep all the reference bits of the last few periods; the more periods kept, the better the approximation. We replace the page with the smallest value of reference bit history (i.e. the one last referenced the furthest back).</li>
<li><strong>Second-chance</strong>: also called FIFO with second-chance. When we bring in a page, we place it at the tail of the queue, as with FIFO. To replace, instead of taking the head page invariably as with FIFO, we give it a second chance if its reference bit is 1 (if it’s 0, no second chances, replace it). A second chance means that we put it at the tail of the queue, set the reference bit to 0, and look at the head of the queue to see if we should replace that or give it a second chance (this happens in a loop until we can replace). This works, as it is a combination of FIFO and the LRU approximation with one reference bit</li>
<li><strong>Clock</strong>: this is a variation of second chance. Imagine all the pages arranged around a (one-armed) clock. For the replacement, we look at the page where the hand of the clock is; if the bit is 0, we replace it, but if it is 1, we set it to 0 and move the hand of the clock to the next page. This is actually exactly the same as FIFO+second-chance; the clock points to the head of the queue. But clock is more efficiently implemented (Microsoft’s code is closed source, but rumor goes that it’s quite close to a clock).</li>
</ul>
<p>In addition to the replacement algorithm, you can pick varying replacement scopes:</p>
<ul>
<li><strong>Local replacement</strong> replaces a page of the faulting process. You cannot affect someone else’s working set (a term well define below), but it’s very inflexible for yourself, since you can’t grow (you cause yourself to thrash)</li>
<li><strong>Global replacement</strong> replaces any page (e.g. for FIFO, local replaces the oldest page of the faulting page, while global picks the oldest page overall). This can allow for DDOS-like attacks where one process takes over the whole memory, causing everyone else to thrash.</li>
</ul>
<p>It’s difficult to say which one is better; in a way, there’s a tradeoff to pick from. If you’re an OS writer, though, you have to favor security before performance, so you have to favor local replacement. In reality, most OSes find some middle ground: you use frame allocation periodically (see below), with local replacement in-between. You may thrash for a short amount of time, but at the next period you’ll probably get the bigger allocation, so that’s fine. You can’t cause thrashing of others, so that’s fine too.</p>
<h3 id="frame-allocation">Frame allocation</h3>
<p>There’s a close link between the degree of multiprocessing and the page fault rate. We can decide to:</p>
<ul>
<li><strong>Give each process frames for all of its pages</strong>: this implies low multiprocessing, few page faults, slow switching on I/O</li>
<li><strong>Give each process 1 frame</strong>: high multiprocessing, many page faults (<em>thrashing</em>) and quick switching on I/O).</li>
</ul>
<p>Where is the correct tradeoff between these two extremes?</p>
<p>To answer this, we need to know about the <em>working set</em> of a process: it’s the set of pages that a process will need for execution over the next execution interval. The intuition is that if the working set isn’t in memory, there will be many page faults; if it is, there will be none. There’s nothing to be gained by putting anything more than the working set in memory.</p>
<p>If you could somehow give each process its working set, we would have a good tradeoff between few page faults and moderate degree of multiprocessing. Why use the working set instead of all pages? It’s the principle of locality: at any given time, a process only accesses part of its pages (initialization, main, termination, error, …).</p>
<p>The frame allocation policy must then be able to give each process enough frames to maintain its working set in memory. If there’s no space in memory for all working sets, we need to swap out one or more processes; if there’s space, we can swap one in.</p>
<p>The obvious question now is how we can predict a working set: in practice, what people do is to assume that it’ll be the same as before. The working set for the next 10k references will be the same as for the last 10k references. This prediction isn’t perfect, because a program may change state (i.e. initialization to main); this will temporarily cause a high page fault rate, but then it’ll work well again.</p>
<p>So how do we measure the past working set? Well, we don’t really need the working set, we only need its size. Periodically, we count the reference bits set to 1, and them reset them all to 0. The working set is thus the number of reference bits set to 1.</p>
<p><em>N.B.</em>: Frame allocation is done periodically, while page replacement is done at page fault time.</p>
<h3 id="optimizations-1">Optimizations</h3>
<h4 id="prepaging">Prepaging</h4>
<p>So far we’ve been paging in one page at a time. Prepaging takes multiple pages at a time. This works because of locality of virtual memory access: nearby pages are often accessed soon after. This avoids future page faults, process switches, and also allows for better disk performance (more on that later), but may bring in unnecessary pages.</p>
<h4 id="cleaning">Cleaning</h4>
<p>So far, we’ve preferred to replace clean pages. Cleaning is about writing out dirty pages when the disk is idle, so that we have more clean pages at replacement time. The drawback is that the page may be modified again, which may create useless disk traffic. Still, cleaning is usually always a gain, so pretty much all systems do it periodically.</p>
<h4 id="free-frame-pool">Free frame pool</h4>
<p>So far, we used all possible frames for pages. With a free frame pool, we keep some frames unused. This makes page fault handling quick by offsetting replacement to outside page fault time. The drawback is that this reduces effective main memory size.</p>
<h4 id="copy-on-write">Copy-on-write</h4>
<p>This is a clever trick to share pages between the processes that:</p>
<ol>
<li>Are initially the same</li>
<li>Are likely to be read-only (remain the same)</li>
<li>But may be modified in the future (although unlikely)</li>
</ol>
<p>To understand how this works, let’s remind ourselves how read-only sharing works: we make the page table entries point to same frame, set the read-only bit so that it generates a trap if the process writes; the trap is treated as an illegal memory access.</p>
<p>We can’t do that in our case, because we will allow the unlikely write. Instead, when we receive the trap, we create a separate frame for the faulting process, insert the <code class="highlighter-rouge">(pageno, frameno)</code> in the page table, set the read-only bit to off (allowing read-write) and copy that page into that frame: further accesses won’t page fault.</p>
<p>This works well if the page is rarely written, and allows us to save <code class="highlighter-rouge">sharers - 1</code> frames. Obviously, if the page is written often, as it causes more page faults without gaining in frame occupations. In practice though, this works very well and all OSes provide it. Linux implements <code class="highlighter-rouge">fork()</code> this way: we don’t copy the page straight away, we just wait for the new process’ first modification to copy the page.</p>
<h2 id="file-systems">File Systems</h2>
<h3 id="what-is-a-file-system">What is a file system?</h3>
<p>The file system provides permanent storage. How permanent is permanent? For this course, it’ll just mean across machine failures and restarts (this is a stronger guarantee than across program invocations, but less strong than across disk failures or data center failures).</p>
<p>Main memory is obviously not suitable for <em>permanent</em> storage. In this course, we’ll only talk about disks and flash (although other technologies exist, such as tape, or battery-backed memory).</p>
<p>There’s no universally accepted definition of what a file is, but we’ll say the following: it’s an un-interpreted collection of objects surviving machine restarts. Un-interpreted means that the file system doesn’t know what the data means, only the application does. These objects can be different things for different OSes; in Linux, they are simple bytes (this is more common, and we’ll only talk about this), but other OSes see them as records.</p>
<p>Should a file be typed or untyped? Typed means that the file system (FS) knows what the object means. This is useful, because it allows us to invoke certain programs by default, prevent errors, and perhaps even store the file more efficiently on disk. However, it can be inflexible, as it may require typecasts, and for the OS, it would involve a lot of code for each type.</p>
<p>We will look at untyped files, which is how Linux works (but Windows uses a typed system). Note that file filename extensions (<code class="highlighter-rouge">.sh</code>, <code class="highlighter-rouge">.html</code>, …) aren’t types: in Linux, it’s pure convention. The user knows, but the system doesn’t do anything with it. In Windows, the user knows, and the system knows and enforces the type.</p>
<h3 id="interface">Interface</h3>
<h4 id="access-primitives">Access primitives</h4>
<p>The main access primitives to the file system are:</p>
<ul>
<li><code class="highlighter-rouge">uid = create(...)</code>: creates a new file and returns a <code class="highlighter-rouge">uid</code>, a non human-readable unique identifier</li>
<li><code class="highlighter-rouge">delete(uid)</code>: deletes the file with the given <code class="highlighter-rouge">uid</code></li>
<li><code class="highlighter-rouge">read(uid, buffer, from, to)</code>: read from file with <code class="highlighter-rouge">uid</code>, from byte <code class="highlighter-rouge">from</code> to a byte <code class="highlighter-rouge">to</code> (this may cause an EOF condition) into a memory buffer (previously allocated, must be of sufficient size)</li>
<li><code class="highlighter-rouge">write(uid, buffer, from, to)</code>: write to file with <code class="highlighter-rouge">uid</code> from byte <code class="highlighter-rouge">from</code> to byte <code class="highlighter-rouge">to</code> from a memory buffer</li>
</ul>
<p>The above <code class="highlighter-rouge">read</code> and <code class="highlighter-rouge">write</code> are <strong>random access</strong> primitives: there’s no connection between two successive accesses. This is opposed to <strong>sequential access</strong>, which reads from where we stopped reading last time, and writes to where we stopped writing last time.</p>
<p>In a sequential read, the file system keeps a file pointer <code class="highlighter-rouge">fp</code> (initially 0), and to execute <code class="highlighter-rouge">read(uid, buffer, bytes)</code>, reads <code class="highlighter-rouge">bytes</code> bytes into <code class="highlighter-rouge">buffer</code> and increments <code class="highlighter-rouge">fp</code> by <code class="highlighter-rouge">bytes</code>.</p>
<p>Note that sequential could be built on top of random access; all it takes is maintaining to the file pointer. Sequential access is essentially a convenience, where the FS manages the file pointer. We can also build random access on top of sequential access thanks to the <code class="highlighter-rouge">seek(uid, from)</code> primitive, which allows us to move to a certain location in the file.</p>
<p>Since sequential access is more common, and managing the file pointer manually leads to many errors, Linux offers sequential access with the addition of the <code class="highlighter-rouge">seek</code> primitive to enable random access.</p>
<h4 id="concurrency-primitives">Concurrency primitives</h4>
<p>If two processes access the same file, what happens to the file pointer <code class="highlighter-rouge">fp</code>? It would get overwritten. To solve this problem, we have the notion of an open file, with the following primitives:</p>
<ul>
<li><code class="highlighter-rouge">tid = open(uid, ...)</code>: creates an instance of file with <code class="highlighter-rouge">uid</code>, accessible by this process only, with the temporary, process-unique id <code class="highlighter-rouge">tid</code>. The file pointer <code class="highlighter-rouge">fp</code> is associated with the <code class="highlighter-rouge">tid</code>, not the <code class="highlighter-rouge">uid</code> (in other words, it is private to the process)</li>
<li><code class="highlighter-rouge">close(tid)</code>: destroys the instance of the file</li>
</ul>
<p>There are different ways of implementing this concurrency, and we can make the changes visible at different moments: on <code class="highlighter-rouge">write</code> (immediately visible) on <code class="highlighter-rouge">close</code> (separate instances until close), or never (separate instances altogether). In Linux, writes are immediately visible (this is much easier to implement).</p>
<h4 id="naming-primitives">Naming primitives</h4>
<p>A naming is a mapping from a human-readable string to a <code class="highlighter-rouge">uid</code>. A directory is a collection of such mappings.</p>
<p>There are a bunch of naming primitives, which we won’t go into detail with, but here are some:</p>
<ul>
<li><code class="highlighter-rouge">insert(string, uid)</code></li>
<li><code class="highlighter-rouge">uid = lookup(string)</code></li>
<li><code class="highlighter-rouge">remove(string, uid)</code></li>
</ul>
<p>There are also directory primitives:</p>
<ul>
<li><code class="highlighter-rouge">createDirectory</code>: equivalent to <code class="highlighter-rouge">mkdir</code></li>
<li><code class="highlighter-rouge">deleteDirectory</code>: equivalent to <code class="highlighter-rouge">rm -rf</code></li>
<li><code class="highlighter-rouge">setWorkingDirectory</code>: equivalent to <code class="highlighter-rouge">cd</code></li>
<li><code class="highlighter-rouge">string = listWorkingDirectory</code>: equivalent to <code class="highlighter-rouge">ls</code></li>
<li><code class="highlighter-rouge">list(directory)</code>: equivalent to <code class="highlighter-rouge">ls ./dir</code></li>
</ul>
<p>The directory structure is hierarchical; it’s actually not a tree in Linux, it’s an acyclic graph, as Linux allows sharing of two <code class="highlighter-rouge">uid</code>s under different names. This is called a <strong>hard link</strong>, where, assuming the mapping <code class="highlighter-rouge">(string1, uid)</code> already exists, we can create a new mapping <code class="highlighter-rouge">(string2, uid)</code>. This is opposed to soft links, where we would add a <code class="highlighter-rouge">(string2, string1)</code> mapping. The difference is that when we remove the first mapping, the second remains with hard linking, and it becomes a dangling reference in soft linking.</p>
<p>To keep the graph acyclic, Linux prevents hard links from making cycles by disallowing hard link to directories (only allowing links to files, i.e. leaves in the graph).</p>
<h3 id="disks">Disks</h3>
<h4 id="how-does-a-disk-work">How does a disk work?</h4>
<p><img src="/images/os/disk.png" alt="Anatomy of a disk" /></p>
<p>Information on a disk is magnetically recorded. The disks turn on the spindle (order of 15,000rpm). The arms do not move independently, they all move together. A disk has many concentric circles on the platter called tracks, on which a basic unit is a sector: on a disk, you cannot read or write a single byte, you can only work with a full sector at a time (which is usually around 512B).</p>
<h4 id="interface-1">Interface</h4>
<p>The disk interface is thus:</p>
<ul>
<li><code class="highlighter-rouge">readSector(logical_sector_number, buffer)</code></li>
<li><code class="highlighter-rouge">writeSector(logical_sector_number, buffer)</code></li>
</ul>
<p>The logical sector number is a function of the platter, cylinder or track, and sector. The main task of the file system is to translate from user interface methods to disk interface methods (e.g. from <code class="highlighter-rouge">read</code> to <code class="highlighter-rouge">readSector</code>).</p>
<h4 id="performance">Performance</h4>
<p>To access the disk, we need to do:</p>
<ul>
<li><strong>Head selection</strong>: select platter. This is very cheap (ns), and it’s just an electronic switch (multiplexer)</li>
<li><strong>Seek</strong>: move arm over cylinder. This is very expensive (few ms), as it is a mechanical motion. The cost of a seek is approximately linear in the number of cylinders that you have to move over.</li>
<li><strong>Rotational latency</strong>: move head over sector. This is also expensive (15000 RPM, half a revolution to do on average, so around 2ms)</li>
<li><strong>Transfer time</strong>: disk throughput is around 1GB/s (maybe hundreds of MB/s), so it’s cheap (microseconds)</li>
</ul>
<h5 id="caching">Caching</h5>
<p>The above is slow, so <em>rule number one</em> of optimizing disk access is to not access the disk: as much as possible, use main memory as a cache. To do this, keep recently acccessed blocks in memory to reduce latency and disk load by reserving kernel memory for cache; cache entries are file blocks. There are two strategies associated with this:</p>
<ul>
<li><strong>Write-through</strong>: write to cache, write to disk, return to user</li>
<li><strong>Write-back</strong>: write to cache, return to user, and later, write to disk</li>
</ul>
<p>Write-back (also called write-behind) has better much response time (microseconds vs ms), but in case of a crash, there’s a window of vulnerability that write-through doesn’t have. Still, because of the performance impact, Linux chose write-back with a periodic cache flush, and provides a primitive to flush data.</p>
<h5 id="read-ahead">Read-ahead</h5>
<p>Still, if you have to use disk, <em>rule number two</em> is don’t wait for the disk. Prefetching allows us to put more blocks in the buffer cache than were requested, so that we can avoid disk IO for the next block; this is especially useful for sequential access (which most accesses are). Note that this does not reduce the number of disk IOs, and could actually increase them (for non-sequential access). Still, in practice it’s usually a win, and in Linux, this is the default: it always reads a block ahead.</p>
<p><code class="highlighter-rouge">fadvise(2)</code> allows an application to tell the kernel how it expects to use a file so that the kernel can choose an appropriate read-ahead and caching strategy. The following values can be supplied as parameters:</p>
<ul>
<li><code class="highlighter-rouge">FADV_RANDOM</code>: expect accesses in random order (e.g. text editor)</li>
<li><code class="highlighter-rouge">FADV_SEQUENTIAL</code>: expect accesses in sequential order (e.g. media player)</li>
<li><code class="highlighter-rouge">FADV_WILLNEED</code>: expect accesses in the near future (e.g. database index)</li>
<li><code class="highlighter-rouge">FADV_DONTNEED</code>: do not expect accesses in the near future (e.g. backup application)</li>
</ul>
<p>Since Linux 2.4.10, the <code class="highlighter-rouge">open(2)</code> syscall accepts an <code class="highlighter-rouge">O_DIRECT</code> flag (<a href="http://yarchive.net/comp/linux/o_direct.html">a controversial addition</a>). When Direct I/O is enabled for a file, data is transferred directly from the disk to user-space buffers, bypassing the file buffer cache; read-ahead and caching is disabled. This is useful for databases, who probably want to manage their own caches.</p>
<h5 id="disk-scheduling">Disk scheduling</h5>
<p>Now, for times that we are actually reading, we can implement <em>rule number three</em> which is to minimize seeks by doing clever disk scheduling. There are different disk scheduling policies:</p>
<ul>
<li><strong>FCFS first-come-first-served</strong>: serve requests in the order that they were requested</li>
<li><strong>SSTF shortest-seek-time-first</strong>: pick “nearest” request in queue; note that this can lead to starvation (far-away request may never be served), but leads to very good seek times</li>
<li><strong>SCAN</strong>: continue moving head in one direction, and pick up request as we’re moving. Also called the elevator algorithm, as the head goes up and down.</li>
<li><strong>C-SCAN</strong>: similar to SCAN, but the elevator only goes one way. From cylinders <code class="highlighter-rouge">0</code> to <code class="highlighter-rouge">MAX_CYL</code>, we pick up requests, and then going down from <code class="highlighter-rouge">MAX_CYL</code> to <code class="highlighter-rouge">0</code> we don’t serve any. This leads to more uniform wait times.</li>
<li><strong>C-LOOK</strong>: similar to C-SCAN, but instead of moving the head over the whole cylinder range, we only move it between the minimum and maximum in the queue</li>
</ul>
<p>In practice, a variation of C-LOOK is most common.</p>
<h5 id="disk-allocation">Disk allocation</h5>
<p>Finally, we have a final rule; <em>rule number four</em> is to minimize rotational latency by doing clever disk allocation. The idea here is to locate consecutive blocks of a file on consecutive sectors in a cylinder.</p>
<p>When we’re under low load, it works best to do disk allocation; under high load, disk scheduling works better. This is because under high load, we get many good scheduling opportunities, while allocation typically gets defeated by interleaved access patterns for different files. Under low load, we have few requests in the queue and thus few scheduling opportunities; we also tend to see more sequential access, so it makes sense to optimize for that.</p>
<h3 id="os-file-system-implementation">OS File System Implementation</h3>
<h4 id="introduction">Introduction</h4>
<p>The file system has two main functionalities: the naming/directory system, and the storage (on-disk and in-memory data structures). In this class, we’ll mainly focus on storage, because once we have this, directories can be trivially implemented by using a file to store directories.</p>
<p>As we’ve said before, the main task of the file system is to translate from user interface methods (<code class="highlighter-rouge">read(ui, buffer, bytes)</code>) to disk interface methods (<code class="highlighter-rouge">readSector(logical_sector_number, buffer)</code>). To study this, we’ll introduce two small simplifications. First, we’ll simplify <code class="highlighter-rouge">read</code>, which normally allows to read an arbitrary number of bytes, to only allow reading a block at a time. A block is fixed size; typically, a block is (sector size) × 2<sup>n</sup> (for instance, 4KB block size and 512B sector size). Second, for simplicity, we’ll just assume that a block is the same size as a sector.</p>
<p>A terminology note: the word pointer is often used, and it means a disk address (i.e. <code class="highlighter-rouge">logical_sector_number</code>), not a memory address.</p>
<h4 id="disk-structure">Disk structure</h4>
<p>The disk consists of a few data structures:</p>
<ul>
<li><strong>Boot block</strong>: it’s at a fixed location on disk (usually sector 0); it contains the boot loader, which is read on machine boot.</li>
<li><strong>Device directory</strong>: a fixed, reserved area on disk. It’s an array of records (called “device directory entries”, DDE). It’s indexed by a <code class="highlighter-rouge">uid</code>, and each DDE record contains information about its <code class="highlighter-rouge">uid</code>s (in-use bit, reference count, size, access rights, disk addresses).</li>
<li><strong>User data</strong>: can be allocated in different ways, as we’ll see below</li>
<li><strong>Free space</strong></li>
</ul>
<h4 id="data-allocation">Data allocation</h4>
<p>User data can be allocated in a number of ways:</p>
<ul>
<li><strong>Contiguous</strong>: disk data blocks continuous on disk. With this, we only need one pointer per device directory entry, as well as the length of the block. This creates disk fragmentation (many unusable holes) and is generally quite impractical
<figure>
<img src="/images/os/contiguous-alloc.png" alt="Contiguous allocation schema" />
<figcaption>The DDE only contains a single pointer to the start of the contiguous data blocks</figcaption>
</figure>
</li>
<li><strong>Linked list</strong>: each data block contains a pointer to the next. We only need one pointer in the DDE (2 if we store the tail), to the first data block. But this is inefficient, especially for random access, because it may require reading many blocks before getting to the right one; additionnally, it’s not great because we have a pointer taking up space in the block (and now usable block sizes aren’t a power of 2 anymore)
<figure>
<img src="/images/os/linked-list-alloc.png" alt="Linked list allocation schema" />
<figcaption>The DDE contains a pointer to the first data block, and every data block has a pointer to the next one</figcaption>
</figure>
</li>
<li><strong>Indexed</strong>: N pointers in DDE, pointing to data blocks. The problem here is size: you can’t represent more than N blocks.
<figure>
<img src="/images/os/indexed-alloc.png" alt="Indexed allocation schema" />
<figcaption>The DDE can only point to up to N data blocks</figcaption>
</figure>
</li>
<li><strong>Indexed with indirect blocks</strong>: we keep the idea of N pointers in the DDE, but with a twist: the M first pointers in the DDE point to files, while the last N-M pointers point to other blocks that can point to files (or blocks). This is optimized for small files (they can be fully represented in the first block level), but also works well for large files (they can be represented in deep block levels). This can be seen as an unbalanced tree (as opposed to a full tree).
<figure>
<img src="/images/os/indirect-indexed-alloc.png" alt="Indirect indexing allocation schema" />
<figcaption>This example shows direct indexing (left), first-level indirection (middle) and second-level indirection (right)</figcaption>
</figure>
</li>
<li><strong>Extent based</strong>: the DDE contains a pointer as before, but we also store the length of the extent (a small integer). In other words, in this model, the DDE points to a sequence of disk blocks. This does even better than the above, offers good sequential <em>and</em> random access, and can be combined with indirect blocks. This is common practice in Linux.
<figure>
<img src="/images/os/extent-alloc.png" alt="Extent based allocation schema" />
<figcaption>The DDE contains pointers to contiguous blocks, as well as the length of these contiguous sections</figcaption>
</figure>
</li>
</ul>
<p>One final word: we also need to keep track of free space. This can be done with a linked list or a bitmap, which is an <code class="highlighter-rouge">array[#sectors]</code>, with single-bit entries representing whether the sector is free or in-use.</p>
<h4 id="in-memory-data-structures">In-memory data structures</h4>
<ul>
<li>
<p><strong>Cache</strong>: a fixed, contiguous area of kernel memory is used as a cache. Its size is m×n, where m is the max number of cache blocks, and n is the block size. This usually makes up a large chunk of memory.</p>
</li>
<li>
<p><strong>Cache directory</strong>: usually a hash table, where the index is given by <code class="highlighter-rouge">hash(disk address)</code>, with an overflow list for collisions. It usually has a dirty bit.</p>
</li>
<li>
<p><strong>Queue of pending user requests</strong></p>
</li>
<li>
<p><strong>Queue of pending disk requests</strong></p>
</li>
<li>
<p><strong>Active file tables</strong>: an array for the whole file system, with an entry per <em>open file</em>. Each entry contains the device directory entry of the file, and additional information.</p>
</li>
<li>
<p><strong>Open file tables</strong>: an array per process, with an entry per <em>open file of that process</em>. It’s indexed by the file descriptor <code class="highlighter-rouge">fd</code>. Each entry contains a pointer to the entry in the active file table, the file pointer <code class="highlighter-rouge">fp</code>, and additional information.</p>
<p><img src="/images/os/open-file-table.png" alt="Open file table" /></p>
</li>
</ul>
<p>An inode is a data structure on a filesystem on Linux and other Unix-like operating systems that stores all the information about a file except its name and its actual data (<a href="http://www.linfo.org/inode.html">source</a>).</p>
<h4 id="pseudo-code">Pseudo-code</h4>
<p>Putting all together, the pseudo-code of the file system primitives looks like this (with some major simplifications, since we omit permission checks, return value checks, etc):</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">create</span><span class="p">():</span>
<span class="n">find</span> <span class="n">a</span> <span class="n">free</span> <span class="n">uid</span> <span class="p">(</span><span class="k">with</span> <span class="n">refcount</span> <span class="o">=</span> <span class="mi">0</span><span class="p">)</span> <span class="c1"># device directory is cached; this is usually easy
</span> <span class="nb">set</span> <span class="n">refcount</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">fill</span> <span class="ow">in</span> <span class="n">additional</span> <span class="n">info</span>
<span class="n">write</span> <span class="n">back</span> <span class="n">to</span> <span class="n">cache</span> <span class="p">(</span><span class="ow">and</span> <span class="n">eventually</span> <span class="n">to</span> <span class="n">disk</span><span class="p">)</span>
<span class="k">return</span> <span class="n">uid</span>
<span class="k">def</span> <span class="nf">delete</span><span class="p">(</span><span class="n">uid</span><span class="p">):</span>
<span class="n">find</span> <span class="n">inode</span>
<span class="n">refcount</span><span class="o">--</span>
<span class="k">if</span> <span class="n">refcount</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>
<span class="n">free</span> <span class="nb">all</span> <span class="n">data</span> <span class="n">blocks</span> <span class="ow">and</span> <span class="n">indirect</span> <span class="n">blocks</span>
<span class="nb">set</span> <span class="n">entries</span> <span class="ow">in</span> <span class="n">free</span> <span class="n">space</span> <span class="n">bitmap</span> <span class="n">to</span> <span class="mi">0</span>
<span class="n">write</span> <span class="n">back</span> <span class="n">to</span> <span class="n">cache</span> <span class="p">(</span><span class="ow">and</span> <span class="n">eventually</span> <span class="n">to</span> <span class="n">disk</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">open</span><span class="p">(</span><span class="n">uid</span><span class="p">):</span>
<span class="n">check</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span> <span class="k">if</span> <span class="n">uid</span> <span class="n">already</span> <span class="nb">open</span>
<span class="k">if</span> <span class="n">so</span><span class="p">:</span>
<span class="n">refcount</span><span class="o">++</span> <span class="c1"># in active file table
</span> <span class="n">allocate</span> <span class="n">entry</span> <span class="ow">in</span> <span class="nb">open</span> <span class="nb">file</span> <span class="n">table</span>
<span class="n">point</span> <span class="n">to</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">find</span> <span class="n">free</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span>
<span class="n">read</span> <span class="n">inode</span> <span class="ow">and</span> <span class="n">copy</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span> <span class="n">entry</span>
<span class="n">refcount</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">allocate</span> <span class="n">entry</span> <span class="ow">in</span> <span class="nb">open</span> <span class="nb">file</span> <span class="n">table</span>
<span class="n">point</span> <span class="n">to</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span>
<span class="n">fp</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">return</span> <span class="n">tid</span>
<span class="k">def</span> <span class="nf">read</span><span class="p">():</span>
<span class="n">find</span> <span class="n">fp</span> <span class="ow">in</span> <span class="nb">open</span> <span class="nb">file</span> <span class="n">table</span> <span class="ow">and</span> <span class="n">increment</span>
<span class="n">compute</span> <span class="n">block</span> <span class="n">number</span> <span class="n">to</span> <span class="n">be</span> <span class="n">read</span>
<span class="n">find</span> <span class="n">disk</span> <span class="n">address</span> <span class="ow">in</span> <span class="n">inode</span> <span class="ow">in</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span>
<span class="n">look</span> <span class="n">up</span> <span class="ow">in</span> <span class="n">cache</span> <span class="n">directory</span> <span class="p">(</span><span class="n">disk</span> <span class="n">address</span><span class="p">)</span>
<span class="k">if</span> <span class="n">present</span><span class="p">:</span>
<span class="k">return</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">find</span> <span class="n">free</span> <span class="n">entry</span> <span class="ow">in</span> <span class="n">cache</span>
<span class="n">readSector</span><span class="p">(</span><span class="n">disk</span> <span class="n">addr</span><span class="p">,</span> <span class="n">free</span> <span class="n">cache</span> <span class="n">block</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">write</span><span class="p">():</span>
<span class="n">find</span> <span class="n">fp</span> <span class="ow">in</span> <span class="nb">open</span> <span class="nb">file</span> <span class="n">table</span> <span class="ow">and</span> <span class="n">increment</span>
<span class="n">compute</span> <span class="n">block</span> <span class="n">number</span> <span class="n">to</span> <span class="n">be</span> <span class="n">written</span>
<span class="n">find</span> <span class="o">/</span> <span class="n">allocate</span> <span class="n">disk</span> <span class="n">address</span> <span class="ow">in</span> <span class="n">Active</span> <span class="n">File</span> <span class="n">Table</span>
<span class="n">Look</span> <span class="n">up</span> <span class="ow">in</span> <span class="n">cache</span> <span class="n">directory</span> <span class="p">(</span><span class="n">disk</span> <span class="n">address</span><span class="p">)</span>
<span class="k">if</span> <span class="n">present</span><span class="p">:</span>
<span class="n">overwrite</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">find</span> <span class="n">free</span> <span class="n">cache</span> <span class="n">entry</span>
<span class="n">overwrite</span>
<span class="k">def</span> <span class="nf">lseek</span><span class="p">(</span><span class="n">tid</span><span class="p">,</span> <span class="n">new_fp</span><span class="p">):</span>
<span class="ow">in</span> <span class="nb">open</span> <span class="nb">file</span> <span class="n">table</span><span class="p">,</span> <span class="nb">set</span> <span class="n">fp</span> <span class="o">=</span> <span class="n">new_fp</span>
</code></pre></div></div>
<p>In the pseudo-code above, we don’t write synchronously to disk; we typically use cache writeback. This is fine for user data, but for metadata, we write to disk more aggressively. Note that this does affect the integrity of the file system!</p>
<p>Inodes contain a reference count (or refcount for short) because of hard links. The refcount is equal to the number of DDEs referencing the inode; if we have hard-linked files, we’ll have multiple DDEs referencing a same inode. The inode cannot be removed until no DDEs referencing it are left (<code class="highlighter-rouge">refcount == 0</code>) to ensure filesystem consistency.</p>
<h4 id="loose-ends">Loose ends</h4>
<p>For memory cache replacement, we keep a LRU list. Unlike with memory management, this is easy to do here; file accesses are far fewer than memory accesses. Again, replacing the clean block is usually preferred, if possible. Periodically (say, every 30s, when the disk is idle), we also flush the cache.</p>
<p>Directories are stored as files. The only difference is that you cannot read them, only the OS can.</p>
<p>A very common sequence of operations is an <code class="highlighter-rouge">open</code> followed by a <code class="highlighter-rouge">read</code>; this executes a directory lookup, inode lookup and disk read for data, which means that the disk head moves between directories, inode and data for a simple read. Since this is such a common pattern, how to do this properly been a subject of research for a long time: we want these operations to be efficient, so if possible, in the same “cylinder group”, next to each other. We won’t look into this more than this in this class, but just know that it is a thing.</p>
<p>For file system startup, in a perfect world, no special actions are necessary. But sometimes, things aren’t normal: the disk sector can go bad, the FS can have bugs… Therefore, it’s common to check the file system. Linux has a utility called <code class="highlighter-rouge">fsck</code> (file system check) that checks that no sectors are allocated twice, and that no sectors are alloacted and on free list (this could happen in the case of a crash happening between freeing). Effectively, <code class="highlighter-rouge">fscheck</code> reconstructs the free list instead of trusting what is already there.</p>
<p>Another thing that can go wrong is that some sectors can go bad. To solve this, we do replication of important blocks, like the boot block, and in modern systems, the device directory.</p>
<p>Disk fragmentation happens even if we have good allocation. Over time, we’ll end up with small “holes” of one or two sectors, and files spread all over the disk. On a fragmented disk, it’s no longer possible to do good allocation. The solution is to take the FS offline, move files into contiguous locations, and put it back online (you can do it without taking the FS down, but it’s tricky and not advisable).</p>
<h4 id="alternative-file-access-method-memory-mapping">Alternative file access method: memory mapping</h4>
<p>There are alternative file access methods: <code class="highlighter-rouge">mmap()</code> maps the contents of a file in memory, and <code class="highlighter-rouge">munmap()</code> removes the mapping. This is the real reason why people wanted 64 bit address spaces: if you <code class="highlighter-rouge">mmap</code> a few big files, you’re out of address space really fast! (The reason wasn’t that the heap ran into the stack, or that the code got too big for 32 bits, it’s mainly a <code class="highlighter-rouge">mmap</code> problem). With 64 bits, there is ample space between the heap and the stack for <code class="highlighter-rouge">mmap</code>-ed files.</p>
<p>To access <code class="highlighter-rouge">mmap</code>-ed files, we access the relevant memory region, page fault, and cause the page file to be brought in.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mmap</span><span class="p">():</span>
<span class="n">allocate</span> <span class="n">page</span> <span class="n">table</span> <span class="n">entries</span>
<span class="nb">set</span> <span class="n">valid</span> <span class="n">bit</span> <span class="n">to</span> <span class="s">"invalid"</span>
<span class="n">on</span> <span class="n">read</span><span class="p">:</span>
<span class="n">page</span> <span class="n">fault</span>
<span class="c1"># just like in demand paging:
</span> <span class="nb">file</span> <span class="o">=</span> <span class="n">backing</span> <span class="n">store</span> <span class="k">for</span> <span class="n">mapped</span> <span class="n">region</span> <span class="n">of</span> <span class="n">memory</span>
<span class="nb">set</span> <span class="n">valid</span> <span class="n">bit</span> <span class="n">to</span> <span class="s">"valid"</span>
<span class="n">on</span> <span class="n">write</span><span class="p">:</span>
<span class="n">write</span> <span class="n">to</span> <span class="n">memory</span>
</code></pre></div></div>
<p>Page replacement will take care of getting the data to disk (we can also force it through <code class="highlighter-rouge">msync()</code>). <code class="highlighter-rouge">mmap</code> is very good for small, random accesses to big files (in databases, for instance), as it only brings the data we need into memory. It’s also an easier programming model than <code class="highlighter-rouge">lseek</code> and <code class="highlighter-rouge">read</code>, and it’s easier to reuse data this way.</p>
<p>But there are restrictions too: you can only <code class="highlighter-rouge">mmap</code> on a page boundary, it’s not easy to extend a file, and for small files, reads are usually cheaper than the <code class="highlighter-rouge">mmap</code> + page fault. Indeed, for random access to small files, it may be better to read the whole file into memory first.</p>
<p>Another use of <code class="highlighter-rouge">mmap</code> is for sharing memory between processes, as a form of interprocess communication (through shared, anonymous map flags).</p>
<h3 id="dealing-with-crashes">Dealing with crashes</h3>
<h4 id="atomic-writes">Atomic writes</h4>
<p>Crashes can happen in the middle of a sequence of writes, and it can have catastrophic effects. Therefore, we aim for atomicity: it’s all or nothing. In a file system, that means that either all updates are on disk, or no updates are on disk; we do not allow for in-betweens. How can we make sure that all or no updates to an open file get to disk?</p>
<p>We’ll operate under the assumption that a single sector disk write is atomic (which is 99.99999…% true. Disk vendors work very hard at this, we’ll just assume it’s always true).</p>
<p>To switch atomically from old to new, we cannot write in-place. Instead, we write to new, separate blocks on the disk, and switch the pointer in the DDE atomically. Old blocks are then de-allocated. If the crash happens before they’re de-allocated, the file system check will fix it.</p>
<p>A more precise description of this process follows: when opening, we read the DDE into the AFT, which effectively creates a copy of the (on-disk) DDE in memory. For writes, we allocate new blocks for the data, write these addresses into the incore DDE, and write to cache. On close, we write all cached blocks to disk blocks, and atomically write the DDE to disk.</p>
<h4 id="intentions-log">Intentions log</h4>
<p>An alternative method is the intentions log. We reserve an area of disk for the intentions log.</p>
<p><img src="/images/os/log.gif" alt="GIF of the steps involved in writing with a log" /></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">write</span><span class="p">():</span>
<span class="n">write</span> <span class="n">to</span> <span class="n">cache</span>
<span class="n">write</span> <span class="n">to</span> <span class="n">log</span>
<span class="n">make</span> <span class="ow">in</span><span class="o">-</span><span class="n">memory</span> <span class="n">inode</span> <span class="n">point</span> <span class="n">to</span> <span class="n">update</span> <span class="ow">in</span> <span class="n">log</span>
<span class="k">def</span> <span class="nf">close</span><span class="p">():</span>
<span class="n">write</span> <span class="n">old</span> <span class="ow">and</span> <span class="n">new</span> <span class="n">inode</span> <span class="n">to</span> <span class="n">log</span> <span class="ow">in</span> <span class="n">one</span> <span class="n">disk</span> <span class="n">write</span>
<span class="n">copy</span> <span class="n">updates</span> <span class="k">from</span> <span class="n">log</span> <span class="n">to</span> <span class="n">original</span> <span class="n">disk</span> <span class="n">locations</span>
<span class="n">when</span> <span class="nb">all</span> <span class="n">updates</span> <span class="n">are</span> <span class="n">done</span><span class="p">,</span> <span class="n">overwrite</span> <span class="n">inode</span> <span class="k">with</span> <span class="n">new</span> <span class="n">value</span>
<span class="n">remove</span> <span class="n">updates</span><span class="p">,</span> <span class="ow">and</span> <span class="n">old</span> <span class="ow">and</span> <span class="n">new</span> <span class="n">inodes</span> <span class="k">from</span> <span class="n">log</span>
</code></pre></div></div>
<p>On recovery from a crash (in other words, on any reboot, since we can’t know if we’ve crashed or not), we search through the log, and for every new inode found, we find and copy the updates to their original location. If all updates are done, we write to the new inode, and then remove updates and old and new inodes from log.</p>
<p>If the new inode is in the log and we crash, then we’ll end up with the new inode. If it isn’t, then we keep the old one. This works even if we crash during crash recovery.</p>
<h4 id="comparison">Comparison</h4>
<p>The DDE method has one disk write for <code class="highlighter-rouge">write()</code>, and one for <code class="highlighter-rouge">close()</code>. Log does two disk writes for <code class="highlighter-rouge">write()</code> (one to the log, another for the data), and one for <code class="highlighter-rouge">close()</code>. And yet surprisingly, log works better! That’s because writes to the log are sequential, so we have no seeks. Data blocks stay in place, and disk allocation stays. To further optimize this, we can write from cache or log to data when the disk is idle, or during cache replacement.</p>
<p>With DDE, disk allocation gets messed up, and we have high fragmentation.</p>
<p>All modern OS file systems implement log-based systems.</p>
<h3 id="log-structured-file-system-lfs">Log-Structured File System (LFS)</h3>
<p>This is an alternative way of structuring the file system, which takes the idea of the log to the extreme. The rationale is that today, we have large memories and large buffer caches. Most reads are served from cache anyway, so most disk traffic is write traffic. So to optimize I/O, we need to optimize disk writes by making sure we write sequentially to disk. The key idea in LFS is to use the log, an <strong>append-only</strong> data structure <strong>on disk</strong>.</p>
<h4 id="writing">Writing</h4>
<p><em>All</em> writes are to a log, including data and inode modifications. Writes first go in a cache, and are eventually written with writeback. Writes also go into an in-memory buffer. When it’s full, we append it to the log: that way, we add to the log in blocks called <em>segments</em>. Since we’re adding sequentially, we don’t do any seeks on write.</p>
<p>A single segment contains both data and inode blocks, and the log is a series of segments.</p>
<p><img src="/images/os/log-segments.png" alt="The log is made of a series of segments" /></p>
<p><img src="/images/os/segment.png" alt="The log is made of a series of segments" /></p>
<h4 id="reading">Reading</h4>
<p>When we write an inode, we write it to the inode map; this is an in-memory table of inode disk addresses:</p>
<p><img src="/images/os/inode-map.png" alt="Inode map" /></p>
<p>The inode map holds the following mapping: <code class="highlighter-rouge">uid -> disk address of the last inode for that uid</code></p>
<p>To read, we need to go through the indirection of the inode map:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">open</span><span class="p">():</span>
<span class="n">get</span> <span class="n">inode</span> <span class="n">address</span> <span class="k">from</span> <span class="n">inode</span> <span class="nb">map</span>
<span class="n">read</span> <span class="n">inode</span> <span class="k">from</span> <span class="n">disk</span> <span class="n">into</span> <span class="n">active</span> <span class="nb">file</span> <span class="n">table</span>
<span class="k">def</span> <span class="nf">read</span><span class="p">():</span> <span class="c1"># as before
</span> <span class="n">get</span> <span class="k">from</span> <span class="n">cache</span>
<span class="n">get</span> <span class="k">from</span> <span class="n">disk</span> <span class="n">address</span> <span class="ow">in</span> <span class="n">inode</span>
</code></pre></div></div>
<p>With the added indirection of the inode map, reading seems more complicated; but performance is chiefly determined by disk reads, so at the end of the day, there is little difference.</p>
<h4 id="checkpoints">Checkpoints</h4>
<p>How do we get the inode map to disk? We do what’s called a <strong>checkpoint</strong>: we write a copy of the inode map to a fixed location on disk, and we put a marker in the log.</p>
<p><em>Fake news alert</em>: we said “all data is written to the log”, but in reality, it’s “all except for checkpoints”.</p>
<p>On a crash, we start from the inode map in the checkpoint. This contains addresses of all inodes written <em>before</em> the last checkpoint. To recover, we need to find the inodes that were in the in-memory inode map before the crash, but not yet written in the checkpoint. To do this, we “<strong>roll forward</strong>”. Remember that the checkpoint put a marker in the log: from this marker forward, we scan for inodes in the log, and add their addresses in to the inode map.</p>
<p>We need to pick a time interval between checkpoints. Too short, and we’ll have too much disk I/O; too long, and the recovery time will be too long (the forward scan can be slow). The compromise is based on the assumption that crashes are rare (if they aren’t, you should probably buy another machine), so we can err towards larger time intervals. The compromise is typically on the side of long recovery times (there are exceptions for critical applications; if your OS is controlling a plane, then you probably want recovery time to be fast, so that the plane doesn’t crash because of forward scanning).</p>
<h4 id="disk-cleaning">Disk cleaning</h4>
<p>What if the disk is full? We always write to the end of the log, never overwriting. This means that the disk gets full quickly. Therefore, we need to <strong>clean</strong> the disk. In this process, we reclaim old data. In this context, “old” means logically overwritten but not physically overwritten. Data is logically overwritten when there’s a later write to the same <code class="highlighter-rouge">(uid, blockno)</code> in the log; it is physically overwritten once it’s been committed to disk (so if there’s an older version of <code class="highlighter-rouge">(uid, blockno)</code> in the log, it’s not physically overwritten yet).</p>
<p>Disk cleaning is done one segment at a time. We determine which blocks in the segment are new, and write them into a buffer; when the buffer is full, we add it to the log as a segment. The cleaned segment is then marked free.</p>
<p>How do we determine whether a block is old or new? We actually have to modify <code class="highlighter-rouge">write()</code> a little. Instead of only writing data to the buffer and log, we must write data, <code class="highlighter-rouge">uid</code> and <code class="highlighter-rouge">blockno</code>. With this, for a given block, we can take its disk address, its <code class="highlighter-rouge">uid</code> and <code class="highlighter-rouge">blockno</code>, and look them up in the inode map. This gives us an inode; if it has a different address, then it must be old. Otherwise, it must be new.</p>
<p><img src="/images/os/cleaning-old.png" alt="How to determine whether a block is old with the inode map" /></p>
<p><img src="/images/os/cleaning-new.png" alt="How to determine whether a block is new with the inode map" /></p>
<p>Taking cleaning into account, the log actually is more complicated than a simple linear log. It’s actually a sequence of segments, some in use, some free. Writes thus don’t only append to the log, but typically write to a free segment.</p>
<p><img src="/images/os/log-empty-segments.png" alt="The log with free segments" /></p>
<p>But the whole point of the log is sequential access so that we never have to move the head! This is true, but the segments tend to be very large (100s of sectors, MBs), so we tend to still get large sequential accesses with the occasional seek. This stays rare, so performance is still very good.</p>
<h4 id="summary">Summary</h4>
<p>In summary, LFS reads mostly from cache. The writes to disk are heavily optimized, so we have very few seeks. The reads from disk are a bit more expensive, but it’s not so bad, since there are few of them. With LFS, we have to endure the cost of cleaning.</p>
<p>The checkpoint and the log are on disk; the checkpoint region is at a fixed location, and the log uses the remainder of the disk. Individual segments in the log contain data and inode sectors, where data includes <code class="highlighter-rouge">uid</code> and <code class="highlighter-rouge">blockno</code>.</p>
<p>Additional in-memory data structures assist us in this system: the <strong>cache</strong> is a regular write-behind buffer cache, and the <strong>segment buffer</strong> is for the segment currently being written, which goes to the log all at once when it’s full. The <strong>inode map</strong> is a map indexed by <code class="highlighter-rouge">uid</code> pointing to the last-written inode for <code class="highlighter-rouge">uid</code>. In addition to this, we also have the <strong>active</strong> and <strong>open file tables</strong>, as before.</p>
<p>Writes put data and inodes into (write-back) cache and in-memory segment buffer, which goes to a free segment in the log once it’s full, which means almost no seeks on write! To open, we read the inode map map to find the inode, and read the inode into the active file table. To read, we get from cache; if it isn’t there, we must get from disk using the disk address in the inode, as before.</p>
<p>LFS is actually more complicated than we presented, but has not become mainstream. This is mainly because cleaning has a considerable cost, and causes performance dips (and people like systems that are predictable). On top of that, the complexity of this system makes it very difficult to implement the code for it. Still, there are similar ideas in some commercial systems (especially high-throughput systems, like computing clusters).</p>
<p>What <em>has</em> become popular are journaling file systems, that use a log (called a journal) for reliability (we can cite ext4 in Linux).</p>
<p>Note that cleaning is somewhat similar to garbage collection, both in its function and its effects. But cleaning happens on disk (as opposed to in memory), and is much slower.</p>
<h2 id="alternative-storage-media">Alternative storage media</h2>
<p>Since the 1970s disks have improved by a factor 100 to 1000 in terms of physical size, cost and capacity. But bandwidth and latency of disks, the two most important factors for performance, have lagged behind, only improving 10-fold. This is problematic for servers, big data computations, transaction systems, etc.</p>
<p>There are two developments that have risen as a response to this: RAID and SSDs. They both improve performance, but remain more costly than disks.</p>
<h3 id="raid">RAID</h3>
<p><strong>RAID</strong> stands for “Redundant Array of Independent Disks”. The main concept behind RAID is <strong>striping</strong>: rather than put a file on one disk, we stripe it across a number of disk. The idea here is that if you put all your data on one disk, the bandwidth isn’t so good, but you can parallelize this: in the best of worlds, the bandwidth of n disks is n times the bandwidth of a single disk. There are other factors that put a cap on this (bandwidth of IO bus, controller, etc), but essentially this holds.</p>
<p><img src="/images/os/striping.png" alt="Striping across disks" /></p>
<p>The downside here is that one disk fail makes all data unavailable. When talking about disk failures, the main figure is MTBF (mean time between failures), which is typically 50,000 hours (5 years). But for RAID disks, it’s the MTBF of a single disk, <em>divided by n</em>. So for 10 disks, we’re looking at half a year, which is clearly not acceptable.</p>
<p>That’s why RAID also has a notion of <strong>redundancy</strong>. RAID comes with a RAID level of redundancy. What we described above is RAID-0, i.e. no redundancy. But in reality, people use RAID-1 (mirroring), RAID-4 (parity disk), or RAID-5 (distributed parity).</p>
<ul>
<li>
<p><strong>RAID-0</strong>: offers the best possible read and write bandwidth, the lack of redundancy makes failures result in data loss, which is typically not acceptable.</p>
</li>
<li>
<p><strong>RAID-1</strong>: we keep a full mirror (back-up) of all the data. Writes must go to both data and mirror disk, but reads may go to any of them. After a crash, we recover from the surviving disk. But in this process, we’ve also halved storage capacity for the same number of disks as RAID-0; we can do better.</p>
<p><img src="/images/os/raid-1.png" alt="RAID-1 disks" /></p>
</li>
<li><strong>RAID-2</strong>: practically never used in practice; stripes data at the bit level (instead of block level), and uses a Hamming code stored across at least 2 additional drives for error correction. The added complexity of Hamming codes offers no significant advantage over the simpler parity bits (it can repair undetected transient errors on a single bad bit, which other levels cannot do, but at the cost of an extra drive), so RAID 2 has rarely been implemented.
<figure>
<img src="/images/os/raid-2.png" alt="RAID-2 disks" />
<figcaption>Source: <a href="https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID2_arch.svg">Wikipedia</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA 3.0</a></figcaption>
</figure>
</li>
<li>
<p><strong>RAID-3</strong>: practically never used in practice; stripes data at the byte level (instead of block level), is able to correct single-bit errors, and uses a dedicated parity disk. Because it stripes at the data level, any single block I/O is spread on all disks, meaning that RAID-3 cannot service multiple requests simultaneously.</p>
<figure>
<img src="/images/os/raid-3.png" alt="RAID-3 disks" />
<figcaption>Two six-byte blocks (in different colors) and their two parity bytes. Source: <a href="https://en.wikipedia.org/wiki/Standard_RAID_levels#/media/File:RAID_3.svg">Wikipedia</a>, <a href="https://creativecommons.org/licenses/by-sa/3.0/">CC-BY-SA 3.0</a></figcaption>
</figure>
</li>
<li>
<p><strong>RAID-4</strong>: N data disks and 1 parity disk. Parity is a simple form of error detection and repair (not specific to RAID, also used in communications). A parity bit is computed as the XOR of 4 bits: p = a ⊕ b ⊕ c ⊕ d. If we lose a bit, say bit b, we can reconstruct it as b = a ⊕ c ⊕ d ⊕ p. In RAID-4, we use the same idea at the disk block level; the parity disk holds the XOR of the bits of data blocks at the same position. Reads read from data disks, while writes must write to data and parity disks. On a crash, we can recover using the parity disk and the other disks.</p>
<p><img src="/images/os/raid-4.png" alt="RAID-4 disks" /></p>
<p>The issue is that every write implies an additional access to a parity disk, which quickly becomes a bottleneck for write-heavy workloads; we can do better.</p>
</li>
<li>
<p><strong>RAID-5</strong>: just like RAID-4, but we distribute the parity blocks over all disks; this balances parity write load over all disks.</p>
</li>
<li>
<p><strong>RAID-6</strong>: double parity, in case of double disk failures</p>
</li>
<li><strong>RAID-1+0</strong>: RAID-0 of RAID-1 configurations</li>
</ul>
<h3 id="ssd">SSD</h3>
<p>The first thing to note is that SSD (Solid State Disks) are not in fact disks. They’re a purely electronic drive (based on NAND flash), with no moving parts. The basic unit in NAND flash are pages (4K) and blocks (e.g. 64 pages).</p>
<p>SSDs were made to be a plug-and-play replacement for HDDs; they have the same form factor and interface as HDDs (even though they have no other technical reason to be so). The SSD interface is thus very much like a disk, with <code class="highlighter-rouge">readSector</code> and <code class="highlighter-rouge">writeSector</code>. The bandwidth is higher than disk, and the latency is much lower (order of 100 μs for read, 300 μs for write).</p>
<p>Basic operations are:</p>
<ul>
<li><code class="highlighter-rouge">read(page)</code></li>
<li><code class="highlighter-rouge">write(page)</code>: note that we cannot <em>re</em>write a page, only write to a blank one (see below)</li>
<li><code class="highlighter-rouge">erase(block)</code>: we must erase a block before a page can be rewritten to. There’s a limited number of erase cycles that SSD manufacturers guarantees (order of 100,000)</li>
</ul>
<p>Since a block must be completely erased before any single page in a block is written, we write pages sequentially in a block. We cannot overwrite, but need to erase a block before writing. This should remind of us LFS… and indeed, an LFS-like system is perfectly suited for SSDs.</p>
<p>With LFS, we can clean the block before erasing, move live data to a new block, and erase it. To do this, we have the TRIM command, <code class="highlighter-rouge">TRIM(range of logical sector numbers)</code> to indicate to the device that some sectors are no longer in use. If that’s been called, then there’s no need to do cleaning, we can just trust what we’ve been told and erase without checking.</p>
<p>Still, when we do cleaning, our strategy when picking blocks is to even out the number of erase cycles, in an attempt to extend the longevity of the SSD.</p>
<h2 id="virtual-machines">Virtual Machines</h2>
<h3 id="virtualization">Virtualization</h3>
<p>Virtualization is an instance of indirection, and to be more specific, of layering.</p>
<blockquote>
<p>“Any problem in computer science can be solved with another layer of indirection. But that usually will create another problem.”</p>
<p>— David Wheeler</p>
</blockquote>
<p>There are three mechanisms of virtualization:</p>
<ul>
<li><strong>Multiplexing</strong>: take one resource, and expose it as multiple virtual entities. Each entity appears to have the full resource (e.g. virtual memory on the MMU)</li>
<li><strong>Aggregation</strong>: take multiple resources, and expose it as a single virtual entity. The goal is typically to achieve enhanced properties, be it bandwidth, capacity, availability, redundancy… A canonical example for this is RAID.</li>
<li><strong>Emulation</strong>: make a resource appear as a different type of resource. For instance, use software to emulate a virtual resource which is different from the underlying physical resource. Examples include RAM disks, Android Emulator or the JVM (which is simply emulating a different instruction set)</li>
</ul>
<h3 id="terminology">Terminology</h3>
<p>A virtual machine (<strong>VM</strong>) is virtualiization applied to an entire computer: a machine running on a machine. Usually, we use virtualization to run many machines on a single machine. The real, base machine is the <strong>physical</strong> machine, and the ones running on top are <strong>virtual</strong> machines.</p>
<p>The <strong>VMM</strong> (Virtual Machine Monitor) is a resource manager for VM. It is comparable to the OS in the sense that they both provide resource management (between processes for an OS, between VMs for a VMM), but an OS provides abstractions (processes, address spaces, FS, …) and a VMM doesn’t; a VMM provides an identical copy of the machine.</p>
<p>The VMM provides creation, destruction and scheduling of VMs, and takes care of memory, disk and I/O management for VMs. Note that this is <em>similar</em> to what an OS does (but with processes replaced by VMs), but again, it’s not the same because a VMM doesn’t provide abstraction.</p>
<p>In a VM, you typically run an OS (called a <strong>guest OS</strong>), with applications on top. This is opposed to the <strong>host OS</strong>, which is the one running on the metal (as opposed to in the VM).</p>
<p>The <strong>hypervisor</strong> or <strong>Type I VMM</strong> is a VMM that is also a host OS (e.g. Xen, VMware vSphere, Microsoft Hyper-V), while a <strong>hosted VMM</strong> or <strong>Type II VMM</strong> is separate from the host OS (e.g. KVM, VMware Workstation and Fusion, or Parallels).</p>
<p>An application or OS running on real hardware runs in <strong>native mode</strong>. If it runs on virtual hardware, we say it’s running in <strong>virtualized mode</strong>.</p>
<h3 id="history">History</h3>
<p>As a bit of history, the VM was developed in the 1970’s by IBM. Back then, the mainframes were <em>very</em> expensive, and VMs allowed sharing by users running different OSes. It fell out of favor in the 80’s and 90’s as cheap microprocessors were introduced, but became prominent again in the 90’s, mainly for server consolidation (most servers are mostly idle, so to save power, we can run multiple servers on the same machine).</p>
<h3 id="vmm-implementation">VMM Implementation</h3>
<p>There are three requirements for a VMM:</p>
<ul>
<li><strong>Equivalence</strong>: the virtual hardware needs to be sufficiently equivalent to the underlying hardware that you can run the same software in the virtual machine</li>
<li><strong>Safety</strong>: the VM must be completely isolated from other VMs and from the VMM. This also means that VMs cannot interfere with each other, or hog resources.</li>
<li><strong>Performance</strong>: the overhead of virtualization must be low enough that the VM can still reasonably run, i.e. that it can be used just as if it were running directly on the hardware.</li>
</ul>
<p>How do we implement a VMM? A major problem to solve is around kernel vs. user mode, privileged instructions and system calls. In native mode, the kernel-user boundary is between applications in the OS, but where should we place it in virtualized mode?</p>
<p>Keeping it between the applications and the OS is incorrect; we would ruin the safety requirement because, as we saw in the beginning of this course, kernel mode is God mode. We can place the boundary between the VMM and the guest OSes, but that raises another problem: the guest OSes are now in user mode, even though they were written for kernel mode. As a result, we run into a host of problems:</p>
<ol>
<li>System calls (that usually put the machine in kernel mode) won’t work correctly</li>
<li>Since the OS and the applications are in user mode together, the applications could access the guest OS memory</li>
<li>The guest OS uses privileged instructions</li>
</ol>
<p>We can solve the above with the following solutions:</p>
<ol>
<li>
<p>If an application does a syscall, it wants to perform it in its <em>guest OS</em>. But the syscalls go to the VMM, which holds syscall vectors for the physical machine. But we want to access the syscall vectors for the OS in the guest OS, so the VMM must somehow forward the syscall.</p>
<p>Remember that hardware directs syscalls to physical syscall vectors; it puts the machine into kernel mode and jumps to the machine’s syscall handler.</p>
<p>Therefore, at boot time, the VMM installs physical machine syscall vectors. At guest OS boot time, the OS wants to install its syscall vectors through a privileged instruction. But being in user mode, it traps to the VMM, which will thus gain the information of where the OS syscall handlers are.</p>
<p>When the VMM’s syscall handler is called, it can set the <abbr title="Program Counter">PC</abbr> to the appropriate syscall vector in the OS, and returns to user mode, in which the guest OS will run its syscall handler. Inevitably, this guest OS syscall handler will execute the “return to user mode” instruction, which traps to the VMM, that can set the PC to the instruction following the original syscall and return to user mode in the guest OS.</p>
</li>
<li>
<p>On switching back from the VMM to the VM, if the return address is in the guest OS, we allow all memory accesses within the VM. If it is outside the guest OS (i.e. to an application), we disallow access to the guest OS memory. This is all done through protections in the page table.</p>
</li>
<li>
<p>The solution to this is outline in the first one, but the gist of it is that privileged instructions will trap to the VMM, which will handle it (i.e. emulate the outcome of the privileged instructions) and return to the OS kernel (except for return to user mode, where we return the application). This is known as <strong>trap-and-emulate</strong>. Another solution is <strong>dynamic binary translation</strong>, where we translate a privileged instruction to an unprivileged one. But this has very high overhead and is best avoided.</p>
</li>
</ol>
<p>Note that unprivileged instructions (from an application, but also from the OS) can be run directly, without having to pass through the VMM; they are directly sent to the metal for performance. This is called <strong>limited direct execution</strong>: we let some instructions (but not all!) run directly on the metal.</p>
<h4 id="popekgoldberg-theorem">Popek/Goldberg Theorem</h4>
<p>An instruction is <strong>privileged</strong> if it can only be executed in kernel mode. It is <strong>sensitive</strong> if it behaves differently in kernel and user mode. The Popek/Goldberg theorem states:</p>
<center>A VMM exists for an architecture ⇔ {sensitive} ⊆ {privileged}</center>
<p>Note that trap-and-emulate doesn’t work if some privileged instructions don’t trap. Indeed, any privileged instruction that doesn’t trap is sensitive, and these exist in common architectures: x86 has 17 sensitive, unprivileged instructions. Writing a VMM is still possible, but it is more complicated.</p>
<p>In this case, we must resort to some workarounds. We could do a binary rewrite of the guest OS to remove sensitive unprivileged instructions, but this is very tricky and has high overhead. More recently though, we have solutions like Intel VT-x and AMD-v (2005, available on all 64-bit processors) which duplicate protection rings to meet the Popek/Goldberg criteria.</p>
<h3 id="virtual-memory">Virtual memory</h3>
<p>To implement virtual memory, we use paging with valid bits.</p>
<p>Remember that in native mode, processes (and the processor) issue virtual addresses. The memory management unit (MMU) produces a physical address from the TLB or page table, which goes to memory. The OS allocates physical memory to processes, and installs page table and TLB entries to do so.</p>
<p>In virtualized mode, the VMM allocates memory between VMs, and the OS installs page tables and TLB entries to do so. The guest OS thinks it’s running on the real machine and has control over all memory of the real machine, but it is in fact running on the virtual machine and only has a limited portion (as allocated by the VMM).</p>
<p>To implement this, we need two levels of memory translation. The virtual address is still the virtual address (<strong>VA</strong>). What we called physical address (<strong>PA</strong>) in native mode becomes the guest physical address (<strong>gPA</strong>) in virtualized mode, and we introduce an additional address, the host physical address (<strong>hPA</strong>).</p>
<p>In native mode, we simply translated VA → PA. In virtualized mode, we translate VA → gPA in the guest OS (using guest TLB and guest page tables), and gPA → hPA in the VMM (using the real TLB and physical page tables). This means that we must have one guest page table per process in the guest OS, which we let reside in the guest OS memory. We also have a physical page table per VM, which resides in VMM memory.</p>
<p>The VA → gPA translation is done by the guest OS (remember that it doesn’t change); to allocate a new page to a process, it finds a free frame, and inserts the page-frame mapping in both page table and TLB. These insertions are privileged instructions that trap to the VMM, which will trap-and-emulate. The gPA → hPA translation is done as in the native OS before. To allocate a page to a VM, we find a free frame and insert the page-to-frame mapping in the VM’s page table, and copy the mapping over in the TLB.</p>
<p>This is an expensive process: we need to walk two levels of page tables. But modern processors have <strong>nested page tables</strong> available, which allows us to access the nested levels in parallel and in hardware, which reduces the cost of VM memory management.</p>
<p>Without nested page tables, we have a challenge: we only have one physical TLB for the whole physical machine and all of the VMs! One solution is to implement <strong>shadow page tables</strong>: we keep two copies of the guests’ page tables. The <strong>real copy</strong> is in the hardware (backed by the real TLB) and contains VA → hPA, while the <strong>shadow copy</strong> is in software and contains VA → gPA. The guest updates the shadow copy, and the VMM updates the real copy from the shadow copy.</p>
CS-206 Parallelism and Concurrency2017-02-22T00:00:00+00:002017-02-22T00:00:00+00:00https://kjaer.io/parcon
<img src="https://kjaer.io/images/hero/epfl-bc.jpg" class="webfeedsFeaturedVisual">
<ul id="markdown-toc">
<li><a href="#introduction" id="markdown-toc-introduction">Introduction</a></li>
<li><a href="#part-1-parallelism" id="markdown-toc-part-1-parallelism">Part 1: Parallelism</a> <ul>
<li><a href="#what-is-parallel-computing" id="markdown-toc-what-is-parallel-computing">What is parallel computing?</a></li>
<li><a href="#parallelism-on-the-jvm" id="markdown-toc-parallelism-on-the-jvm">Parallelism on the JVM</a> <ul>
<li><a href="#definitions" id="markdown-toc-definitions">Definitions</a></li>
<li><a href="#implementation" id="markdown-toc-implementation">Implementation</a></li>
</ul>
</li>
<li><a href="#atomicity" id="markdown-toc-atomicity">Atomicity</a> <ul>
<li><a href="#synchronized-blocks" id="markdown-toc-synchronized-blocks">Synchronized blocks</a></li>
<li><a href="#deadlocks" id="markdown-toc-deadlocks">Deadlocks</a> <ul>
<li><a href="#resolving-deadlocks" id="markdown-toc-resolving-deadlocks">Resolving deadlocks</a></li>
</ul>
</li>
<li><a href="#memory-model" id="markdown-toc-memory-model">Memory model</a></li>
</ul>
</li>
<li><a href="#running-computations-in-parallel" id="markdown-toc-running-computations-in-parallel">Running computations in parallel</a> <ul>
<li><a href="#signature-of-parallel" id="markdown-toc-signature-of-parallel">Signature of parallel</a></li>
<li><a href="#underlying-hardware-architecture-affects-performance" id="markdown-toc-underlying-hardware-architecture-affects-performance">Underlying hardware architecture affects performance</a></li>
</ul>
</li>
<li><a href="#tasks" id="markdown-toc-tasks">Tasks</a></li>
<li><a href="#how-do-we-measure-performance" id="markdown-toc-how-do-we-measure-performance">How do we measure performance?</a> <ul>
<li><a href="#work-and-depth" id="markdown-toc-work-and-depth">Work and depth</a></li>
<li><a href="#asymptotic-analysis" id="markdown-toc-asymptotic-analysis">Asymptotic analysis</a></li>
<li><a href="#empirical-analysis-benchmarking" id="markdown-toc-empirical-analysis-benchmarking">Empirical analysis: Benchmarking</a></li>
</ul>
</li>
<li><a href="#parallelizing-important-algorithms" id="markdown-toc-parallelizing-important-algorithms">Parallelizing important algorithms</a> <ul>
<li><a href="#parallel-merge-sort" id="markdown-toc-parallel-merge-sort">Parallel merge sort</a> <ul>
<li><a href="#copying-array-in-parallel" id="markdown-toc-copying-array-in-parallel">Copying array in parallel</a></li>
</ul>
</li>
<li><a href="#parallel-map" id="markdown-toc-parallel-map">Parallel map</a> <ul>
<li><a href="#comparison-of-arrays-and-immutable-trees" id="markdown-toc-comparison-of-arrays-and-immutable-trees">Comparison of arrays and immutable trees</a></li>
</ul>
</li>
<li><a href="#parallel-reduce" id="markdown-toc-parallel-reduce">Parallel reduce</a></li>
<li><a href="#associative-andor-commutative-operations" id="markdown-toc-associative-andor-commutative-operations">Associative and/or commutative operations</a> <ul>
<li><a href="#making-an-operation-commutative-is-easy" id="markdown-toc-making-an-operation-commutative-is-easy">Making an operation commutative is easy</a></li>
<li><a href="#constructing-associative-operations" id="markdown-toc-constructing-associative-operations">Constructing associative operations</a></li>
</ul>
</li>
<li><a href="#parallel-scan" id="markdown-toc-parallel-scan">Parallel scan</a> <ul>
<li><a href="#on-trees" id="markdown-toc-on-trees">On trees</a></li>
<li><a href="#on-arrays" id="markdown-toc-on-arrays">On arrays</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#data-parallelism" id="markdown-toc-data-parallelism">Data parallelism</a> <ul>
<li><a href="#workload" id="markdown-toc-workload">Workload</a></li>
<li><a href="#parallel-for-loop" id="markdown-toc-parallel-for-loop">Parallel for-loop</a></li>
<li><a href="#non-parallelizable-operations" id="markdown-toc-non-parallelizable-operations">Non-parallelizable operations</a></li>
<li><a href="#parallelizable-operations" id="markdown-toc-parallelizable-operations">Parallelizable operations</a></li>
<li><a href="#parallel-collections" id="markdown-toc-parallel-collections">Parallel collections</a> <ul>
<li><a href="#avoiding-parallel-errors" id="markdown-toc-avoiding-parallel-errors">Avoiding parallel errors</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#data-parallel-abstractions" id="markdown-toc-data-parallel-abstractions">Data-parallel abstractions</a> <ul>
<li><a href="#iterators" id="markdown-toc-iterators">Iterators</a></li>
<li><a href="#splitters" id="markdown-toc-splitters">Splitters</a></li>
<li><a href="#builders" id="markdown-toc-builders">Builders</a></li>
<li><a href="#combiners" id="markdown-toc-combiners">Combiners</a> <ul>
<li><a href="#implementing-combiners" id="markdown-toc-implementing-combiners">Implementing combiners</a></li>
<li><a href="#two-phase-construction" id="markdown-toc-two-phase-construction">Two-phase construction</a></li>
<li><a href="#conc-trees" id="markdown-toc-conc-trees">Conc-Trees</a></li>
<li><a href="#combiners-using-conc-trees" id="markdown-toc-combiners-using-conc-trees">Combiners using Conc-Trees</a></li>
</ul>
</li>
</ul>
</li>
</ul>
</li>
<li><a href="#part-2-concurrent-programming" id="markdown-toc-part-2-concurrent-programming">Part 2: Concurrent programming</a> <ul>
<li><a href="#a-surprising-program" id="markdown-toc-a-surprising-program">A surprising program</a></li>
<li><a href="#overview-of-threads" id="markdown-toc-overview-of-threads">Overview of threads</a> <ul>
<li><a href="#some-more-definitions" id="markdown-toc-some-more-definitions">Some more definitions</a></li>
</ul>
</li>
<li><a href="#monitors" id="markdown-toc-monitors">Monitors</a> <ul>
<li><a href="#memory-model-1" id="markdown-toc-memory-model-1">Memory model</a> <ul>
<li><a href="#volatile-fields" id="markdown-toc-volatile-fields">Volatile fields</a></li>
</ul>
</li>
<li><a href="#executors" id="markdown-toc-executors">Executors</a></li>
<li><a href="#atomic-primitives" id="markdown-toc-atomic-primitives">Atomic primitives</a></li>
</ul>
</li>
<li><a href="#programming-without-locks" id="markdown-toc-programming-without-locks">Programming without locks</a> <ul>
<li><a href="#lazy-values" id="markdown-toc-lazy-values">Lazy values</a></li>
<li><a href="#collections" id="markdown-toc-collections">Collections</a></li>
</ul>
</li>
<li><a href="#futures" id="markdown-toc-futures">Futures</a> <ul>
<li><a href="#synchronous-try" id="markdown-toc-synchronous-try">Synchronous: <code class="highlighter-rouge">Try</code></a></li>
<li><a href="#asynchronous-future" id="markdown-toc-asynchronous-future">Asynchronous: <code class="highlighter-rouge">Future</code></a> <ul>
<li><a href="#recover-and-recoverwith" id="markdown-toc-recover-and-recoverwith">Recover and recoverWith</a></li>
</ul>
</li>
<li><a href="#implementation-of-flatmap-on-future" id="markdown-toc-implementation-of-flatmap-on-future">Implementation of FlatMap on Future</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#part-3-actors" id="markdown-toc-part-3-actors">Part 3: Actors</a> <ul>
<li><a href="#why-actors" id="markdown-toc-why-actors">Why Actors?</a></li>
<li><a href="#what-is-an-actor" id="markdown-toc-what-is-an-actor">What is an Actor?</a> <ul>
<li><a href="#the-actor-trait" id="markdown-toc-the-actor-trait">The Actor Trait</a></li>
</ul>
</li>
<li><a href="#a-simple-stateful-actor" id="markdown-toc-a-simple-stateful-actor">A simple, stateful Actor</a> <ul>
<li><a href="#how-messages-are-sent" id="markdown-toc-how-messages-are-sent">How messages are sent</a></li>
<li><a href="#the-actors-context" id="markdown-toc-the-actors-context">The Actor’s Context</a></li>
<li><a href="#creating-and-stopping-actors" id="markdown-toc-creating-and-stopping-actors">Creating and Stopping Actors</a></li>
</ul>
</li>
<li><a href="#message-processing" id="markdown-toc-message-processing">Message Processing</a> <ul>
<li><a href="#revisiting-the-bank-account-example" id="markdown-toc-revisiting-the-bank-account-example">Revisiting the Bank Account Example</a></li>
<li><a href="#message-delivery-guarantees" id="markdown-toc-message-delivery-guarantees">Message Delivery Guarantees</a></li>
<li><a href="#message-ordering" id="markdown-toc-message-ordering">Message Ordering</a></li>
</ul>
</li>
<li><a href="#designing-actor-systems" id="markdown-toc-designing-actor-systems">Designing Actor Systems</a> <ul>
<li><a href="#actor-based-logging" id="markdown-toc-actor-based-logging">Actor-Based Logging</a></li>
<li><a href="#handling-timeouts" id="markdown-toc-handling-timeouts">Handling Timeouts</a></li>
</ul>
</li>
<li><a href="#testing-actor-systems" id="markdown-toc-testing-actor-systems">Testing Actor Systems</a></li>
<li><a href="#failure-handling-with-actors" id="markdown-toc-failure-handling-with-actors">Failure Handling with Actors</a> <ul>
<li><a href="#strategies" id="markdown-toc-strategies">Strategies</a></li>
<li><a href="#restarts" id="markdown-toc-restarts">Restarts</a></li>
<li><a href="#lifecycle-hooks" id="markdown-toc-lifecycle-hooks">Lifecycle Hooks</a></li>
<li><a href="#lifecycle-monitoring" id="markdown-toc-lifecycle-monitoring">Lifecycle Monitoring</a></li>
<li><a href="#the-chlidren-list" id="markdown-toc-the-chlidren-list">The Chlidren List</a></li>
<li><a href="#the-error-kernel" id="markdown-toc-the-error-kernel">The Error Kernel</a></li>
<li><a href="#eventstream" id="markdown-toc-eventstream">EventStream</a></li>
<li><a href="#unhandled-messages" id="markdown-toc-unhandled-messages">Unhandled Messages</a></li>
<li><a href="#persistent-actor-state" id="markdown-toc-persistent-actor-state">Persistent Actor State</a></li>
</ul>
</li>
<li><a href="#actors-are-distributed" id="markdown-toc-actors-are-distributed">Actors are Distributed</a> <ul>
<li><a href="#the-impact-of-network-communication" id="markdown-toc-the-impact-of-network-communication">The Impact of Network Communication</a></li>
<li><a href="#actor-path" id="markdown-toc-actor-path">Actor Path</a></li>
<li><a href="#clusters" id="markdown-toc-clusters">Clusters</a></li>
<li><a href="#eventual-consistency" id="markdown-toc-eventual-consistency">Eventual Consistency</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#part-4-big-data-analysis-with-scala-and-spark" id="markdown-toc-part-4-big-data-analysis-with-scala-and-spark">Part 4: Big Data Analysis with Scala and Spark</a> <ul>
<li><a href="#data-parallel-to-distributed-data-parallel" id="markdown-toc-data-parallel-to-distributed-data-parallel">Data-Parallel to Distributed Data-Parallel</a> <ul>
<li><a href="#latency" id="markdown-toc-latency">Latency</a></li>
</ul>
</li>
<li><a href="#rdds-sparks-distributed-collections" id="markdown-toc-rdds-sparks-distributed-collections">RDDs, Spark’s Distributed Collections</a> <ul>
<li><a href="#creating-rdds" id="markdown-toc-creating-rdds">Creating RDDs</a></li>
<li><a href="#transformations-and-actions" id="markdown-toc-transformations-and-actions">Transformations and Actions</a></li>
<li><a href="#benefits-of-laziness-for-large-scale-data" id="markdown-toc-benefits-of-laziness-for-large-scale-data">Benefits of laziness for Large-Scale Data</a></li>
<li><a href="#caching-and-persistence" id="markdown-toc-caching-and-persistence">Caching and Persistence</a></li>
</ul>
</li>
<li><a href="#reductions" id="markdown-toc-reductions">Reductions</a></li>
<li><a href="#distributed-key-value-pairs-pair-rdds" id="markdown-toc-distributed-key-value-pairs-pair-rdds">Distributed Key-Value Pairs (Pair RDDs)</a> <ul>
<li><a href="#creating-a-pair-rdd" id="markdown-toc-creating-a-pair-rdd">Creating a Pair RDD</a></li>
<li><a href="#groupbykey" id="markdown-toc-groupbykey"><code class="highlighter-rouge">groupByKey</code></a></li>
<li><a href="#reducebykey" id="markdown-toc-reducebykey"><code class="highlighter-rouge">reduceByKey</code></a></li>
<li><a href="#mapvalues" id="markdown-toc-mapvalues"><code class="highlighter-rouge">mapValues</code></a></li>
<li><a href="#countbykey" id="markdown-toc-countbykey"><code class="highlighter-rouge">countByKey</code></a></li>
<li><a href="#keys" id="markdown-toc-keys"><code class="highlighter-rouge">keys</code></a></li>
<li><a href="#example" id="markdown-toc-example">Example</a></li>
</ul>
</li>
<li><a href="#joins" id="markdown-toc-joins">Joins</a></li>
<li><a href="#shuffles" id="markdown-toc-shuffles">Shuffles</a> <ul>
<li><a href="#partitioning" id="markdown-toc-partitioning">Partitioning</a></li>
<li><a href="#optimizing-with-partitioners" id="markdown-toc-optimizing-with-partitioners">Optimizing with Partitioners</a></li>
<li><a href="#wide-vs-narrow-dependencies" id="markdown-toc-wide-vs-narrow-dependencies">Wide vs Narrow Dependencies</a></li>
</ul>
</li>
<li><a href="#structured-and-unstructured-data" id="markdown-toc-structured-and-unstructured-data">Structured and Unstructured Data</a></li>
<li><a href="#spark-sql" id="markdown-toc-spark-sql">Spark SQL</a> <ul>
<li><a href="#getting-started" id="markdown-toc-getting-started">Getting started</a></li>
<li><a href="#dataframes" id="markdown-toc-dataframes">DataFrames</a> <ul>
<li><a href="#cleaning-data-with-dataframes" id="markdown-toc-cleaning-data-with-dataframes">Cleaning Data with DataFrames</a></li>
<li><a href="#common-actions-on-dataframes" id="markdown-toc-common-actions-on-dataframes">Common actions on DataFrames</a></li>
<li><a href="#joins-on-dataframes" id="markdown-toc-joins-on-dataframes">Joins on DataFrames</a></li>
<li><a href="#optimizations-on-dataframes" id="markdown-toc-optimizations-on-dataframes">Optimizations on DataFrames</a></li>
<li><a href="#limitations" id="markdown-toc-limitations">Limitations</a></li>
</ul>
</li>
<li><a href="#datasets" id="markdown-toc-datasets">Datasets</a> <ul>
<li><a href="#creating-datasets" id="markdown-toc-creating-datasets">Creating Datasets</a></li>
<li><a href="#transformations-on-datasets" id="markdown-toc-transformations-on-datasets">Transformations on Datasets</a></li>
<li><a href="#aggregators" id="markdown-toc-aggregators">Aggregators</a></li>
<li><a href="#dataset-actions" id="markdown-toc-dataset-actions">Dataset Actions</a></li>
<li><a href="#limitations-of-datasets" id="markdown-toc-limitations-of-datasets">Limitations of Datasets</a></li>
</ul>
</li>
<li><a href="#datasets-vs-dataframes-vs-rdds" id="markdown-toc-datasets-vs-dataframes-vs-rdds">Datasets vs DataFrames vs RDDs</a></li>
</ul>
</li>
</ul>
</li>
</ul>
<!-- More -->
<p>These are my notes from the <a href="http://lara.epfl.ch/w/parcon17:top">CS-206 Parallelism and Concurrency course</a>. Prerequisites are:</p>
<ul>
<li><a href="/funprog/">Functional Programming</a></li>
<li><a href="/algorithms/">Algorithms</a></li>
<li>Computer Architecture</li>
</ul>
<p>Please note that these notes won’t be as good or complete as in the previous semester, as some of the lectures in this course were given ex cathedra instead of as a MOOC.</p>
<h2 id="introduction">Introduction</h2>
<p>Almost every desktop, laptop, mobile device today has multiple processors; it is therefore important to learn how to harness these resources. We’ll see how functional programming applies to parallelization. We’ll also learn how to estimate and measure performance.</p>
<h2 id="part-1-parallelism">Part 1: Parallelism</h2>
<h4 id="what-is-parallel-computing">What is parallel computing?</h4>
<p><em>Parallel computing</em> is a type of computation in which many calculations are performed at the same time. The basic principle is to divide the computation into smaller subproblems, each of which can be solved simultaneously. This is, of course, assuming that parallel hardware is at our disposal, with shared access to memory. Parallel programming is much harder than sequential programming, but we can get significant <em>speedups</em>.</p>
<p>Parallelism and concurrency are closely related concepts:</p>
<ul>
<li><strong>Parallel program</strong>: uses parallel hardware to execute computation more quickly. It is mainly concerned with division into subproblems and optimal use of parallel hardware</li>
<li><strong>Concurrent program</strong>: may or may not execute multiple executions at the same time. Mainly concerned with modularity, responsiveness or maintainability (convenience).</li>
</ul>
<p>The two often overlap; neither is the superset of the other.</p>
<p>Parallelism manifests itself at different granularity levels.</p>
<ul>
<li><strong>Bit-level parallelism</strong>: processing multiple bits of data in parallel</li>
<li><strong>Instruction-level parallelism</strong>: executing different instructions from the same instruction stream in parallel</li>
<li><strong>Task-level parallelism</strong>: executing separate instruction streams in parallel</li>
</ul>
<p>The first two are mainly implemented in hardware or in compilers; as developers, we focus on task-level parallelism.</p>
<h3 id="parallelism-on-the-jvm">Parallelism on the JVM</h3>
<h4 id="definitions">Definitions</h4>
<p>A process is an instance of a program that is executing in the OS. The same program can be started as a process more than once, or even simultaneously in the same OS. The operating system <em>multiplexes</em> many different processes and a limited number of CPUs, so that they get <em>time slices</em> of execution. This
mechanism is called <em>multitasking</em>.</p>
<p>Two different processes cannot access each other’s memory directly — they
are isolated. Interprocess communication methods exist, but they aren’t particularly straightforward.</p>
<p>Each process can contain multiple independent concurrency units called
<em>threads</em>. They can be started programmatically within the program, and they share the same memory address space — this allows them to exchange information by doing memory read/writes.</p>
<p>Each thread has a program counter and a program stack. JVM threads can’t modify each other’s stack memory, they can only modify the heap memory.</p>
<h4 id="implementation">Implementation</h4>
<p>Each JVM process starts with a <strong>main thread</strong>. To start additional threads:</p>
<ol>
<li>Define a <code class="highlighter-rouge">Thread</code> subclass.</li>
<li>Instantiate a new <code class="highlighter-rouge">Thread</code> object.</li>
<li>Call <code class="highlighter-rouge">start</code> on the <code class="highlighter-rouge">Thread</code> object.</li>
</ol>
<p>Notice that the same class can be used to start multiple threads.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
</pre></td><td class="code"><pre><span class="k">class</span> <span class="nc">HelloThread</span> <span class="k">extends</span> <span class="nc">Thread</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="n">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Hello world!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">val</span> <span class="n">t</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">HelloThread</span> <span class="c1">// new thread instance
</span>
<span class="n">t</span><span class="o">.</span><span class="n">start</span><span class="o">()</span> <span class="c1">// start thread
</span><span class="n">t</span><span class="o">.</span><span class="n">join</span><span class="o">()</span> <span class="o">//</span> <span class="n">wait</span> <span class="k">for</span> <span class="n">its</span> <span class="n">completion</span></pre></td></tr></tbody></table></code></pre></figure>
<p><code class="highlighter-rouge">t.join()</code> blocks the main thread’s execution until the <code class="highlighter-rouge">t</code> thread is done executing.</p>
<p>Let’s look at a more complex example:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
</pre></td><td class="code"><pre><span class="k">class</span> <span class="nc">HelloThread</span> <span class="k">extends</span> <span class="nc">Thread</span> <span class="o">{</span>
<span class="k">override</span> <span class="k">def</span> <span class="n">run</span><span class="o">()</span> <span class="o">{</span>
<span class="n">println</span><span class="o">(</span><span class="s">"Hello"</span><span class="o">)</span>
<span class="n">println</span><span class="o">(</span><span class="s">"world!"</span><span class="o">)</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">main</span><span class="o">()</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">t</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">HelloThread</span>
<span class="k">val</span> <span class="n">s</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">HelloThread</span>
<span class="n">t</span><span class="o">.</span><span class="n">start</span><span class="o">()</span>
<span class="n">s</span><span class="o">.</span><span class="n">start</span><span class="o">()</span>
<span class="n">t</span><span class="o">.</span><span class="n">join</span><span class="o">()</span>
<span class="n">s</span><span class="o">.</span><span class="n">join</span><span class="o">()</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Running it multiple times might yield the following output:</p>
<figure class="highlight"><pre><code class="language-text" data-lang="text"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre>Hello
world!
Hello
world!
Hello
world!
Hello
world!
Hello
Hello
world!
world!</pre></td></tr></tbody></table></code></pre></figure>
<p>On the first two executions, the threads happened to execute linearly; first <code class="highlighter-rouge">t</code>, then <code class="highlighter-rouge">s</code>. But on the third attempt, the first thread printed <code class="highlighter-rouge">Hello</code>, but then the second thread kicked in, also printed <code class="highlighter-rouge">Hello</code> — before the first had time to print out <code class="highlighter-rouge">world!</code>, and then they both completed.</p>
<h3 id="atomicity">Atomicity</h3>
<p>The above shows that <strong>two parallel threads can overlap arbitrarily</strong>. However, we sometimes want to ensure that a sequence of statements executes at once, as if they were just one statement, meaning that we don’t want them to overlap. This is called atomicity.</p>
<p>An operation is <em>atomic</em> if it appears as if it occurred instantaneously from the point of view of other threads.</p>
<p>The implementation of <code class="highlighter-rouge">getUniqueId()</code> below isn’t atomic, as it suffers from the same problem as the hello world example above.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
</pre></td><td class="code"><pre><span class="k">private</span> <span class="k">var</span> <span class="n">uidCount</span> <span class="k">=</span> <span class="mi">0L</span> <span class="c1">// 0 as a long
</span><span class="k">def</span> <span class="n">getUniqueId</span><span class="o">()</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="o">{</span>
<span class="n">uidCount</span> <span class="k">=</span> <span class="n">uidCount</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">uidCount</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="synchronized-blocks">Synchronized blocks</h4>
<p>How can we secure it from this problem? How do we get it to execute atomically?</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">private</span> <span class="k">val</span> <span class="n">x</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">AnyRef</span> <span class="o">{}</span>
<span class="k">private</span> <span class="k">var</span> <span class="n">uidCount</span> <span class="k">=</span> <span class="mi">0L</span>
<span class="k">def</span> <span class="n">getUniqueId</span><span class="o">()</span><span class="k">:</span> <span class="kt">Long</span> <span class="o">=</span> <span class="n">x</span><span class="o">.</span><span class="n">synchronized</span> <span class="o">{</span>
<span class="n">uidCount</span> <span class="k">=</span> <span class="n">uidCount</span> <span class="o">+</span> <span class="mi">1</span>
<span class="n">uidCount</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>The <code class="highlighter-rouge">synchronized</code> block is used to achieve atomicity. Code blocks after a <code class="highlighter-rouge">synchronized</code> call on an object <code class="highlighter-rouge">x</code> are never executed on two threads at once. The JVM ensures this by storing an object called the <em>monitor</em> in each object. At most one thread can own the monitor at any particular time, and releases it when it’s done executing.</p>
<p><code class="highlighter-rouge">synchronized</code> blocks can even be nested.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
</pre></td><td class="code"><pre><span class="k">class</span> <span class="nc">Account</span><span class="o">(</span><span class="k">private</span> <span class="k">var</span> <span class="n">amount</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">0</span><span class="o">)</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">transfer</span><span class="o">(</span><span class="n">target</span><span class="k">:</span> <span class="kt">Account</span><span class="o">,</span> <span class="n">n</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span>
<span class="k">this</span><span class="o">.</span><span class="n">synchronized</span> <span class="o">{</span> <span class="c1">// synchronized block on source account
</span> <span class="n">target</span><span class="o">.</span><span class="n">synchronized</span> <span class="o">{</span> <span class="c1">// and on target account
</span> <span class="k">this</span><span class="o">.</span><span class="n">amount</span> <span class="o">-=</span> <span class="n">n</span>
<span class="n">target</span><span class="o">.</span><span class="n">amount</span> <span class="o">+=</span> <span class="n">n</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>This way, the thread gets a monitor on account A, and then on account B. Once it has monitors on both, it can transfer the amount from A to B. Another thread can do this with C and D in parallel.</p>
<h4 id="deadlocks">Deadlocks</h4>
<p>Sometimes though, this may cause the code to freeze, or to <em>deadlock</em>. This is a scenario in which two or more threads compete for resources (such as monitor ownership) and wait for each to finish without releasing
the already acquired resources.</p>
<p>The following code should cause a deadlock:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
</pre></td><td class="code"><pre><span class="k">val</span> <span class="n">a</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Account</span><span class="o">(</span><span class="mi">50</span><span class="o">)</span>
<span class="k">val</span> <span class="n">b</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">Account</span><span class="o">(</span><span class="mi">70</span><span class="o">)</span>
<span class="c1">// thread T1
</span><span class="n">a</span><span class="o">.</span><span class="n">transfer</span><span class="o">(</span><span class="n">b</span><span class="o">,</span> <span class="mi">10</span><span class="o">)</span>
<span class="c1">// thread T2
</span><span class="n">b</span><span class="o">.</span><span class="n">transfer</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="mi">10</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<p><code class="highlighter-rouge">T1</code> gets the monitor for <code class="highlighter-rouge">a</code>, <code class="highlighter-rouge">T2</code> gets the monitor for <code class="highlighter-rouge">b</code>. Then they both wait for each other to release the monitor, leaving us in a deadlock.</p>
<h5 id="resolving-deadlocks">Resolving deadlocks</h5>
<p>One approach is to always acquire resources in the same order. This assumes an ordering relationship on the resources. In our example, we can simply assign unique IDs on the accounts, and order our <code class="highlighter-rouge">synchronized</code> calls according to this ID.</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
</pre></td><td class="code"><pre><span class="k">val</span> <span class="n">uid</span> <span class="k">=</span> <span class="n">getUniqueUid</span><span class="o">()</span>
<span class="k">private</span> <span class="k">def</span> <span class="n">lockAndTransfer</span><span class="o">(</span><span class="n">target</span><span class="k">:</span> <span class="kt">Account</span><span class="o">,</span> <span class="n">n</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span>
<span class="k">this</span><span class="o">.</span><span class="n">synchronized</span> <span class="o">{</span>
<span class="n">target</span><span class="o">.</span><span class="n">synchronized</span> <span class="o">{</span>
<span class="k">this</span><span class="o">.</span><span class="n">amount</span> <span class="o">-=</span> <span class="n">n</span>
<span class="n">target</span><span class="o">.</span><span class="n">amount</span> <span class="o">+=</span> <span class="n">n</span>
<span class="o">}</span>
<span class="o">}</span>
<span class="k">def</span> <span class="n">transfer</span><span class="o">(</span><span class="n">target</span><span class="k">:</span> <span class="kt">Account</span><span class="o">,</span> <span class="n">n</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span>
<span class="k">if</span> <span class="o">(</span><span class="k">this</span><span class="o">.</span><span class="n">uid</span> <span class="o"><</span> <span class="n">target</span><span class="o">.</span><span class="n">uid</span><span class="o">)</span> <span class="k">this</span><span class="o">.</span><span class="n">lockAndTransfer</span><span class="o">(</span><span class="n">target</span><span class="o">,</span> <span class="n">n</span><span class="o">)</span>
<span class="k">else</span> <span class="n">target</span><span class="o">.</span><span class="n">lockAndTransfer</span><span class="o">(</span><span class="k">this</span><span class="o">,</span> <span class="o">-</span><span class="n">n</span><span class="o">)</span></pre></td></tr></tbody></table></code></pre></figure>
<h4 id="memory-model">Memory model</h4>
<p>A <em>memory model</em> is a set of rules describing how threads interact when accessing shared memory. Java Memory Model is the memory model for the JVM. There are many rules, but the ones we chose to remember in the context of this course are:</p>
<ol>
<li>Two threads writing to separate locations in memory do not need synchronization.</li>
<li>A thread X that calls <code class="highlighter-rouge">join</code> on another thread Y is guaranteed to observe all the writes by thread Y after <code class="highlighter-rouge">join</code> returns. Note that if we don’t call <code class="highlighter-rouge">join</code>, there’s no guarantee that X will see any of Y’s changes when it reads in memory.</li>
</ol>
<p>We will not be using threads and the <code class="highlighter-rouge">synchronized</code> primitive directly in the remainder of the course. However, the methods in the course are based on these, and knowledge about them is indeed useful.</p>
<h3 id="running-computations-in-parallel">Running computations in parallel</h3>
<p>How can we run the following code in parallel?</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">pNormTwoPart</span><span class="o">(</span><span class="n">a</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Int</span><span class="o">],</span> <span class="n">p</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">m</span> <span class="k">=</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">val</span> <span class="o">(</span><span class="n">sum1</span><span class="o">,</span> <span class="n">sum2</span><span class="o">)</span> <span class="k">=</span> <span class="o">(</span><span class="n">sumSegment</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">m</span><span class="o">),</span>
<span class="n">sumSegment</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="n">m</span><span class="o">,</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span><span class="o">))</span>
<span class="n">power</span><span class="o">(</span><span class="n">sum1</span> <span class="o">+</span> <span class="n">sum2</span><span class="o">,</span> <span class="mi">1</span><span class="o">/</span><span class="n">p</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>We just add <code class="highlighter-rouge">parallel</code>!</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">pNormTwoPart</span><span class="o">(</span><span class="n">a</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Int</span><span class="o">],</span> <span class="n">p</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">m</span> <span class="k">=</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span> <span class="o">/</span> <span class="mi">2</span>
<span class="k">val</span> <span class="o">(</span><span class="n">sum1</span><span class="o">,</span> <span class="n">sum2</span><span class="o">)</span> <span class="k">=</span> <span class="n">parallel</span><span class="o">(</span><span class="n">sumSegment</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">m</span><span class="o">),</span>
<span class="n">sumSegment</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="n">m</span><span class="o">,</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span><span class="o">))</span>
<span class="n">power</span><span class="o">(</span><span class="n">sum1</span> <span class="o">+</span> <span class="n">sum2</span><span class="o">,</span> <span class="mi">1</span><span class="o">/</span><span class="n">p</span><span class="o">)</span>
<span class="o">}</span></pre></td></tr></tbody></table></code></pre></figure>
<p>Recursion works very well with parallelism. We can for instance spin up an arbitrary number of threads:</p>
<figure class="highlight"><pre><code class="language-scala" data-lang="scala"><table class="rouge-table"><tbody><tr><td class="gutter gl"><pre class="lineno">1
2
3
4
5
6
7
8
9
10
11
12
13
14
</pre></td><td class="code"><pre><span class="k">def</span> <span class="n">pNormRec</span><span class="o">(</span><span class="n">a</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Int</span><span class="o">],</span> <span class="n">p</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span>
<span class="n">power</span><span class="o">(</span><span class="n">segmentRec</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="mi">0</span><span class="o">,</span> <span class="n">a</span><span class="o">.</span><span class="n">length</span><span class="o">),</span> <span class="mi">1</span><span class="o">/</span><span class="n">p</span><span class="o">)</span>
<span class="c1">// like sumSegment but parallel
</span><span class="k">def</span> <span class="n">segmentRec</span><span class="o">(</span><span class="n">a</span><span class="k">:</span> <span class="kt">Array</span><span class="o">[</span><span class="kt">Int</span><span class="o">],</span> <span class="n">p</span><span class="k">:</span> <span class="kt">Double</span><span class="o">,</span> <span class="n">s</span><span class="k">:</span> <span class="kt">Int</span><span class="o">,</span> <span class="n">t</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="o">{</span>
<span class="k">if</span> <span class="o">(</span><span class="n">t</span> <span class="o">-</span> <span class="n">s</span> <span class="o"><</span> <span class="n">threshold</span><span class="o">)</span>
<span class="n">sumSegment</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="n">s</span><span class="o">,</span> <span class="n">t</span><span class="o">)</span> <span class="c1">// small segment: do it sequentially
</span> <span class="k">else</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">m</span> <span class="k">=</span> <span class="n">s</span> <span class="o">+</span> <span class="o">(</span><span class="n">t</span> <span class="o">-</span> <span class="n">s</span><span class="o">)/</span><span class="mi">2</span>
<span class="k">val</span> <span class="o">(</span><span class="n">sum1</span><span class="o">,</span> <span class="n">sum2</span><span class="o">)</span> <span class="k">=</span> <span class="n">parallel</span><span class="o">(</span><span class="n">segmentRec</span><span class="o">(</span><span class="n">a</span><span class="o">,</span> <span class="n">p</span><span class="o">,</span> <span class="n">s</span><span class="o">,</span> <span class="n">m</span><span class="o">),</span>
<span class="n">segment