Compare revisions

William Waites · Pat Prochacki · Pat Prochacki · Pat Prochacki · Pat · Pat
--- a/README.md
+++ b/README.md
+# CS101/3 - Cognitive Science and Artificial Intelligence
+
+## Info
+
+- **Lecture**
+  - Mondays 14:00-15:00 RC426
+- **Labs:**
+  - LT1105, LT1201, LT1221, LT1320 :: Mondays 15:00-17:00
+  - **Demonstrators:**
+    - Adel Dadaa <[adel.dadaa@strath.ac.uk](mailto:adel.dadaa@strath.ac.uk)>
+    - Pat Prochacki <[pat.prochacki@strath.ac.uk](mailto:pat.prochacki@strath.ac.uk)>
+    - Tochukwu Umeasiegbu <[tochukwu.umeasiegbu@strath.ac.uk](mailto:tochukwu.umeasiegbu@strath.ac.uk)>
+
+- **Links**
+  - [Course Mattermost Channel](https://mattermost.cis.strath.ac.uk/learning/channels/cs101-22-24)
+
+## Marking Scheme
+
+*Final due date for all assignments: 26 February*
+
+- 10% :: Participation
+- 30% :: Assignment 1 - [Yak Shaving](./cs101-csai-lec1.pdf)
+  - 20% :: Create a git repository with a file, and share it
+  - 10% :: Put a transcript of a session with the Emacs doctor in that file
+- 30% :: Assignment 2 - [Probability and Text](./cs101-csai-lec1.pdf)
+  - 10% :: Write a program to output random characters
+  - 10% :: Write a program that, given a character, predicts the next character
+  - 10% :: Write a program to output a sequence of characters
+  - 10% (bonus) :: Write a program that outputs a sequence of characters conditional on the previous two characters
+- 30% :: Assignment 3 - [Stochastic Parrot](./cs101-csai-assignment3.md)
+  - 10% :: Write a program to output random words
+  - 10% :: Write a program that, given a word, predicts the next word
+  - 10% :: Write a program to output a sequence of words
+
+## Topics
+
+- Yak Shaving - Software Engineering Tooling [(PDF)](./cs101-csai-lec1.pdf)
+- Some Philosophical Experiments
+- Symbol Manipulation and Logic
+- Probability and Text prediction
+- Vector Spaces and Word embedding
+
+
+## Python Quick-Start Guide
+
+[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/r54u4z_qay0/0.jpg)](https://www.youtube.com/watch?v=r54u4z_qay0)
+
+[Setting up Python - Write-Up](https://gitlab.cis.strath.ac.uk/xgb21195/cs101-csai/-/blob/main/setup.md?ref_type=heads)
--- a/README.org
+++ b/README.org
-#+title: CS101/3 - Cognitive Science and Artificial Intelligence
-
-* Info
-
-** Lecture
-
-Mondays 14:00-15:00 RC426
-
-** Labs:
-
- LT1105, LT1201, LT1221, LT1320 :: Mondays 15:00-17:00
-*** Demonstrators:
- Adel Dadaa <[[mailto:adel.dadaa@strath.ac.uk>][adel.dadaa@strath.ac.uk]]>
- Pat Prochacki <[[mailto:pat.prochacki@strath.ac.uk][pat.prochacki@strath.ac.uk]]>
- Tochukwu Umeasiegbu <[[mailto:tochukwu.umeasiegbu@strath.ac.uk][tochukwu.umeasiegbu@strath.ac.uk]]>
-  
-** Links
-
- [[https://mattermost.cis.strath.ac.uk/learning/channels/cs101-22-24][Course Mattermost Channel]]
-
-* Marking Scheme
-
- 10% :: Participation
- 30% :: Assignment 1 - [[./cs101-csai-lec1.pdf][Yak Shaving]]
-  - 20% :: Create a git repository with a file, and share it
-  - 10% :: Put a transcript of a session with the Emacs doctor in that file
- 30% :: Assignment 2 - [[./cs101-csai-lec1.pdf][Probability and Text]]
-  - 10% :: Write a program to output random characters
-  - 10% :: Write a program that, given a character, predicts the next character
-  - 10% :: Write a program to output a sequence of characters
-  - 10% (bonus) :: Write a program that outputs a sequence of
-    characters conditional on the previous two characters
- 30% :: Assignment 3
-
-  
-* Topics
-1. Yak Shaving - Software Engineering Tooling [[./cs101-csai-lec1.pdf][(PDF]])
-2. Some Philosophical Experiments
-3. Symbol Manipulation and Logic
-4. Probability and Text prediction
-5. Vector Spaces and Word embedding
--- a/cs101-csai-assignment3.md
+++ b/cs101-csai-assignment3.md
+# CS101/3 - CS/AI Assignment 3 - The Stochastic Parrot
+
+Assignment 3 is just like Assignment 2 but with words instead of letters.
+
+You may wish to use the Python NLTK (https://www.nltk.org) Natural Language
+Toolkit to help break the text up into tokens (words)
+
+You may adapt the helper functions from Assignment 2 to work with words 
+rather than letters or you may write your own from scratch
+
+Optionally, instead of doing this programming exercise, you may write a
+short essay, no longer than 2000 words on one of the topics that we have
+touched on in the philosophy of mind:
+
+  - Turing's Imitation Game
+  - Dneprov's Game (or, equivalently, Searle's Chinese Room)
+  - Logical Behaviorism or Functionalism
+  - The Engineering End-Run
+
+Assignment 3 is worth 30% in total.
+
+## 3a word probabilities - 10%
+
+Given an input text, compute the word probabilities and generate a word,
+
+	xgb21195@cafe:~$ ./assignment-3a example.txt
+        the
+
+## 3b conditional word probabilities - 10%
+
+Given an input text and a word, generate the next word according to the
+conditional probabilities in the text,
+
+	xgb21195@cafe:~$ ./assignment-3b example.txt the
+	cat
+
+## 3c a stochastic parrot - 10%
+
+Given an input text, generate a sentence of a particular length according
+to the conditional word distributions,
+
+	xgb21195@cafe:~$ ./assignment-3b example.txt 10
+	the cat ate a burrito that is not a butterfly
+
+
--- a/cs101-csai-lec2.org
+++ b/cs101-csai-lec2.org
@@ -22,7 +22,9 @@

 *This is only slightly more than half the class!*

-Marking will happen this week.
+Marking will begin this week.
+
+If you are having trouble, *ask a demonstrator for help*.

 If you do not do this assignment, you will not be able to submit any
 of the other assignments and you will get *0* for this module.

--- a/cs101-csai-lec2.pdf
+++ b/cs101-csai-lec2.pdf
--- a/img/idlewin.png
+++ b/img/idlewin.png
--- a/img/pycharmwin.png
+++ b/img/pycharmwin.png
--- a/img/pythoncmd.png
+++ b/img/pythoncmd.png
--- a/img/vscodewin.png
+++ b/img/vscodewin.png
--- a/lec2/__init__.py
+++ b/lec2/__init__.py
--- a/lec2/letters.py
+++ b/lec2/letters.py
@@ -7,14 +7,15 @@ def file2prob(filename):
    Read a file and return a dictionary of letters and
    their probabilities
    """
-    letter_dict  = { c: 0 for c in letters }
+    letter_dict  = {}
    letter_total = 0

-    with open(filename) as fp:
+    with open(filename, encoding="utf-8") as fp:
        for c in fp.read():
-            if c.lower() not in letter_dict:
+            if c.lower() not in letters:
                continue
-            letter_dict[c.lower()] += 1
+            c = c.lower()
+            letter_dict[c] = letter_dict.get(c, 0) + 1
            letter_total += 1

    probs = { c: letter_dict[c]/letter_total for c in letter_dict }
@@ -27,22 +28,23 @@ def file2pairs(filename):
    conditional probability of a letter given its
    predecessor.
    """
-    letter_dict  = { c: { a: 0 for a in letters }
-                     for c in letters }
+    letter_dict  = {}
+
    previous = None
-    with open(filename) as fp:
+    with open(filename, encoding="utf-8") as fp:
        for c in fp.read():
-            if c not in letter_dict:
-                continue
            c = c.lower()
+            if c not in letters:
+                continue
            if previous is None:
                previous = c
                continue
-            letter_dict[previous][c] += 1
+            d = letter_dict.setdefault(previous, {})
+            d[c] = d.get(c, 0) + 1
            previous = c

    probs = { c: { d: letter_dict[c][d]/sum(letter_dict[c].values())
-                   for d in letters } for c in letters }
+                   for d in letter_dict[c] } for c in letter_dict }
    return probs

 if __name__ == '__main__':

--- a/marking/aabbcc.txt
+++ b/marking/aabbcc.txt
+aabbcc
--- a/marking/abcd.txt
+++ b/marking/abcd.txt
+abcd
--- a/marking/abcda.txt
+++ b/marking/abcda.txt
+abcda
--- a/marking/barlow.txt
+++ b/marking/barlow.txt
+Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind. On behalf of the future, I ask you of the past to leave us alone. You are not welcome among us. You have no sovereignty where we gather.
+
+We have no elected government, nor are we likely to have one, so I address you with no greater authority than that with which liberty itself always speaks. I declare the global social space we are building to be naturally independent of the tyrannies you seek to impose on us. You have no moral right to rule us nor do you possess any methods of enforcement we have true reason to fear.
+
+Governments derive their just powers from the consent of the governed. You have neither solicited nor received ours. We did not invite you. You do not know us, nor do you know our world. Cyberspace does not lie within your borders. Do not think that you can build it, as though it were a public construction project. You cannot. It is an act of nature and it grows itself through our collective actions.
+
+You have not engaged in our great and gathering conversation, nor did you create the wealth of our marketplaces. You do not know our culture, our ethics, or the unwritten codes that already provide our society more order than could be obtained by any of your impositions.
+
+You claim there are problems among us that you need to solve. You use this claim as an excuse to invade our precincts. Many of these problems don't exist. Where there are real conflicts, where there are wrongs, we will identify them and address them by our means. We are forming our own Social Contract. This governance will arise according to the conditions of our world, not yours. Our world is different.
+
+Cyberspace consists of transactions, relationships, and thought itself, arrayed like a standing wave in the web of our communications. Ours is a world that is both everywhere and nowhere, but it is not where bodies live.
+
+We are creating a world that all may enter without privilege or prejudice accorded by race, economic power, military force, or station of birth.
+
+We are creating a world where anyone, anywhere may express his or her beliefs, no matter how singular, without fear of being coerced into silence or conformity.
+
+Your legal concepts of property, expression, identity, movement, and context do not apply to us. They are all based on matter, and there is no matter here.
+
+Our identities have no bodies, so, unlike you, we cannot obtain order by physical coercion. We believe that from ethics, enlightened self-interest, and the commonweal, our governance will emerge. Our identities may be distributed across many of your jurisdictions. The only law that all our constituent cultures would generally recognize is the Golden Rule. We hope we will be able to build our particular solutions on that basis. But we cannot accept the solutions you are attempting to impose.
+
+In the United States, you have today created a law, the Telecommunications Reform Act, which repudiates your own Constitution and insults the dreams of Jefferson, Washington, Mill, Madison, DeToqueville, and Brandeis. These dreams must now be born anew in us.
+
+You are terrified of your own children, since they are natives in a world where you will always be immigrants. Because you fear them, you entrust your bureaucracies with the parental responsibilities you are too cowardly to confront yourselves. In our world, all the sentiments and expressions of humanity, from the debasing to the angelic, are parts of a seamless whole, the global conversation of bits. We cannot separate the air that chokes from the air upon which wings beat.
+
+In China, Germany, France, Russia, Singapore, Italy and the United States, you are trying to ward off the virus of liberty by erecting guard posts at the frontiers of Cyberspace. These may keep out the contagion for a small time, but they will not work in a world that will soon be blanketed in bit-bearing media.
+
+Your increasingly obsolete information industries would perpetuate themselves by proposing laws, in America and elsewhere, that claim to own speech itself throughout the world. These laws would declare ideas to be another industrial product, no more noble than pig iron. In our world, whatever the human mind may create can be reproduced and distributed infinitely at no cost. The global conveyance of thought no longer requires your factories to accomplish.
+
+These increasingly hostile and colonial measures place us in the same position as those previous lovers of freedom and self-determination who had to reject the authorities of distant, uninformed powers. We must declare our virtual selves immune to your sovereignty, even as we continue to consent to your rule over our bodies. We will spread ourselves across the Planet so that no one can arrest our thoughts.
+
+We will create a civilization of the Mind in Cyberspace. May it be more humane and fair than the world your governments have made before.
+
--- a/marking/lec2
+++ b/marking/lec2
+../lec2
\ No newline at end of file
--- a/marking/mark1.sh
+++ b/marking/mark1.sh
+#!/bin/bash
+
+CS=/home/xgb21195/teaching/cs101
+TOK=`cat ~/.gitlab`
+
+mark1 () {
+    date 1>&2
+    echo "Marking Assignment 1 for ${dsname}" 1>&2
+    echo "Marking Assignment 1"; echo
+    marks=0
+    nfiles=`find . -type f | sed '/^\.\/\./d' | wc -l`
+    find . -type f | sed '/^\.\/\./d' | while read f; do
+        touch ${marking}/assignment-1.files
+        if grep -q 'I am the psychotherapist.  Please, describe your problems.' "$f"; then
+            touch ${marking}/assignment-1.emacs
+        fi
+    done
+    if test -f ${marking}/assignment-1.files; then
+        echo "    Found some files... 20/20 marks"
+        marks=20
+    else
+        echo "    Found no files... 0/20 marks"
+    fi
+    if test -f ${marking}/assignment-1.emacs; then
+        echo "    Found an interaction with the Emacs doctor... 10/10 marks"
+        marks=$(($marks + 10))
+    else
+        echo "    Found no interaction with the Emacs doctor... 0/10 marks"
+    fi
+    echo
+    echo "${marks}/30 marks in total"
+    echo "${marks} marks" 1>&2
+}
+
+dsname="$1"
+student="${CS}/students/${dsname}"
+marking="${CS}/marking/${dsname}"
+
+mark1
--- a/marking/mark2.py
+++ b/marking/mark2.py
+from multiprocessing import Pool
+from scipy.stats import wasserstein_distance as emd
+from Levenshtein import distance as levenshtein
+from sys import argv
+from sys import stdout
+from lec2.letters import file2prob, letters
+from random import choices, sample
+from subprocess import PIPE, DEVNULL, STDOUT
+import subprocess
+import ast
+import os
+import re
+
+CS="/home/xgb21195/teaching/cs101"
+republic = "/home/xgb21195/teaching/cs101/cs101-csai/lec2/republic.txt"
+modest = "/home/xgb21195/teaching/cs101/cs101-csai/marking/modest.txt"
+aabbcc = "/home/xgb21195/teaching/cs101/cs101-csai/marking/aabbcc.txt"
+
+sbre = re.compile("^ *\['([a-z ])'\].*|.*: ([a-z ])$")
+
+def ngrams(data, n=2):
+    """
+    From the data, construct a dictionary of n-grams and
+    the probabilities of following letters. That is, the
+    conditional probability of a letter given its n
+    """
+ 
+    letter_dict  = {}
+    previous = []
+    for c in data:
+        c = c.lower()
+        if c not in letters:
+            continue
+        if len(previous) == n:
+            key = "".join(previous)
+            counts = letter_dict.setdefault(key, {})
+            counts[c] = counts.get(c, 0) + 1
+            letter_dict[key][c] += 1
+        previous.append(c)
+        previous = previous[-n:]
+
+    probs = { c: { d: letter_dict[c][d]/sum(letter_dict[c].values())
+                   for d in letter_dict[c].keys() } for c in letter_dict.keys() }
+    return probs
+
+def file2ngrams(filename, n=2):
+    """
+    Read a file and return a dictionary of n-grams and
+    the probabilities of following letters. That is, the
+    conditional probability of a letter given its n
+    predecessors.
+    """
+    with open(filename, encoding="utf-8") as fp:
+        return ngrams(fp.read(), n)
+
+def find_assignment(repo, name):
+    for d, subs, files in os.walk(repo, name):
+        for fname in files:
+            lname = fname.lower()
+            if lname.startswith("assignment") and name in lname:
+                return os.path.join(repo, d, fname)
+
+def ispython(fname):
+    try:
+        with open(fname) as fp:
+            p = fp.read()
+        ast.parse(p)
+        return True
+    except:
+        return False
+
+def runner(fname):
+    if ispython(fname):
+        def _run(*args):
+            cmd = ["/bin/python", fname] + [str(a) for a in args]
+            try:
+                proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+                if proc.returncode is not None and proc.returncode != 0:
+                    return None
+                return proc.stdout.decode("utf-8")
+            except Exception as e:
+                print(f"Error running program: {e}")
+                return None
+    else:
+        def _run(*args):
+            cmd = [fname] + [str(a) for a in args]
+            try:
+                proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+                if proc.returncode is not None and proc.returncode != 0:
+                    return None
+                return proc.stdout.decode("utf-8")
+            except Exception as e:
+                print(f"Error running program: {e}")
+                return None
+            
+    return _run 
+
+def run(args):
+    fname = args[0]
+    _run = runner(fname)
+    result = _run(*args[1:])
+    return result.rstrip("\r\n") if result is not None else None
+
+def pmark(indent, msg, marks):
+    l = indent*4 + len(msg)
+    dent = ' ' * (indent*4)
+    dots = '.' * (69-l)
+    print(dent + msg + dots + " " + marks + " marks")
+
+def mark2a(repo):
+    tally = 0
+    stdout.write("Finding assignment 2a...")
+    prog  =  find_assignment(repo, "2a")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 2a", "0/10")
+        return 0
+    print(f" {prog}")
+
+    output = run((prog, aabbcc))
+    if output is None:
+        pmark(1, "Program cannot be run", "0/10")
+        return 0
+    else:
+        pmark(1, "Program exists and can be run", "1/1")
+        tally += 1
+        output = output.strip()
+        if len(output) == 1 and output in "abc":
+            pmark(1, "Program output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, "Program output is not well-formed", "0/2")
+            print("    Expected a single letter, either 'a', 'b', or 'c', got:")
+            print(f"        {repr(output)}")
+            print("    Will try to compensate for this for the rest of 2a")
+    
+    print("    Testing AABBCC distribution...")
+    counts = {}
+    with Pool(processes=4) as pool:
+        for c in pool.imap_unordered(run, [(prog, aabbcc) for _ in range(2500)]):
+            if c is None:
+                continue
+            if len(c) != 1:
+                m = sbre.match(c)
+                if m is None:
+                    continue
+                bra, end = m.groups()
+                if bra is not None: c = bra
+                else: c = end
+            counts[c] = counts.get(c, 0) + 1
+    probs = { c : counts[c]/sum(counts.values()) for c in counts }
+    n = len(probs)
+    if n != 3:
+        pmark(2, f"Strange distribution of size {n} found", "0/2")
+    else:
+        pmark(2, f"Ternary distribution found", "1/1") 
+        tally += 1
+        balanced = all( abs(0.3333-v) < 0.03 for v in probs.values() )
+        if balanced:
+            pmark(2, "Balanced ternary distribution found", "1/1")
+            tally += 1
+        else:
+            pmark(2, "Unbalanced ternary distribution found", "0/1")
+            print(f"        {probs}")
+
+    modest_probs = file2prob(modest)
+
+    print("    Testing A Modest Proposal...")
+    counts = {}
+    with Pool(processes=24) as pool:
+        for c in pool.imap_unordered(run, [(prog, modest) for _ in range(2500)]):
+            if c is None:
+                continue
+            if len(c) != 1:
+                m = sbre.match(c)
+                if m is None:
+                    continue
+                bra, end = m.groups()
+                if bra is not None: c = bra
+                else: c = end
+            counts[c] = counts.get(c, 0) + 1
+    probs = { c : counts[c]/sum(counts.values()) for c in counts }
+   
+    if len(probs) > 0: 
+        exp_support = sorted(list(modest_probs.keys()))
+        ans_support = sorted(list(probs.keys()))
+        d = levenshtein("".join(exp_support), "".join(ans_support))
+        if d > 2:
+            pmark(2, f"Distribution with bad support (d={d}) found", "0/2")
+            print(f"        Exp: {exp_support}")
+            print(f"        Got: {ans_support}")
+        else:
+            pmark(2, f"Distribution with acceptable support (d={d}) found", "2/2")
+            tally += 2
+        
+        u = list(modest_probs.values())
+        v = list(probs.values())
+        d = emd(u, v)
+        df = "%.04f" % d
+        if d < 0.01:
+            pmark(2, f"Distribution acceptably close (d={df})", "3/3")
+            tally += 3
+        else:
+            pmark(2, f"Distribution too far away (d={df})", "0/3")
+    else:
+        pmark(2, "Could not understand program output", "0/5")
+    return tally
+
+
+def mark2b(repo):
+    tally = 0
+    stdout.write("Finding assignment 2b...")
+    prog  =  find_assignment(repo, "2b")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 2b", "0/10")
+        return 0
+    print(f" {prog}")
+
+#    prog = "../assignment-2b.py"
+    pairs = file2ngrams(modest, 1)
+   
+    probs = file2prob(modest)
+    support = [k for k in probs.keys() if k != ' ']
+    distrib = [probs[k] for k in support]
+   
+    output = run((prog, modest, "e"))
+    if output is not None:
+        pmark(1, "Program exists and can be run", "2/2")
+        tally += 2
+        if output in letters:
+            pmark(1, "Output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, "Output is not well-formed", "0/2")
+    else:
+        pmark(1, "Program cannot be run", "0/10")
+        return 0
+ 
+    for init in sample(support, 2, counts=[int(1000*p) for p in distrib]):
+        stdout.write(f"    Computing 2-gram distribution starting with {init}......")
+        counts = {}
+        with Pool(processes=24) as pool:
+            for c in pool.imap_unordered(run, [(prog, modest, init) for _ in range(2500)]):
+                if c is None or len(c) != 1:
+                    continue
+                counts[c] = counts.get(c, 0) + 1
+        print(" done.")
+        if len(counts) == 0:
+            pmark(2, "Bad program output", "0/2")
+            continue
+        probs = { c : counts[c]/sum(counts.values()) for c in counts }
+        exp_support = sorted(list(pairs[init].keys()))
+        exp_distrib = [pairs[init][k] for k in exp_support]
+        ans_support = sorted(list(probs.keys()))
+        ans_distrib = [probs[k] for k in ans_support]
+        d = levenshtein("".join(exp_support), "".join(ans_support))
+        if d < 4:
+            pmark(2, f"Support is good (d={d})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Support is not good (d={d})", "0/1")
+            print(f"        Exp {exp_support}")
+            print(f"        Got {ans_support}")
+        d = emd(exp_distrib, ans_distrib)
+        df = "%.04f" % d
+        if d < 0.01:
+            pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Distribution is too far (d={df})", "0/1")
+
+    proc = subprocess.run(["/bin/python", prog, "abcd.txt", "d"], timeout=10, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+    if proc.returncode is not None:
+        pmark(1, f"Correct error on sneaky test case", "2/2")
+        tally += 2
+    else:
+        print(1, f"Sneaky test case does not error", "0/2")
+    
+    return tally
+
+def mark2c(repo):
+    tally = 0
+    stdout.write("Finding assignment 2c...")
+    prog  =  find_assignment(repo, "2c")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 2c", "0/10")
+        return 0
+    print(f" {prog}")
+    return mark_ngrams(prog, 1)
+
+def mark2d(repo):
+    tally = 0
+    stdout.write("Finding assignment 2d (bonus)...")
+    prog  =  find_assignment(repo, "2d")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 2d", "0/10")
+        return 0
+    print(f" {prog}")
+    return mark_ngrams(prog, 2)
+
+def mark_ngrams(prog, n):
+    output = run((prog, modest, 100000))
+    if output is None:
+        pmark(1, f"Could not run program", "0/10")
+        return 0
+    output = output.strip()
+    exp = file2ngrams(modest, n)
+    ans = ngrams(output, n)
+    tally = 0
+
+    output = run((prog, modest, "100"))
+    if output is not None:
+        pmark(1, "Program exists and can be run", "2/2")
+        tally += 2
+        if len(output) == 100:
+            pmark(1, "Output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, "Output is not well-formed", "0/8")
+            return tally
+    else:
+        pmark(1, "Program cannot be run", "0/10")
+        return tally
+ 
+    for gram in sample(list(exp.keys()), 3):
+        print(f"    Comparing distributions for {n+1}-grams starting with {gram}")
+        if gram not in ans:
+            pmark(2, f"{n}-gram '{gram}' not found in answer", "0/2")
+            continue
+        exp_cond = exp[gram]
+        ans_cond = ans[gram]
+        exp_support = sorted(list(exp[gram].keys()))
+        ans_support = sorted(list(ans[gram].keys()))
+        d = levenshtein("".join(exp_support), "".join(ans_support))
+        if d < 4:
+            pmark(2, f"Support is good (d={d})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Support is not good (d={d})", "0/1")
+            print(f"        Exp {exp_support}")
+            print(f"        Got {ans_support}")
+        exp_distrib = [exp[gram][k] for k in exp_support]
+        ans_distrib = [ans[gram][k] for k in ans_support]
+        d = emd(exp_distrib, ans_distrib)
+        df = "%.04f" % d
+        if d < 0.01:
+            pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Distribution is too far (d={df})", "0/1")
+    
+    return tally
+
+def mkseed(s):
+    x = 0
+    for c in s:
+        x ^= ord(c)
+    return x
+
+if __name__ == '__main__':
+    from random import seed
+    dsname=argv[1]
+    marks = os.path.join(CS, "marking", dsname)
+    repo  = os.path.join(CS, "repos", dsname)
+    tally = 0
+    bonus = 0
+
+    print("Marking Assignment 2")
+    print("====================")
+    print()
+
+    seed(mkseed(dsname))
+
+    tally += mark2a(repo)
+    tally += mark2b(repo)
+    tally += mark2c(repo)
+    bonus += mark2d(repo)
+    print("="*80)
+    pmark(0, "Total for assignment 2", f"{tally+bonus}/30")
--- a/marking/mark3.py
+++ b/marking/mark3.py
+from multiprocessing import Pool
+from string import ascii_letters, digits
+from scipy.stats import wasserstein_distance as emd
+from Levenshtein import distance as levenshtein
+from sys import argv
+from sys import stdout
+from random import choices, sample
+from subprocess import PIPE, DEVNULL, STDOUT
+from nltk.tokenize import word_tokenize
+import subprocess
+import ast
+import os
+import re
+
+CS="/home/xgb21195/teaching/cs101"
+republic = "/home/xgb21195/teaching/cs101/cs101-csai/lec2/republic.txt"
+barlow = "/home/xgb21195/teaching/cs101/cs101-csai/marking/barlow.txt"
+with open(barlow) as fp:
+    barlow_words = [w.lower() for w in word_tokenize(fp.read())]
+aabbcc = "/home/xgb21195/teaching/cs101/cs101-csai/marking/aabbcc.txt"
+mary = "/home/xgb21195/teaching/cs101/cs101-csai/marking/mary.txt"
+with open(mary) as fp:
+    mary_words = [w.lower() for w in word_tokenize(fp.read())]
+nltkvomit = '[nltk_data] Downloading package punkt to /home/xgb21195/nltk_data...\n[nltk_data]   Package punkt is already up-to-date!\n'
+
+letters = ascii_letters + digits
+
+def ntoks(data, n=2):
+    """
+    From the data, construct a dictionary of n-tokens and
+    the probabilities of following words. That is, the
+    conditional probability of a word given its n predecessors
+    """
+ 
+    word_dict  = {}
+    previous = []
+    for w in word_tokenize(data):
+        w = w.lower()
+        if len(previous) == n:
+            key = tuple(previous)
+            counts = word_dict.setdefault(key, {})
+            counts[w] = counts.get(w, 0) + 1
+            word_dict[key][w] += 1
+        previous.append(w)
+        previous = previous[-n:]
+
+    probs = { c: { d: word_dict[c][d]/sum(word_dict[c].values())
+                   for d in word_dict[c].keys() } for c in word_dict.keys() }
+    return probs
+
+def file2ntoks(filename, n=2):
+    """
+    Read a file and return a dictionary of n-toks and
+    the probabilities of following words. That is, the
+    conditional probability of a word given its n
+    predecessors.
+    """
+    with open(filename, encoding="utf-8") as fp:
+        return ntoks(fp.read(), n)
+
+def file2prob(filename):
+    """
+    Read a file and return a dictionary of letters and
+    their probabilities
+    """
+    word_dict  = {}
+    word_total = 0
+
+    with open(filename, encoding="utf-8") as fp:
+        for w in word_tokenize(fp.read()):
+            w = w.lower()
+            word_dict[w] = word_dict.get(w, 0) + 1
+            word_total += 1
+
+    probs = { w: word_dict[w]/word_total for w in word_dict }
+    return probs
+
+
+def find_assignment(repo, name):
+    for d, subs, files in os.walk(repo, name):
+        for fname in files:
+            lname = fname.lower()
+            if lname.startswith("assignment") and name in lname:
+                return os.path.join(repo, d, fname)
+
+def ispython(fname):
+    try:
+        with open(fname) as fp:
+            p = fp.read()
+        ast.parse(p)
+        return True
+    except:
+        return False
+
+def runner(fname):
+    if ispython(fname):
+        def _run(*args):
+            cmd = ["/bin/python", fname] + [str(a) for a in args]
+            try:
+                proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+                if proc.returncode is not None and proc.returncode != 0:
+                    return None
+                return proc.stdout.decode("utf-8")
+            except Exception as e:
+                print(f"Error running program: {e}")
+                return None
+    else:
+        def _run(*args):
+            cmd = [fname] + [str(a) for a in args]
+            try:
+                proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+                if proc.returncode is not None and proc.returncode != 0:
+                    return None
+                return proc.stdout.decode("utf-8")
+            except Exception as e:
+                print(f"Error running program: {e}")
+                return None
+            
+    return _run 
+
+def run(args):
+    fname = args[0]
+    _run = runner(fname)
+    result = _run(*args[1:])
+    return result.removeprefix(nltkvomit).rstrip("\r\n") if result is not None else None
+
+def pmark(indent, msg, marks):
+    l = indent*4 + len(msg)
+    dent = ' ' * (indent*4)
+    dots = '.' * (69-l)
+    print(dent + msg + dots + " " + marks + " marks")
+
+def mark3a(repo):
+    tally = 0
+    stdout.write("Finding assignment 3a...")
+    prog  =  find_assignment(repo, "3a")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 3a", "0/10")
+        return 0
+    print(f" {prog}")
+
+    output = run((prog, mary))
+    if output is None:
+        pmark(1, "Program cannot be run", "0/10")
+        print("      " + " ".join((prog, mary)))
+        return 0
+    else:
+        pmark(1, "Program exists and can be run", "1/1")
+        tally += 1
+        output = output.strip().removeprefix(nltkvomit)
+        if output in mary_words:
+            pmark(1, "Program output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, "Program output is not well-formed", "0/2")
+            print(f"    Expected a single word in {mary_words}, got:")
+            print(f"        {repr(output)}"[:500] + "...")
+    
+    print("    Testing mary.txt distribution...")
+    counts = {}
+    with Pool(processes=6) as pool:
+        for w in pool.imap_unordered(run, [(prog, mary) for _ in range(360)]):
+            if w is None:
+                continue
+            counts[w] = counts.get(w, 0) + 1
+    probs = { w : counts[w]/sum(counts.values()) for w in counts }
+    n = len(probs)
+    if n != 5:
+        pmark(2, f"Strange distribution of size {n} found", "0/2")
+    else:
+        pmark(2, f"5-ary distribution found", "1/1") 
+        tally += 1
+        balanced = all( abs(0.2-v) < 0.05 for v in probs.values() )
+        if balanced:
+            pmark(2, "Balanced 5-ary distribution found", "1/1")
+            tally += 1
+        else:
+            pmark(2, "Unbalanced 5-ary distribution found", "0/1")
+            print(f"        {probs}")
+
+    barlow_probs = file2prob(barlow)
+
+    print("    Testing barlow.txt...")
+    counts = {}
+    with Pool(processes=6) as pool:
+        for w in pool.imap_unordered(run, [(prog, barlow) for _ in range(360)]):
+            if w is None:
+                continue
+            counts[w] = counts.get(w, 0) + 1
+    probs = { w : counts[w]/sum(counts.values()) for w in counts }
+   
+    if len(probs) > 0: 
+        exp_support = sorted(list(barlow_probs.keys()))
+        ans_support = sorted(list(k.lower() for k in probs.keys()))
+        d = levenshtein(" ".join(exp_support), " ".join(ans_support))
+        if d > 50:
+            pmark(2, f"Distribution with bad support (d={d}) found", "0/2")
+            print(f"        Exp: {exp_support}"[:500] + "...")
+            print(f"        Got: {ans_support}"[:500] + "...")
+        else:
+            pmark(2, f"Distribution with acceptable support (d={d}) found", "2/2")
+            tally += 2
+        
+        u = list(barlow_probs.values())
+        v = list(probs.values())
+        d = emd(u, v)
+        df = "%.04f" % d
+        if d < 0.01:
+            pmark(2, f"Distribution acceptably close (d={df})", "3/3")
+            tally += 3
+        else:
+            pmark(2, f"Distribution too far away (d={df})", "0/3")
+    else:
+        pmark(2, "Could not understand program output", "0/5")
+    return tally
+
+
+def mark3b(repo):
+    tally = 0
+    stdout.write("Finding assignment 3b...")
+    prog  =  find_assignment(repo, "3b")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 3b", "0/10")
+        return 0
+    print(f" {prog}")
+
+    all_pairs = file2ntoks(barlow, 1)
+   
+    probs = file2prob(barlow)
+    support = [k for k in probs.keys()]
+    distrib = [probs[k] for k in support]
+
+    output = run((prog, barlow, "it"))
+    if output is not None:
+        pmark(1, "Program exists and can be run", "2/2")
+        tally += 2
+        if output in barlow_words:
+            pmark(1, "Output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, "Output is not well-formed", "0/2")
+            print(f"      Expected one of JPB's words, got {output[:100]}...")
+    else:
+        pmark(1, "Program cannot be run", "0/10")
+        print("      " + " ".join((prog, barlow, "it")))
+        return 0
+
+    # it was unclear in the instructions whether punctuation
+    # should be included. NLTK's word_tokenize() thinks that
+    # punctuation symbols are words. if the student does not
+    # include punctuation, drop it from our reference
+    # distribution
+    def filter(d):
+        return { k: desc(w) for k,w in d.items()
+                 if all(c in letters for c in k[0]) }
+    def desc(v):
+        return filter(v) if isinstance(v, dict) else v
+    def renorm(d):
+        return { k: v/sum(d.values()) for k, v in d.items() }
+
+    words = filter(probs) 
+    for init in sample(list(words.keys()), 2, counts=[int(1000*p) for p in words.values()]):
+        stdout.write(f"    Computing 2-token distribution starting with '{init}'......")
+        counts = {}
+        with Pool(processes=6) as pool:
+            for w in pool.imap_unordered(run, [(prog, barlow, init) for _ in range(120)]):
+                if w is None:
+                    continue
+                w = w.lower()
+                counts[w] = counts.get(w, 0) + 1
+        print(" done.")
+        if len(counts) == 0:
+            pmark(2, "Bad program output", "0/2")
+            continue
+        probs = { w : counts[w]/sum(counts.values()) for w in counts }
+
+
+        if all(all(c in letters for c in w) for w in probs):
+            print("        No punctuation found at top level, renormalising.")
+            pairs = { k: renorm(v) for k, v in filter(all_pairs).items() }
+        else:
+            pairs = all_pairs
+        exp_support = sorted(list(pairs[(init,)].keys()))
+        exp_distrib = [pairs[(init,)][k] for k in exp_support]
+        exp_string = " ".join(t[0] for t in exp_support)
+
+        ans_support = sorted(list(probs.keys()))
+        ans_distrib = [probs[k] for k in ans_support]
+        ans_string = " ".join(ans_support)
+        
+        if len(exp_support) == 0:
+            if len(ans_support) == 0:
+                pmark(2, f"Correctly returned an empty distribution", "2/2")
+            else:
+                pmark(2, f"Invalid distribution size {len(ans_support)} should be {len(exp_support)}", "0/2")
+        else:
+            d = levenshtein(exp_string, ans_string)
+            if d < 200:
+                pmark(2, f"Support is good (d={d})", "1/1")
+                tally += 1
+            else:
+                pmark(2, f"Support is not good (d={d})", "0/1")
+                print(f"        Exp {exp_support}"[:500] + "...")
+                print(f"        Got {ans_support}"[:500] + "...")
+            if len(ans_distrib) > 0:
+                d = emd(exp_distrib, ans_distrib)
+                df = "%.04f" % d
+                if d < 0.1:
+                    pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
+                    tally += 1
+                else:
+                    pmark(2, f"Distribution is too far (d={df})", "0/1")
+                    print(f"      Exp {exp_support}"[:500] + "...")
+                    print(f"          {exp_distrib}"[:500] + "...")
+                    print(f"      Got {ans_support}"[:500] + "...")
+                    print(f"          {ans_distrib}"[:500] + "...")
+            else:
+                pmark(2, "Distribution is empty", "0/1")                  
+
+    proc = subprocess.run(["/bin/python", prog, mary, "lamb"], timeout=10, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
+    if proc.returncode is not None:
+        pmark(1, f"Correct error on sneaky test case", "2/2")
+        tally += 2
+    else:
+        print(1, f"Sneaky test case does not error", "0/2")
+    
+    return tally
+
+def mark3c(repo):
+    tally = 0
+    stdout.write("Finding assignment 3c...")
+    prog  =  find_assignment(repo, "3c")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 3c", "0/10")
+        return 0
+    print(f" {prog}")
+    return mark_toks(prog, 1)
+
+def mark3d(repo):
+    tally = 0
+    stdout.write("Finding assignment 3d (bonus)...")
+    prog  =  find_assignment(repo, "3d")
+    if prog is None:
+        print(" not found")
+        pmark(1, "Could not find assignment 3d", "0/10")
+        return 0
+    print(f" {prog}")
+    return mark_toks(prog, 2)
+
+def mark_toks(prog, n):
+    output = run((prog, barlow, 100))
+    if output is None:
+        pmark(1, f"Could not run program", "0/10")
+        print("      " + " ".join((prog, barlow, "100")))
+        return 0
+
+    exp = file2ntoks(barlow, n)
+
+    tally = 0
+
+    output = []
+    with Pool(processes=6) as pool:
+        for s in pool.imap_unordered(run, [(prog, barlow, 100) for _ in range(24)]):
+            if s is None:
+                continue
+            output.append(s.strip())
+    
+    if len(output) > 0:
+        pmark(1, "Program exists and can be run", "2/2")
+        tally += 2
+        sentence = " ".join(output)
+        words = word_tokenize(sentence)
+        ans = ntoks(sentence, n)
+        nwords = len(words)
+        if nwords == 2400:
+            pmark(1, "Output is well-formed", "2/2")
+            tally += 2
+        else:
+            pmark(1, f"Expected 2400 words in total, got {nwords}", "0/8")
+            return tally
+    else:
+        pmark(1, "Program cannot be run", "0/10")
+        print("      " + " ".join((prog, barlow, "it")))
+        return tally
+ 
+    for word in sample(list(exp.keys()), 3):
+        print(f"    Comparing distributions for {n+1}-words starting with {' '.join(word)}")
+        if word not in ans:
+            pmark(2, f"{n}-word '{' '.join(word)}' not found in answer", "0/2")
+            print(f"        answer: {list(' '.join(w) for w in ans)}")
+            continue
+        exp_cond = exp[word]
+        ans_cond = ans[word]
+        exp_support = sorted(list(exp[word].keys()))
+        ans_support = sorted(list(ans[word].keys()))
+        d = levenshtein("".join(exp_support), "".join(ans_support))
+        if d < 4:
+            pmark(2, f"Support is good (d={d})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Support is not good (d={d})", "0/1")
+            print(f"        Exp {exp_support}"[:500] + "...")
+            print(f"        Got {ans_support}"[:500] + "...")
+        exp_distrib = [exp[word][k] for k in exp_support]
+        ans_distrib = [ans[word][k] for k in ans_support]
+        d = emd(exp_distrib, ans_distrib)
+        df = "%.04f" % d
+        if d < 0.1:
+            pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
+            tally += 1
+        else:
+            pmark(2, f"Distribution is too far (d={df})", "0/1")
+    
+    return tally
+
+def mkseed(s):
+    x = 0
+    for c in s:
+        x ^= ord(c)
+    return x
+
+if __name__ == '__main__':
+    from random import seed
+    dsname=argv[1]
+    marks = os.path.join(CS, "marking", dsname)
+    repo  = os.path.join(CS, "repos", dsname)
+    tally = 0
+    bonus = 0
+
+    print("Marking Assignment 3")
+    print("====================")
+    print()
+
+    seed(mkseed(dsname))
+
+    tally += mark3a(repo)
+    tally += mark3b(repo)
+    tally += mark3c(repo)
+    print("="*80)
+    pmark(0, "Total for assignment 3", f"{tally}/30")
--- a/marking/mary.txt
+++ b/marking/mary.txt
+mary had a little lamb
+
No results found