Skip to content
Snippets Groups Projects

Compare revisions

Changes are shown as if the source revision was being merged into the target revision. Learn more about comparing revisions.

Source

Select target project
No results found

Target

Select target project
  • mfb23137/cs101-csai
  • gvb23154/cs101-csai
  • kkb23140/cs-101-cognitive-science-and-artificial-intelligence-sem-2-first-half
  • kxb23169/cs101-csai
  • xgb21195/cs101-csai
5 results
Show changes
Commits on Source (17)
# CS101/3 - Cognitive Science and Artificial Intelligence
## Info
- **Lecture**
- Mondays 14:00-15:00 RC426
- **Labs:**
- LT1105, LT1201, LT1221, LT1320 :: Mondays 15:00-17:00
- **Demonstrators:**
- Adel Dadaa <[adel.dadaa@strath.ac.uk](mailto:adel.dadaa@strath.ac.uk)>
- Pat Prochacki <[pat.prochacki@strath.ac.uk](mailto:pat.prochacki@strath.ac.uk)>
- Tochukwu Umeasiegbu <[tochukwu.umeasiegbu@strath.ac.uk](mailto:tochukwu.umeasiegbu@strath.ac.uk)>
- **Links**
- [Course Mattermost Channel](https://mattermost.cis.strath.ac.uk/learning/channels/cs101-22-24)
## Marking Scheme
*Final due date for all assignments: 26 February*
- 10% :: Participation
- 30% :: Assignment 1 - [Yak Shaving](./cs101-csai-lec1.pdf)
- 20% :: Create a git repository with a file, and share it
- 10% :: Put a transcript of a session with the Emacs doctor in that file
- 30% :: Assignment 2 - [Probability and Text](./cs101-csai-lec1.pdf)
- 10% :: Write a program to output random characters
- 10% :: Write a program that, given a character, predicts the next character
- 10% :: Write a program to output a sequence of characters
- 10% (bonus) :: Write a program that outputs a sequence of characters conditional on the previous two characters
- 30% :: Assignment 3 - [Stochastic Parrot](./cs101-csai-assignment3.md)
- 10% :: Write a program to output random words
- 10% :: Write a program that, given a word, predicts the next word
- 10% :: Write a program to output a sequence of words
## Topics
- Yak Shaving - Software Engineering Tooling [(PDF)](./cs101-csai-lec1.pdf)
- Some Philosophical Experiments
- Symbol Manipulation and Logic
- Probability and Text prediction
- Vector Spaces and Word embedding
## Python Quick-Start Guide
[![IMAGE ALT TEXT HERE](https://img.youtube.com/vi/r54u4z_qay0/0.jpg)](https://www.youtube.com/watch?v=r54u4z_qay0)
[Setting up Python - Write-Up](https://gitlab.cis.strath.ac.uk/xgb21195/cs101-csai/-/blob/main/setup.md?ref_type=heads)
#+title: CS101/3 - Cognitive Science and Artificial Intelligence
* Info
** Lecture
Mondays 14:00-15:00 RC426
** Labs:
- LT1105, LT1201, LT1221, LT1320 :: Mondays 15:00-17:00
*** Demonstrators:
- Adel Dadaa <[[mailto:adel.dadaa@strath.ac.uk>][adel.dadaa@strath.ac.uk]]>
- Pat Prochacki <[[mailto:pat.prochacki@strath.ac.uk][pat.prochacki@strath.ac.uk]]>
- Tochukwu Umeasiegbu <[[mailto:tochukwu.umeasiegbu@strath.ac.uk][tochukwu.umeasiegbu@strath.ac.uk]]>
** Links
- [[https://mattermost.cis.strath.ac.uk/learning/channels/cs101-22-24][Course Mattermost Channel]]
* Marking Scheme
- 10% :: Participation
- 30% :: Assignment 1 - [[./cs101-csai-lec1.pdf][Yak Shaving]]
- 20% :: Create a git repository with a file, and share it
- 10% :: Put a transcript of a session with the Emacs doctor in that file
- 30% :: Assignment 2 - [[./cs101-csai-lec1.pdf][Probability and Text]]
- 10% :: Write a program to output random characters
- 10% :: Write a program that, given a character, predicts the next character
- 10% :: Write a program to output a sequence of characters
- 10% (bonus) :: Write a program that outputs a sequence of
characters conditional on the previous two characters
- 30% :: Assignment 3
* Topics
1. Yak Shaving - Software Engineering Tooling [[./cs101-csai-lec1.pdf][(PDF]])
2. Some Philosophical Experiments
3. Symbol Manipulation and Logic
4. Probability and Text prediction
5. Vector Spaces and Word embedding
# CS101/3 - CS/AI Assignment 3 - The Stochastic Parrot
Assignment 3 is just like Assignment 2 but with words instead of letters.
You may wish to use the Python NLTK (https://www.nltk.org) Natural Language
Toolkit to help break the text up into tokens (words)
You may adapt the helper functions from Assignment 2 to work with words
rather than letters or you may write your own from scratch
Optionally, instead of doing this programming exercise, you may write a
short essay, no longer than 2000 words on one of the topics that we have
touched on in the philosophy of mind:
- Turing's Imitation Game
- Dneprov's Game (or, equivalently, Searle's Chinese Room)
- Logical Behaviorism or Functionalism
- The Engineering End-Run
Assignment 3 is worth 30% in total.
## 3a word probabilities - 10%
Given an input text, compute the word probabilities and generate a word,
xgb21195@cafe:~$ ./assignment-3a example.txt
the
## 3b conditional word probabilities - 10%
Given an input text and a word, generate the next word according to the
conditional probabilities in the text,
xgb21195@cafe:~$ ./assignment-3b example.txt the
cat
## 3c a stochastic parrot - 10%
Given an input text, generate a sentence of a particular length according
to the conditional word distributions,
xgb21195@cafe:~$ ./assignment-3b example.txt 10
the cat ate a burrito that is not a butterfly
......@@ -22,7 +22,9 @@
*This is only slightly more than half the class!*
Marking will happen this week.
Marking will begin this week.
If you are having trouble, *ask a demonstrator for help*.
If you do not do this assignment, you will not be able to submit any
of the other assignments and you will get *0* for this module.
......
No preview for this file type
img/idlewin.png

34.3 KiB

img/pycharmwin.png

123 KiB

img/pythoncmd.png

63.5 KiB

img/vscodewin.png

46.2 KiB

......@@ -7,14 +7,15 @@ def file2prob(filename):
Read a file and return a dictionary of letters and
their probabilities
"""
letter_dict = { c: 0 for c in letters }
letter_dict = {}
letter_total = 0
with open(filename) as fp:
with open(filename, encoding="utf-8") as fp:
for c in fp.read():
if c.lower() not in letter_dict:
if c.lower() not in letters:
continue
letter_dict[c.lower()] += 1
c = c.lower()
letter_dict[c] = letter_dict.get(c, 0) + 1
letter_total += 1
probs = { c: letter_dict[c]/letter_total for c in letter_dict }
......@@ -27,22 +28,23 @@ def file2pairs(filename):
conditional probability of a letter given its
predecessor.
"""
letter_dict = { c: { a: 0 for a in letters }
for c in letters }
letter_dict = {}
previous = None
with open(filename) as fp:
with open(filename, encoding="utf-8") as fp:
for c in fp.read():
if c not in letter_dict:
continue
c = c.lower()
if c not in letters:
continue
if previous is None:
previous = c
continue
letter_dict[previous][c] += 1
d = letter_dict.setdefault(previous, {})
d[c] = d.get(c, 0) + 1
previous = c
probs = { c: { d: letter_dict[c][d]/sum(letter_dict[c].values())
for d in letters } for c in letters }
for d in letter_dict[c] } for c in letter_dict }
return probs
if __name__ == '__main__':
......
aabbcc
abcd
abcda
Governments of the Industrial World, you weary giants of flesh and steel, I come from Cyberspace, the new home of Mind. On behalf of the future, I ask you of the past to leave us alone. You are not welcome among us. You have no sovereignty where we gather.
We have no elected government, nor are we likely to have one, so I address you with no greater authority than that with which liberty itself always speaks. I declare the global social space we are building to be naturally independent of the tyrannies you seek to impose on us. You have no moral right to rule us nor do you possess any methods of enforcement we have true reason to fear.
Governments derive their just powers from the consent of the governed. You have neither solicited nor received ours. We did not invite you. You do not know us, nor do you know our world. Cyberspace does not lie within your borders. Do not think that you can build it, as though it were a public construction project. You cannot. It is an act of nature and it grows itself through our collective actions.
You have not engaged in our great and gathering conversation, nor did you create the wealth of our marketplaces. You do not know our culture, our ethics, or the unwritten codes that already provide our society more order than could be obtained by any of your impositions.
You claim there are problems among us that you need to solve. You use this claim as an excuse to invade our precincts. Many of these problems don't exist. Where there are real conflicts, where there are wrongs, we will identify them and address them by our means. We are forming our own Social Contract. This governance will arise according to the conditions of our world, not yours. Our world is different.
Cyberspace consists of transactions, relationships, and thought itself, arrayed like a standing wave in the web of our communications. Ours is a world that is both everywhere and nowhere, but it is not where bodies live.
We are creating a world that all may enter without privilege or prejudice accorded by race, economic power, military force, or station of birth.
We are creating a world where anyone, anywhere may express his or her beliefs, no matter how singular, without fear of being coerced into silence or conformity.
Your legal concepts of property, expression, identity, movement, and context do not apply to us. They are all based on matter, and there is no matter here.
Our identities have no bodies, so, unlike you, we cannot obtain order by physical coercion. We believe that from ethics, enlightened self-interest, and the commonweal, our governance will emerge. Our identities may be distributed across many of your jurisdictions. The only law that all our constituent cultures would generally recognize is the Golden Rule. We hope we will be able to build our particular solutions on that basis. But we cannot accept the solutions you are attempting to impose.
In the United States, you have today created a law, the Telecommunications Reform Act, which repudiates your own Constitution and insults the dreams of Jefferson, Washington, Mill, Madison, DeToqueville, and Brandeis. These dreams must now be born anew in us.
You are terrified of your own children, since they are natives in a world where you will always be immigrants. Because you fear them, you entrust your bureaucracies with the parental responsibilities you are too cowardly to confront yourselves. In our world, all the sentiments and expressions of humanity, from the debasing to the angelic, are parts of a seamless whole, the global conversation of bits. We cannot separate the air that chokes from the air upon which wings beat.
In China, Germany, France, Russia, Singapore, Italy and the United States, you are trying to ward off the virus of liberty by erecting guard posts at the frontiers of Cyberspace. These may keep out the contagion for a small time, but they will not work in a world that will soon be blanketed in bit-bearing media.
Your increasingly obsolete information industries would perpetuate themselves by proposing laws, in America and elsewhere, that claim to own speech itself throughout the world. These laws would declare ideas to be another industrial product, no more noble than pig iron. In our world, whatever the human mind may create can be reproduced and distributed infinitely at no cost. The global conveyance of thought no longer requires your factories to accomplish.
These increasingly hostile and colonial measures place us in the same position as those previous lovers of freedom and self-determination who had to reject the authorities of distant, uninformed powers. We must declare our virtual selves immune to your sovereignty, even as we continue to consent to your rule over our bodies. We will spread ourselves across the Planet so that no one can arrest our thoughts.
We will create a civilization of the Mind in Cyberspace. May it be more humane and fair than the world your governments have made before.
../lec2
\ No newline at end of file
#!/bin/bash
CS=/home/xgb21195/teaching/cs101
TOK=`cat ~/.gitlab`
mark1 () {
date 1>&2
echo "Marking Assignment 1 for ${dsname}" 1>&2
echo "Marking Assignment 1"; echo
marks=0
nfiles=`find . -type f | sed '/^\.\/\./d' | wc -l`
find . -type f | sed '/^\.\/\./d' | while read f; do
touch ${marking}/assignment-1.files
if grep -q 'I am the psychotherapist. Please, describe your problems.' "$f"; then
touch ${marking}/assignment-1.emacs
fi
done
if test -f ${marking}/assignment-1.files; then
echo " Found some files... 20/20 marks"
marks=20
else
echo " Found no files... 0/20 marks"
fi
if test -f ${marking}/assignment-1.emacs; then
echo " Found an interaction with the Emacs doctor... 10/10 marks"
marks=$(($marks + 10))
else
echo " Found no interaction with the Emacs doctor... 0/10 marks"
fi
echo
echo "${marks}/30 marks in total"
echo "${marks} marks" 1>&2
}
dsname="$1"
student="${CS}/students/${dsname}"
marking="${CS}/marking/${dsname}"
mark1
from multiprocessing import Pool
from scipy.stats import wasserstein_distance as emd
from Levenshtein import distance as levenshtein
from sys import argv
from sys import stdout
from lec2.letters import file2prob, letters
from random import choices, sample
from subprocess import PIPE, DEVNULL, STDOUT
import subprocess
import ast
import os
import re
CS="/home/xgb21195/teaching/cs101"
republic = "/home/xgb21195/teaching/cs101/cs101-csai/lec2/republic.txt"
modest = "/home/xgb21195/teaching/cs101/cs101-csai/marking/modest.txt"
aabbcc = "/home/xgb21195/teaching/cs101/cs101-csai/marking/aabbcc.txt"
sbre = re.compile("^ *\['([a-z ])'\].*|.*: ([a-z ])$")
def ngrams(data, n=2):
"""
From the data, construct a dictionary of n-grams and
the probabilities of following letters. That is, the
conditional probability of a letter given its n
"""
letter_dict = {}
previous = []
for c in data:
c = c.lower()
if c not in letters:
continue
if len(previous) == n:
key = "".join(previous)
counts = letter_dict.setdefault(key, {})
counts[c] = counts.get(c, 0) + 1
letter_dict[key][c] += 1
previous.append(c)
previous = previous[-n:]
probs = { c: { d: letter_dict[c][d]/sum(letter_dict[c].values())
for d in letter_dict[c].keys() } for c in letter_dict.keys() }
return probs
def file2ngrams(filename, n=2):
"""
Read a file and return a dictionary of n-grams and
the probabilities of following letters. That is, the
conditional probability of a letter given its n
predecessors.
"""
with open(filename, encoding="utf-8") as fp:
return ngrams(fp.read(), n)
def find_assignment(repo, name):
for d, subs, files in os.walk(repo, name):
for fname in files:
lname = fname.lower()
if lname.startswith("assignment") and name in lname:
return os.path.join(repo, d, fname)
def ispython(fname):
try:
with open(fname) as fp:
p = fp.read()
ast.parse(p)
return True
except:
return False
def runner(fname):
if ispython(fname):
def _run(*args):
cmd = ["/bin/python", fname] + [str(a) for a in args]
try:
proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None and proc.returncode != 0:
return None
return proc.stdout.decode("utf-8")
except Exception as e:
print(f"Error running program: {e}")
return None
else:
def _run(*args):
cmd = [fname] + [str(a) for a in args]
try:
proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None and proc.returncode != 0:
return None
return proc.stdout.decode("utf-8")
except Exception as e:
print(f"Error running program: {e}")
return None
return _run
def run(args):
fname = args[0]
_run = runner(fname)
result = _run(*args[1:])
return result.rstrip("\r\n") if result is not None else None
def pmark(indent, msg, marks):
l = indent*4 + len(msg)
dent = ' ' * (indent*4)
dots = '.' * (69-l)
print(dent + msg + dots + " " + marks + " marks")
def mark2a(repo):
tally = 0
stdout.write("Finding assignment 2a...")
prog = find_assignment(repo, "2a")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 2a", "0/10")
return 0
print(f" {prog}")
output = run((prog, aabbcc))
if output is None:
pmark(1, "Program cannot be run", "0/10")
return 0
else:
pmark(1, "Program exists and can be run", "1/1")
tally += 1
output = output.strip()
if len(output) == 1 and output in "abc":
pmark(1, "Program output is well-formed", "2/2")
tally += 2
else:
pmark(1, "Program output is not well-formed", "0/2")
print(" Expected a single letter, either 'a', 'b', or 'c', got:")
print(f" {repr(output)}")
print(" Will try to compensate for this for the rest of 2a")
print(" Testing AABBCC distribution...")
counts = {}
with Pool(processes=4) as pool:
for c in pool.imap_unordered(run, [(prog, aabbcc) for _ in range(2500)]):
if c is None:
continue
if len(c) != 1:
m = sbre.match(c)
if m is None:
continue
bra, end = m.groups()
if bra is not None: c = bra
else: c = end
counts[c] = counts.get(c, 0) + 1
probs = { c : counts[c]/sum(counts.values()) for c in counts }
n = len(probs)
if n != 3:
pmark(2, f"Strange distribution of size {n} found", "0/2")
else:
pmark(2, f"Ternary distribution found", "1/1")
tally += 1
balanced = all( abs(0.3333-v) < 0.03 for v in probs.values() )
if balanced:
pmark(2, "Balanced ternary distribution found", "1/1")
tally += 1
else:
pmark(2, "Unbalanced ternary distribution found", "0/1")
print(f" {probs}")
modest_probs = file2prob(modest)
print(" Testing A Modest Proposal...")
counts = {}
with Pool(processes=24) as pool:
for c in pool.imap_unordered(run, [(prog, modest) for _ in range(2500)]):
if c is None:
continue
if len(c) != 1:
m = sbre.match(c)
if m is None:
continue
bra, end = m.groups()
if bra is not None: c = bra
else: c = end
counts[c] = counts.get(c, 0) + 1
probs = { c : counts[c]/sum(counts.values()) for c in counts }
if len(probs) > 0:
exp_support = sorted(list(modest_probs.keys()))
ans_support = sorted(list(probs.keys()))
d = levenshtein("".join(exp_support), "".join(ans_support))
if d > 2:
pmark(2, f"Distribution with bad support (d={d}) found", "0/2")
print(f" Exp: {exp_support}")
print(f" Got: {ans_support}")
else:
pmark(2, f"Distribution with acceptable support (d={d}) found", "2/2")
tally += 2
u = list(modest_probs.values())
v = list(probs.values())
d = emd(u, v)
df = "%.04f" % d
if d < 0.01:
pmark(2, f"Distribution acceptably close (d={df})", "3/3")
tally += 3
else:
pmark(2, f"Distribution too far away (d={df})", "0/3")
else:
pmark(2, "Could not understand program output", "0/5")
return tally
def mark2b(repo):
tally = 0
stdout.write("Finding assignment 2b...")
prog = find_assignment(repo, "2b")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 2b", "0/10")
return 0
print(f" {prog}")
# prog = "../assignment-2b.py"
pairs = file2ngrams(modest, 1)
probs = file2prob(modest)
support = [k for k in probs.keys() if k != ' ']
distrib = [probs[k] for k in support]
output = run((prog, modest, "e"))
if output is not None:
pmark(1, "Program exists and can be run", "2/2")
tally += 2
if output in letters:
pmark(1, "Output is well-formed", "2/2")
tally += 2
else:
pmark(1, "Output is not well-formed", "0/2")
else:
pmark(1, "Program cannot be run", "0/10")
return 0
for init in sample(support, 2, counts=[int(1000*p) for p in distrib]):
stdout.write(f" Computing 2-gram distribution starting with {init}......")
counts = {}
with Pool(processes=24) as pool:
for c in pool.imap_unordered(run, [(prog, modest, init) for _ in range(2500)]):
if c is None or len(c) != 1:
continue
counts[c] = counts.get(c, 0) + 1
print(" done.")
if len(counts) == 0:
pmark(2, "Bad program output", "0/2")
continue
probs = { c : counts[c]/sum(counts.values()) for c in counts }
exp_support = sorted(list(pairs[init].keys()))
exp_distrib = [pairs[init][k] for k in exp_support]
ans_support = sorted(list(probs.keys()))
ans_distrib = [probs[k] for k in ans_support]
d = levenshtein("".join(exp_support), "".join(ans_support))
if d < 4:
pmark(2, f"Support is good (d={d})", "1/1")
tally += 1
else:
pmark(2, f"Support is not good (d={d})", "0/1")
print(f" Exp {exp_support}")
print(f" Got {ans_support}")
d = emd(exp_distrib, ans_distrib)
df = "%.04f" % d
if d < 0.01:
pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
tally += 1
else:
pmark(2, f"Distribution is too far (d={df})", "0/1")
proc = subprocess.run(["/bin/python", prog, "abcd.txt", "d"], timeout=10, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None:
pmark(1, f"Correct error on sneaky test case", "2/2")
tally += 2
else:
print(1, f"Sneaky test case does not error", "0/2")
return tally
def mark2c(repo):
tally = 0
stdout.write("Finding assignment 2c...")
prog = find_assignment(repo, "2c")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 2c", "0/10")
return 0
print(f" {prog}")
return mark_ngrams(prog, 1)
def mark2d(repo):
tally = 0
stdout.write("Finding assignment 2d (bonus)...")
prog = find_assignment(repo, "2d")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 2d", "0/10")
return 0
print(f" {prog}")
return mark_ngrams(prog, 2)
def mark_ngrams(prog, n):
output = run((prog, modest, 100000))
if output is None:
pmark(1, f"Could not run program", "0/10")
return 0
output = output.strip()
exp = file2ngrams(modest, n)
ans = ngrams(output, n)
tally = 0
output = run((prog, modest, "100"))
if output is not None:
pmark(1, "Program exists and can be run", "2/2")
tally += 2
if len(output) == 100:
pmark(1, "Output is well-formed", "2/2")
tally += 2
else:
pmark(1, "Output is not well-formed", "0/8")
return tally
else:
pmark(1, "Program cannot be run", "0/10")
return tally
for gram in sample(list(exp.keys()), 3):
print(f" Comparing distributions for {n+1}-grams starting with {gram}")
if gram not in ans:
pmark(2, f"{n}-gram '{gram}' not found in answer", "0/2")
continue
exp_cond = exp[gram]
ans_cond = ans[gram]
exp_support = sorted(list(exp[gram].keys()))
ans_support = sorted(list(ans[gram].keys()))
d = levenshtein("".join(exp_support), "".join(ans_support))
if d < 4:
pmark(2, f"Support is good (d={d})", "1/1")
tally += 1
else:
pmark(2, f"Support is not good (d={d})", "0/1")
print(f" Exp {exp_support}")
print(f" Got {ans_support}")
exp_distrib = [exp[gram][k] for k in exp_support]
ans_distrib = [ans[gram][k] for k in ans_support]
d = emd(exp_distrib, ans_distrib)
df = "%.04f" % d
if d < 0.01:
pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
tally += 1
else:
pmark(2, f"Distribution is too far (d={df})", "0/1")
return tally
def mkseed(s):
x = 0
for c in s:
x ^= ord(c)
return x
if __name__ == '__main__':
from random import seed
dsname=argv[1]
marks = os.path.join(CS, "marking", dsname)
repo = os.path.join(CS, "repos", dsname)
tally = 0
bonus = 0
print("Marking Assignment 2")
print("====================")
print()
seed(mkseed(dsname))
tally += mark2a(repo)
tally += mark2b(repo)
tally += mark2c(repo)
bonus += mark2d(repo)
print("="*80)
pmark(0, "Total for assignment 2", f"{tally+bonus}/30")
from multiprocessing import Pool
from string import ascii_letters, digits
from scipy.stats import wasserstein_distance as emd
from Levenshtein import distance as levenshtein
from sys import argv
from sys import stdout
from random import choices, sample
from subprocess import PIPE, DEVNULL, STDOUT
from nltk.tokenize import word_tokenize
import subprocess
import ast
import os
import re
CS="/home/xgb21195/teaching/cs101"
republic = "/home/xgb21195/teaching/cs101/cs101-csai/lec2/republic.txt"
barlow = "/home/xgb21195/teaching/cs101/cs101-csai/marking/barlow.txt"
with open(barlow) as fp:
barlow_words = [w.lower() for w in word_tokenize(fp.read())]
aabbcc = "/home/xgb21195/teaching/cs101/cs101-csai/marking/aabbcc.txt"
mary = "/home/xgb21195/teaching/cs101/cs101-csai/marking/mary.txt"
with open(mary) as fp:
mary_words = [w.lower() for w in word_tokenize(fp.read())]
nltkvomit = '[nltk_data] Downloading package punkt to /home/xgb21195/nltk_data...\n[nltk_data] Package punkt is already up-to-date!\n'
letters = ascii_letters + digits
def ntoks(data, n=2):
"""
From the data, construct a dictionary of n-tokens and
the probabilities of following words. That is, the
conditional probability of a word given its n predecessors
"""
word_dict = {}
previous = []
for w in word_tokenize(data):
w = w.lower()
if len(previous) == n:
key = tuple(previous)
counts = word_dict.setdefault(key, {})
counts[w] = counts.get(w, 0) + 1
word_dict[key][w] += 1
previous.append(w)
previous = previous[-n:]
probs = { c: { d: word_dict[c][d]/sum(word_dict[c].values())
for d in word_dict[c].keys() } for c in word_dict.keys() }
return probs
def file2ntoks(filename, n=2):
"""
Read a file and return a dictionary of n-toks and
the probabilities of following words. That is, the
conditional probability of a word given its n
predecessors.
"""
with open(filename, encoding="utf-8") as fp:
return ntoks(fp.read(), n)
def file2prob(filename):
"""
Read a file and return a dictionary of letters and
their probabilities
"""
word_dict = {}
word_total = 0
with open(filename, encoding="utf-8") as fp:
for w in word_tokenize(fp.read()):
w = w.lower()
word_dict[w] = word_dict.get(w, 0) + 1
word_total += 1
probs = { w: word_dict[w]/word_total for w in word_dict }
return probs
def find_assignment(repo, name):
for d, subs, files in os.walk(repo, name):
for fname in files:
lname = fname.lower()
if lname.startswith("assignment") and name in lname:
return os.path.join(repo, d, fname)
def ispython(fname):
try:
with open(fname) as fp:
p = fp.read()
ast.parse(p)
return True
except:
return False
def runner(fname):
if ispython(fname):
def _run(*args):
cmd = ["/bin/python", fname] + [str(a) for a in args]
try:
proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None and proc.returncode != 0:
return None
return proc.stdout.decode("utf-8")
except Exception as e:
print(f"Error running program: {e}")
return None
else:
def _run(*args):
cmd = [fname] + [str(a) for a in args]
try:
proc = subprocess.run(cmd, timeout=15, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None and proc.returncode != 0:
return None
return proc.stdout.decode("utf-8")
except Exception as e:
print(f"Error running program: {e}")
return None
return _run
def run(args):
fname = args[0]
_run = runner(fname)
result = _run(*args[1:])
return result.removeprefix(nltkvomit).rstrip("\r\n") if result is not None else None
def pmark(indent, msg, marks):
l = indent*4 + len(msg)
dent = ' ' * (indent*4)
dots = '.' * (69-l)
print(dent + msg + dots + " " + marks + " marks")
def mark3a(repo):
tally = 0
stdout.write("Finding assignment 3a...")
prog = find_assignment(repo, "3a")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 3a", "0/10")
return 0
print(f" {prog}")
output = run((prog, mary))
if output is None:
pmark(1, "Program cannot be run", "0/10")
print(" " + " ".join((prog, mary)))
return 0
else:
pmark(1, "Program exists and can be run", "1/1")
tally += 1
output = output.strip().removeprefix(nltkvomit)
if output in mary_words:
pmark(1, "Program output is well-formed", "2/2")
tally += 2
else:
pmark(1, "Program output is not well-formed", "0/2")
print(f" Expected a single word in {mary_words}, got:")
print(f" {repr(output)}"[:500] + "...")
print(" Testing mary.txt distribution...")
counts = {}
with Pool(processes=6) as pool:
for w in pool.imap_unordered(run, [(prog, mary) for _ in range(360)]):
if w is None:
continue
counts[w] = counts.get(w, 0) + 1
probs = { w : counts[w]/sum(counts.values()) for w in counts }
n = len(probs)
if n != 5:
pmark(2, f"Strange distribution of size {n} found", "0/2")
else:
pmark(2, f"5-ary distribution found", "1/1")
tally += 1
balanced = all( abs(0.2-v) < 0.05 for v in probs.values() )
if balanced:
pmark(2, "Balanced 5-ary distribution found", "1/1")
tally += 1
else:
pmark(2, "Unbalanced 5-ary distribution found", "0/1")
print(f" {probs}")
barlow_probs = file2prob(barlow)
print(" Testing barlow.txt...")
counts = {}
with Pool(processes=6) as pool:
for w in pool.imap_unordered(run, [(prog, barlow) for _ in range(360)]):
if w is None:
continue
counts[w] = counts.get(w, 0) + 1
probs = { w : counts[w]/sum(counts.values()) for w in counts }
if len(probs) > 0:
exp_support = sorted(list(barlow_probs.keys()))
ans_support = sorted(list(k.lower() for k in probs.keys()))
d = levenshtein(" ".join(exp_support), " ".join(ans_support))
if d > 50:
pmark(2, f"Distribution with bad support (d={d}) found", "0/2")
print(f" Exp: {exp_support}"[:500] + "...")
print(f" Got: {ans_support}"[:500] + "...")
else:
pmark(2, f"Distribution with acceptable support (d={d}) found", "2/2")
tally += 2
u = list(barlow_probs.values())
v = list(probs.values())
d = emd(u, v)
df = "%.04f" % d
if d < 0.01:
pmark(2, f"Distribution acceptably close (d={df})", "3/3")
tally += 3
else:
pmark(2, f"Distribution too far away (d={df})", "0/3")
else:
pmark(2, "Could not understand program output", "0/5")
return tally
def mark3b(repo):
tally = 0
stdout.write("Finding assignment 3b...")
prog = find_assignment(repo, "3b")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 3b", "0/10")
return 0
print(f" {prog}")
all_pairs = file2ntoks(barlow, 1)
probs = file2prob(barlow)
support = [k for k in probs.keys()]
distrib = [probs[k] for k in support]
output = run((prog, barlow, "it"))
if output is not None:
pmark(1, "Program exists and can be run", "2/2")
tally += 2
if output in barlow_words:
pmark(1, "Output is well-formed", "2/2")
tally += 2
else:
pmark(1, "Output is not well-formed", "0/2")
print(f" Expected one of JPB's words, got {output[:100]}...")
else:
pmark(1, "Program cannot be run", "0/10")
print(" " + " ".join((prog, barlow, "it")))
return 0
# it was unclear in the instructions whether punctuation
# should be included. NLTK's word_tokenize() thinks that
# punctuation symbols are words. if the student does not
# include punctuation, drop it from our reference
# distribution
def filter(d):
return { k: desc(w) for k,w in d.items()
if all(c in letters for c in k[0]) }
def desc(v):
return filter(v) if isinstance(v, dict) else v
def renorm(d):
return { k: v/sum(d.values()) for k, v in d.items() }
words = filter(probs)
for init in sample(list(words.keys()), 2, counts=[int(1000*p) for p in words.values()]):
stdout.write(f" Computing 2-token distribution starting with '{init}'......")
counts = {}
with Pool(processes=6) as pool:
for w in pool.imap_unordered(run, [(prog, barlow, init) for _ in range(120)]):
if w is None:
continue
w = w.lower()
counts[w] = counts.get(w, 0) + 1
print(" done.")
if len(counts) == 0:
pmark(2, "Bad program output", "0/2")
continue
probs = { w : counts[w]/sum(counts.values()) for w in counts }
if all(all(c in letters for c in w) for w in probs):
print(" No punctuation found at top level, renormalising.")
pairs = { k: renorm(v) for k, v in filter(all_pairs).items() }
else:
pairs = all_pairs
exp_support = sorted(list(pairs[(init,)].keys()))
exp_distrib = [pairs[(init,)][k] for k in exp_support]
exp_string = " ".join(t[0] for t in exp_support)
ans_support = sorted(list(probs.keys()))
ans_distrib = [probs[k] for k in ans_support]
ans_string = " ".join(ans_support)
if len(exp_support) == 0:
if len(ans_support) == 0:
pmark(2, f"Correctly returned an empty distribution", "2/2")
else:
pmark(2, f"Invalid distribution size {len(ans_support)} should be {len(exp_support)}", "0/2")
else:
d = levenshtein(exp_string, ans_string)
if d < 200:
pmark(2, f"Support is good (d={d})", "1/1")
tally += 1
else:
pmark(2, f"Support is not good (d={d})", "0/1")
print(f" Exp {exp_support}"[:500] + "...")
print(f" Got {ans_support}"[:500] + "...")
if len(ans_distrib) > 0:
d = emd(exp_distrib, ans_distrib)
df = "%.04f" % d
if d < 0.1:
pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
tally += 1
else:
pmark(2, f"Distribution is too far (d={df})", "0/1")
print(f" Exp {exp_support}"[:500] + "...")
print(f" {exp_distrib}"[:500] + "...")
print(f" Got {ans_support}"[:500] + "...")
print(f" {ans_distrib}"[:500] + "...")
else:
pmark(2, "Distribution is empty", "0/1")
proc = subprocess.run(["/bin/python", prog, mary, "lamb"], timeout=10, stdin=DEVNULL, stdout=PIPE, stderr=STDOUT)
if proc.returncode is not None:
pmark(1, f"Correct error on sneaky test case", "2/2")
tally += 2
else:
print(1, f"Sneaky test case does not error", "0/2")
return tally
def mark3c(repo):
tally = 0
stdout.write("Finding assignment 3c...")
prog = find_assignment(repo, "3c")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 3c", "0/10")
return 0
print(f" {prog}")
return mark_toks(prog, 1)
def mark3d(repo):
tally = 0
stdout.write("Finding assignment 3d (bonus)...")
prog = find_assignment(repo, "3d")
if prog is None:
print(" not found")
pmark(1, "Could not find assignment 3d", "0/10")
return 0
print(f" {prog}")
return mark_toks(prog, 2)
def mark_toks(prog, n):
output = run((prog, barlow, 100))
if output is None:
pmark(1, f"Could not run program", "0/10")
print(" " + " ".join((prog, barlow, "100")))
return 0
exp = file2ntoks(barlow, n)
tally = 0
output = []
with Pool(processes=6) as pool:
for s in pool.imap_unordered(run, [(prog, barlow, 100) for _ in range(24)]):
if s is None:
continue
output.append(s.strip())
if len(output) > 0:
pmark(1, "Program exists and can be run", "2/2")
tally += 2
sentence = " ".join(output)
words = word_tokenize(sentence)
ans = ntoks(sentence, n)
nwords = len(words)
if nwords == 2400:
pmark(1, "Output is well-formed", "2/2")
tally += 2
else:
pmark(1, f"Expected 2400 words in total, got {nwords}", "0/8")
return tally
else:
pmark(1, "Program cannot be run", "0/10")
print(" " + " ".join((prog, barlow, "it")))
return tally
for word in sample(list(exp.keys()), 3):
print(f" Comparing distributions for {n+1}-words starting with {' '.join(word)}")
if word not in ans:
pmark(2, f"{n}-word '{' '.join(word)}' not found in answer", "0/2")
print(f" answer: {list(' '.join(w) for w in ans)}")
continue
exp_cond = exp[word]
ans_cond = ans[word]
exp_support = sorted(list(exp[word].keys()))
ans_support = sorted(list(ans[word].keys()))
d = levenshtein("".join(exp_support), "".join(ans_support))
if d < 4:
pmark(2, f"Support is good (d={d})", "1/1")
tally += 1
else:
pmark(2, f"Support is not good (d={d})", "0/1")
print(f" Exp {exp_support}"[:500] + "...")
print(f" Got {ans_support}"[:500] + "...")
exp_distrib = [exp[word][k] for k in exp_support]
ans_distrib = [ans[word][k] for k in ans_support]
d = emd(exp_distrib, ans_distrib)
df = "%.04f" % d
if d < 0.1:
pmark(2, f"Distribution is acceptably close (d={df})", "1/1")
tally += 1
else:
pmark(2, f"Distribution is too far (d={df})", "0/1")
return tally
def mkseed(s):
x = 0
for c in s:
x ^= ord(c)
return x
if __name__ == '__main__':
from random import seed
dsname=argv[1]
marks = os.path.join(CS, "marking", dsname)
repo = os.path.join(CS, "repos", dsname)
tally = 0
bonus = 0
print("Marking Assignment 3")
print("====================")
print()
seed(mkseed(dsname))
tally += mark3a(repo)
tally += mark3b(repo)
tally += mark3c(repo)
print("="*80)
pmark(0, "Total for assignment 3", f"{tally}/30")
mary had a little lamb