WORCESTER POLYTECHNIC INSTITUTE
Computer Science Department

CS4341 ❏ Artificial Intelligence

Version: Wed Apr 24 19:52:44 EDT 2013

Course Contents

Lecture 1: Introduction (1)

Course Information
-- email, web, book, intro page, projects, weekly exams
-- myWPI, web-turnin
-- my preparation
-- sources for slides
What is AI? Definitions:
-- AI is the study of ...
- computations that make it possible to perceive, reason, and act.
- how to make computers do things which, at the moment, people do better.
- the design of intelligent agents.
- how to make computers act like those in the movies!
Four goals: thinking/acting, humanly/rationally
Rational: does the right thing given what it knows
Thinking Humanly (reasoning)
- Cognitive modeling
- Implement model of reasoning
- Does it reason like a human?
Acting Humanly (behavior)
- Turing Test
- does not include perception
Thinking Rationally (reasoning)
- Laws of thought ("logic")
- works in practice?
Acting Rationally (behavior) [this book]
- The Rational Agent approach
- tries to find best outcome, or best 'expected' outcome.
- actions should achieve one's goals
Engineering goal -- solve real-world problems
Scientific goal -- explain various sorts of intelligence
How AI has changed
-- focus on systems that act rationally
-- this is the book's focus
-- there are areas that this book doesn't include (e.g., design, creativity)
Foundations of AI
- Philosophy
- Mathematics
- Economics
- Neuroscience
- Psychology
- Computer Engineering
- Control theory and cybernetics
- Linguistics
The Near-Term Applications
-- e.g., routine design
-- e.g., detect credit card fraud
The Long-Term Applications
-- what is still left to do...????
-- chess? Deep Blue
-- space? Remote Agent and Deep Space 1
-- autonomous vehicles?
What Intelligent Systems Can Do
-- diagnosis, design, planning, scheduling, navigation, vision, tutoring, learning, ...
AI Sheds New Light on Traditional Questions
-- computers provide new concepts & language
-- computers require precision (e.g., what is "creativity"?)
-- explore impact of technique or knowledge (add/remove)
-- theories --> computational models --> implementations --> results --> refinements
-- use of computers allows testing
-- well tested methods used as tools
AI Helps Us to Become More Intelligent
-- suggests new/better ways to tackle problems
AI Is Becoming Less Conspicuous, yet More Essential
-- Airport gate allocation
-- many embedded applications (cars, washing machines, ...)
Criteria for Success
-- clear definition of task and implementable procedure for it
-- regularities or constraints available
-- other knowledge
-- solves real problem
-- provides new theory/method
-- suggests new opportunities

Lecture 2: Intelligent Agents (2-2.3)

Agents & Environments
- agent, sensors, actuators, environment
- percept, percept sequence
- action, action sequence
- agent program implements agent function (percepts --> actions)
Rationality
The Nature of Environments

Lecture 3: Intelligent Agents (2.4, 3.1)

Structure of Agents
- Agent = architecture + program
- Table-driven program: table indexed by percept sequences
  -- full table not practical for real problems
  -- but note Case-Based Reasoning, tables in chess, and memoization (look-up tables).
- Simple Reflex Agents
  -- next action depends on current percept only
  -- condition-action rule
  -- Rule-Match picks rule to use
  -- environment must be fully observable
  -- there must always be a matching rule (otherwise ???)
  -- the basic idea behind rule-based systems
- Model-Based Reflex Agents
  -- internal state: keep track of best guess of state of environment
  -- model: how next state depends on current state and action
  -- in casual use, model = internal state (i.e., a model of environment)
- Goal-Based Agents
  -- goals: desirable situations (result is achieved/happy or not)
  -- needs to have: what will happen if I do this...?
  -- can check relevant actions wrt achieving goal
- Utility-Based Agents
  -- combine with model
  -- utility: quality of being useful (degrees of happy)
  -- utility function: estimates the performance measure
  -- maximize expected utility: will behave rationally
- Learning Agents
  - agents can learn to become more competent
  - learning element: makes improvements
  - performance element: selects actions
  - critic: determines (using fixed performance standard)
    whether/how performance element should be modified
    -- i.e., it will perform differently after modification
  - problem generator: suggests actions that lead to new experiences
- Representations of the environment
  -- atomic: no internal structure
  -- factored: vector of attribute values (features)
  -- structured: objects with attributes and relationships
  -- consequences ???
Problem-solving Agents
- Goal formulation: adopt goal: first step in problem-solving
- Problem formulation: decide what actions and states to consider
- with options: may need to examine future actions to determine value
- solution to some problems is a set of actions ("path")
- solution to other problems is a state
Well-defined problems & solutions
- initial state
- set of possible actions applicable in state s
- transition model gives state resulting from each action
- state space: set of reachable states from initial state
  -- state-to-state transitions form a graph
- goal test detects goal state (the state or its properties)
  -- might be more than one goal state
- step cost: cost of taking an action from state to state
- path costs: cost of following a path
- solution: path from initial state to goal
- optimal solution: lowest cost solution

Lecture 4: Uninformed search (3.2-3.4)

Example problems
- toy problems vs real-world problems
- toy:
  - vacuum world (goal = squares clean; solution = path)
  - 8-puzzle (goal = configuration; solution = path)
  - 8-queens puzzle (goal = configuration; solution = state)
- real-world
  - route-finding
  - touring (e.g., traveling salesperson problem)
  - VLSI layout
  - robot navigation
  - packing a cargo plane
Searching for solutions
- search tree
  -- nodes = states
  -- links = actions (with costs)
  -- root node = start state
- expand node: apply possible actions to generate new states
- parent nodes lead to child nodes
- leaf node: no children (yet)
- frontier: leaf nodes ready for expansion
- search strategy: how to select which node to expand next
  -- determined by how frontier queue built and how selection made
  -- e.g., FIFO queue, LIFO queue, priority queue
- loops and redundant paths (graph)
- Tree-Search vs Graph-Search
  -- for graph search recognize where you have already searched
Uninformed search
- Uninformed: no additional information about states
- Informed: uses knowledge of how "promising" a state is (wrt goal)
Breadth-first
- all nodes at one level expanded before any nodes at next level
- test for goal at generation time (save time/space)
- huge memory requirements
Uniform-cost
- assumes different step costs
- expand node with lowest current path cost: g(n)
- use priority queue
- alternative higher costs paths to node are ignored
Depth-first
- expands most recently generated node
- goes deep down a path before investigating alternatives
- involves backing up from nodes that don't expand (aren't expanded)
- space complexity much better than Breadth-first
- the basic search of AI (often with modifications)
Depth-limited
- depth-first with predetermined search depth limit
- path not explored past depth limit
- need to pick good value for limit (based on problem)
Iterative deepening depth-first
- depth-first with varying depth limit
- start with depth at 0 and increase it
- some redundancy but not significant
- adds a touch of Breadth-first, as at each level, whole tree may be searched
- prefered uninformed search

Lecture 5: Informed search (3.5-3.7

Heuristic/Informed Search
-- use problem-specific knowedge to gain efficiency
-- can guide and prune
-- evaluation function --- f(n)
- cost estimate for path through n to goal
-- actual path cost to node n --- g(n)
-- heuristic function --- h(n)
- estimated cost of cheapest path from n to goal
- uses "heuristic" to estimate ("rule of thumb")
Greedy best-first search
- f(n) = h(n) --- instead of g(n)
- sample heuristic = "as the crow flies"
  -- e.g., roads are always longer, but its a good estimate.
- greedy -- doesn't take current cost into account!
A* search
- "A star": a kind of best-first search
- estimated path cost through n
- f(n) = g(n) + h(n)
- pick lowest f(n) each time
- complete: will always find goal if there is one
- optimal: finds best path
- h(n) must be admissible -- i.e., optimistic!
  -- it always underestimates actual cost to goal
- accurate h(n) close to or equals actual cost
  -- what if h(n) = actual cost???
- can run out of space
Memory-bounded heuristic search
- Recursive best-first
  -- it prunes search if another branch becomes better
  -- but remembers best cost of pruned subtree
- Simplified Memory-bounded A* (SMA*)
  -- uses A* until memory full
  -- expands newest best leaf, deletes oldest worst leaf.
- SMA* robust choice for searching
Heuristic functions
- good heuristics lower effective branching factor
  -- i.e., branching that actually occurs in a search
- ebf close to 1 indicates few unnecessary branches
- heuristic function with close to correct values are best
- use relaxed problems (fewer restrictions) to generate heuristics
- cost of optimal soln. to relaxed problem is admissible heuristic for original problem
  -- (e.g., Manhattan distance for 8 puzzle)
- Pattern databases: store exact costs for subproblems
  -- gives heuristic value for cost of full problem

Lecture 6: Local Search

Local search & optimization problems
- local search usually looking for a solution state, not a path
- usually looks around a state (or states) by modifying it (them)
- optimization: find best state, measured by an objective function
- state space "landscape"
  -- surface formed by function's value across all states
- global maximum (optimum) vs. local maximum
- could be looking for minimum (gradient descent)
Hill-climbing
- looking for maximum
- search moves in direction of most improvement at each move
  -- steepest ascent (it's greedy)
  -- just records current state
- problems: local maxima; ridges; plateaux
- getting unstuck: stochastic (add some randomness at each move)
- random restart hill-climbing: a set of random start states
Simulated-annealing
- annealing = heating then gradually cooling
- minimize cost (descent)
- disturb search out of local minima
- gradually disturb ("shake") less over time
- makes a random move: accepts it with some probability
- probability decreases if move makes things worse (a shake)
  -- you're still trying to go down hill to global minimum
- probability slowly decreases also depending on time
Local beam search
- beam searches move in restricted areas of search space
- k random start states
- expand all states
- pick k best, and continue
- may have poor diversity (i.e., stuck in a region of the state space)
- variants add some randomness to encourage "diversity"
Local search in Continuous spaces
- continuous actions/states lead to infinite branching factors!
- easiest solution -- make discrete changes
  -- e.g., consider new states only by making discrete (delta) changes
- can also compute local gradients for hill-climbing

Lecture 7: Genetic Algorithms

Genetic Algorithms (text's overview)
- analogy to natural selection
  -- survival of the fittest
- works on a series of populations of individuals (states)
  -- each population producing the next
- initial population of k random states (k often 100+)
- each state is rated by a objective/fitness function
  -- higher value, fitter individuals
- individuals represent descriptions of states (using features)
  -- often as a binary string
- fitter individuals replicated
  -- fitter get better chance of taking part in production of next population
  -- more fit, more copies
- randomly select pairs for mating (crossover)
- for each pair, randomly select crossover point.
- crossover produces new pairs (for next population).
- a small number of individuals are mutated (very small random change)
- stop after some number of generations,
  when very fit individual appears,
  or if best (or avg) fitness is stable.
Genetic Algorithms (additional information)
- See these A Quick Introduction to Genetic Algorithms notes.
- many variations of algorithm
- all have individuals, populations, fitness, crossover, mutation
- vary by:
  - population size
  - whether the population size varies
  - representation of individuals
    -- direct representation (e.g., LISP program)
    -- coded representation (e.g., binary string(
  - how crossover done
  - probability of mutation
  - whether some individuals copied from previous population
  - whether individuals are checked for legality after crossover/mutation
  - how fitness is calculated and used
  - whether diversity is used to select for a new population
- See these Diversity Selection notes.
GAs and Creativity
- Koza
- automated circuit design
- uses circuit description language
- each individual in the population is a circuit description

Lecture 8: Adversarial Search

Games
- multi-agent, competitive
- deterministic, turn-taking, two-player, zero-sum, fully observable
- zero sum: one wins & one loses; or both draw.
- very large game trees (search spaces): need to "prune" and ignore parts of game tree
  -- (search tree < game tree)
- chess has 10⁴⁰ nodes in game tree (intractable)
- terminal state: one person has won
- looking ahead: complete search can find terminal states (correct utility)
- utility function: e.g., win (+1), lose (-1), draw (0)
- looking ahead: can limit depth and estimate utility
- ply: a move by one player
- need legal move generator (can filter by what's "plausible")
- use transposition (hash) table of evaluations at previously seen positions
- can use pruning strategies
  -- e.g., based on shallow, fast evaluation
  -- danger: may prune the path that leads to a win!
Optimal decisions in games (Minimax)
- assume both players play optimally (they want to win)
- A plays their best move, assuming that B responds with their best move
  -- all the way down the tree!
- High utility = player1 wins; Low utility = player2 wins.
- Player1 tries to move value up, Player2 tries to move value down.
- Search down the tree to terminal state, then back the values up taking min or max values until all states resulting from move choices have values that indicate what they'll lead to if played. Pick the best.
- pick move that avoids opponents best moves!
- time is exponential in search depth. :-(
- getting to optimal requires searching to terminal states
  -- just not viable for huge game trees!
Alpha-Beta pruning
- pruning!
- an addition to minimax
- dont expand a node that can't provide a score that's better than what you already have
- time/space saved can allow deeper searches (e.g., twice as deep)
- still exponential with depth, but visits fewer nodes due to pruning
- game tree branch order affects pruning possibilities
- chess: could order by expected utility
  -- e.g., captures; threats; move forward; move back
Imperfect decisions
- can't search tree to terminal state
- cut off search earlier and use evaluation function
  -- accurate estimate of chances of winning in that state (i.e., utility)
- depth limited, or iterative deepening ("anytime algorithm")
- Features:
  - # of pieces
  - strength of pieces (queen > pawn)
  - mobility (poss. moves)
  - control (squares threatened)
  - threats (potential captures)
  - patterns of pieces (e.g., diagonal pawns)
- Evaluation function: often a weighted linear function
Chess: Heuristic Continuation Fights the Horizon Effect
- fixed depth search produces a "horizon" (may be bad beyond it!)
- singular-extension
  -- if one move's value is much better than rest, then keep looking down that branch, as it's a place where the most change in value could result from minimaxing
- search-until quiescent
  -- look for quiet (i.e., no possible captures)
Chess: Deep Blue plays Grandmaster Chess
- see this and this
- first machine to win chess game against reigning world champion
- uses alpha-beta search, with selective extensions
- could search to a depth of 12 ply
- has opening "book" and all five-or-fewer piece endgames
- massively parallel, 30-node, RS/6000, SP-based computer system enhanced with 480 special purpose VLSI chess chips
- evaluates 200,000,000 chess positions per second
- several months working with a grandmaster on evaluation function
- "In three minutes, ... it computes everything it knows about the current position from scratch."
Chinook: world man-machine checkers champion
- see this and this.

Lecture 9: Constraint Satisfaction Problems 1 (6.1-6.2)

Defining CSPs
- Constraint Satisfaction Problem (CSP)
- set of constraints that specify allowable combinations of values of variables
  -- e.g., X₁ ≥ X₂, X₁ > X₃, X₂ ≥ X₃
- set of variable (each one can have a value)
  -- e.g., Vbls = { X₁, X₂, X₃ }
- a set of allowable values (domain) for each variable
  -- e.g., the domain of each variable is {1, 2, 3, 4, 5}
  -- usually discrete, finite domains
- the problem is to find a complete and consistent assignment
  -- all variables have values, no constraints are violated
- there may be several, or no, consistent assignments
- the result may need to be all or one consistent assignment
- constraint graph: nodes = variables; links show constraint influence
  -- If constraint SA ≠ WA then SA-----WA in graph
- constraint propagation:
  -- the influence of removing inconsistent values can spread through the graph (prune domains)
- constraints can be fully enumerated
  -- show all allowable assignments for variables in the constraint
  -- e.g., { (red, green), (red, blue), ... (blue, green)}
- types of problem solvers for CSPs
  -- search making one variable assignment at a time
  -- gradually eliminate inconsistent values from domains
  -- manipulate a potential solution until it becomes consistent
- unary constraints include one variable (e.g., X ≠ blue )
- binary constraints include two variables (e.g., A > B+3 )
  -- usually can reduce to all binary constraints
- global constraints: e.g., Alldiff (means "all different")
- preference constraints: ( ProfDCB prefers afternoon )
  -- other assignments are consistent, but suboptimal (incur cost)
- resource constraints: Atmost(10, A, B, C, D) (i.e., 10 max)
- bounds: reason using variable domains represented by [lower, upper]
- Examples: map coloring, scheduling, 8 queens, cryptarithmetic, Sudoku
Inference in CSPs by Constraint propagation
- Node consistency: variable's unary constraints satisfied
- Arc consistency: binary constraints satisfied between two variables
  -- e.g., variables X and Y
  -- for every value in the domain of X there's a value in the domain of Y that satisfies constraint
  (i.e., there's potential for a solution!)
  -- larger goal: aim to make whole graph arc consistent by removing domain values that don't give arc consistency
- AC-3 algorithm: if domain of a variable is reduced, then look to see if that affects variables connected to it by constraints!
  -- i.e., the effects are propagated, until failure, or graph is arc consistent.
  -- even if result isn't a solution, it will be much easier to solve! (small domains)
- Path consistency: look at triples of variables.
  -- IF A----B----C is a path, THEN, for every consistent assignment of values to both A and B (consistent with the constraints on both A and on B), there must be an assignment to B that is consistent with the A----B constraints AND the B----C constraints.

Lecture 10: Constraint Satisfaction Problems 2 (6.3-6.4)

Backtracking search for CSPs
- depth 1st search that choses value for one variable at a time,
  and backtracks when a variable has no legal value left to assign.
  -- backtrack to a choice point on failure.
  -- keeps a single representation of the state and alters it
- Choices?
  -- which variable to assign next?
  -- which order to assign values to that variable?
- Variable choice
  - choose vbl with fewest remaining values
    -- most constrained vbl is more likely to fail soon
    -- 1,000+ times better performance
  - choose vbl that is involved in constraints with largest number of other vbls
    -- most influence
- Value assignment order
  - prefer the value that rules out the fewest values in the closest vbls in the constraint graph
    -- leave max flexibility for subsequent assignments
- Search mixed with inference
  - after choice of value for vbl X do inference (e.g., arc consistency)
  - forward checking: check arc consistency
  - maintaining arc consistency (MAC): do AC-3 on neighbors of X
- Intelligent backtracking on failure
  - normal backtracking is "chronological"
    -- unwind in reverse temporal order
  - improved backtracking is "dependency-based"
    -- unwind to point that contributed to failure
    -- e.g. conflict-directed backjumping
  - no-good: keep track of set of vbls and their values that cause a problem
    -- no-good set gives early warning of failure
Min-conflicts
- Local search for CSPs -- uses one state and modifies it
- 8-queens problem
- move randomly chosen conflicted piece
  -- move it to position with least conflicts (min-conflicts)
- works well for hard problems
- works well if there are many solutions in state space
Constraint posting
- constraints can record knowledge
- consider vbl X
- reasoning infers constraints
- post a constraint (X > 10)
- post another constraint (X < 12)
- don't decide value for X until you know a lot about it!
- Least Commitment
Conditional CSPs
- configuration problems
- not all variables known in advance (unlike basic CSP!)
- use a part in the config, then add its variables
- i.e., vbls are conditional
- e.g., car config rules
  -- RV means Require Variable
  -- RNV means Require No Variable
- Package="luxury" ==>_RV Sunroof
- Sunroof="type2" ==>_RV Opener
- Type="convertible" ==>_RNV Sunroof

Lecture 11: Logical Agents & Propositional Logic (7.1-7.5, 7.7)

Knowledge-based agents
- reasoning using representations of knowledge
- KB = knowledge base = collection of knowledge
- logic = declarative knowledge representation language
- TELL = agent told new kowledge
- ASK = agent asked what it knows or can "infer"
- axiom: taken as given, as being true
- knowledge level vs. implementation level
Wumpus World
- discrete, static, single-agent, partially observable
- requires reasoning to update world model in order to decide moves
Logic Intro
- allows truth values True and False
- KB has sentences in logic
- syntax = legal structure of sentence
- semantics = meaning of sentence given "possible world"
- model = possible world
- a sentence is true in some models and false in others
- model m makes sentence a is true ≡ m satisfies a
- a entails b: b follows logically from a: a |= b
- iff every model for which a is true, b is also true
  i.e., M(a) ⊆ M(b)
- logical inference uses logic to provide answers (e.g., about s)
- model checking = enumerating all possible models
  to see if for all models in which KB is true, s is true
  M(KB) ⊆ M(s)
  KB |= s
- Inference: finding if something follows from what you know
- lots of things are entailed by the KB, inference is looking for one particular one.
- |-_i = inference using algorithm i
- KB |-_i s = s can be derived from the KB
- a "sound" inference algorithm is truth preserving
- model checking is sound
- a "complete" inference algorithm can produce any sentence that is entailed
  -- i.e., anything that follows logically
- if KB is true in the real world, then any sentence a derived from KB by a sound inference procedure is also true in the real world.
- grounding: connecting the logical reasoning with the agent's real world
- the agent's sensors create the connection
Propositional Logic
- propositional symbols: each stands for a proposition (true or false)
- connectives: 'not' (negation), 'and' (conjunction), 'or' (disjunction),
  'implies' (implication/if-then), 'iff' (if-and-only-if/biconditional/equivalence)
- operator precedence
- a model determines a truth value for every propositional symbol
- semantics: how to compute truth value for any sentence
- rules for evaluating truth of the 5 connectives
- note TFF for (P ⇒ Q) implication, and F implies anything
- truth tables: every assignment of T/F to propositions
- KB is set of propositions saying when they're true
  e.g., P_x,y is true if there's a pit in location [x,y]
- KB includes sentences about propositions
  e.g., ¬B_1,1
- simple inference: model checking for KB |-_i s
  -- check all assignments of T/F to propositions
  -- find assignments where KB is true (all sentences are true)
  -- look for how s is assigned.
Propositional Theorem Proving
- theorem proving = applying rules of inference to KB to try to show what we want
- logical equivalence = true in same set of models [e.g., ¬(¬P) ≡ P ]
- valid sentence = tautology = true in all models [e.g., P v ¬P ]
- satisfiable sentence = true in some model
- P is valid iff ¬P is unsatisfiable
  i.e., if there are no models that satisfy ¬P
- KB |= b iff (KB ∧ ¬b) is unsatisfiable
  - e.g., to show b assume b to be false and add ¬b to the KB
    i.e., KB ∧ ¬b
  - then try, by inference, to show this causes a contradiction
  - if there's a contradiction then b must in fact follow from KB
  - known as proof by "refutation"
Inference and Proofs
- inferences rules can be used in sequence in a proof
- Modus Ponens: given a and (a ⇒ b) then b can be inferred
- And-Elimination: given (a ∧ b) infer a
- all the logical equivalences can be used as inference rules, as they preserve truth
  e.g., ¬(¬P) ≡ P
- monotonicity: set of entailed sentences only grows as more are added to the KB
- inference rules might apply to anything in the KB (control needed)
Proof by Resolution
- Resolution: an inference rule
- works on clauses: disjunction of literals
  e.g., P ∨ Q ∨ ¬R
- (a ∨ b) resolves with (¬a ∨ c) giving (b ∨ c)
- removes the complementary literals (a, ¬a)
- result has all of the other literals
- remove duplicated literals
- Resolution uses Conjunctive Normal Form (CNF)
- e.g., <clause> ∧ <clause> ∧ <clause>
- can convert any propositional logic sentence to CNF
- If you're trying to prove a
  - 1. convert (KB ∧ ¬a) into CNF
  - 2. use resolution inference rule on the resulting clauses
  - 3. if a resolvent is empty then we have a contradiction, and a is proved.
  - 4. if no new clauses result then the proof ends.
Using Horn clauses
- Horn clause: disjunction of literals, with at most one positive
  e.g., P ∨ ¬Q ∨ ¬R
- resolution on Horn clauses produces Horn clauses
- Horn clauses can be written as implications (nicer to read/write)
  e.g., (a ⇒ b) ≡ (¬a ∨ b)
- normal form is A ∧ B ⇒ C
- proofs controlled by forward-chaining or backward-chaining search strategies
- AND-OR graph
- forward: (data-driven) starts from known facts (positive literals) and works forwards by inferences until the query is found.
  e.g., if you want to prove C, given A and also B, then use (A ∧ B ⇒ C) to provide C.
- backward: (goal-directed) starts from query and works back trying to show that all the things that lead to the query can be inferred.
  e.g., if you want to prove C, and (A ∧ B ⇒ C), then prove both A and also B.
Agents based on Propositional logic (brief summary)
- problem: percepts (e.g., Stench) only apply at a particular time
- adding ¬Stench to a KB that alread contains Stench gives contradiction!
- fluent: something that changes
- need to to state what changes and what doesn't for each action
- this is known as the "frame problem"
- hard to deal with in propositional logic as there are only symbols
- we can make symbols Stench¹ and Stench² etc to show different times
  N.B., the superscript is part of the symbol and has no influence in the logic.

Lecture 12: First Order Logic (8.1-8.3, 8.4)

Representation revisited
- Propositional logic - facts
- First Order Logic - facts, objects and relations
- can include variables
- includes statements about some or all (quantifiers)
- FOL assumes world with objects and relations
- true or false or unknown
- standard syntax -- "syntactic sugar" provides allowed variants
Syntax & Semantics
- models contain objects (Richard), relations (brother-of), properties (king), functions (left leg)
- syntactic elements in the language are symbols
- constant symbols (Richard) stand for objects
- predicate symbols (Brother) stand for relations
- function symbols (LeftLeg) stand for functions
- interpretation specifies exactly what in the model symbols refer to
- terms refer to objects - e.g., Richard, or LeftLeg(Richard)
- atomic sentences = facts - e.g., Brother(Richard, John)
- logical connectives
- Quantifiers -- 'for all' ∀ and 'there exists' ∃ -- use variables
- ∀x King(x) ⇒ Person(x) --- note TFF
- ∃x Crown(x) ∧ OnHead(x, John)
- quantifier order matters
- ∀x ∃y Loves(x, y)
- ∃y ∀x Loves(x, y)
- use different vbl names for each quantifier
- ∃ and ∀ are related by ≡ rules -- how?
- equality: two terms refer to same object --- e.g., Father(John) = Henry
- alternative semantics
- unique-names assumption -- every constant refers to distinct object
- closed-world assumption -- if we don't know it's true, it's false
- domain closure -- # domain elements = # constant symbols
Using FOL
- TELL -- add "assertions" to KB
- ASK queries -- can retrieve directly or infer
- ASKVARS gives vbl bindings/substitutions for the answer
  e.g., ASKVARS(KB, Person(x)) gives {x/John} and also {x/Richard}
- theorems are derived from axioms (i.e., from basic factual info and definitions)
- theorems can be used in inference too
- unlike Propositional logic, can make statements about any time
  e.g., ∀t HaveArrow(t + 1) ⇔ (HaveArrow(t) ∧ ¬Action(Shoot, t))
Knowledge Engineering in FOL
- knowledge engineering = KB construction for task/domain
  1. Identify task: what needs to be represented
  2. Assemble relevant knowledge: knowledge acquisition
  3. Decide on vocabulary: predicates, functions and constants
    i.e., define the Ontology
  4. Encode general knowledge about domain
  5. Encode specific problem instance (e.g., info from sensors)
  6. Pose queries and get answers (ASK)
  7. Debug the KB (and individual sentences)

Lecture 13: Inference in First Order Logic (9.1-9.5)

Propositional vs First Order Inference
- simple inefficient approach: convert FOL to propositional logic then do inference
- remove quantifiers and variables
- ∀ -- if possible do Universal Instantiation (substitute variables with ground terms)
- ∃ -- pick a Skolem constant to stand for the thing that exists.
- typically generates lots of sentences, many irrelevant
Unification
- for FOL use Generalized Modus Ponens (MP)
- find substitutions for variables that makes regular MP useable
- Generalized MP is MP "lifted" to apply to variables
- unification = finding substitutions that make different logical expressions look identical
  e.g., UNIFY(Knows(John,x), Knows(y,Bill)) = {x/Bill, y/John}
- after unification then a P with vbls matches the P in (P ⇒ Q) allowing MP
  
  Note: skip section about making retrieval more efficient
Forward Chaining
- useful for Situation ⇔ Response systems (rules)
- use definite clauses: disjunctions of literals with exactly one positive
- perfect for sentences such as: King(x) ∧ Greedy(x) ⇔ Evil(x)
  which converts into a definite clause
- algorithm: start from known facts, use all rules whose premises are satisfied, and add the conclusions to the known facts, and repeat until query answered.
- sound and complete
- may not be efficient
- incremental forward chaining: every new fact inferred in iteration t must be derived from at least one new fact inferred in iteration t-1.
Backward Chaining
- works backwards from goal query
  from conclusions back to premises
- uses definite clauses
- needs to keep track of accumulated substitutions
- can be done by depth-1st search
- AND-OR tree
- used in Logic Programming (e.g., Prolog)
  
  Note: skip section 9.4.3-9.4.6
Resolution
- Every sentence of FOL can be converted into an inferentially equivalent Conjunctive Normal Form (CNF) sentence
  i.e., a conjunction of clauses, with each clause being a disjunction of literals:
  clause e.g., ¬American(x) ∨ ¬Weapon(y) ∨ ¬Hostile(z) ∨ ¬Sells(x,y,z) ∨ Criminal(x)
- to convert to CNF
  1. eliminate implications
  2. move ¬ inwards
  3. standardize variables
  4. Skolemize to remove existential quantifiers
  5. drop universal quantifiers
  6. distribute ∨ over ∧
  7. result is a clauses connected by ∧
- resolution inference:
  1. take two clauses with complementary literals
  2. find a substitution that allows one to "cancel out the other"
  3. what's left over, with the substitution, forms the resolvent clause
- resolution proof: prove KB |= a by proving that (KB ∧ ¬a) is unsatisfiable,
  by deriving the empty clause.
- each resolution step adds a new clause to the KB (increasing in size)
  
  Note: skip section 9.5.4-9.5.5
- Resolution Strategies: resolution needs guidance about which clauses to try to resolve
  - Unit preference: always include a single literal in the resolution (gets shorter clauses back)
  - Set of Support: always use a member of a predetermined set of clauses in each resolution step (e.g., initially use negated query -- add every resolvent to the set of support)
  - Input Resolution: always use clauses from KB or the query
  - Subsumption: eliminate all sentences that are more specific (subsumed by) than something already in the KB

Lecture 14: Classical Planning (10-10.3, 10.4.4, 11.1-11.2.2)

Definition
- devising a plan of action to achieve ones goals
- world is represented by a collection of variables
- a search problem: inital state; actions available; result of acting; goal test.
- state: a conjunction of fluents (with no variables)
- closed world assumption
- unique names assumption
- Action: defined using an action schema using vbls (represents a set of specific actions)
  e.g., Fly: fly from Boston to SF, fly from Austin to NYC, ...
- actions only mention preconditions and effects
- preconditions must be true in order to do the action
- effects: delete list (no longer true) & add list (new fluents)
  e.g., ¬At(p,from) ∧ At(p,to)
- initial state: a specific state description
  e.g., At(C1,SFO) ∧ At(C2,JFK) ∧ ...
- goal: a conjunction of literals that may contain vbls
  e.g., At(C1,JFK) ∧ At(C2,SFO)
- note that actions may have costs, or the count of actions could be used if we assume equal costs.
Planning as state-space search
- Forward state-space search (progression)
  -- start from initial state and apply actions until goal is found
  -- strong domain-independent heuristic needed, and available
  -- most planning systems use forward search
- Backward relevant-states search (regression)
  -- start from goal and apply relevant actions backwards until initial state found
  -- select actions that could contribute to the goal, but dont negate an element of the goal
  -- previous state is current state without the add list and including the preconditions
- hueristics for planning
  - try to find a relaxed problem
    -- ignore all preconditions
    -- ignore some preconditons
    -- ignore delete lists
    -- ignore some fluents
  - use decomposition
    -- assume independent subgoals, solve separately, combine costs
  - use pattern databases
    -- stored cost for problems with particular pattern in them
Planning graph
- can give a better heuristic estimate for guiding planning search
- graph can be used to estimate how many steps to reach goal
- GraphPlan: extract plan from searching in the planning graph
- for propositions only (no variables)
- connects possible states with possible actions
- S₀, A₀, S₁, A₁, ...
- S_i is all the literals that could hold at time i, depending on the actions taken in prior steps.
- A_i is all the actions that could be taken from S_i including "persistence" (i.e., no change / no-op action).
- build new S levels with actions between until there's no change in the literals included (levelled off)
- planning graph isn't too costly to construct
- can extract plan as a backward search once all literals from the goal are present in some S level and they aren't marked as mutually exclusive.
- Mutex links = mutual exclusion
  i.e., things that can't exist together
  e.g., Have(Cake) with ¬Have(cake)
  e.g., Have(Cake) with Eaten(Cake)
  e.g., Bake(Cake) with Eat(Cake) (i.e., actions have conflicting prereqs)
- Mutex between actions too: Inconsistent effects; Interferences; Competing needs.
- if any goal literal is not in final S_i level then problem is not solvable
- heuristic: can estimate the cost of achieving any goal literal by what level of the graph it first appears (level cost)
- heuristics: for goal with conjunction of literals, try sum of level costs
Partial Order Planning
- totally ordered plans: linear sequence of actions
- partial order plans: actions with ordering constraints
  i.e., add liquid to flour BEFORE whisk together
- find flaw in plan at each stage and suggest an action to add to fix it
- use "least commitment" to fix flaw
- build partial order plan
- backtrack if necessary
- can combine with libraries of high-level plans
Schedules
- include how long an action takes, and when it should occur
- plan first and schedule later
- can also have resource constraints
  e.g., there is only one engine hoist
  that's important as plan is partial order plan
- resources reusable or consumable
- duration of plan used as cost function
- actions have durations, and earliest & latest start times
- slack: range of start times
- CPM: Critical Path Method
- critical path is the one whose duration is longest
  whole plan can't be shorter
- from start, can look at earliest start for each action in a path
- from end, can look at latest start for each action in a path
- order constraints impose possible actual start times
- resource constraints add additional restrictions
  e.g., actions using the one hoist can't overlap
Hierarchical Planning (Hierarchical Task Networks)
- humans plan at using high level actions (HLA) first
  e.g., get to airport, fly, drive to destination
  i.e., HLA + HLA + HLA
- hierarchical decomposition: higher = more abstract; lower = more concrete
- each HLA has one of more "refinements"
- a refinement is a more concrete sequence of actions (either HLAs or primitive actions)
- can refine plans recursively down to primitives
- at least one of the fully refined plans must achieve the goal
- can use a plan library of refinements
- a lot of knowledge about refinements can be encoded
- planner effectively searches the space of plan refinements
- it can be done breadth first.

Lecture 15: Knowledge Representation (8.4, 12.1-12.6)

"knowledge is power"
how many types of knowledge representation have we seen so far?
Ontological Engineering
- ontology: those concepts that exist and can be reasoned about in the world
- general concepts: events, time, physical objects, beliefs
- Ontological Engineering: representing these concepts
- Upper Ontology (e.g., SUMO) (Adam Pease, WPI, BS&MS)
- add more details down to specific levels (e.g., Wumpus)
- all upper level details (axioms) must still be relevant at lower levels (apart from exceptions)
- ontologies produced by:
  - a team of ontologists/logicians
  - importing categories, attributes and values from databases
  - extracting information from text documents automagically
  - doing it wiki style with open access
Categories and Objects
- category knowledge is vital
  e.g., supports recognition and also prediction
- use Basketball(b) or "reify" it to Basketballs
- subclass and member relations
- subclasses form a taxonomy (e.g., plants)
- Basketballs ⊂ Balls
- BB9 ∈ Basketballs
- for categories assume ∀
- (x ∈ Basketballs) ⇒ Spherical(x)
- Orange(x) ∧ Round(x) ∧ Diameter(x) = 9.5 ∧ x ∈ Balls ⇒ x ∈ Basketballs
- Males and Females are subclasses of Animals
- they are an exhaustive decomposition
- they are disjoint (no members in common)
- can define categories
  x ∈ Bachelors ⇔ Unmarried(x) ∧ x ∈ Adults ∧ x ∈ Males
- natural kinds: most real-world categories have no clear-cut definitions
  e.g., games, tomatoes, chairs, ...
  ... think of a definition based on an example, think of a counter-example!
- Physical decomposition also needs to be represented
- Part-of hierarchies
- tricky! is "cheek part-of face" the same as "wheel is part-of car"?
- composite objects have structural relationships between parts: e.g., Attached(x,y)
- bunch: objects with definite parts but no structure
  BunchOf(Apples)
- Measurements: uses measure objects
  Length(L1) = Inches(1.5) = Centimeters(3.81)
- some things don't have a scale (e.g., beauty), but still can use
  Beauty(Rose1) > Beauty(Weed1)
- Stuff -- part of stuff is stuff (e.g., butter)
- intrinsic properties: belonging to the substance of the object
  e.g., color, flavor, ownership, ...
- extrinsic properties: belonging to the object
  e.g., length, shape, weight, ...
- a category that includes only intrinsic properties is a substance
- what is half of a pile of sand?
Events
- events are actions based on points in time
- fluent: may change over time -- At(DCB, Office)
- assert that its true -- T(At(DCB, Office))
- events take place over a time interval
  Happens(e,i) where i = (t1, t2)
- events can make fluents become true or false at some time
  Terminates(e,f,t) --- event e causes fluent f to cease to hold at time t
- Processes: actions where any part of the action is still the same type
- sorta like "stuff" for objects
- e.g., Flyings
- Time intervals: moments (zero duration) and extended intervals
- predicates for time intervals
  - Meet(i,j) ⇔ End(i) = Begin(j)
  - Before(i,j) ⇔ End(i) < Begin(j)
  - After(i,j)
  - During(i,j)
  - Overlap(i,j)
  - Begins(i,j)
  - Finishes(i,j)
  - Equals(i,j)
- Fluents and objects -- an object is a chunk of space-time!
- President(USA) denotes a single object that consists of different people at different times!
Mental events and objects
- agents need statements about beliefs (mental objects)
- propositional attitudes: believes, knows, wants, intends, informs
- need Modal logic: include qualifications of a statement, such as "usual", "possible", "necessary", "impossible", "always", "believed", ...
- K_AP means "A knows P"
- can make statements about one agent's knowledge about another's knowledge
  e.g., K_A[K_BP]
  i.e., A knows that B knows
- K_AP ⇒ K_A(K_AP)
  i.e., if they know something then they know that they know it
- need complicated (!) collection of "possible worlds" to figure out the semantics.
Reasoning with categories
- semantic networks: graphical way of representing knowledge + inference
- most semantic networks have an underlying logic
- distinguish between categories and individuals
  MalePersons vs. John
  SubsetOf vs. MemberOf
- inheritance: properties of categories flow down to subcategories
- multiple inheritance: MemberOf(tux,Penguins), MemberOf(tux,Birds), does tux fly?
- semantic nets allow "default" values
  these can be overridden by specified values in subcategories
- description logics: logics tuned to categories and for deciding relationships between them
- subsumption: checking if one category is a subset of another by checking definitions
- classification: checking whether an object belongs to a category
- consistency: checking if category definition is logically satisfiable
- dl language is intended to be easier to write than FOL
- but they typically lack negation and disjunction
- dl emphasises tractability of inference
- And[Man, AtLeast(3, Son), AtMost(2, Daughter),
  All(Son, And(Unemployed, Married, All(Spouse, Doctor)))
  All(Daughter, And(Professor, Fills(Department, Physics, Math)))]
Default information
- example of default knowledge?
- monotonic: new statements produced by inference added to KB
- nonmonotic: override inherited properties: e.g., with Legs(John,1)
- new evidence can override default statement (can't have both 1 and 2 legs!)
- nonmonotic logics: "circumscription", and "default logic"
- circumscription: add circumscribed predicates
  e.g., Bird(x) ∧ ¬Abnormal(x) ⇒ Flies(x)
- assume ¬Abnormal(x) unless Abnormal(x) is declared to be true
- default logic: includes default rules
- Bird(x) : Flies(x) / Flies(x)
  if prereq Bird(x) is true, and justification Flies(x) is consistent with KB, then conclude Flies(x)
- Nixon-diamond semantic net example
- Truth Maintenance: retracting facts as needed (belief revision)
- suppose P had been assumed by default, but ¬P is found
- need to retract P and assert ¬P, but also retract all sentences inferred from P!
- JTMS: justification-base truth maintenance
- annotate each sentence in KB with justification
  sentences from which it was inferred
- allows sentences with multiple justifications not to be retracted
- sentences without justification are marked as out (not deleted), allowing efficient future changes
- ATMS: assumption-based TMS
  keeps track of all the assumptions that would cause a sentence to be true.

Lecture 16: Quantifying uncertainty (13.1-13.3)

Acting under uncertainty
- Intro...
- uncertainty due to partial observability, nondeterminism
- uncertainty due to Laziness, Theoretical Ignorance, Practical Ignorance.
- belief state: set of all possible worlds the uncertain agent might be in
- Summarizing uncertainty...
- connections between effect and cause is not a logical consequence, but is affected by degree of belief (probability)
- probability summerizes uncertainty
- probability statements made wrt knowledge states (what's known)
- Uncertainty and rational decisions...
- agents prefers some outcome over others
- utility: quality of being useful (preferences)
- basic idea: if it is highly probably and highly useful, that's good!
- Decision Theory = Probability Theory + Utility Theory
- Principle of maximum expected utility
  Agent is "rational" iff it chooses the action that yields the highest expected utility, averaged over all the possible outcomes.
Basic Notation
- what probabilities are about...
- sample space: set of all possible worlds
  mutually exclusive & exhaustive
  e.g., set of all rolls from a pair of dice (1,1),(1,2),...,(6,6)
- probability model: numerical probability with each possible world (0 to 1)
- pair of dice: P(Total=11) = P((5,6)) + P((6,5)) = 1/36 + 1/36 = 1/18 (an unconditional probability)
- P(doubles) = 0.25
- P(cavity) = 0.2
- unconditional P, or prior P (i.e., there's no other evidence)
- if first dice is 5, P(doubles | Die1 = 5) = ??
- conditional P, or posterior P (i.e., it depends on other evidence)
  e.g., P(cavity | toothache) = 0.6
  P(cavity | toothache ∧ ¬cavity) = 0
- product rule: P(a ∧ b) = P(a | b) P(b)
- the language of propositions (probability assertions)...
- random variable: variables in probability theory e.g., Weather, Cavity, Toothache
- each random variable has a domain of values
  e.g., Weather has {sunny, rain, cloudy, snow}
- can write "sunny" for Weather = sunny
- P(Weather) = < 0.6, 0.1, 0.29, 0.01 >
  stands for
- probabilities sum to 1.
- the P statement defines a "probability distribution" for the single variable Weather (here, as a vector)
- joint probability distribution: P(Weather, Cavity)
  includes some of the random variables
- this is a 4 * 2 table of probability values
  {sunny, rain, cloudy, snow}, {cavity, ¬cavity}
- P(sunny, Cavity) is 2 element vector
  sunny with cavity, sunny with no cavity
- P(sunny, cavity) is a 1 element vector
- full joint probability distribution
  includes all of the random variables
  e.g., P(Weather, Toothache, Cavity)
- a possible world is an assignment of values to all the variables under consideration
  e.g., 4 * 2 possible worlds for vbls Weather and Cavity
- skip probability axioms and their reasonableness...
- where do probabilities come from...
- different views
  - frequentist: from experiments, observed samples
  - objectivist: probabilities are real aspects of the universe
  - subjectivist: a way of characterizing an agent's belief, without external physical significance
Inference using Full Joint Distributions
- full joint distribution for Toothache, Catch, Cavity (sum to 1)
- look at worlds where proposition is true and add their probabilities
- marginal probability: use a subset of the variables
  i.e., cavity in all of the 4 situations of the 2 other vbls.
- marginalization: sum up all values over the other variables
  P(Cavity) = sum of P(Cavity, z), over z, where z is {Catch, Toothache}
- similarly for conditional probabilities (conditioning)
- usually want to compute conditional probabilities
  i.e., use the effect of evidence
- P(cavity | toothache) = P(cavity ∧ toothache) / P(toothache)
  from product rule
- P(¬cavity | toothache) = P(¬cavity ∧ toothache) / P(toothache)
- view 1/P(toothache) as a "normalization factor" = α
  without knowing value of P(toothache)
- P(Cavity | toothache) = α P(Cavity, toothache)
- = α[P(Cavity, toothache, catch) +P(Cavity, toothache, ¬catch)]
- but you need full joint distribution to answer, so it doesn't scale :-(
- in general P(X | e) = αP(X, e) = α∑P(X, e, y)
  where e is all the evidence, y is all possible combinations of values from the unobserved vbls.

Lecture 17: Uncertainty & Bayes (13.4-13.5)

Independence
- some variables have no influence on others
  e.g., evidence about toothache, catch and cavity have no influence on cloudiness (they're independent)
  i.e., P(cloudy | toothache, catch, cavity) = P(cloudy)
- if independent (P(a | b) = P(a) or (P(b | a) = P(b) or P(a ∧ b) = P(a)P(b)
- can generalize for P (probability distributions)
- it factors large joint distributions into smaller ones.
- nice but often hard to find.
Bayes' Rule
- Rule: P(b|a) = P(a|b)P(b) / P(a)
- as a set of equations with background evidence e
  where e could be toothache and catch
- Applying Bayes' rule: the simplest case...
- Best thought of as
  with e.g., effect = symptom, cause = disease
- diagnosis problem: given a symptom what is the disease?
- uses causal knowledge -- what things cause what effects
- Using Bayes' rule: combining evidence...
- Toothache and Catch are probably dependent
- If there's a Cavity, then Cavity can cause Toothache, and Cavity can cause Catch, but neither has a direct effect on the other.
- i.e., in the presence of Cavity, Toothache and Catch can be considered independent
- called "conditional independence"
- P(toothache∧catch | Cavity) = P(toothache|Cavity) P(catch|Cavity)
- to decompose a full joint distribution, using conditional independence
  giving three smaller tables
- this allows probabilistic systems to scale up.
- in general
  P(Cause, Effect₁,...,Effect_n) = P(Cause) * Π P(Effect_i | Cause)

Lecture 18: Probabilistic Reasoning (14.1-14.2, 14.4, 16.1-16.2)

Representing knowledge in an uncertain domain
- Bayesian network: data structure that can represent a full joint distribution using conditional independence and smaller distributions.
- a directed acyclic graph.
- if node1-------->node2 then node1 is "parent" of node2
  node1 has a "direct influence" on node2
- conditional independence is indicated by lack of link between two nodes, but with shared parent
- independent variables aren't connected to others
- nodes annotated with conditional probability distribution
  P(X_i | Parents(X_i)) -- giving effects of parents on that node
- when building a network order variables so that causes precede effects
- include links from parents if one variable directly influences another
Semantics of Bayesian networks
- For a particular entry in the joint distribution over all n variables
  i.e., X₁=x₁ ∧ ... ∧ X_n=x_n
  P(x₁,....,x_n) = Π P(x_i | parents(X_i)) -- varying i from 1 to n.
- e.g., for john, mary, alarm, not burglary, not earthquake
  by tracing back to parents.
- causal models: causes ---> effects
- diagnostic models: effects ---> causes
- causal models easier to build, and easier to get probabilities for nodes
- skip 14.2.2 and 14.3
Exact inference in Baysian Networks
- usual problem is to compute posterior probability for query vbls
  given some event (some assignment to evidence variables)
  - X is query vbl
  - E is set of evidence variables E₁,...,E_m
  - e is observed event (evidence)
  - Y is set of nonevidence, nonquery vbs Y₁,...,Y_l
    the "hidden variables"
  - complete set of vbls X = {X} ∪ E ∪ Y
  - typical query P(X | e)
- sample query P(Burglary | JohnCalls=true, MaryCalls=true) = <0.284, 0716>
- i.e., P(B | j, m), and e = earthquake, a = alarm, b = burglary
- Inference by enumeration...
- for typical query
- in general P(X | e) = αP(X, e) = α∑P(X, e, y)
  where y is all possible combinations of values from the unobserved vbls.
- note that P(x₁,....,x_n) = Π P(x_i | parents(X_i))
- that allows P(x, e, y) to be calculated
- P(B | j, m) = αP(B, j, m) = α∑_e∑_aP(b)P(e)P(a|b,e)P(j|a)P(m|a)
- note that this uses each of the P(x_i | parents(X_i)) in the network
- skip the rest
Quick Intro to Utility
- Decision Theory: choose amongst actions based on immediate outcomes
- in nondeterministic, partially observable environment
- RESULT(a) is a random vbl that has values that are possible outcome states of action a
- P(RESULT(a)=s' | a, e)
  probability of outcome s' given action a executed and evidence observations e
- utility function: U(s') given a number expressing desirability/usefulness of the state s'
- EU(a | e) -- expected utility of an action:
  with lots of outcomes we need a way of weighting their utility by their probability
- EU(a | e) = ∑_s' P(RESULT(a)=s' | a, e) * U(s')
- maximum expected utility (MEU): a rational agent should pick the action that maximizes the expected utility
- action = argmax_a EU(a | e)
- Preferences in choice:
  A > B -- agent prefers A over B
  A ~ B -- agent is indifferent between A and B
  A ≥ B -- agent prefers A over B, or is indifferent between them
- there are axioms of utility theory that if followed will have an agent exhibit rational behavior.
- if so
  U(A) > U(B) ⇔ A > B
  U(A) = U(B) ⇔ A ~ B

Lecture 19: Learning from examples (18.1-18.4)

Intro
- Review the "Learning Agent"
- agent is learning if it changes its performance, hopefully for the better, on future tasks after obtaining observations about the world.
- basic case: "from examples"...
  given input-output pairs, learn function that predicts outputs for new inputs.
- called "Inductive Learning"
  -- inductive inference learns something general from specific things
- learning handles lack of agent designer's knowledge about the world, how it changes, or how to operate in it.
Forms of Learning
- Factors affecting learning:
  - Component to be improved
  - Prior knowledge agent has
  - Representation used for the data/observations
  - Representation used for the Component
  - Feedback available to learn from
- Components that might be learned include:
  - direct mapping from state to actions
  - inference of relevant properties of the world from percept sequence
  - information about the way the world evolves
  - information about the results of possible actions
  - the desirability of world states (utility)
  - the desirability of actions
  - goals describing classes of states to be achieved
- Component Representations include logic, and Bayesian networks.
- Much learning concerns factored data representations (vector of attribute/values)
- Feedback to learn from: three types of learning...
  - unsupervised: learns patterns in input with no feedback (e.g., clustering)
  - reinforcement: agent learns from rewards/punishments which actions were good/bad
  - supervised: agent gets input and is told the matching output
- problems: noise in data: incorrect or missing
Supervised Learning
- "training set": input-output pairs (x_i, y_i), generated by unknown function y = f(x)
- find function h (hypothesis) that approximates f
- "test set": some additional examples ≠ training set
  -- used to test h (i.e., can h(x) correctly predict y?)
- classification: discrete set of y values (e.g., diseases)
- Boolean classification: y=true or y=false (learn goal predicate)
- regression: y is a number
- hypothesis space: a set of functions that h belongs to
- consistent hypothesis: agrees with all the data
- Ockham's razor: prefer the simplest consistent hypothesis
  e.g., prefer small decision trees
Learning Decision Trees (by induction)
- decision tree representation
- trees can be understood by people
- decisions trees are good for some types of problems but not all
- decisions reached by a series of tests (path through tree)
- a node is a test of an attribute
- links from each node are labelled with each of the possible attribute values
- leaf nodes are labelled with a y value (the output)
- as trees are built additional nodes are added below single root node
- not all attributes need to be included
- there are many possible trees (most are inefficient)
- if useful, paths through trees can be rewritten as rules, or logical statements.
- inducing decision trees from examples
- typical input is a vector of x values and a single y value
  x = { Sunny=very, Windy=moderate }, y = Sailing
  x = { Sunny=moderate, Windy=none }, y = Hiking
- use greedy divide-and-conquer approach to learn trees
- grow one level of tree below each node, moving down the tree
- nodes are picked by their discriminating/sorting power ("important attributes")
  i.e., splitting the data to maximize progress towards leaf nodes
- start at top with most important node, next level is a set of decision tree learning problems with smaller sets of data that were produced by the previous node's split.
- results
  - reach leaf node with single y value if data is split perfectly
  - run out of data but there are still attributes left to use on that path, then we don't have an observation for that case
  - if we use all the attributes on a path but still have data, then there is noise in the data.
- learning curve: improvement in accuracy of learning
  e.g., gradually increase training set size, and get increase in proportiion of test set correct (exponential)
- choosing attribute tests
- pick most important attribute at each step of tree learning
- how good are the subsets of the data produced by each attribute
  i.e., how well sorted
- use entropy: a measure of uncertainty
- a data subset with an equal mix of data leaves us uncertain about the result
- want to reduce uncertainty - increase the amount of sorting that has been done - "information gain"
- Gain: entropy of data set before using attribute, minus entropy of data subsets after using an attribute, is expected reduction in entropy (information gain)
- check the Gain for each available attribute at that point in the tree, and use the one with the greatest Gain.
- generalization and overfitting
- overfitting: having more data tends to introduce more patterns in the data, and the tree will try to accomodate that.
  i.e., it overcommits, and learns too much (such as noise)
- decision tree pruning: eliminate nodes (leading to leaf nodes) that are not relevant.
- likely to prune nodes that provide very small information gain
- significance test: use statistics to test whether that deviation in the data is significantly different from no or normal deviation
  i.e., what are the chances that this could occur normally
- pruning reduces the decision tree learning's sensitivity to noise
- broadening the applicability
- need to handle
  -- missing data
  -- attributes with many possible attributes (weakens Gain test)
  -- continuous and integer valued attributes (infinites set of values)
  :: use split points for node tests (e.g., Weight > 160)
  -- continuous valued output attributes: regression tree to predict output value
Evaluating and Choosing the Best Hypothesis
- Intro
- stationarity assumption: probability distribution over examples doesn't change over time.
- independent: each example is independent of previous examples
- identically distributed: each example has an identical prior probability distribution
- error rate of hypothesis h(x): proportion of mistakes it makes
- low error rate may still not predict well for other data
- cross-validation: using the data in multiple ways to build and test
- holdout cross-validation: randomly split data set into training set and test set
  -- need large training set to learn well
  -- but...need large test set to test well
- k-fold cross-validation: divide data into k subsets; use each subset to test; use average error to estimate the accuracy of a tree trained on all data. k=10 is common.
- Model selection: complexity vs. goodness of fit
- model selection: choosing the type of hypothesis to define a space of things that can be learned. i.e., h comes from the space.
- optimization: getting the best h from the space
- size: an approximation of the complexity of the hypothesis
  -- e.g., linear function < quadratic function
  -- e.g., small decision tree < larger decision tree
- find best 'size' that balances underfitting and overfitting to give best test set accuracy.
- wrapper: an algorithm to try to find the best size, that takes a learning algorithm (e.g., decision tree learning) and some examples
  -- it varies size, uses cross validation to learn error rate
  -- stops at lowest error, when h starts to overfit
  -- then learns with all data for a hyp of that size.
- From error rates to loss
- not all errors are created equal!
  -- better to get false +ves? (told you have disease when you don't)
  -- false -ves? (not told you have disease when you do)
- need to take that utility into account as well
- assume h(x) gives ÿ instead of y
- loss function: loss of utility by getting an error
- can use just L(y, ÿ)
- small loss is better (we want to minimize it)
- Loss functions
  - Absolute value loss: L₁(y,ÿ) = |y-ÿ|
  - Squared error loss: L₂(y,ÿ) = (y-ÿ)²
  - 0/1 loss: L_0/1(y,ÿ) = 0 if y=ÿ else 0
- generalized loss: taking prior probability distribution over all I/O pairs into account
- empirical loss: for an h, assume data equally likely, sum loss for each h(x)
- estimated best hypothesis: the h with the minimum emperical loss
- small-scale learning: problems with dozen's to 1000s of examples
- large-scale learning: millions of examples -- restricted by computation
- Regularization
- explicitly penalizing complex hypotheses
- can search for hypotheses that minimize
  empirical loss + complexity

Lecture 20: More learning (18.7-18.8)

Artificial Neural Networks
- Intro
- neurons: brain cells
- neural networks (NNs): networks of simulated neurons (units)
- neuron "fires" when a linear combination of inputs exceeds some threshold
- Neural network structures
- units: the nodes/units of a NN
- link: connections between nodes
- activation: the output from a node
- output of one node can be the input to another
- weight: links have weights w_i,j on them
- unit j takes weighted sum of all inputs w_i,j × a_i
- weighted sum is in_j
- bias weight: each node has a dummy input fixed to 1 with a weight on it
- an activation function g converts in_j to a_j
- perceptron: a unit with g as a hard threshold
- sigmoid perceptron: a unit with g as a softer threshold
- these are non-linear activation functions
- feed-forward network: connections are only towards the output from input
- recurrent network: allows loops (i.e., more complex, and powerful)
- layers: single layer has input to units and output from those units.
- hidden units: a layer of units that do not connect to inputs or outputs
- classification/categorization: usually as many outputs as classes
- Single-layer feed-forward neural networks
- known as "perceptron networks"
- activation function g determines training process
- error is y - h_w(x)
- as this does 0/1 classification both y and h_w(x) can be 0 or 1.
- perceptron learning rule: assumes hard threshold, does weight updates depending on error
- logistic regression: uses softened threshold, does weight updates depending on error
  - h_w(x) = sigmoid function applied to the data (i.e., to x).
  - w_i ← w_i + α(y - h_w(x)) × h_w(x)(1 - h_w(x)) × x_i
- function can be learned if it is linearly separable
  i.e., it learns linear decision boundaries
  OK = { and, or } Not OK = { xor }
- learning curve for perceptrons sometimes better than decision trees, sometimes not.
- Multilayer feed-forward neural networks
- has hidden units in a layer or layers
- network is a function h_w(x) parameterized by weights w, where x is an input vector.
- output is expressed as a fn of inputs and weights (including use of g)
- train using gradient descent loss-minimization method
- neural network does nonlinear regression
  -- i.e., fitting a non-linear fn to some data
  -- non-linear as NN provides nested non-linear threshold/activation fns.
- Learning in Multilayer neural networks
- goal output is y
- NN returns h_w(x)
- error vector at output is y - h_w(x)
- outputs may depend on all weights in the NN
- back-propagate error from output layer to hidden layers
- at output layer, update rule adjusts weights depending on error:
- Let Err_k be error of k^th element of error vector
- Define
  where g' is the derivative of g, and in_k is the sum of the inputs to unit k.
- update rule for the weight between hidden unit j and output unit k is
  where
- at hidden layer, update rule adjusts weights depending the amount of error for which the hidden layer unit might be responsible.
- the Δ_k values are divided according to strength of connection between hidden node and all the connected output nodes k.
- Define
  where in_j is the sum of the inputs to hidden unit j, the w_j,k are the weights from unit j to all the output nodes to which it is connected, and Δ_k is the error for each of those nodes.
- update rule for the weight between inputs and hidden unit j is
- Learning in neural networks structures
- if use fully connected networks
- choices - how many hidden layers and their sizes.
- usually trial and error
- use cross validation technique to estimate error.
Nonparametric Models (skim!)
- a parametric model uses a fixed number of parameters (e.g., the size of x )
- nonparametric model can change with more data
- instance-based learning stores data as it arrives
- simple table: ask for h(x) find x in the table and return the y
- if not in table then a problem.
- use k-nearest neighbors in the stored data
- take plurality vote of the neighbors as the answer.
- nearest: needs a distance metric
- use Manhattan instance or Euclidean distance between query and data points
- works well in low-dimensional spaces, with lots of data
- k-d trees: balanced binary tree with arbitrary number of dimensions
- split data at every dimension
- nearest neighbors is easy if query isn't near a boundary
- if it is you need to check on both sides of the split
- works well with up to 20 dimensions with millions of examples

Lecture 21: Knowledge in Learning (19.1-19.3)

Logical formulation of learning

ML using prior knowledge of the world to learn hypothesis
put Hypothesis (h), Examples and matching Classifications (x's and y's) as set of logical sentences
given new example (in logic) use h to infer classification
Examples and hypotheses
examples in terms of values for Attributes
example x₁: Alternate=Yes, Bar=No, Fri=No, Hungry=Yes, ...
i.e., Alternate(X₁) ∧ ¬Bar(X₁) ∧ ¬Fri/Sat(X₁) ∧ Hungry(X₁)...
classification (Goal predicate) -- WillWait(X₁) or ¬WillWait(X₁)
each hyp h_j is in form -- ∀x Goal(x) ⇔ C_j
where candidate definition C_j is a logical expression
C_j for a decision tree can be expressed as the a logical expression for each path (using ∧) linked by ∨
h_j predicts that the set of examples that satisfies C_j are examples of Goal(x)
Those examples are the "extension" of the goal
Hyp space H = {h₁, ..., h_n}
Learning alg believes h₁ ∨ h₂ ∨ ... ∨ h_n
if h_i not consistent with new example it can be removed
- can be false negative for h_i
  h falsely says that it should be negative, but it is in fact positive
- can be false positive for h_i
  h falsely says that it should be positive, but it is in fact negative
note that hyp space H is vast, so this is not practical via theorem proving.
Current-best-hypothesis search
maintain single h and adjust it as new examples arrive
for each h_i keep all examples that it classifies (+ve) (the extension)
those examples define the hypothesis
if new example is false negative -- include in the extension ("generalization")
if new example is false positive -- remove from the extension ("specialization")
note that when doing generalization or specialization you need to check that the result is compatible with previously seen examples.
in fact what is needed is for h_i to be modified to reflect generalization or specialization.
for generalization h_i needs to become less precise (drop conditions from C_i)
for specialization h_i needs to become more precise (add conditions to C_i)
at each step there are multiple possibilities, not all of which are good, but a choice must be made, so backtracking will be needed.
at each step checking that the result is compatible with previously seen examples is expensive.
i.e., with large number of examples and large hyp space H it isn't practical.
Least-commitment search (Version space)
least-commitment: make least change necessary
keep around summary of all hyps consistent with data seen so far
new example may alter summary slightly to reduce it
"version space": only those hyps still consistent with data (after reduction)
incremental learning
version space defined by upper boundary G (general) and lower boundary S (specific)
*** do simple example **
G starts with True (i.e., the most general example)
S starts with False (i.e., the most specific example)
S and G get updated by +ve and -ve examples
any hyp between S and G must agree with all the examples
updates
- False positive for S_i --- S_i is too general, so throw it out of S
- False negative for S_i --- S_i is too specific, so replace it by all of its immediate generalizations (i.e., move that portion of S up towards G)
- False positive for G_i --- G_i is too general, so replace it by all of its immediate specializations (i.e., move that portion of G down towards S)
- False negative for G_i --- G_i is too specific, so throw it out of G
results
- one hyp remains (hooray!)
- S or G becomes empty (i.e., no consistent h for training set)
- run out of examples with several h remaining
Version space approach is probably not practical in many situations (especially with noise), but it's a great model

Knowledge in learning

...skim this section...
moral: background knowledge can allow faster learning
Note Explanation Based Learning (EBL)
Hypothesis: what is being learned (h)
Descriptions: all the examples (x's)
Classifications: all the classifications (y's)
Background: existing relevant knowledge

Explanation based learning (EBL)
- it works by "explaining" a solution
- Extracting general rules from examples
- construct proof for problem (e.g., using backward-chaining theorem prover)
- e.g., prove Derivative(X², X) = 2X
- e.g., prove Simplify(1 × (0 + X), w)
  
  i.e., can it be simplified?
- construct two proof trees simultaneously
  
  original proof
  the same proof with all constants replaced by variables
  i.e., a generalized proof tree
- extract general rule from generalized proof tree
- EBL steps
  
  construct proof of example using background knowledge
  also construct parallel proof with variables
  construct new rule with lhs including leaves of proof tree ⇒ rhs as example with variables and bindings applied.
  
  i.e., lhs terms are the conditions that the background knowledge shows to be true, which need to be true to make this inference again in the future
  
  drop any conditions on lhs that are true regardless of values of variables in rhs
  result is a new rule that summarizes the result of applying background knowledge
  ArithmeticUnknown(z) ⇒ Simplify(1 × (0 + z), z)
- Improving eficiency
- can also extract more general rules from the generalized proof tree by using non-leaf nodes
- tradeoff: general rules apply to more cases, but don't find answer as directly
- tradeoff: adding lots of specific rules makes each one apply directly to a specific set of situations, but finding the right one becomes harder (increased branching factor!)
- tradeoff: check whether parts of each new rule are easy to solve, but this make learning time longer.
- tradeoff: "easy to solve" varies as rules are added.
Lecture 22: Reinforcement Learning (21.1-21.2)
- Introduction
  "reward" or "reinforcement": feedback for action
  Markov Decision Processes: to MDP quick overview!
  reinforcement learning: based on rewards
  simple, fully observable environments, but with probabilistic action outcomes
  possible use by different agent types
  
  utility-based agent: learns utility function on states
  -- uses it to select actions that maximize expected outcome utility
  Q-learning agent: learns action-utility function (Q-function)
  -- the expected utility of taking a given action in a given state
  reflex agent: learns a policy that maps states directly to actions
  
  Model based vs. Model free
  
  Model based approach to RL
  -- learn MDP model: transitions and rewards (or approximation)
  
  Model free approach to RL
  -- do not learn the model
- Passive Reinforcement Learning
  "passive learning": agent's policy is fixed, learn utilities of states
  state-based representation, fully observable environment
  given a policy
  goal: learn how good the policy π is
  i.e., learn utility function U^π(s)
  does not know transition model in advance
  does not know reward function in advance
  agent make "trials" using the policy
  each trial runs to the terminal state
  the agent's percepts supply the current state s and the reward for that state.
  use reward info learn the expected utility for each state s
  Direct utility estimation
  reward-to-go: expected total reward from that state onwards to terminal state
  after each trial calculate reward-to-go for each state, and make expected utility for that state the running average.
  use reward-to-go as direct evidence of actual expected utility for state
  need many trials to get right answer (converges slowly).
  
  however, utilities of states are not independent, as...
  The utility of each state equals its own reward plus the expected utility of its successor states
  They obey Bellman's equations U^π(s) = R(s) + γΣ_s'P(s' | s, π(s))U^π(s')
  i.e., U^π(s) depends on U^π(s'), the next state's utility
  
  Adaptive Dynamic Programming
  does trials as before
  learns transition probabilities from observations
  -- how often do you get to s' from s by doing a?
  learns reward function R(s) from observations
  -- in new state, just store the reward given
  plugs values into Bellman equations
  solve for utilities
  
  Temporal Difference (TD) Learning
  make computation easier and obtain an aproximate utility
  just adjust utility of state based only on the observed successor
  don't need transition model, as transitions are observed.
  e.g., after some learning, calculate
  U^π(1,3) = R(1,3) + U^π(2,3)
  where (2,3) is the observed successor
  if that calculated value ≠ current utility value for U^π(1,3) then update it in the right direction.
  update using the TD Update Rule for s to s'
  U^π(s) = U^π(s) + α( R(s) + γU^π(s') - U^π(s) )
  where α = learning rate, γ = discount
  R(s) + γU^π(s') is approx/noisy utility measure
  make learning rate gradually decrease with the number of samples
Lecture 23: Natural Language Processing (22.1-22.4)
- Intro
  knowledge acquisition: need language understanding for getting new knowledge
- Language models
  language model: predict the probability distribution of language
  language: set of strings of characters
  grammar: rules that define legal structure (syntax)
  semantics: allocate meaning
  natural language: English, Spanish, ...
  word combinations have probabilities (some rare; some sorta OK)
  ambiguity: probability distribution over possible meanings
  -- "He saw her duck"
  language is huge so models are approximate
  N-gram character models
  simple language model: probability distribution over characters
  probability of sequence of N characters P( c_1:N )
  e.g., P("the") = 0.027
  n-gram: sequence of length n
  --- (bigram, trigram samples)
  --- Google books Ngram Viewer
  n-gram is Markov chain of order n-1
  --- P(c_i) depends on immediately preceding characters (e.g., previous 2 for a trigram)
  i.e., P(c_1:N) = Π_i=1..N P(c_i | c_i-2:i-1)
  extract n-gram probabilities from a corpus (large body of text)
  
  language identification: given text, what language is it written in ?
  example
  trigram model of each language (i.e., probabilities)
  i.e., have P(text|language)
  want P(language|text)
  = P(text|language)P(language)/P(text) and drop P(text)
  P(language) is dominated by P(text|language) term in calculation so it can be approximate and still OK
  argmax_l P(language) Π_i=1..N P(c_i | c_i-2:i-1)
  Smoothing n-gram models
  one corpus isnt the same as another, so n-gram model approx
  things claimed to be 0 probabilities actually are possible
  smoothing: adjust zero probabilities up, and others slightly down (sum to 1)
  N-gram word models
  n-grams for words
  probability of word sequence
  3-gram word model sentences are staring to look somewhat reasonable
- Text Classification
  categorization: given text what type is it?
  e.g., spam, positive/negative movie review, ...
  could use supervised learning
  "features" for category: word level, character level
  keep top 100 or so features
  can use supervised learning with features (e.g., decision tree)
  
  train n-gram word model for ¬spam and another for spam.
  P(category|message) = P(message|category)P(category)
  by Bayes rule and ignoring P(message)
  pick larger probability P(¬spam|message) vs. P(spam|message)
  
  can use data compression for classification
  e.g., add new msg to spam and compress, add same msg to ¬spam and compress, the greatest relative reduction indicates category!
- Information Retrieval (IR)
  task of finding relevant documents
  needs
  
  corpus of documents
  query in query language
  result set (possibly relevant documents)
  presentation of result set
  
  Boolean keyword model
  -- query language with AND/OR/NOT
  -- look in document for keywords
  
  IR scoring functions: query returns a score for a document
  high score = high relevance
  TF = frequency of a word in a document
  IDF = inverse domain frequency of a word
  --- if a word appears in most documents it has less importance
  DF = the number of documents that contain a word
  use these to return a score for a document and some query words.
  
  Precision = proportion of result set that are actually relevant
  Recall = proportion of all relevant documents in corpus that are returned in the result set.
  can make tradeoffs between P and R
  tweaks include adjusting case (car = CAR = Car); stemming (run = runs = running); synonyms (sofa = couch)
  
  PageRank developed by Google
  PR(p) depends on PR of all pages that link to page p, and the count of number of links from each of the pages that link to p.
  i.e., depends on Σ_i( PR(in_i)/C(in_i) )
  the HITS algorithm first gets pages that satisfy query, then does a similar sort of analysis
  Finds Hubs and Authories
  e.g., authority pages have many relevant pages pointing to them.
  
  Question answering: query is a question
  been around for a while!
  D.C.Brown (1974) A survey and analysis of question answering systems,
  M.Sc. Thesis, University of Kent, Canterbury, England.
  
  Can use standard question types
  Convert questions into standard type, then into web search query.
  Selections of text retrieved are analysed.
  Uses knowledge about what type of answer is expected
  e.g., who vs. how many expects name vs. number
  (used in Watson)
- Information extraction
  Acquire knowledge by skimming text and looking for objects & relationships
  e.g., extract addresses
  Approaches:
  
  Finite-state automata
  Probabilistic models (skip this)
  Conditional random fields (skip this)
  Ontology extraction
  Automated template construction
  Machine reading
  
  Finite-state automata
  
  assume text is description of single thing
  extract attributes (e.g., Manufacturer, Model, Price)
  define "template" for each attribute
  template defines as finite-state automata (e.g., regular expression)
  regex -- can define sequence, repetition, optional items
  template may have test for pre and post context
  e.g., price is 100 dollars
  
  finite-state automata can be cascaded (sequence)
  modularizes the knowledge
  works very well with text in restricted domains
  1st tokenize
  2nd detect complex words (e.g., company names)
  3rd group words and tag (e.g., noun phrases)
  4th handle complex phrases
  5th merge related structures
  
  Ontology extraction
  
  build ontology of facts from large corpus
  precision is vital
  use very general templates
  templates that match fact-giving syntax
  
  Automated template construction
  
  looking for templates that reveal particular relation
  e.g., subcategory; author-title; etc.
  start with some examples in the form of simple templates
  use those to retrieve text
  infer other templates from the text
  use context around the match to add to new templates (e.g., "type of"; "wrote")
  
  Machine reading
  
  needs to learn many templates
  start with general syntactic templates
  learns underlying probabilities
Lecture 24: Natural Language for Communication (23)
- Communication
  language intended send messages
  syntax = structure
  semantics = meaning
  pragmatics = practical issues affecting meaning that relate to context
  language is too vast and complex for trigrams to be the only tool
- Phrase Structure Grammars
  need rules that define the legal language -- a grammar
  part of speech (lexical category) -- Noun, Verb, Article, Pronoun, etc.
  syntactic categories -- noun phrase (NP), verb phrase (VP)
  combinations form phrase structure of sentence -- e.g., NP VP
  Non-terminals -- Article, Noun, NP, ...
  Terminals -- "the", "wumpus", ...
  parsing -- finding the structure of a sentence using grammar
  usually tree form
  [S [NP [Article "every"] [Noun "wumpus"]] [VP [Verb "smells"]]]
  generation -- using the grammar rules to produce sentences
  simple grammars can overgenerate (e.g., "me go home")
  
  need rules that define the legal language -- a grammar
  the form of the rules alter the complexity of the languages that the grammar can parse/generate (Chomsky Hierarchy)
  
  recursively enumerable (unrestricted rules)
  context-sensitive (can apply a rule in a specific context)
  context-free (used in any context)
  regular (highly restricted)
  
  context-free grammar
  
  S → NP VP
  NP → Article Noun
  ...
  
  probabilistic context-free grammar (PCFG)
  S → NP VP [0.90]
  NP → Article Noun [0.25]
  ...
  
  probability assigned to every string
  lexicon -- words with lexical category and probabilities
  probability of sentence is product of probabilities of rules and words
- Syntactic Analysis (Parsing)
  Parsing: using grammar to find phrase structure
  top down: start with S and work down to words
  bottom up: start with words and work up to S
  use memory (chart) to keep track of successful parses of parts of sentence to prevent having to reparse them again later
  syntactic ambiguity: multiple ways to parse a sentence
  "he eats grass and leaves" (leaves can be a N or a V)
  look for best parse -- related to probability
  could use A* with cost 1/p of parse found so far
  
  learning probabilities for PCFGs
  learn grammar from data
  large corpus of correctly parsed sentences (treebank)
  extract rules from parses and learn count frequencies
- Augmented grammars and Semantic Interpretation
  lexicalized PCFGs
  probabilities depend on relationships between words that rule includes
  "eat a banana" vs. "eat a bandana"
  augmented PCFG includes sytactic structure as well as word relationships
  'head' of phrase is most important word (e.g., v = "eat", n = "banana")
  VP(v) = Verb(v) NP(n) [P(v, n)]
  P(v, n) depends on v and n.
  P(eat, bandana) is very low
  use smoothing for very low probabilities so that they aren't zero
  can learn P(v, n) from treebank
  
  grammar rules can be expressed in logic
  parsing can be expressed as logical inference
  not really practical for unrestricted parsing
  could be used for language generation
  
  Case agreement and subject-verb agreement
  there are a variety of additional linguistic rules that need to be expressed somehow in order to parse/generate correctly.
  getting them all into the grammar could mean adding lots of extra non-terminals
  e.g., subjective case ("I"), objective case ("me")
  e.g., subject-verb agreement ("I smell bad", "he smells bad", "they smell bad")
  Instead, add parameters to the non-terminals
  NP(c, pn, head)
  c = case, pn = person/number (e.g., 1st person singular), head = head word of phrase
  
  Semantic interpretation
  compositional semantics: semantics of phrase depends on semantics of subphrases
  i.e., the meaning can be built up during bottom-up parse
  syntax rules annotated with semantic functions
  meanings carried up the parse tree and composed
  "John loves Mary" → Loves(John, Mary)
  meaning of "loves" is the lambda expression
  λy λx Loves(x,y)
  "Mary" gets bound to y, on one branch of parse tree.
  Higher up the parse tree, "John" gets bound to x.
  
  Pragmatics -- influence of current situation on the meaning
  Indexicals: "I am in Worcester today" -- "I", "today"
  Speech Act: determining speaker's intent
  "Could you close the door?" ("yes, I could")
  could even require input from perception
  "Give me that book"
  
  Ambiguity!
  "Squad helps dog bite victim"
  Almost every utterance is ambiguous.
  Alternative meanings get pruned out by native speakers.
  Lexical ambiguity: "bank" two kinds of noun, a verb, and an adjective
  Syntactic ambiguity: "I saw the flower in the park"
  seeing in the park, flower in the park
  Metaphor: "All the world's a stage" (no it isn't)
  Disambiguation: needs knowledge
  
  World model: knowledge of what is likely in the world
  Mental model: speaker's belief and hearer's belief
  Language model: likelihood of certain string of words
  Acoustic model: concerns sequences of sounds
- Machine Translation
  translate source to target (e.g., English to French)
  perfect translation requires complete understanding of the text
  Alternative meanings get pruned out by native speakers.
  → Alternatív jelentések kap metszett ki anyanyelvű.
  → Los informes alternativos se cortan fuera a hablar.
  → Alternative reports are cut out to speak.
  other languages have different words for different situations where English may have one (and v.v.)
  Levels of translation:
  
  English → Interlingua → French
  English Semantics → French Semantics
  English Syntax → French Syntax
  English words → French Words
  
  Statistical machine translation
  use large bilingual corpus of translations to train probabilistic model
  f* = argmax_f P(f | e) = argmax P(e | f)P(f)
  P(e | f) is a translation model (but P(f | e) can be found directly)
  P(f) is a language model for french
  Phrase approach -- find best french phrase of short english phrase
  P(f_i | e_i) are known
  sequence of french phrases are 'distorted' to a new order (for better french)
  P(d_i) distortion probabilities are known (learned)
  P(f, d | e) = Π_i P(f_i | e_i) P(d_i)
  use a search to find best f for the e.
- Speech recognition
  Speech recognition: identify sequence of spoken words
  many problems...
  Segmentation: no pauses between spoken words
  Coarticulation: adjacent sounds affect each other
  Homophones: to, too, two.
  Use vector of features from audio signal to represent the speech
  argmax P(word | sound) = argmax P(sound | word) P(word)
  for some time period
  P(sound | word) is the acoustic model -- the sounds of words
  P(word) is the language model (for each utterance)
  Markov assumption: the current state Word_t depends on a fixed number of previous states.
  
  Acoustic Model
  sounds waves --- A-to-D converter --- sampling rate
  quantization factor: precision of each measurement (8-12 bits)
  phones: different speech sounds (about 100)
  phoneme: smallest unit of sound with a distinct meaning for a language (e.g., pill vs. kill)
  kit vs. skill --- the K is two different phones but one phoneme
  frames: overlapping time slices through signal (e.g., 10 ms)
  vector of discrete features for each frame (e.g., energy at different frequencies)
  
  phone model
  transition probabilities between parts of a phone
  Form hidden Markov Model
  parts have expected features
  parts are onset, middle, end
  could take 5-10 frames as input and recognize phone [m] for e.g.
  
  pronunciation model
  transition probabilities between phones
  e.g., [ t ow m aa t ow ]
  can augment to show dialect variation and coarticulation
  [t] [ow] vs. [t] [ah] at the start of "tomato"
  
  Language Model
  based on corpus of task-specific text
  use transcripts of spoken interactions (e.g., airline reservations)
  include all task-specific vocabulary
  have voice interface ask specific questions to constrain user input
  Building a Speech Recognizer
  Components:
  
  high quality microphone
  low background noise
  signal processing algorithms
  features used
  phone models
  word pronunciation models
  language model
  
  phone models & word pronunciation models often hand developed
  probabilities come from speech corpus
  models can now be learned automatically
  performance error less than 1% for limited topics
  up to 10-20% error in larger vocabularies
  task specific interaction lowers error
Lecture 25: Perception
- Intro
  Perception: interpreting response of sensors
  vision, hearing, touch -- plus radio, GPS, infrared, etc
  sensor model: sensor (S) provides evidence about the environment (E), i.e., P(E | S)
  object model: describes objects in the world (e.g., 3D geometry)
  rendering model: how stimulus is produces from the world (e.g., lighting)
  lots of ambiguity in vision: some managed by using prior knowledge
  video camera may deliver 10 GB per minute
  i.e., what to use, what to ignore?
  
  feature extraction: simple computations applied to sensor observations
  recognition: making key distinctions between objects, perhaps labelling them
  reconstruction: build geometric model of world from image(s)
- Image formation
  imaging distorts the appearance of objects (e.g., perspective, foreshortening) *1*
  scene → sensor → 2D image
  pixels: smallest units of image
  image formed at the image plane (e.g., via pin-hole camera) *2*
  f is distance from pinhole to image plane
  (x,y) is point on image plane
  (X,Y,Z) is location in scene
  x = -fX/Z, y = -fY/Z
  image is inverted up-down & left-right
  larger Z, smaller x & y
  parallel lines converge in the image at vanishing point
  note the importance of Z: if you know the rest, you can calculate Z!
  Lens Systems
  lens gathers more light *3*
  have limited depth of field
  i.e., can 'focus' light from a limited range of Z values
  outside that range will give unsharp image
  
  Scaled orthographic projection
  if points on object have very limited Z variation then scaling factor f/Z (in -fX/Z) is effectively a constant s
  i.e., x = sX, y = sY
  
  Light and Shading
  brightness of image depends on brightness of patch of surface that projects to the pixel.
  main causes of varying brightness:
  --- overall intensity of light
  --- reflecting more or less of the light
  --- shading due to not facing the light as much
  diffuse reflection: light evenly scattered
  i.e., brightness doesn't depend on viewing direction
  specular reflection: brightness depends on viewing direction
  specularities: small patches where there's specular reflection *4*
  default assumption is distant point light source
  amount of light at surface patch depend on angle between the normal to the patch and the illumination direction. *5*
  diffuse surface patch reflects some fraction of light
  --- diffuse albedo (e.g., white paper has 0.90)
  Lambert's cosine law for brightness of diffuse patch
  I = ρI₀cosθ
  where ρ is diffuse albedo,
  I₀ is intensity of light source,
  θ is angle between light source direction and surface normal.
  note that lighting provides surface information (due to θ)
  surface with no light is in shadow
  interreflections: prevent shadows from being completely black
  ambient illumination: from interreflections
  
  Color
  (or, using my trigram system, Colour)
  energy at different wavelengths (spectral energy density)
  humans see red, green, blue (dogs)
  principle of trichromacy: by mixing three colors humans can be fooled into seeing the original color (e.g., TV)
  model light source with different R/G/B intensities
  model surfaces with different albedos for R/G/B
- Early image-processing operations
  early: reducing the amount of data, starting interpretation into compact representation
  early: usually local operation (rely on small part of the image)
  early: often in parallel
  
  edge detection
  straightlines or curves in image
  significant change in brightness
  different kinds of edges (types detected later) *6*
  
  depth discontinuities (object to background)
  surface orientation discontinuities (edge of object)
  reflectance discontinuities (change of surface material)
  illumination discontinuities (shadows)
  
  in 1D brightness is I(x)
  edge is sharp change in brightness *7*
  detect change by large change in derivative I'(x)
  noise may give this, so smooth/blur first --- (I * Blur)'
  Blur = Gaussian filter G_σ
  (I * Blur)' = (I * G_σ)' = I * G_σ'
  convolution of I and G_σ'
  σ is the standard deviation -- small blurs less
  corresponds to replacing each pixel by avg values of those around
  --- giving closer ones more weight and further away less weight.
  think of it as a small operator that scans across the image
  peaks (max of large gradient) in processed image correspond to edges *8*
  similar in 2D --- also interested in edge orientation θ(x,y)
  link edge points that are related by orientation
  texture analysis
  spatially repeating pattern on surface that can be detected visually
  e.g., grass, pebbles
  use multi-pixel patch -- characterize patch by histogram of pixel (edge) orientations
  histogram changes in an image area suggest change in object
  orientations largely illumination invariant
  optical flow
  direction and speed of motion of object in the image *10*
  object or camera moving between frames of video
  rate of flow can indicate distance, and show actions
  need corresponding point between two images (2 frames)
  select image patch at (x₀, y₀) at time t₀
  compare patch with places around that point in second image at time t₀+D_t
  at (x₀+D_x, y₀+D_y)
  minimize the measure of Sum of Squared Differences
  i.e., find best (D_x, D_y)
  optical flow at (x₀, y₀) is (v_x, v_y) = (D_x/D_t, D_y/D_t)
  there needs to be some texture for this to work
  Segmentation of Images
  break image into regions of similar pixels *11*
  regions often indicate edges of objects
  can either detect region boundaries, or regions themselves
  detect region boundaries: train classifier based on brightness, color and texture
  estimates P_b(x,y,θ) --- probability of boundary b at x,y at angle θ
  however, may not form closed curves
  Alternative approach: cluster pixels based on brightness, color and texture
  maximize similarity of pixels in cluster, and maximize difference between clusters
- Object recognition by appearance
  appearance: what object looks like
  simple/consistent objects: just test for distinctive features in the image
  e.g., works quite well for faces
  slide round window over image, compute features, use classifier, find faces!
  overlapping windows might be combined to report single face
  train classifier with marked-up face images *12*
  
  Complex appearance and pattern elements
  several effects move features around in an image: *13*
  
  foreshortening: viewing slanted surface
  aspect: object at different rotation angles
  occlusion: parts hidden by other parts or objects
  deformation: objects with moving parts/regions
  
  try looking across image for object parts (also vary scale)
  if related parts are close together then object detected
  i.e., look for image features together in approx the right place
  heuristic --- use spatial information (e.g., car wheels at bottom)
  Pedestrian detection with Histogram of Gradient features
  use histograms of local orientations in an image *14*
  break image into cells -- make orientation histogram for each cell
  emphasise important gradients by weights that show how significant they are relative to others in the same cell
  gives Histogram of Gradient feature
  train classifier with existing training sets
- Reconstructing the 3D world
  recover 3D model from image
  i.e., can we do P(Scene|Image) = P(Image|Scene)P(Scene) ?
  Motion parallax
  camera moves relative to 3D scene *16*
  apparent motion in image tells us about camera mvt and depth info in scene
  viewer translational velocity T
  Z(x,y) is z-coordinate of point in scene corresponding to image point (x,y)
  optical flow
  v_x(x,y) = xT_z/Z(x,y)
  v_y(x,y) = yT_z/Z(x,y)
  can detect relative depths from optical flow
  Binocular stereopsis
  two images separated in space *17*
  disparity: difference in location in two images of same features
  need to solve the correspondence problem
  displacement of eyes (cameras) by amount b along x-axis (approx 6cm)
  horizontal disparity (in image) H = b/Z
  measure disparity, know b, obtain Z the depth of some point on object
  humans fixate: look at a certain depth
  small variations in depth correspond to small angles at the eye
  smallest detectable angle is about 5 seconds of arc
  (a minute of arc is 1/60th of a degree)
  (a second of arc is 1/60th of a arcminute)
  e.g., at 30cm we can detect 0.036mm!
  generize to multiple views *18*19*
  Shading
  variation in intensity of light from different portions of a surface in the scene
  due to geometry and reflectance properties
  very hard to recover these from the image
  there are many interflections
  Contour
  we can extract distance and 3D properties from outlines *21*
  figure-ground problem: which is foreground, which is background?
  big clue is T-junctions
  assume "ground plane"
  i.e., nearer objects project to points lower in image
  Objects and geometric structure of scenes
  can use horizon detector: images closer to the horizon are further away *22*
  also, pedestrians are approx same height so images size reflects distance
  
  for solid object with distinct feature points m_i
  pose detection, for use for industrial robots manipulating parts
  assume rotation and translation of object, and projection to image
  image point p_i = Q(m_i)
  Q is the same for all image points
  if three object features can be found in the image then equations can be solved (e.g., using edges and vertex detection)
  i.e., all m_i of object can be predicted
  and object position and "pose" is known allowing manipulation
- Object recognition from structural information
  use knowledge of object being seen
  e.g., simple model of human body
  deformable template: moveable image blocks with relationships
  e.g., leg image relative to body image *23*
  
  model geometry of body with eleven rectangular segments with connections and constraints
  "cardboard people": model forms a tree rooted at torso
  segments move independently of segment to which they're connected
  e.g., lower arm relative to upper arm
  image rectangle should resemble the model segment
  relationship between image rectangles should match expected relationships between associated model segments
  find best match
  can use size of rectangle/image to help
  color can help matching
  Appearance model: model of segments reflecting most likely position of person in the world, based on the image *24*
  
  Coherent appearance
  tracking people in video *25*
  look for torso in lots of frames
  build up a reliable appearance model that explains many frames
- Using Vision
  many applications!
  e.g., surveillance, sports, HCI, games, ...
  in simple cases with large fixed backgrounds can subtract background from complete image leaving image of interest
  can train classifier on optical flow to recognize standard actions
  
  Image retrieval
  find relevant images from d-b
  can be done via IR techniques (e.g., images have keywords)
  can learn keywords for image by using tagged training images and nearest-neighbors methods (test image similar to training image?)
  
  Reconstruction from many views
  assume a familiar 3D object, then we have an object model
  determine correspondences between image points and object points
  use correspondences to determine parameters of camera (and lense)
  test this by projecting other model points through camera to image
  determine whether there are matching image points nearby
  can confirm model
  applications include...
  
  Model-building: use video or collection of pictures to extract detailed 3D model of object *26*
  Matching moves: to put computer graphics characters in real video, determine actual camera moves so that graphics characters can be rendered correctly.
  Path reconstruction: robots can reconstruct object that they have seen, and use camera information to construct record of path
  Using vision for controlling movement
  navigation -- e.g., autonomous vehicles
  Lateral control: stay in lane
  Longitudinal control: stay away from vehicle ahead
  Obstacle avoidance: avoid other cars, and pedestrians
  adjust steering, accelaration and braking
  need position & orientation relative to lane
  use edge detection to find lane markers
  augment with map knowledge: vision is confirmation
  but obstacles aren't (usually) on the map
  use binocular stereopsis for car ahead distance
  augment with laser rangefinders to build probabilitiy maps of surroundings
  use landmarks to reset absolute position information
  for driving you don't need ALL the information from an image
  DARPA Urban Challenge
Lecture 26: Watson
Lecture 27: AI at WPI
Lecture 28: AI at WPI
Markov Decision Processes Quick Overview
- agent must chose action from ACTIONS(s) from each state s (at each time step)
- begins at start state in a fully observable environment
- sequential decision problem: find a (good) sequence of actions to terminal state
- terminal states have rewards (may be +ve or -ve)
- actions are unreliable (stochastic)
  --- some probability that movement will not be in direction chosen
  --- e.g., 0.8 in intended direction, 0.1 in two others.
- transition model: the outcome of each action at each state
- transition probabilities (to s' from s due to a) are known --- P(s' | s,a)
- transitions are Markovian: probabilities do not depend on earlier states, just s.
- utility function for agent depends on sequence of states (environment history)
- in each state agent gets a reward R(s)
  --- may be +ve or -ve
  --- negative rewards encourage agent not to be there!
- simple utility is sum of the rewards received
  --- including at a terminal state, where a larger reward may occur (perhaps -ve)
  --- U([s₀, s₁, ...] = R(s₀) + R(s₁) + ...
- discounted rewards (using "discount factor" γ)
  --- U([s₀, s₁, ...] = R(s₀) + γR(s₁) + γ²R(s₂) +...
- γ between 0 and 1,
  --- expresses preference for known current rewards over less well known future rewards.
- Markov Decision Process: states, actions, rewards, Markovian transitions.
- Policy π(s) : Solution to MDP: what action to take in any state
- each time policy is executed from s₀ it may lead to a different sequence of states (stochastic)
- quality of policy is "expected utility" of environment histories generated by policy.
- Optimal Policy π*(s): one that yields the highest expected utility
- if agent knows current state s it can then executes action π*(s) (Reflex Agent)
- changing R(s) values affects π*(s)
- maximize expected utility
  π*(s) = argmax Σ_s' P(s' | s,a)U(s')
  i.e., agent can choose action that maximizes expected utility of next state
  
  Return to Lecture 22 notes
http://web.cs.wpi.edu/~dcb/courses/CS4341/2013/contents.html

CS4341 ❏ Artificial Intelligence

Course Contents

Lecture 1: Introduction (1)

Lecture 2: Intelligent Agents (2-2.3)

Lecture 3: Intelligent Agents (2.4, 3.1)

Lecture 4: Uninformed search (3.2-3.4)

Lecture 5: Informed search (3.5-3.7

Lecture 6: Local Search

Lecture 7: Genetic Algorithms

Lecture 8: Adversarial Search

Lecture 9: Constraint Satisfaction Problems 1 (6.1-6.2)

Lecture 10: Constraint Satisfaction Problems 2 (6.3-6.4)

Lecture 11: Logical Agents & Propositional Logic (7.1-7.5, 7.7)

Lecture 12: First Order Logic (8.1-8.3, 8.4)

Lecture 13: Inference in First Order Logic (9.1-9.5)

Lecture 14: Classical Planning (10-10.3, 10.4.4, 11.1-11.2.2)

Lecture 15: Knowledge Representation (8.4, 12.1-12.6)

Lecture 16: Quantifying uncertainty (13.1-13.3)

Lecture 17: Uncertainty & Bayes (13.4-13.5)

Lecture 18: Probabilistic Reasoning (14.1-14.2, 14.4, 16.1-16.2)

Lecture 19: Learning from examples (18.1-18.4)

Lecture 20: More learning (18.7-18.8)

Lecture 21: Knowledge in Learning (19.1-19.3)

Lecture 22: Reinforcement Learning (21.1-21.2)

Lecture 23: Natural Language Processing (22.1-22.4)

Lecture 24: Natural Language for Communication (23)

Lecture 25: Perception

Lecture 26: Watson

Lecture 27: AI at WPI

Lecture 28: AI at WPI

Markov Decision Processes Quick Overview