Computer Science Department

CS4341 ❏ Artificial Intelligence

Version: Wed Apr 24 19:52:44 EDT 2013

Course Contents

Lecture 1: Introduction (1)

  • Course Information
    -- email, web, book, intro page, projects, weekly exams
    -- myWPI, web-turnin
    -- my preparation
    -- sources for slides

  • What is AI? Definitions:
    -- AI is the study of ...
    • computations that make it possible to perceive, reason, and act.
    • how to make computers do things which, at the moment, people do better.
    • the design of intelligent agents.
    • how to make computers act like those in the movies!

  • Four goals: thinking/acting, humanly/rationally
  • Rational: does the right thing given what it knows

  • Thinking Humanly (reasoning)
    • Cognitive modeling
    • Implement model of reasoning
    • Does it reason like a human?
  • Acting Humanly (behavior)
    • Turing Test
        -- avoids definition of intelligence
          * How would you define it ???
        -- includes language, learning, knowledge, reasoning
        -- system intelligent if passes test
        -- person or machine ? (Eliza)
    • does not include perception
  • Thinking Rationally (reasoning)
    • Laws of thought ("logic")
    • works in practice?
  • Acting Rationally (behavior) [this book]
    • The Rational Agent approach
    • tries to find best outcome, or best 'expected' outcome.
    • actions should achieve one's goals

  • Engineering goal -- solve real-world problems
  • Scientific goal -- explain various sorts of intelligence
  • How AI has changed
    -- focus on systems that act rationally
    -- this is the book's focus
    -- there are areas that this book doesn't include (e.g., design, creativity)

  • Foundations of AI
    • Philosophy
    • Mathematics
    • Economics
    • Neuroscience
    • Psychology
    • Computer Engineering
    • Control theory and cybernetics
    • Linguistics

  • The Near-Term Applications
    -- e.g., routine design
    -- e.g., detect credit card fraud
  • The Long-Term Applications
    -- what is still left to do...????
    -- chess? Deep Blue
    -- space? Remote Agent and Deep Space 1
      "Remote Agent (RA) is a model-based, reusable, artificial intelligence (AI) software system that enables goal-based spacecraft commanding and robust fault recovery. RA was flight validated during an experiment onboard Deep Space 1 (DS1) between May 17 and May 21, 1999."
    -- autonomous vehicles?

  • What Intelligent Systems Can Do
    -- diagnosis, design, planning, scheduling, navigation, vision, tutoring, learning, ...

  • AI Sheds New Light on Traditional Questions
    -- computers provide new concepts & language
    -- computers require precision (e.g., what is "creativity"?)
    -- explore impact of technique or knowledge (add/remove)
    -- theories --> computational models --> implementations --> results --> refinements
    -- use of computers allows testing
    -- well tested methods used as tools

  • AI Helps Us to Become More Intelligent
    -- suggests new/better ways to tackle problems

  • AI Is Becoming Less Conspicuous, yet More Essential
    -- Airport gate allocation
    -- many embedded applications (cars, washing machines, ...)

  • Criteria for Success
    -- clear definition of task and implementable procedure for it
    -- regularities or constraints available
    -- other knowledge
    -- solves real problem
    -- provides new theory/method
    -- suggests new opportunities

Lecture 2: Intelligent Agents (2-2.3)

  • Agents & Environments
    • agent, sensors, actuators, environment
    • percept, percept sequence
    • action, action sequence
    • agent program implements agent function (percepts --> actions)

  • Rationality
      -- agent actions change state on environment
      -- Performance measure evaluates sequence of environment states
      -- Rational agent
        For each possible percept sequence, a rational agent should select an action that is expected to maximize its performance measure, given the evidence provided by the percept sequence and whatever built-in knowledge the agent has.
      -- What is rational depends on
      • performance measure,
      • prior knowledge,
      • performable actions,
      • percept sequence.
      -- maximize expected performance
      -- information gathering changes future percepts (helps maximize expected performance)
      -- exploration (investigate unknown environment)
      -- agent autonomy: doesn't only rely on agent designer's kowledge

  • The Nature of Environments
      -- Task environment for agent
      -- PEAS = Performance measure, Environment, Actuators, Sensors
      -- Properties of task environments
      • Fully observable vs partially observable
        -- sensors detect all relevant aspects of environment
      • Single agent vs multiagent
        -- multiagent: competitive vs cooperative
      • Deterministic vs stochastic
        -- i. state of environment completely determined by current state & agent action.
        -- ii. outcomes determined by probabilities
      • Episodic vs sequential
        -- i. agent experiences atomic episodes
        -- i. next episode does not depend on previous actions
        -- ii. current action could affect all future ones.
      • Static vs dynamic
        -- dynamic if environment changes while agent is deliberating
      • Discrete vs continuous
        -- relates to percepts and actions
      • Known vs unknown
        -- refers to agents knowledge
        -- are outcomes (or their probabilities) known for all actions?
      -- hard! = partially observable, multiagent, stochastic, sequential, dynamic, unknown

Lecture 3: Intelligent Agents (2.4, 3.1)

  • Structure of Agents
    • Agent = architecture + program
    • Table-driven program: table indexed by percept sequences
      -- full table not practical for real problems
      -- but note Case-Based Reasoning, tables in chess, and memoization (look-up tables).

    • Simple Reflex Agents
      -- next action depends on current percept only
      -- condition-action rule
      -- Rule-Match picks rule to use
      -- environment must be fully observable
      -- there must always be a matching rule (otherwise ???)
      -- the basic idea behind rule-based systems
    • Model-Based Reflex Agents
      -- internal state: keep track of best guess of state of environment
      -- model: how next state depends on current state and action
      -- in casual use, model = internal state (i.e., a model of environment)
    • Goal-Based Agents
      -- goals: desirable situations (result is achieved/happy or not)
      -- needs to have: what will happen if I do this...?
      -- can check relevant actions wrt achieving goal
    • Utility-Based Agents
      -- combine with model
      -- utility: quality of being useful (degrees of happy)
      -- utility function: estimates the performance measure
      -- maximize expected utility: will behave rationally

    • Learning Agents
      • agents can learn to become more competent
      • learning element: makes improvements
      • performance element: selects actions
      • critic: determines (using fixed performance standard)
        whether/how performance element should be modified
        -- i.e., it will perform differently after modification
      • problem generator: suggests actions that lead to new experiences

    • Representations of the environment
      -- atomic: no internal structure
      -- factored: vector of attribute values (features)
      -- structured: objects with attributes and relationships
      -- consequences ???

  • Problem-solving Agents
    • Goal formulation: adopt goal: first step in problem-solving
    • Problem formulation: decide what actions and states to consider
    • with options: may need to examine future actions to determine value
    • solution to some problems is a set of actions ("path")
    • solution to other problems is a state

  • Well-defined problems & solutions
    • initial state
    • set of possible actions applicable in state s
    • transition model gives state resulting from each action
    • state space: set of reachable states from initial state
      -- state-to-state transitions form a graph
    • goal test detects goal state (the state or its properties)
      -- might be more than one goal state
    • step cost: cost of taking an action from state to state
    • path costs: cost of following a path
    • solution: path from initial state to goal
    • optimal solution: lowest cost solution

Lecture 4: Uninformed search (3.2-3.4)

  • Example problems
    • toy problems vs real-world problems
    • toy:
      • vacuum world (goal = squares clean; solution = path)
      • 8-puzzle (goal = configuration; solution = path)
      • 8-queens puzzle (goal = configuration; solution = state)
    • real-world
      • route-finding
      • touring (e.g., traveling salesperson problem)
      • VLSI layout
      • robot navigation
      • packing a cargo plane

  • Searching for solutions
    • search tree
      -- nodes = states
      -- links = actions (with costs)
      -- root node = start state
    • expand node: apply possible actions to generate new states
    • parent nodes lead to child nodes
    • leaf node: no children (yet)
    • frontier: leaf nodes ready for expansion
    • search strategy: how to select which node to expand next
      -- determined by how frontier queue built and how selection made
      -- e.g., FIFO queue, LIFO queue, priority queue
    • loops and redundant paths (graph)
    • Tree-Search vs Graph-Search
      -- for graph search recognize where you have already searched

  • Uninformed search
    • Uninformed: no additional information about states
    • Informed: uses knowledge of how "promising" a state is (wrt goal)

  • Breadth-first
    • all nodes at one level expanded before any nodes at next level
    • test for goal at generation time (save time/space)
    • huge memory requirements

  • Uniform-cost
    • assumes different step costs
    • expand node with lowest current path cost: g(n)
    • use priority queue
    • alternative higher costs paths to node are ignored

  • Depth-first
    • expands most recently generated node
    • goes deep down a path before investigating alternatives
    • involves backing up from nodes that don't expand (aren't expanded)
    • space complexity much better than Breadth-first
    • the basic search of AI (often with modifications)

  • Depth-limited
    • depth-first with predetermined search depth limit
    • path not explored past depth limit
    • need to pick good value for limit (based on problem)

  • Iterative deepening depth-first
    • depth-first with varying depth limit
    • start with depth at 0 and increase it
    • some redundancy but not significant
    • adds a touch of Breadth-first, as at each level, whole tree may be searched
    • prefered uninformed search

Lecture 5: Informed search (3.5-3.7

  • Heuristic/Informed Search
    -- use problem-specific knowedge to gain efficiency
    -- can guide and prune
    -- evaluation function --- f(n)
    • cost estimate for path through n to goal
    -- actual path cost to node n --- g(n)
    -- heuristic function --- h(n)
    • estimated cost of cheapest path from n to goal
    • uses "heuristic" to estimate ("rule of thumb")

  • Greedy best-first search
    • f(n) = h(n)   --- instead of g(n)
    • sample heuristic = "as the crow flies"
      -- e.g., roads are always longer, but its a good estimate.
    • greedy -- doesn't take current cost into account!

  • A* search
    • "A star": a kind of best-first search
    • estimated path cost through n
    • f(n) = g(n) + h(n)
    • pick lowest f(n) each time
    • complete: will always find goal if there is one
    • optimal: finds best path
    • h(n) must be admissible -- i.e., optimistic!
      -- it always underestimates actual cost to goal
    • accurate h(n) close to or equals actual cost
      -- what if h(n) = actual cost???
    • can run out of space

  • Memory-bounded heuristic search
      Iterative-deepening A* (IDA*)
      -- use f values for cutoff, instead of d
    • Recursive best-first
      -- it prunes search if another branch becomes better
      -- but remembers best cost of pruned subtree
    • Simplified Memory-bounded A* (SMA*)
      -- uses A* until memory full
      -- expands newest best leaf, deletes oldest worst leaf.
    • SMA* robust choice for searching

  • Heuristic functions
    • good heuristics lower effective branching factor
      -- i.e., branching that actually occurs in a search
    • ebf close to 1 indicates few unnecessary branches
    • heuristic function with close to correct values are best
    • use relaxed problems (fewer restrictions) to generate heuristics
    • cost of optimal soln. to relaxed problem is admissible heuristic for original problem
      -- (e.g., Manhattan distance for 8 puzzle)
    • Pattern databases: store exact costs for subproblems
      -- gives heuristic value for cost of full problem

Lecture 6: Local Search

  • Local search & optimization problems
    • local search usually looking for a solution state, not a path
    • usually looks around a state (or states) by modifying it (them)
    • optimization: find best state, measured by an objective function
    • state space "landscape"
      -- surface formed by function's value across all states
    • global maximum (optimum) vs. local maximum
    • could be looking for minimum (gradient descent)

  • Hill-climbing
    • looking for maximum
    • search moves in direction of most improvement at each move
      -- steepest ascent (it's greedy)
      -- just records current state
    • problems: local maxima; ridges; plateaux
    • getting unstuck: stochastic (add some randomness at each move)
    • random restart hill-climbing: a set of random start states
  • Simulated-annealing
    • annealing = heating then gradually cooling
    • minimize cost (descent)
    • disturb search out of local minima
    • gradually disturb ("shake") less over time
    • makes a random move: accepts it with some probability
    • probability decreases if move makes things worse (a shake)
      -- you're still trying to go down hill to global minimum
    • probability slowly decreases also depending on time
  • Local beam search
    • beam searches move in restricted areas of search space
    • k random start states
    • expand all states
    • pick k best, and continue
    • may have poor diversity (i.e., stuck in a region of the state space)
    • variants add some randomness to encourage "diversity"

  • Local search in Continuous spaces
    • continuous actions/states lead to infinite branching factors!
    • easiest solution -- make discrete changes
      -- e.g., consider new states only by making discrete (delta) changes
    • can also compute local gradients for hill-climbing

Lecture 7: Genetic Algorithms

  • Genetic Algorithms (text's overview)
    • analogy to natural selection
      -- survival of the fittest
    • works on a series of populations of individuals (states)
      -- each population producing the next
    • initial population of k random states (k often 100+)
    • each state is rated by a objective/fitness function
      -- higher value, fitter individuals
    • individuals represent descriptions of states (using features)
      -- often as a binary string
    • fitter individuals replicated
      -- fitter get better chance of taking part in production of next population
      -- more fit, more copies
    • randomly select pairs for mating (crossover)
    • for each pair, randomly select crossover point.
    • crossover produces new pairs (for next population).
    • a small number of individuals are mutated (very small random change)
    • stop after some number of generations,
      when very fit individual appears,
      or if best (or avg) fitness is stable.

  • Genetic Algorithms (additional information)
    • See these A Quick Introduction to Genetic Algorithms notes.
    • many variations of algorithm
    • all have individuals, populations, fitness, crossover, mutation
    • vary by:
      • population size
      • whether the population size varies
      • representation of individuals
        -- direct representation (e.g., LISP program)
        -- coded representation (e.g., binary string(
      • how crossover done
      • probability of mutation
      • whether some individuals copied from previous population
      • whether individuals are checked for legality after crossover/mutation
      • how fitness is calculated and used
      • whether diversity is used to select for a new population
    • See these Diversity Selection notes.

  • GAs and Creativity
    • Koza
    • automated circuit design
    • uses circuit description language
    • each individual in the population is a circuit description

Lecture 8: Adversarial Search

  • Games
    • multi-agent, competitive
    • deterministic, turn-taking, two-player, zero-sum, fully observable
    • zero sum: one wins & one loses; or both draw.
    • very large game trees (search spaces): need to "prune" and ignore parts of game tree
      -- (search tree < game tree)
    • chess has 1040 nodes in game tree (intractable)
    • terminal state: one person has won
    • looking ahead: complete search can find terminal states (correct utility)
    • utility function: e.g., win (+1), lose (-1), draw (0)
    • looking ahead: can limit depth and estimate utility
    • ply: a move by one player
    • need legal move generator (can filter by what's "plausible")
    • use transposition (hash) table of evaluations at previously seen positions
    • can use pruning strategies
      -- e.g., based on shallow, fast evaluation
      -- danger: may prune the path that leads to a win!

  • Optimal decisions in games (Minimax)
    • assume both players play optimally (they want to win)
    • A plays their best move, assuming that B responds with their best move
      -- all the way down the tree!
    • High utility = player1 wins; Low utility = player2 wins.
    • Player1 tries to move value up, Player2 tries to move value down.
    • Search down the tree to terminal state, then back the values up taking min or max values until all states resulting from move choices have values that indicate what they'll lead to if played. Pick the best.
    • pick move that avoids opponents best moves!
    • time is exponential in search depth. :-(
    • getting to optimal requires searching to terminal states
      -- just not viable for huge game trees!

  • Alpha-Beta pruning
    • pruning!
    • an addition to minimax
    • dont expand a node that can't provide a score that's better than what you already have
    • time/space saved can allow deeper searches (e.g., twice as deep)
    • still exponential with depth, but visits fewer nodes due to pruning
    • game tree branch order affects pruning possibilities
    • chess: could order by expected utility
      -- e.g., captures; threats; move forward; move back

  • Imperfect decisions
    • can't search tree to terminal state
    • cut off search earlier and use evaluation function
      -- accurate estimate of chances of winning in that state (i.e., utility)
    • depth limited, or iterative deepening ("anytime algorithm")
    • Features:
      • # of pieces
      • strength of pieces (queen > pawn)
      • mobility (poss. moves)
      • control (squares threatened)
      • threats (potential captures)
      • patterns of pieces (e.g., diagonal pawns)
    • Evaluation function: often a weighted linear function

  • Chess: Heuristic Continuation Fights the Horizon Effect
    • fixed depth search produces a "horizon" (may be bad beyond it!)
    • singular-extension
      -- if one move's value is much better than rest, then keep looking down that branch, as it's a place where the most change in value could result from minimaxing
    • search-until quiescent
      -- look for quiet (i.e., no possible captures)

  • Chess: Deep Blue plays Grandmaster Chess
    • see this and this
    • first machine to win chess game against reigning world champion
    • uses alpha-beta search, with selective extensions
    • could search to a depth of 12 ply
    • has opening "book" and all five-or-fewer piece endgames
    • massively parallel, 30-node, RS/6000, SP-based computer system enhanced with 480 special purpose VLSI chess chips
    • evaluates 200,000,000 chess positions per second
    • several months working with a grandmaster on evaluation function
    • "In three minutes, ... it computes everything it knows about the current position from scratch."

  • Chinook: world man-machine checkers champion

Lecture 9: Constraint Satisfaction Problems 1 (6.1-6.2)

  • Defining CSPs
    • Constraint Satisfaction Problem (CSP)
    • set of constraints that specify allowable combinations of values of variables
      -- e.g., X1 ≥ X2,   X1 > X3,   X2 ≥ X3
    • set of variable (each one can have a value)
      -- e.g., Vbls = { X1, X2, X3 }
    • a set of allowable values (domain) for each variable
      -- e.g., the domain of each variable is {1, 2, 3, 4, 5}
      -- usually discrete, finite domains
    • the problem is to find a complete and consistent assignment
      -- all variables have values, no constraints are violated
    • there may be several, or no, consistent assignments
    • the result may need to be all or one consistent assignment

    • constraint graph: nodes = variables; links show constraint influence
      -- If constraint SA ≠ WA then SA-----WA in graph
    • constraint propagation:
      -- the influence of removing inconsistent values can spread through the graph (prune domains)
    • constraints can be fully enumerated
      -- show all allowable assignments for variables in the constraint
      -- e.g., { (red, green), (red, blue), ... (blue, green)}

    • types of problem solvers for CSPs
      -- search making one variable assignment at a time
      -- gradually eliminate inconsistent values from domains
      -- manipulate a potential solution until it becomes consistent

    • unary constraints include one variable (e.g., X ≠ blue )
    • binary constraints include two variables (e.g., A > B+3 )
      -- usually can reduce to all binary constraints
    • global constraints: e.g., Alldiff (means "all different")
    • preference constraints: ( ProfDCB prefers afternoon )
      -- other assignments are consistent, but suboptimal (incur cost)
    • resource constraints: Atmost(10, A, B, C, D)   (i.e., 10 max)
    • bounds: reason using variable domains represented by [lower, upper]

    • Examples: map coloring, scheduling, 8 queens, cryptarithmetic, Sudoku

  • Inference in CSPs by Constraint propagation
    • Node consistency: variable's unary constraints satisfied
    • Arc consistency: binary constraints satisfied between two variables
      -- e.g., variables X and Y
      -- for every value in the domain of X there's a value in the domain of Y that satisfies constraint
          (i.e., there's potential for a solution!)
      -- larger goal: aim to make whole graph arc consistent by removing domain values that don't give arc consistency

    • AC-3 algorithm: if domain of a variable is reduced, then look to see if that affects variables connected to it by constraints!
      -- i.e., the effects are propagated, until failure, or graph is arc consistent.
      -- even if result isn't a solution, it will be much easier to solve! (small domains)

    • Path consistency: look at triples of variables.
      -- IF A----B----C is a path, THEN, for every consistent assignment of values to both A and B (consistent with the constraints on both A and on B), there must be an assignment to B that is consistent with the A----B constraints AND the B----C constraints.

Lecture 10: Constraint Satisfaction Problems 2 (6.3-6.4)

  • Backtracking search for CSPs
    • depth 1st search that choses value for one variable at a time,
      and backtracks when a variable has no legal value left to assign.
      -- backtrack to a choice point on failure.
      -- keeps a single representation of the state and alters it

    • Choices?
      -- which variable to assign next?
      -- which order to assign values to that variable?

    • Variable choice
      • choose vbl with fewest remaining values
        -- most constrained vbl is more likely to fail soon
        -- 1,000+ times better performance
      • choose vbl that is involved in constraints with largest number of other vbls
        -- most influence

    • Value assignment order
      • prefer the value that rules out the fewest values in the closest vbls in the constraint graph
        -- leave max flexibility for subsequent assignments

    • Search mixed with inference
      • after choice of value for vbl X do inference (e.g., arc consistency)
      • forward checking: check arc consistency
      • maintaining arc consistency (MAC): do AC-3 on neighbors of X

    • Intelligent backtracking on failure
      • normal backtracking is "chronological"
        -- unwind in reverse temporal order
      • improved backtracking is "dependency-based"
        -- unwind to point that contributed to failure
        -- e.g. conflict-directed backjumping
      • no-good: keep track of set of vbls and their values that cause a problem
        -- no-good set gives early warning of failure

  • Min-conflicts
    • Local search for CSPs -- uses one state and modifies it
    • 8-queens problem
    • move randomly chosen conflicted piece
      -- move it to position with least conflicts (min-conflicts)
    • works well for hard problems
    • works well if there are many solutions in state space

  • Constraint posting
    • constraints can record knowledge
    • consider vbl X
    • reasoning infers constraints
    • post a constraint (X > 10)
    • post another constraint (X < 12)
    • don't decide value for X until you know a lot about it!
    • Least Commitment

  • Conditional CSPs
    • configuration problems
    • not all variables known in advance (unlike basic CSP!)
    • use a part in the config, then add its variables
    • i.e., vbls are conditional
    • e.g., car config rules
      -- RV means Require Variable
      -- RNV means Require No Variable
    • Package="luxury" ==>RV Sunroof
    • Sunroof="type2" ==>RV Opener
    • Type="convertible" ==>RNV Sunroof

Lecture 11: Logical Agents & Propositional Logic (7.1-7.5, 7.7)

  • Knowledge-based agents
    • reasoning using representations of knowledge
    • KB = knowledge base = collection of knowledge
    • logic = declarative knowledge representation language
    • TELL = agent told new kowledge
    • ASK = agent asked what it knows or can "infer"
    • axiom: taken as given, as being true
    • knowledge level vs. implementation level

  • Wumpus World
    • discrete, static, single-agent, partially observable
    • requires reasoning to update world model in order to decide moves

  • Logic Intro
    • allows truth values True and False
    • KB has sentences in logic
    • syntax = legal structure of sentence
    • semantics = meaning of sentence given "possible world"
    • model = possible world
    • a sentence is true in some models and false in others
    • model m makes sentence a is true   ≡   m satisfies a
    • a entails b:   b follows logically from a:   a |= b
    • iff every model for which a is true, b is also true
      i.e., M(a) ⊆ M(b)
    • logical inference uses logic to provide answers (e.g., about s)
    • model checking = enumerating all possible models
       to see if for all models in which KB is true, s is true
          M(KB) ⊆ M(s)
          KB |= s

    • Inference: finding if something follows from what you know
    • lots of things are entailed by the KB, inference is looking for one particular one.
    • |-i     = inference using algorithm i
    • KB |-i s     = s can be derived from the KB
    • a "sound" inference algorithm is truth preserving
    • model checking is sound
    • a "complete" inference algorithm can produce any sentence that is entailed
      -- i.e., anything that follows logically

    • if KB is true in the real world, then any sentence a derived from KB by a sound inference procedure is also true in the real world.
    • grounding: connecting the logical reasoning with the agent's real world
    • the agent's sensors create the connection

  • Propositional Logic
    • propositional symbols: each stands for a proposition (true or false)
    • connectives: 'not' (negation), 'and' (conjunction), 'or' (disjunction),
      'implies' (implication/if-then), 'iff' (if-and-only-if/biconditional/equivalence)
    • operator precedence
    • a model determines a truth value for every propositional symbol
    • semantics: how to compute truth value for any sentence
    • rules for evaluating truth of the 5 connectives
    • note TFF for (P ⇒ Q) implication, and F implies anything
    • truth tables: every assignment of T/F to propositions

    • KB is set of propositions saying when they're true
      e.g., Px,y is true if there's a pit in location [x,y]
    • KB includes sentences about propositions
      e.g., ¬B1,1

    • simple inference: model checking for KB |-i s
      -- check all assignments of T/F to propositions
      -- find assignments where KB is true (all sentences are true)
      -- look for how s is assigned.

  • Propositional Theorem Proving
    • theorem proving = applying rules of inference to KB to try to show what we want
    • logical equivalence = true in same set of models [e.g., ¬(¬P) ≡ P ]
    • valid sentence = tautology = true in all models [e.g., P v ¬P ]
    • satisfiable sentence = true in some model
    • P is valid iff ¬P is unsatisfiable
      i.e., if there are no models that satisfy ¬P

    • KB |= b   iff   (KB ∧ ¬b) is unsatisfiable
      • e.g., to show b assume b to be false and add ¬b to the KB
        i.e., KB ∧ ¬b
      • then try, by inference, to show this causes a contradiction
      • if there's a contradiction then b must in fact follow from KB
      • known as proof by "refutation"

  • Inference and Proofs
    • inferences rules can be used in sequence in a proof
    • Modus Ponens: given a and (ab) then b can be inferred
    • And-Elimination: given (ab) infer a
    • all the logical equivalences can be used as inference rules, as they preserve truth
      e.g., ¬(¬P) ≡ P
    • monotonicity: set of entailed sentences only grows as more are added to the KB
    • inference rules might apply to anything in the KB (control needed)

  • Proof by Resolution
    • Resolution: an inference rule
    • works on clauses: disjunction of literals
      e.g., P ∨ Q ∨ ¬R
    • (ab) resolves with (¬ac) giving (bc)
    • removes the complementary literals (a, ¬a)
    • result has all of the other literals
    • remove duplicated literals

    • Resolution uses Conjunctive Normal Form (CNF)
    • e.g., <clause> ∧ <clause> ∧ <clause>
    • can convert any propositional logic sentence to CNF

    • If you're trying to prove a
      • 1. convert (KB ∧ ¬a) into CNF
      • 2. use resolution inference rule on the resulting clauses
      • 3. if a resolvent is empty then we have a contradiction, and a is proved.
      • 4. if no new clauses result then the proof ends.

  • Using Horn clauses
    • Horn clause: disjunction of literals, with at most one positive
      e.g., P ∨ ¬Q ∨ ¬R
    • resolution on Horn clauses produces Horn clauses
    • Horn clauses can be written as implications (nicer to read/write)
      e.g., (ab) ≡ (¬ab)
    • normal form is A ∧ B ⇒ C
    • proofs controlled by forward-chaining or backward-chaining search strategies
    • AND-OR graph
    • forward: (data-driven) starts from known facts (positive literals) and works forwards by inferences until the query is found.
      e.g., if you want to prove C, given A and also B, then use (A ∧ B ⇒ C) to provide C.
    • backward: (goal-directed) starts from query and works back trying to show that all the things that lead to the query can be inferred.
      e.g., if you want to prove C, and (A ∧ B ⇒ C), then prove both A and also B.

  • Agents based on Propositional logic (brief summary)
    • problem: percepts (e.g., Stench) only apply at a particular time
    • adding ¬Stench to a KB that alread contains Stench gives contradiction!
    • fluent: something that changes
    • need to to state what changes and what doesn't for each action
    • this is known as the "frame problem"
    • hard to deal with in propositional logic as there are only symbols
    • we can make symbols Stench1 and Stench2 etc to show different times
      N.B., the superscript is part of the symbol and has no influence in the logic.

Lecture 12: First Order Logic (8.1-8.3, 8.4)

  • Representation revisited
    • Propositional logic - facts
    • First Order Logic - facts, objects and relations
    • can include variables
    • includes statements about some or all (quantifiers)
    • FOL assumes world with objects and relations
    • true or false or unknown
    • standard syntax -- "syntactic sugar" provides allowed variants

  • Syntax & Semantics
    • models contain objects (Richard), relations (brother-of), properties (king), functions (left leg)
    • syntactic elements in the language are symbols
    • constant symbols (Richard) stand for objects
    • predicate symbols (Brother) stand for relations
    • function symbols (LeftLeg) stand for functions
    • interpretation specifies exactly what in the model symbols refer to

    • terms refer to objects - e.g., Richard, or LeftLeg(Richard)
    • atomic sentences = facts - e.g., Brother(Richard, John)
    • logical connectives
    • Quantifiers -- 'for all' ∀ and 'there exists' ∃ -- use variables
    • ∀x King(x) ⇒ Person(x) --- note TFF
    • ∃x Crown(x) ∧ OnHead(x, John)

    • quantifier order matters
    • ∀x ∃y Loves(x, y)
    • ∃y ∀x Loves(x, y)
    • use different vbl names for each quantifier
    • ∃ and ∀ are related by ≡ rules -- how?

    • equality: two terms refer to same object --- e.g., Father(John) = Henry

    • alternative semantics
    • unique-names assumption -- every constant refers to distinct object
    • closed-world assumption -- if we don't know it's true, it's false
    • domain closure -- # domain elements = # constant symbols

  • Using FOL
    • TELL -- add "assertions" to KB
    • ASK queries -- can retrieve directly or infer
    • ASKVARS gives vbl bindings/substitutions for the answer
      e.g., ASKVARS(KB, Person(x))   gives   {x/John} and also {x/Richard}
    • theorems are derived from axioms (i.e., from basic factual info and definitions)
    • theorems can be used in inference too
    • unlike Propositional logic, can make statements about any time
      e.g., ∀t HaveArrow(t + 1) ⇔ (HaveArrow(t) ∧ ¬Action(Shoot, t))

  • Knowledge Engineering in FOL
    • knowledge engineering = KB construction for task/domain
      1. Identify task: what needs to be represented
      2. Assemble relevant knowledge: knowledge acquisition
      3. Decide on vocabulary: predicates, functions and constants
        i.e., define the Ontology
      4. Encode general knowledge about domain
      5. Encode specific problem instance (e.g., info from sensors)
      6. Pose queries and get answers (ASK)
      7. Debug the KB (and individual sentences)

Lecture 13: Inference in First Order Logic (9.1-9.5)

  • Propositional vs First Order Inference
    • simple inefficient approach: convert FOL to propositional logic then do inference
    • remove quantifiers and variables
    • ∀ -- if possible do Universal Instantiation (substitute variables with ground terms)
    • ∃ -- pick a Skolem constant to stand for the thing that exists.
    • typically generates lots of sentences, many irrelevant

  • Unification
    • for FOL use Generalized Modus Ponens (MP)
    • find substitutions for variables that makes regular MP useable
    • Generalized MP is MP "lifted" to apply to variables
    • unification = finding substitutions that make different logical expressions look identical
      e.g., UNIFY(Knows(John,x), Knows(y,Bill)) = {x/Bill, y/John}
    • after unification then a P with vbls matches the P in (P ⇒ Q) allowing MP

      Note: skip section about making retrieval more efficient

  • Forward Chaining
    • useful for Situation ⇔ Response systems (rules)
    • use definite clauses: disjunctions of literals with exactly one positive
    • perfect for sentences such as: King(x) ∧ Greedy(x) ⇔ Evil(x)
      which converts into a definite clause
    • algorithm: start from known facts, use all rules whose premises are satisfied, and add the conclusions to the known facts, and repeat until query answered.
    • sound and complete
    • may not be efficient
    • incremental forward chaining: every new fact inferred in iteration t must be derived from at least one new fact inferred in iteration t-1.

  • Backward Chaining
    • works backwards from goal query
      from conclusions back to premises
    • uses definite clauses
    • needs to keep track of accumulated substitutions
    • can be done by depth-1st search
    • AND-OR tree
    • used in Logic Programming (e.g., Prolog)

      Note: skip section 9.4.3-9.4.6

  • Resolution
    • Every sentence of FOL can be converted into an inferentially equivalent Conjunctive Normal Form (CNF) sentence
      i.e., a conjunction of clauses, with each clause being a disjunction of literals:
      clause e.g., ¬American(x) ∨ ¬Weapon(y) ∨ ¬Hostile(z) ∨ ¬Sells(x,y,z) ∨ Criminal(x)
    • to convert to CNF
      1. eliminate implications
      2. move ¬ inwards
      3. standardize variables
      4. Skolemize to remove existential quantifiers
      5. drop universal quantifiers
      6. distribute ∨ over ∧
      7. result is a clauses connected by ∧
    • resolution inference:
      1. take two clauses with complementary literals
      2. find a substitution that allows one to "cancel out the other"
      3. what's left over, with the substitution, forms the resolvent clause
    • resolution proof: prove KB |= a by proving that (KB ∧ ¬a) is unsatisfiable,
      by deriving the empty clause.
    • each resolution step adds a new clause to the KB (increasing in size)

      Note: skip section 9.5.4-9.5.5

    • Resolution Strategies: resolution needs guidance about which clauses to try to resolve
      • Unit preference: always include a single literal in the resolution (gets shorter clauses back)
      • Set of Support: always use a member of a predetermined set of clauses in each resolution step (e.g., initially use negated query -- add every resolvent to the set of support)
      • Input Resolution: always use clauses from KB or the query
      • Subsumption: eliminate all sentences that are more specific (subsumed by) than something already in the KB

Lecture 14: Classical Planning (10-10.3, 10.4.4, 11.1-11.2.2)

  • Definition
    • devising a plan of action to achieve ones goals
    • world is represented by a collection of variables
    • a search problem: inital state; actions available; result of acting; goal test.
    • state: a conjunction of fluents (with no variables)
    • closed world assumption
    • unique names assumption

    • Action: defined using an action schema using vbls (represents a set of specific actions)
      e.g., Fly: fly from Boston to SF, fly from Austin to NYC, ...
    • actions only mention preconditions and effects
    • preconditions must be true in order to do the action
    • effects: delete list (no longer true) & add list (new fluents)
      e.g., ¬At(p,from) ∧ At(p,to)
    • initial state: a specific state description
      e.g., At(C1,SFO) ∧ At(C2,JFK) ∧ ...
    • goal: a conjunction of literals that may contain vbls
      e.g., At(C1,JFK) ∧ At(C2,SFO)
    • note that actions may have costs, or the count of actions could be used if we assume equal costs.

  • Planning as state-space search
    • Forward state-space search (progression)
      -- start from initial state and apply actions until goal is found
      -- strong domain-independent heuristic needed, and available
      -- most planning systems use forward search
    • Backward relevant-states search (regression)
      -- start from goal and apply relevant actions backwards until initial state found
      -- select actions that could contribute to the goal, but dont negate an element of the goal
      -- previous state is current state without the add list and including the preconditions

    • hueristics for planning
      • try to find a relaxed problem
        -- ignore all preconditions
        -- ignore some preconditons
        -- ignore delete lists
        -- ignore some fluents
      • use decomposition
        -- assume independent subgoals, solve separately, combine costs
      • use pattern databases
        -- stored cost for problems with particular pattern in them

  • Planning graph
    • can give a better heuristic estimate for guiding planning search
    • graph can be used to estimate how many steps to reach goal
    • GraphPlan: extract plan from searching in the planning graph
    • for propositions only (no variables)

    • connects possible states with possible actions
    • S0, A0, S1, A1, ...
    • Si is all the literals that could hold at time i, depending on the actions taken in prior steps.
    • Ai is all the actions that could be taken from Si including "persistence" (i.e., no change / no-op action).
    • build new S levels with actions between until there's no change in the literals included (levelled off)
    • planning graph isn't too costly to construct

    • can extract plan as a backward search once all literals from the goal are present in some S level and they aren't marked as mutually exclusive.

    • Mutex links = mutual exclusion
      i.e., things that can't exist together
      e.g., Have(Cake) with ¬Have(cake)
      e.g., Have(Cake) with Eaten(Cake)
      e.g., Bake(Cake) with Eat(Cake) (i.e., actions have conflicting prereqs)
    • Mutex between actions too: Inconsistent effects; Interferences; Competing needs.

    • if any goal literal is not in final Si level then problem is not solvable
    • heuristic: can estimate the cost of achieving any goal literal by what level of the graph it first appears (level cost)
    • heuristics: for goal with conjunction of literals, try sum of level costs

  • Partial Order Planning
    • totally ordered plans: linear sequence of actions
    • partial order plans: actions with ordering constraints
      i.e., add liquid to flour BEFORE whisk together
    • find flaw in plan at each stage and suggest an action to add to fix it
    • use "least commitment" to fix flaw
    • build partial order plan
    • backtrack if necessary
    • can combine with libraries of high-level plans

  • Schedules
    • include how long an action takes, and when it should occur
    • plan first and schedule later
    • can also have resource constraints
      e.g., there is only one engine hoist
      that's important as plan is partial order plan
    • resources reusable or consumable
    • duration of plan used as cost function
    • actions have durations, and earliest & latest start times
    • slack: range of start times

    • CPM: Critical Path Method
    • critical path is the one whose duration is longest
      whole plan can't be shorter
    • from start, can look at earliest start for each action in a path
    • from end, can look at latest start for each action in a path
    • order constraints impose possible actual start times
    • resource constraints add additional restrictions
      e.g., actions using the one hoist can't overlap

  • Hierarchical Planning (Hierarchical Task Networks)
    • humans plan at using high level actions (HLA) first
      e.g., get to airport, fly, drive to destination
      i.e., HLA + HLA + HLA
    • hierarchical decomposition: higher = more abstract; lower = more concrete
    • each HLA has one of more "refinements"
    • a refinement is a more concrete sequence of actions (either HLAs or primitive actions)
    • can refine plans recursively down to primitives
    • at least one of the fully refined plans must achieve the goal
    • can use a plan library of refinements
    • a lot of knowledge about refinements can be encoded
    • planner effectively searches the space of plan refinements
    • it can be done breadth first.

Lecture 15: Knowledge Representation (8.4, 12.1-12.6)

  • "knowledge is power"
  • how many types of knowledge representation have we seen so far?

  • Ontological Engineering
    • ontology: those concepts that exist and can be reasoned about in the world
    • general concepts: events, time, physical objects, beliefs
    • Ontological Engineering: representing these concepts
    • Upper Ontology (e.g., SUMO) (Adam Pease, WPI, BS&MS)
    • add more details down to specific levels (e.g., Wumpus)
    • all upper level details (axioms) must still be relevant at lower levels (apart from exceptions)
    • ontologies produced by:
      • a team of ontologists/logicians
      • importing categories, attributes and values from databases
      • extracting information from text documents automagically
      • doing it wiki style with open access

  • Categories and Objects
    • category knowledge is vital
      e.g., supports recognition and also prediction
    • use Basketball(b) or "reify" it to Basketballs
    • subclass and member relations
    • subclasses form a taxonomy (e.g., plants)
    • Basketballs ⊂ Balls
    • BB9 ∈ Basketballs
    • for categories assume ∀
    • (x ∈ Basketballs) ⇒ Spherical(x)
    • Orange(x) ∧ Round(x) ∧ Diameter(x) = 9.5 ∧ x ∈ Balls   ⇒   x ∈ Basketballs

    • Males and Females are subclasses of Animals
    • they are an exhaustive decomposition
    • they are disjoint (no members in common)
    • can define categories
      x ∈ Bachelors   ⇔   Unmarried(x) ∧ x ∈ Adults ∧ x ∈ Males
    • natural kinds: most real-world categories have no clear-cut definitions
      e.g., games, tomatoes, chairs, ...
      ... think of a definition based on an example, think of a counter-example!

    • Physical decomposition also needs to be represented
    • Part-of hierarchies
    • tricky! is "cheek part-of face" the same as "wheel is part-of car"?
    • composite objects have structural relationships between parts: e.g., Attached(x,y)
    • bunch: objects with definite parts but no structure

    • Measurements: uses measure objects
      Length(L1) = Inches(1.5) = Centimeters(3.81)
    • some things don't have a scale (e.g., beauty), but still can use
      Beauty(Rose1) > Beauty(Weed1)

    • Stuff -- part of stuff is stuff (e.g., butter)
    • intrinsic properties: belonging to the substance of the object
      e.g., color, flavor, ownership, ...
    • extrinsic properties: belonging to the object
      e.g., length, shape, weight, ...
    • a category that includes only intrinsic properties is a substance
    • what is half of a pile of sand?

  • Events
    • events are actions based on points in time
    • fluent: may change over time -- At(DCB, Office)
    • assert that its true -- T(At(DCB, Office))
    • events take place over a time interval
      Happens(e,i) where i = (t1, t2)
    • events can make fluents become true or false at some time
      Terminates(e,f,t) --- event e causes fluent f to cease to hold at time t

    • Processes: actions where any part of the action is still the same type
    • sorta like "stuff" for objects
    • e.g., Flyings

    • Time intervals: moments (zero duration) and extended intervals
    • predicates for time intervals
      • Meet(i,j) ⇔ End(i) = Begin(j)
      • Before(i,j) ⇔ End(i) < Begin(j)
      • After(i,j)
      • During(i,j)
      • Overlap(i,j)
      • Begins(i,j)
      • Finishes(i,j)
      • Equals(i,j)

    • Fluents and objects -- an object is a chunk of space-time!
    • President(USA) denotes a single object that consists of different people at different times!

  • Mental events and objects
    • agents need statements about beliefs (mental objects)
    • propositional attitudes: believes, knows, wants, intends, informs
    • need Modal logic: include qualifications of a statement, such as "usual", "possible", "necessary", "impossible", "always", "believed", ...

    • KAP means "A knows P"
    • can make statements about one agent's knowledge about another's knowledge
      e.g., KA[KBP]
      i.e., A knows that B knows
    • KAP ⇒ KA(KAP)
      i.e., if they know something then they know that they know it
    • need complicated (!) collection of "possible worlds" to figure out the semantics.

  • Reasoning with categories
    • semantic networks: graphical way of representing knowledge + inference
    • most semantic networks have an underlying logic
    • distinguish between categories and individuals
      MalePersons vs. John
      SubsetOf vs. MemberOf
    • inheritance: properties of categories flow down to subcategories
    • multiple inheritance: MemberOf(tux,Penguins), MemberOf(tux,Birds), does tux fly?
    • semantic nets allow "default" values
      these can be overridden by specified values in subcategories

    • description logics: logics tuned to categories and for deciding relationships between them
    • subsumption: checking if one category is a subset of another by checking definitions
    • classification: checking whether an object belongs to a category
    • consistency: checking if category definition is logically satisfiable
    • dl language is intended to be easier to write than FOL
    • but they typically lack negation and disjunction
    • dl emphasises tractability of inference
    • And[Man, AtLeast(3, Son), AtMost(2, Daughter),
              All(Son, And(Unemployed, Married, All(Spouse, Doctor)))
              All(Daughter, And(Professor, Fills(Department, Physics, Math)))]

  • Default information
    • example of default knowledge?
    • monotonic: new statements produced by inference added to KB
    • nonmonotic: override inherited properties: e.g., with Legs(John,1)
    • new evidence can override default statement (can't have both 1 and 2 legs!)
    • nonmonotic logics: "circumscription", and "default logic"

    • circumscription: add circumscribed predicates
      e.g., Bird(x) ∧ ¬Abnormal(x) ⇒ Flies(x)
    • assume ¬Abnormal(x) unless Abnormal(x) is declared to be true

    • default logic: includes default rules
    • Bird(x) : Flies(x) / Flies(x)
      if prereq Bird(x) is true, and justification Flies(x) is consistent with KB, then conclude Flies(x)
    • Nixon-diamond semantic net example
        Republican(Nixon) ∧ Quaker(Nixon)
        Republican(x) : ¬Pacifist(x) / ¬Pacifist(x)
        Quaker(x) : Pacifist(x) / Pacifist(x)

    • Truth Maintenance: retracting facts as needed (belief revision)
    • suppose P had been assumed by default, but ¬P is found
    • need to retract P and assert ¬P, but also retract all sentences inferred from P!
    • JTMS: justification-base truth maintenance
    • annotate each sentence in KB with justification
      sentences from which it was inferred
    • allows sentences with multiple justifications not to be retracted
    • sentences without justification are marked as out (not deleted), allowing efficient future changes
    • ATMS: assumption-based TMS
      keeps track of all the assumptions that would cause a sentence to be true.

Lecture 16: Quantifying uncertainty (13.1-13.3)

  • Acting under uncertainty
    • Intro...
    • uncertainty due to partial observability, nondeterminism
    • uncertainty due to Laziness, Theoretical Ignorance, Practical Ignorance.
    • belief state: set of all possible worlds the uncertain agent might be in

    • Summarizing uncertainty...
    • connections between effect and cause is not a logical consequence, but is affected by degree of belief (probability)
    • probability summerizes uncertainty
    • probability statements made wrt knowledge states (what's known)

    • Uncertainty and rational decisions...
    • agents prefers some outcome over others
    • utility: quality of being useful (preferences)
    • basic idea: if it is highly probably and highly useful, that's good!
    • Decision Theory = Probability Theory + Utility Theory
    • Principle of maximum expected utility
      Agent is "rational" iff it chooses the action that yields the highest expected utility, averaged over all the possible outcomes.

  • Basic Notation
    • what probabilities are about...
    • sample space: set of all possible worlds
      mutually exclusive & exhaustive
      e.g., set of all rolls from a pair of dice (1,1),(1,2),...,(6,6)
    • probability model: numerical probability with each possible world (0 to 1)
    • pair of dice: P(Total=11) = P((5,6)) + P((6,5)) = 1/36 + 1/36 = 1/18 (an unconditional probability)
    • P(doubles) = 0.25
    • P(cavity) = 0.2
    • unconditional P, or prior P (i.e., there's no other evidence)
    • if first dice is 5, P(doubles | Die1 = 5) = ??
    • conditional P, or posterior P (i.e., it depends on other evidence)
      e.g., P(cavity | toothache) = 0.6
            P(cavity | toothache ∧ ¬cavity) = 0
    • product rule: P(a ∧ b) = P(a | b) P(b)

    • the language of propositions (probability assertions)...
    • random variable: variables in probability theory e.g., Weather, Cavity, Toothache
    • each random variable has a domain of values
      e.g., Weather has {sunny, rain, cloudy, snow}
    • can write "sunny" for Weather = sunny

    • P(Weather) = < 0.6, 0.1, 0.29, 0.01 >
      stands for
        P(Weather = sunny) = 0.6
        P(Weather = rain) = 0.1
        P(Weather = cloudy) = 0.29
        P(Weather = snow) = 0.01
    • probabilities sum to 1.
    • the P statement defines a "probability distribution" for the single variable Weather (here, as a vector)

    • joint probability distribution: P(Weather, Cavity)
      includes some of the random variables
    • this is a 4 * 2 table of probability values
      {sunny, rain, cloudy, snow}, {cavity, ¬cavity}
    • P(sunny, Cavity) is 2 element vector
      sunny with cavity, sunny with no cavity
    • P(sunny, cavity) is a 1 element vector
    • full joint probability distribution
      includes all of the random variables
      e.g., P(Weather, Toothache, Cavity)

    • a possible world is an assignment of values to all the variables under consideration
      e.g., 4 * 2 possible worlds for vbls Weather and Cavity

    • skip probability axioms and their reasonableness...

    • where do probabilities come from...
    • different views
      • frequentist: from experiments, observed samples
      • objectivist: probabilities are real aspects of the universe
      • subjectivist: a way of characterizing an agent's belief, without external physical significance

  • Inference using Full Joint Distributions
    • full joint distribution for Toothache, Catch, Cavity (sum to 1)
    • look at worlds where proposition is true and add their probabilities
    • marginal probability: use a subset of the variables
        P(cavity) = 0.108 + 0.012 + 0.072 + 0.008
      i.e., cavity in all of the 4 situations of the 2 other vbls.
    • marginalization: sum up all values over the other variables
      P(Cavity) = sum of P(Cavity, z), over z, where z is {Catch, Toothache}
    • similarly for conditional probabilities (conditioning)

    • usually want to compute conditional probabilities
      i.e., use the effect of evidence
    • P(cavity | toothache) = P(cavity ∧ toothache) / P(toothache)
      from product rule
    • P(¬cavity | toothache) = P(¬cavity ∧ toothache) / P(toothache)
    • view 1/P(toothache) as a "normalization factor" = α
      without knowing value of P(toothache)
    • P(Cavity | toothache) = α P(Cavity, toothache)
    • = α[P(Cavity, toothache, catch) +P(Cavity, toothache, ¬catch)]
    • but you need full joint distribution to answer, so it doesn't scale :-(
    • in general P(X | e) = αP(X, e) = α∑P(X, e, y)
      where e is all the evidence, y is all possible combinations of values from the unobserved vbls.

Lecture 17: Uncertainty & Bayes (13.4-13.5)

  • Independence
    • some variables have no influence on others
      e.g., evidence about toothache, catch and cavity have no influence on cloudiness (they're independent)
      i.e., P(cloudy | toothache, catch, cavity) = P(cloudy)
    • if independent (P(a | b) = P(a) or (P(b | a) = P(b) or P(a ∧ b) = P(a)P(b)
    • can generalize for P (probability distributions)
    • it factors large joint distributions into smaller ones.
    • nice but often hard to find.

  • Bayes' Rule
    • Rule: P(b|a) = P(a|b)P(b) / P(a)
    • as a set of equations with background evidence e
        P(Y|X,e) = P(X|Y,e)P(Y,e) / P(X,e)
      where e could be toothache and catch

    • Applying Bayes' rule: the simplest case...
    • Best thought of as
        P(cause|effect) = P(effect|cause)P(cause) / P(effect)
      with e.g., effect = symptom, cause = disease
    • diagnosis problem: given a symptom what is the disease?
    • uses causal knowledge -- what things cause what effects

    • Using Bayes' rule: combining evidence...
    • Toothache and Catch are probably dependent
    • If there's a Cavity, then Cavity can cause Toothache, and Cavity can cause Catch, but neither has a direct effect on the other.
    • i.e., in the presence of Cavity, Toothache and Catch can be considered independent
    • called "conditional independence"
    • P(toothache∧catch | Cavity) = P(toothache|Cavity) P(catch|Cavity)

    • to decompose a full joint distribution, using conditional independence
        P(Toothache, Catch, Cavity)
        = P(Toothache, Catch | Cavity) * P(Cavity)
        = P(Toothache|Cavity) P(Catch|Cavity) * P(Cavity)
        = P(Cavity) * P(Toothache|Cavity) P(Catch|Cavity)
      giving three smaller tables
    • this allows probabilistic systems to scale up.
    • in general
      P(Cause, Effect1,...,Effectn) = P(Cause) * Π P(Effecti | Cause)

Lecture 18: Probabilistic Reasoning (14.1-14.2, 14.4, 16.1-16.2)

  • Representing knowledge in an uncertain domain
    • Bayesian network: data structure that can represent a full joint distribution using conditional independence and smaller distributions.
    • a directed acyclic graph.
    • if node1-------->node2 then node1 is "parent" of node2
      node1 has a "direct influence" on node2
    • conditional independence is indicated by lack of link between two nodes, but with shared parent
    • independent variables aren't connected to others
    • nodes annotated with conditional probability distribution
      P(Xi | Parents(Xi))     -- giving effects of parents on that node
    • when building a network order variables so that causes precede effects
    • include links from parents if one variable directly influences another

  • Semantics of Bayesian networks
    • For a particular entry in the joint distribution over all n variables
      i.e., X1=x1 ∧ ... ∧ Xn=xn
      P(x1,....,xn) = Π P(xi | parents(Xi))     -- varying i from 1 to n.
    • e.g., for john, mary, alarm, not burglary, not earthquake
        P(j, m, a, ¬b, ¬e)   =   P(j|a) P(m|a) P(a | ¬b∧¬e) P(¬b) P(¬e)
      by tracing back to parents.

    • causal models: causes ---> effects
    • diagnostic models: effects ---> causes
    • causal models easier to build, and easier to get probabilities for nodes

    • skip 14.2.2 and 14.3

  • Exact inference in Baysian Networks
    • usual problem is to compute posterior probability for query vbls
      given some event (some assignment to evidence variables)
      • X is query vbl
      • E is set of evidence variables E1,...,Em
      • e is observed event (evidence)
      • Y is set of nonevidence, nonquery vbs Y1,...,Yl
        the "hidden variables"
      • complete set of vbls X = {X} ∪ EY
      • typical query P(X | e)
    • sample query P(Burglary | JohnCalls=true, MaryCalls=true)   = <0.284, 0716>
    • i.e., P(B | j, m), and e = earthquake, a = alarm, b = burglary

    • Inference by enumeration...
    • for typical query
    • in general P(X | e) = αP(X, e) = α∑P(X, e, y)
      where y is all possible combinations of values from the unobserved vbls.
    • note that P(x1,....,xn) = Π P(xi | parents(Xi))
    • that allows P(x, e, y) to be calculated

    • P(B | j, m) = αP(B, j, m) = α∑eaP(b)P(e)P(a|b,e)P(j|a)P(m|a)
    • note that this uses each of the P(xi | parents(Xi)) in the network

    • skip the rest

  • Quick Intro to Utility
    • Decision Theory: choose amongst actions based on immediate outcomes
    • in nondeterministic, partially observable environment
    • RESULT(a) is a random vbl that has values that are possible outcome states of action a
    • P(RESULT(a)=s' | a, e)
      probability of outcome s' given action a executed and evidence observations e

    • utility function: U(s') given a number expressing desirability/usefulness of the state s'
    • EU(a | e) -- expected utility of an action:
          with lots of outcomes we need a way of weighting their utility by their probability
    • EU(a | e) = ∑s' P(RESULT(a)=s' | a, e) * U(s')
    • maximum expected utility (MEU): a rational agent should pick the action that maximizes the expected utility
    • action = argmaxa EU(a | e)

    • Preferences in choice:
      A > B -- agent prefers A over B
      A ~ B -- agent is indifferent between A and B
      A B -- agent prefers A over B, or is indifferent between them
    • there are axioms of utility theory that if followed will have an agent exhibit rational behavior.
    • if so
      U(A) > U(B) ⇔ A > B
      U(A) = U(B) ⇔ A ~ B

Lecture 19: Learning from examples (18.1-18.4)

  • Intro
    • Review the "Learning Agent"
    • agent is learning if it changes its performance, hopefully for the better, on future tasks after obtaining observations about the world.
    • basic case: "from examples"...
      given input-output pairs, learn function that predicts outputs for new inputs.
    • called "Inductive Learning"
      -- inductive inference learns something general from specific things
    • learning handles lack of agent designer's knowledge about the world, how it changes, or how to operate in it.

  • Forms of Learning
    • Factors affecting learning:
      • Component to be improved
      • Prior knowledge agent has
      • Representation used for the data/observations
      • Representation used for the Component
      • Feedback available to learn from
    • Components that might be learned include:
      • direct mapping from state to actions
      • inference of relevant properties of the world from percept sequence
      • information about the way the world evolves
      • information about the results of possible actions
      • the desirability of world states (utility)
      • the desirability of actions
      • goals describing classes of states to be achieved
    • Component Representations include logic, and Bayesian networks.
    • Much learning concerns factored data representations (vector of attribute/values)
    • Feedback to learn from: three types of learning...
      • unsupervised: learns patterns in input with no feedback (e.g., clustering)
      • reinforcement: agent learns from rewards/punishments which actions were good/bad
      • supervised: agent gets input and is told the matching output
    • problems: noise in data: incorrect or missing

  • Supervised Learning
    • "training set": input-output pairs (xi, yi), generated by unknown function y = f(x)
    • find function h (hypothesis) that approximates f
    • "test set": some additional examples ≠ training set
      -- used to test h (i.e., can h(x) correctly predict y?)
    • classification: discrete set of y values (e.g., diseases)
    • Boolean classification: y=true or y=false (learn goal predicate)
    • regression: y is a number
    • hypothesis space: a set of functions that h belongs to
    • consistent hypothesis: agrees with all the data
    • Ockham's razor: prefer the simplest consistent hypothesis
      e.g., prefer small decision trees

  • Learning Decision Trees (by induction)
    • decision tree representation
    • trees can be understood by people
    • decisions trees are good for some types of problems but not all
    • decisions reached by a series of tests (path through tree)
    • a node is a test of an attribute
    • links from each node are labelled with each of the possible attribute values
    • leaf nodes are labelled with a y value (the output)
    • as trees are built additional nodes are added below single root node
    • not all attributes need to be included
    • there are many possible trees (most are inefficient)
    • if useful, paths through trees can be rewritten as rules, or logical statements.

    • inducing decision trees from examples
    • typical input is a vector of x values and a single y value
      x = { Sunny=very, Windy=moderate }, y = Sailing
      x = { Sunny=moderate, Windy=none }, y = Hiking
    • use greedy divide-and-conquer approach to learn trees
    • grow one level of tree below each node, moving down the tree
    • nodes are picked by their discriminating/sorting power ("important attributes")
      i.e., splitting the data to maximize progress towards leaf nodes
    • start at top with most important node, next level is a set of decision tree learning problems with smaller sets of data that were produced by the previous node's split.
    • results
      • reach leaf node with single y value if data is split perfectly
      • run out of data but there are still attributes left to use on that path, then we don't have an observation for that case
      • if we use all the attributes on a path but still have data, then there is noise in the data.
    • learning curve: improvement in accuracy of learning
      e.g., gradually increase training set size, and get increase in proportiion of test set correct (exponential)

    • choosing attribute tests
    • pick most important attribute at each step of tree learning
    • how good are the subsets of the data produced by each attribute
      i.e., how well sorted
    • use entropy: a measure of uncertainty
    • a data subset with an equal mix of data leaves us uncertain about the result
    • want to reduce uncertainty - increase the amount of sorting that has been done - "information gain"
    • Gain: entropy of data set before using attribute, minus entropy of data subsets after using an attribute, is expected reduction in entropy (information gain)
    • check the Gain for each available attribute at that point in the tree, and use the one with the greatest Gain.

    • generalization and overfitting
    • overfitting: having more data tends to introduce more patterns in the data, and the tree will try to accomodate that.
      i.e., it overcommits, and learns too much (such as noise)
    • decision tree pruning: eliminate nodes (leading to leaf nodes) that are not relevant.
    • likely to prune nodes that provide very small information gain
    • significance test: use statistics to test whether that deviation in the data is significantly different from no or normal deviation
      i.e., what are the chances that this could occur normally
    • pruning reduces the decision tree learning's sensitivity to noise

    • broadening the applicability
    • need to handle
      -- missing data
      -- attributes with many possible attributes (weakens Gain test)
      -- continuous and integer valued attributes (infinites set of values)
          :: use split points for node tests (e.g., Weight > 160)
      -- continuous valued output attributes: regression tree to predict output value

  • Evaluating and Choosing the Best Hypothesis
    • Intro
    • stationarity assumption: probability distribution over examples doesn't change over time.
    • independent: each example is independent of previous examples
    • identically distributed: each example has an identical prior probability distribution
    • error rate of hypothesis h(x): proportion of mistakes it makes
    • low error rate may still not predict well for other data

    • cross-validation: using the data in multiple ways to build and test
    • holdout cross-validation: randomly split data set into training set and test set
      -- need large training set to learn well
      -- but...need large test set to test well
    • k-fold cross-validation: divide data into k subsets; use each subset to test; use average error to estimate the accuracy of a tree trained on all data. k=10 is common.

    • Model selection: complexity vs. goodness of fit
    • model selection: choosing the type of hypothesis to define a space of things that can be learned. i.e., h comes from the space.
    • optimization: getting the best h from the space
    • size: an approximation of the complexity of the hypothesis
      -- e.g., linear function < quadratic function
      -- e.g., small decision tree < larger decision tree
    • find best 'size' that balances underfitting and overfitting to give best test set accuracy.
    • wrapper: an algorithm to try to find the best size, that takes a learning algorithm (e.g., decision tree learning) and some examples
      -- it varies size, uses cross validation to learn error rate
      -- stops at lowest error, when h starts to overfit
      -- then learns with all data for a hyp of that size.

    • From error rates to loss
    • not all errors are created equal!
      -- better to get false +ves? (told you have disease when you don't)
      -- false -ves? (not told you have disease when you do)
    • need to take that utility into account as well
    • assume h(x) gives ÿ instead of y
    • loss function: loss of utility by getting an error
        L(x,y,ÿ) = Utility(result of using y given an input x) - Utility(result of using ÿ given an input x)
    • can use just L(y, ÿ)
    • small loss is better (we want to minimize it)
    • Loss functions
      • Absolute value loss: L1(y,ÿ) = |y-ÿ|
      • Squared error loss: L2(y,ÿ) = (y-ÿ)2
      • 0/1 loss: L0/1(y,ÿ) = 0 if y=ÿ else 0
    • generalized loss: taking prior probability distribution over all I/O pairs into account
    • empirical loss: for an h, assume data equally likely, sum loss for each h(x)
    • estimated best hypothesis: the h with the minimum emperical loss
    • small-scale learning: problems with dozen's to 1000s of examples
    • large-scale learning: millions of examples -- restricted by computation

    • Regularization
    • explicitly penalizing complex hypotheses
    • can search for hypotheses that minimize
      empirical loss + complexity

Lecture 20: More learning (18.7-18.8)

  • Artificial Neural Networks
    • Intro
    • neurons: brain cells
    • neural networks (NNs): networks of simulated neurons (units)
    • neuron "fires" when a linear combination of inputs exceeds some threshold

    • Neural network structures
    • units: the nodes/units of a NN
    • link: connections between nodes
    • activation: the output from a node
    • output of one node can be the input to another
    • weight: links have weights wi,j on them
    • unit j takes weighted sum of all inputs wi,j × ai
    • weighted sum is inj
    • bias weight: each node has a dummy input fixed to 1 with a weight on it
    • an activation function g converts inj to aj
    • perceptron: a unit with g as a hard threshold
    • sigmoid perceptron: a unit with g as a softer threshold
    • these are non-linear activation functions
    • feed-forward network: connections are only towards the output from input
    • recurrent network: allows loops (i.e., more complex, and powerful)
    • layers: single layer has input to units and output from those units.
    • hidden units: a layer of units that do not connect to inputs or outputs
    • classification/categorization: usually as many outputs as classes

    • Single-layer feed-forward neural networks
    • known as "perceptron networks"
    • activation function g determines training process
    • error is y - hw(x)
    • as this does 0/1 classification both y and hw(x) can be 0 or 1.
    • perceptron learning rule: assumes hard threshold, does weight updates depending on error
        wi ← wi + α(y - hw(x)) × xi
    • logistic regression: uses softened threshold, does weight updates depending on error
      • hw(x) = sigmoid function applied to the data (i.e., to x).
      • wi ← wi + α(y - hw(x)) × hw(x)(1 - hw(x)) × xi

    • function can be learned if it is linearly separable
      i.e., it learns linear decision boundaries
      OK = { and, or }     Not OK = { xor }
    • learning curve for perceptrons sometimes better than decision trees, sometimes not.

    • Multilayer feed-forward neural networks
    • has hidden units in a layer or layers
    • network is a function hw(x) parameterized by weights w, where x is an input vector.
    • output is expressed as a fn of inputs and weights (including use of g)
    • train using gradient descent loss-minimization method
    • neural network does nonlinear regression
      -- i.e., fitting a non-linear fn to some data
      -- non-linear as NN provides nested non-linear threshold/activation fns.

    • Learning in Multilayer neural networks
    • goal output is y
    • NN returns hw(x)
    • error vector at output is y - hw(x)
    • outputs may depend on all weights in the NN
    • back-propagate error from output layer to hidden layers

    • at output layer, update rule adjusts weights depending on error:
    • Let Errk be error of kth element of error vector
    • Define
        Δk = Errk × g'(ink)
      where g' is the derivative of g, and ink is the sum of the inputs to unit k.
    • update rule for the weight between hidden unit j and output unit k is
        wj,k ← wj,k + α × aj × Δk
        α is the learning rate (how much you want to update the weight each time), and
        aj is the output from the hidden unit j.

    • at hidden layer, update rule adjusts weights depending the amount of error for which the hidden layer unit might be responsible.
    • the Δk values are divided according to strength of connection between hidden node and all the connected output nodes k.
    • Define
        Δj = g'(inj) ∑k wj,k Δk
      where inj is the sum of the inputs to hidden unit j, the wj,k are the weights from unit j to all the output nodes to which it is connected, and Δk is the error for each of those nodes.
    • update rule for the weight between inputs and hidden unit j is
        wi,j ← wi,j + α × ai × Δj

    • Learning in neural networks structures
    • if use fully connected networks
    • choices - how many hidden layers and their sizes.
    • usually trial and error
    • use cross validation technique to estimate error.

  • Nonparametric Models (skim!)
    • a parametric model uses a fixed number of parameters (e.g., the size of x )
    • nonparametric model can change with more data
    • instance-based learning stores data as it arrives
    • simple table: ask for h(x) find x in the table and return the y
    • if not in table then a problem.
    • use k-nearest neighbors in the stored data
    • take plurality vote of the neighbors as the answer.
    • nearest: needs a distance metric
    • use Manhattan instance or Euclidean distance between query and data points
    • works well in low-dimensional spaces, with lots of data

    • k-d trees: balanced binary tree with arbitrary number of dimensions
    • split data at every dimension
    • nearest neighbors is easy if query isn't near a boundary
    • if it is you need to check on both sides of the split
    • works well with up to 20 dimensions with millions of examples

Lecture 21: Knowledge in Learning (19.1-19.3)

  • Logical formulation of learning
    • ML using prior knowledge of the world to learn hypothesis
    • put Hypothesis (h), Examples and matching Classifications (x's and y's) as set of logical sentences
    • given new example (in logic) use h to infer classification

    • Examples and hypotheses
    • examples in terms of values for Attributes
    • example x1: Alternate=Yes, Bar=No, Fri=No, Hungry=Yes, ...
    • i.e., Alternate(X1) ∧ ¬Bar(X1) ∧ ¬Fri/Sat(X1) ∧ Hungry(X1)...
    • classification (Goal predicate) -- WillWait(X1)   or   ¬WillWait(X1)
    • each hyp hj is in form -- ∀x Goal(x) ⇔ Cj
      where candidate definition Cj is a logical expression
    • Cj for a decision tree can be expressed as the a logical expression for each path (using ∧) linked by ∨
    • hj predicts that the set of examples that satisfies Cj are examples of Goal(x)
    • Those examples are the "extension" of the goal
    • Hyp space H = {h1, ..., hn}
    • Learning alg believes h1 ∨ h2 ∨ ... ∨ hn

    • if hi not consistent with new example it can be removed
      • can be false negative for hi
        h falsely says that it should be negative, but it is in fact positive
      • can be false positive for hi
        h falsely says that it should be positive, but it is in fact negative
    • note that hyp space H is vast, so this is not practical via theorem proving.

    • Current-best-hypothesis search
    • maintain single h and adjust it as new examples arrive
    • for each hi keep all examples that it classifies (+ve) (the extension)
    • those examples define the hypothesis
    • if new example is false negative -- include in the extension ("generalization")
    • if new example is false positive -- remove from the extension ("specialization")
    • note that when doing generalization or specialization you need to check that the result is compatible with previously seen examples.

    • in fact what is needed is for hi to be modified to reflect generalization or specialization.
    • for generalization hi needs to become less precise (drop conditions from Ci)
    • for specialization hi needs to become more precise (add conditions to Ci)
    • at each step there are multiple possibilities, not all of which are good, but a choice must be made, so backtracking will be needed.
    • at each step checking that the result is compatible with previously seen examples is expensive.
    • i.e., with large number of examples and large hyp space H it isn't practical.

    • Least-commitment search (Version space)
    • least-commitment: make least change necessary
    • keep around summary of all hyps consistent with data seen so far
    • new example may alter summary slightly to reduce it
    • "version space": only those hyps still consistent with data (after reduction)
    • incremental learning
    • version space defined by upper boundary G (general) and lower boundary S (specific)
    • *** do simple example **
    • G starts with True (i.e., the most general example)
    • S starts with False (i.e., the most specific example)
    • S and G get updated by +ve and -ve examples
    • any hyp between S and G must agree with all the examples
    • updates
      • False positive for Si --- Si is too general, so throw it out of S
      • False negative for Si --- Si is too specific, so replace it by all of its immediate generalizations (i.e., move that portion of S up towards G)
      • False positive for Gi --- Gi is too general, so replace it by all of its immediate specializations (i.e., move that portion of G down towards S)
      • False negative for Gi --- Gi is too specific, so throw it out of G
    • results
      • one hyp remains (hooray!)
      • S or G becomes empty (i.e., no consistent h for training set)
      • run out of examples with several h remaining
    • Version space approach is probably not practical in many situations (especially with noise), but it's a great model

  • Knowledge in learning
    • ...skim this section...
    • moral: background knowledge can allow faster learning
    • Note Explanation Based Learning (EBL)
    • Hypothesis: what is being learned (h)
    • Descriptions: all the examples (x's)
    • Classifications: all the classifications (y's)
    • Background: existing relevant knowledge
        Hypothesis ∧ Descriptions |= Classifications
        Background |= Hypothesis

  • Explanation based learning (EBL)
    • Intro
    • converts general "first-principles" theories to useful special-purpose knowledge
    • allows reasoning speedup in the future
    • take solution to a specific problem and learns a general method for slightly more specific problems.
    • more than just memoization (the specific case is learned)
    • it works by "explaining" a solution

    • Extracting general rules from examples
    • construct proof for problem (e.g., using backward-chaining theorem prover)
    • e.g., prove Derivative(X2, X) = 2X
    • e.g., prove Simplify(1 × (0 + X), w)
        i.e., can it be simplified?
    • construct two proof trees simultaneously
      • original proof
      • the same proof with all constants replaced by variables
        i.e., a generalized proof tree
    • extract general rule from generalized proof tree

    • EBL steps
      1. construct proof of example using background knowledge
      2. also construct parallel proof with variables
      3. construct new rule with lhs including leaves of proof tree ⇒ rhs as example with variables and bindings applied.
          i.e., lhs terms are the conditions that the background knowledge shows to be true, which need to be true to make this inference again in the future
      4. drop any conditions on lhs that are true regardless of values of variables in rhs
      5. result is a new rule that summarizes the result of applying background knowledge
          ArithmeticUnknown(z) ⇒ Simplify(1 × (0 + z), z)

    • Improving eficiency
    • can also extract more general rules from the generalized proof tree by using non-leaf nodes
    • tradeoff: general rules apply to more cases, but don't find answer as directly
    • tradeoff: adding lots of specific rules makes each one apply directly to a specific set of situations, but finding the right one becomes harder (increased branching factor!)
    • tradeoff: check whether parts of each new rule are easy to solve, but this make learning time longer.
    • tradeoff: "easy to solve" varies as rules are added.

    Lecture 22: Reinforcement Learning (21.1-21.2)

    • Introduction
      • "reward" or "reinforcement": feedback for action
      • Markov Decision Processes: to MDP quick overview!
      • reinforcement learning: based on rewards
      • simple, fully observable environments, but with probabilistic action outcomes
      • possible use by different agent types
        • utility-based agent: learns utility function on states
          -- uses it to select actions that maximize expected outcome utility
        • Q-learning agent: learns action-utility function (Q-function)
          -- the expected utility of taking a given action in a given state
        • reflex agent: learns a policy that maps states directly to actions
      • Model based vs. Model free
        • Model based approach to RL
          -- learn MDP model: transitions and rewards (or approximation)
        • Model free approach to RL
          -- do not learn the model

    • Passive Reinforcement Learning
      • "passive learning": agent's policy is fixed, learn utilities of states
      • state-based representation, fully observable environment
      • given a policy
      • goal: learn how good the policy π is
        i.e., learn utility function Uπ(s)
      • does not know transition model in advance
      • does not know reward function in advance
      • agent make "trials" using the policy
      • each trial runs to the terminal state
      • the agent's percepts supply the current state s and the reward for that state.
      • use reward info learn the expected utility for each state s

        Direct utility estimation
      • reward-to-go: expected total reward from that state onwards to terminal state
      • after each trial calculate reward-to-go for each state, and make expected utility for that state the running average.
      • use reward-to-go as direct evidence of actual expected utility for state
      • need many trials to get right answer (converges slowly).

      • however, utilities of states are not independent, as...
        The utility of each state equals its own reward plus the expected utility of its successor states
      • They obey Bellman's equations Uπ(s) = R(s) + γΣs'P(s' | s, π(s))Uπ(s')
      • i.e., Uπ(s) depends on Uπ(s'), the next state's utility

        Adaptive Dynamic Programming
      • does trials as before
      • learns transition probabilities from observations
        -- how often do you get to s' from s by doing a?
      • learns reward function R(s) from observations
        -- in new state, just store the reward given
      • plugs values into Bellman equations
      • solve for utilities

        Temporal Difference (TD) Learning
      • make computation easier and obtain an aproximate utility
      • just adjust utility of state based only on the observed successor
      • don't need transition model, as transitions are observed.
      • e.g., after some learning, calculate
        Uπ(1,3) = R(1,3) + Uπ(2,3)
        where (2,3) is the observed successor
      • if that calculated value ≠ current utility value for Uπ(1,3) then update it in the right direction.
      • update using the TD Update Rule for s to s'
        Uπ(s) = Uπ(s) + α( R(s) + γUπ(s') - Uπ(s) )
        where α = learning rate, γ = discount
      • R(s) + γUπ(s') is approx/noisy utility measure
      • make learning rate gradually decrease with the number of samples

    Lecture 23: Natural Language Processing (22.1-22.4)

    • Intro
      • knowledge acquisition: need language understanding for getting new knowledge
    • Language models
      • language model: predict the probability distribution of language
      • language: set of strings of characters
      • grammar: rules that define legal structure (syntax)
      • semantics: allocate meaning
      • natural language: English, Spanish, ...
      • word combinations have probabilities (some rare; some sorta OK)
      • ambiguity: probability distribution over possible meanings
        -- "He saw her duck"
      • language is huge so models are approximate

        N-gram character models
      • simple language model: probability distribution over characters
      • probability of sequence of N characters P( c1:N )
      • e.g., P("the") = 0.027
      • n-gram: sequence of length n
        --- (bigram, trigram samples)
        --- Google books Ngram Viewer
      • n-gram is Markov chain of order n-1
        --- P(ci) depends on immediately preceding characters (e.g., previous 2 for a trigram)
      • i.e., P(c1:N) = Πi=1..N P(ci | ci-2:i-1)
      • extract n-gram probabilities from a corpus (large body of text)

      • language identification: given text, what language is it written in ?
      • trigram model of each language (i.e., probabilities)
      • i.e., have P(text|language)
      • want P(language|text)
      • = P(text|language)P(language)/P(text) and drop P(text)
      • P(language) is dominated by P(text|language) term in calculation so it can be approximate and still OK
        argmaxl P(language) Πi=1..N P(ci | ci-2:i-1)

        Smoothing n-gram models
      • one corpus isnt the same as another, so n-gram model approx
      • things claimed to be 0 probabilities actually are possible
      • smoothing: adjust zero probabilities up, and others slightly down (sum to 1)

        N-gram word models
      • n-grams for words
      • probability of word sequence
      • 3-gram word model sentences are staring to look somewhat reasonable

    • Text Classification
      • categorization: given text what type is it?
      • e.g., spam, positive/negative movie review, ...
      • could use supervised learning
      • "features" for category: word level, character level
      • keep top 100 or so features
      • can use supervised learning with features (e.g., decision tree)

      • train n-gram word model for ¬spam and another for spam.
      • P(category|message) = P(message|category)P(category)
        by Bayes rule and ignoring P(message)
      • pick larger probability P(¬spam|message) vs. P(spam|message)

      • can use data compression for classification
      • e.g., add new msg to spam and compress, add same msg to ¬spam and compress, the greatest relative reduction indicates category!

    • Information Retrieval (IR)
      • task of finding relevant documents
      • needs
        • corpus of documents
        • query in query language
        • result set (possibly relevant documents)
        • presentation of result set
      • Boolean keyword model
        -- query language with AND/OR/NOT
        -- look in document for keywords

      • IR scoring functions: query returns a score for a document
      • high score = high relevance
      • TF = frequency of a word in a document
      • IDF = inverse domain frequency of a word
        --- if a word appears in most documents it has less importance
      • DF = the number of documents that contain a word
      • use these to return a score for a document and some query words.

      • Precision = proportion of result set that are actually relevant
      • Recall = proportion of all relevant documents in corpus that are returned in the result set.
      • can make tradeoffs between P and R
      • tweaks include adjusting case (car = CAR = Car); stemming (run = runs = running); synonyms (sofa = couch)

      • PageRank developed by Google
      • PR(p) depends on PR of all pages that link to page p, and the count of number of links from each of the pages that link to p.
        i.e., depends on Σi( PR(ini)/C(ini) )
      • the HITS algorithm first gets pages that satisfy query, then does a similar sort of analysis
      • Finds Hubs and Authories
      • e.g., authority pages have many relevant pages pointing to them.

      • Question answering: query is a question
      • been around for a while!
          D.C.Brown (1974) A survey and analysis of question answering systems,
          M.Sc. Thesis, University of Kent, Canterbury, England.
      • Can use standard question types
      • Convert questions into standard type, then into web search query.
      • Selections of text retrieved are analysed.
      • Uses knowledge about what type of answer is expected
        e.g., who vs. how many expects name vs. number
        (used in Watson)

    • Information extraction
      • Acquire knowledge by skimming text and looking for objects & relationships
        e.g., extract addresses
      • Approaches:
        • Finite-state automata
        • Probabilistic models (skip this)
        • Conditional random fields (skip this)
        • Ontology extraction
        • Automated template construction
        • Machine reading

      • Finite-state automata
        • assume text is description of single thing
        • extract attributes (e.g., Manufacturer, Model, Price)
        • define "template" for each attribute
        • template defines as finite-state automata (e.g., regular expression)
        • regex -- can define sequence, repetition, optional items
        • template may have test for pre and post context
          e.g., price is 100 dollars

        • finite-state automata can be cascaded (sequence)
        • modularizes the knowledge
        • works very well with text in restricted domains
        • 1st tokenize
        • 2nd detect complex words (e.g., company names)
        • 3rd group words and tag (e.g., noun phrases)
        • 4th handle complex phrases
        • 5th merge related structures

      • Ontology extraction
        • build ontology of facts from large corpus
        • precision is vital
        • use very general templates
        • templates that match fact-giving syntax

      • Automated template construction
        • looking for templates that reveal particular relation
          e.g., subcategory; author-title; etc.
        • start with some examples in the form of simple templates
        • use those to retrieve text
        • infer other templates from the text
        • use context around the match to add to new templates (e.g., "type of"; "wrote")

      • Machine reading
        • needs to learn many templates
        • start with general syntactic templates
        • learns underlying probabilities

    Lecture 24: Natural Language for Communication (23)

    • Communication
      • language intended send messages
      • syntax = structure
      • semantics = meaning
      • pragmatics = practical issues affecting meaning that relate to context
      • language is too vast and complex for trigrams to be the only tool

    • Phrase Structure Grammars
      • need rules that define the legal language -- a grammar
      • part of speech (lexical category) -- Noun, Verb, Article, Pronoun, etc.
      • syntactic categories -- noun phrase (NP), verb phrase (VP)
      • combinations form phrase structure of sentence -- e.g., NP VP
      • Non-terminals -- Article, Noun, NP, ...
      • Terminals -- "the", "wumpus", ...
      • parsing -- finding the structure of a sentence using grammar
        usually tree form
        [S [NP [Article "every"] [Noun "wumpus"]] [VP [Verb "smells"]]]
      • generation -- using the grammar rules to produce sentences
      • simple grammars can overgenerate (e.g., "me go home")

      • need rules that define the legal language -- a grammar
      • the form of the rules alter the complexity of the languages that the grammar can parse/generate (Chomsky Hierarchy)
        • recursively enumerable (unrestricted rules)
        • context-sensitive (can apply a rule in a specific context)
        • context-free (used in any context)
        • regular (highly restricted)
      • context-free grammar
          S → NP VP
          NP → Article Noun
      • probabilistic context-free grammar (PCFG)
          S → NP VP [0.90]
          NP → Article Noun [0.25]
      • probability assigned to every string
      • lexicon -- words with lexical category and probabilities
      • probability of sentence is product of probabilities of rules and words

    • Syntactic Analysis (Parsing)
      • Parsing: using grammar to find phrase structure
      • top down: start with S and work down to words
      • bottom up: start with words and work up to S
      • use memory (chart) to keep track of successful parses of parts of sentence to prevent having to reparse them again later
      • syntactic ambiguity: multiple ways to parse a sentence
        "he eats grass and leaves" (leaves can be a N or a V)
      • look for best parse -- related to probability
      • could use A* with cost 1/p of parse found so far

      • learning probabilities for PCFGs
      • learn grammar from data
      • large corpus of correctly parsed sentences (treebank)
      • extract rules from parses and learn count frequencies

    • Augmented grammars and Semantic Interpretation
      • lexicalized PCFGs
      • probabilities depend on relationships between words that rule includes
        "eat a banana" vs. "eat a bandana"
      • augmented PCFG includes sytactic structure as well as word relationships
      • 'head' of phrase is most important word (e.g., v = "eat", n = "banana")
      • VP(v) = Verb(v) NP(n)   [P(v, n)]
      • P(v, n) depends on v and n.
      • P(eat, bandana) is very low
      • use smoothing for very low probabilities so that they aren't zero
      • can learn P(v, n) from treebank

      • grammar rules can be expressed in logic
      • parsing can be expressed as logical inference
      • not really practical for unrestricted parsing
      • could be used for language generation

      • Case agreement and subject-verb agreement
      • there are a variety of additional linguistic rules that need to be expressed somehow in order to parse/generate correctly.
      • getting them all into the grammar could mean adding lots of extra non-terminals
      • e.g., subjective case ("I"), objective case ("me")
      • e.g., subject-verb agreement ("I smell bad", "he smells bad", "they smell bad")
      • Instead, add parameters to the non-terminals
        NP(c, pn, head)
        c = case, pn = person/number (e.g., 1st person singular), head = head word of phrase

      • Semantic interpretation
      • compositional semantics: semantics of phrase depends on semantics of subphrases
        i.e., the meaning can be built up during bottom-up parse
      • syntax rules annotated with semantic functions
      • meanings carried up the parse tree and composed
      • "John loves Mary" → Loves(John, Mary)
      • meaning of "loves" is the lambda expression
        λy λx Loves(x,y)
      • "Mary" gets bound to y, on one branch of parse tree.
      • Higher up the parse tree, "John" gets bound to x.

      • Pragmatics -- influence of current situation on the meaning
      • Indexicals: "I am in Worcester today" -- "I", "today"
      • Speech Act: determining speaker's intent
        "Could you close the door?" ("yes, I could")
      • could even require input from perception
        "Give me that book"

      • Ambiguity!
      • "Squad helps dog bite victim"
      • Almost every utterance is ambiguous.
      • Alternative meanings get pruned out by native speakers.
      • Lexical ambiguity: "bank" two kinds of noun, a verb, and an adjective
      • Syntactic ambiguity: "I saw the flower in the park"
        seeing in the park, flower in the park
      • Metaphor: "All the world's a stage" (no it isn't)
      • Disambiguation: needs knowledge
        • World model: knowledge of what is likely in the world
        • Mental model: speaker's belief and hearer's belief
        • Language model: likelihood of certain string of words
        • Acoustic model: concerns sequences of sounds

    • Machine Translation
      • translate source to target (e.g., English to French)
      • perfect translation requires complete understanding of the text
      • Alternative meanings get pruned out by native speakers.
        → Alternatív jelentések kap metszett ki anyanyelvű.
        → Los informes alternativos se cortan fuera a hablar.
        → Alternative reports are cut out to speak.
      • other languages have different words for different situations where English may have one (and v.v.)
      • Levels of translation:
        • English → Interlingua → French
        • English Semantics → French Semantics
        • English Syntax → French Syntax
        • English words → French Words

      • Statistical machine translation
      • use large bilingual corpus of translations to train probabilistic model
      • f* = argmaxf P(f | e) = argmax P(e | f)P(f)
      • P(e | f) is a translation model (but P(f | e) can be found directly)
      • P(f) is a language model for french
      • Phrase approach -- find best french phrase of short english phrase
      • P(fi | ei) are known
      • sequence of french phrases are 'distorted' to a new order (for better french)
      • P(di) distortion probabilities are known (learned)
      • P(f, d | e) = Πi P(fi | ei) P(di)
      • use a search to find best f for the e.

    • Speech recognition
      • Speech recognition: identify sequence of spoken words
      • many problems...
      • Segmentation: no pauses between spoken words
      • Coarticulation: adjacent sounds affect each other
      • Homophones: to, too, two.
      • Use vector of features from audio signal to represent the speech
      • argmax P(word | sound) = argmax P(sound | word) P(word)
        for some time period
      • P(sound | word) is the acoustic model -- the sounds of words
      • P(word) is the language model (for each utterance)
      • Markov assumption: the current state Wordt depends on a fixed number of previous states.

      • Acoustic Model
      • sounds waves --- A-to-D converter --- sampling rate
      • quantization factor: precision of each measurement (8-12 bits)
      • phones: different speech sounds (about 100)
      • phoneme: smallest unit of sound with a distinct meaning for a language (e.g., pill vs. kill)
      • kit vs. skill --- the K is two different phones but one phoneme
      • frames: overlapping time slices through signal (e.g., 10 ms)
      • vector of discrete features for each frame (e.g., energy at different frequencies)

      • phone model
      • transition probabilities between parts of a phone
      • Form hidden Markov Model
      • parts have expected features
      • parts are onset, middle, end
      • could take 5-10 frames as input and recognize phone [m] for e.g.

      • pronunciation model
      • transition probabilities between phones
      • e.g., [ t ow m aa t ow ]
      • can augment to show dialect variation and coarticulation
      • [t] [ow] vs. [t] [ah] at the start of "tomato"

      • Language Model
      • based on corpus of task-specific text
      • use transcripts of spoken interactions (e.g., airline reservations)
      • include all task-specific vocabulary
      • have voice interface ask specific questions to constrain user input

      • Building a Speech Recognizer
      • Components:
        • high quality microphone
        • low background noise
        • signal processing algorithms
        • features used
        • phone models
        • word pronunciation models
        • language model
      • phone models & word pronunciation models often hand developed
      • probabilities come from speech corpus
      • models can now be learned automatically
      • performance error less than 1% for limited topics
      • up to 10-20% error in larger vocabularies
      • task specific interaction lowers error

    Lecture 25: Perception

    • Intro
      • Perception: interpreting response of sensors
      • vision, hearing, touch -- plus radio, GPS, infrared, etc
      • sensor model: sensor (S) provides evidence about the environment (E), i.e., P(E | S)
      • object model: describes objects in the world (e.g., 3D geometry)
      • rendering model: how stimulus is produces from the world (e.g., lighting)
      • lots of ambiguity in vision: some managed by using prior knowledge
      • video camera may deliver 10 GB per minute
      • i.e., what to use, what to ignore?

      • feature extraction: simple computations applied to sensor observations
      • recognition: making key distinctions between objects, perhaps labelling them
      • reconstruction: build geometric model of world from image(s)

    • Image formation
      • imaging distorts the appearance of objects (e.g., perspective, foreshortening) *1*
      • scene → sensor → 2D image
      • pixels: smallest units of image
      • image formed at the image plane (e.g., via pin-hole camera) *2*
      • f is distance from pinhole to image plane
      • (x,y) is point on image plane
      • (X,Y,Z) is location in scene
      • x = -fX/Z, y = -fY/Z
      • image is inverted up-down & left-right
      • larger Z, smaller x & y
      • parallel lines converge in the image at vanishing point
      • note the importance of Z: if you know the rest, you can calculate Z!

      • Lens Systems
      • lens gathers more light *3*
      • have limited depth of field
        i.e., can 'focus' light from a limited range of Z values
      • outside that range will give unsharp image

      • Scaled orthographic projection
      • if points on object have very limited Z variation then scaling factor f/Z (in -fX/Z) is effectively a constant s
      • i.e., x = sX, y = sY

      • Light and Shading
      • brightness of image depends on brightness of patch of surface that projects to the pixel.
      • main causes of varying brightness:
        --- overall intensity of light
        --- reflecting more or less of the light
        --- shading due to not facing the light as much
      • diffuse reflection: light evenly scattered
        i.e., brightness doesn't depend on viewing direction
      • specular reflection: brightness depends on viewing direction
      • specularities: small patches where there's specular reflection *4*
      • default assumption is distant point light source
      • amount of light at surface patch depend on angle between the normal to the patch and the illumination direction. *5*
      • diffuse surface patch reflects some fraction of light
        --- diffuse albedo (e.g., white paper has 0.90)
      • Lambert's cosine law for brightness of diffuse patch
            I = ρI0cosθ
        where ρ is diffuse albedo,
        I0 is intensity of light source,
        θ is angle between light source direction and surface normal.
      • note that lighting provides surface information (due to θ)
      • surface with no light is in shadow
      • interreflections: prevent shadows from being completely black
      • ambient illumination: from interreflections

      • Color
      • (or, using my trigram system, Colour)
      • energy at different wavelengths (spectral energy density)
      • humans see red, green, blue (dogs)
      • principle of trichromacy: by mixing three colors humans can be fooled into seeing the original color (e.g., TV)
      • model light source with different R/G/B intensities
      • model surfaces with different albedos for R/G/B

    • Early image-processing operations
      • early: reducing the amount of data, starting interpretation into compact representation
      • early: usually local operation (rely on small part of the image)
      • early: often in parallel

      • edge detection
      • straightlines or curves in image
      • significant change in brightness
      • different kinds of edges (types detected later) *6*
        • depth discontinuities (object to background)
        • surface orientation discontinuities (edge of object)
        • reflectance discontinuities (change of surface material)
        • illumination discontinuities (shadows)
      • in 1D brightness is I(x)
      • edge is sharp change in brightness *7*
      • detect change by large change in derivative I'(x)
      • noise may give this, so smooth/blur first --- (I * Blur)'
      • Blur = Gaussian filter Gσ
      • (I * Blur)' = (I * Gσ)' = I * Gσ'
      • convolution of I and Gσ'
      • σ is the standard deviation -- small blurs less
      • corresponds to replacing each pixel by avg values of those around
        --- giving closer ones more weight and further away less weight.
      • think of it as a small operator that scans across the image
      • peaks (max of large gradient) in processed image correspond to edges *8*
      • similar in 2D --- also interested in edge orientation θ(x,y)
      • link edge points that are related by orientation

      • texture analysis
      • spatially repeating pattern on surface that can be detected visually
      • e.g., grass, pebbles
      • use multi-pixel patch -- characterize patch by histogram of pixel (edge) orientations
      • histogram changes in an image area suggest change in object
      • orientations largely illumination invariant

      • optical flow
      • direction and speed of motion of object in the image *10*
      • object or camera moving between frames of video
      • rate of flow can indicate distance, and show actions
      • need corresponding point between two images (2 frames)
      • select image patch at (x0, y0) at time t0
      • compare patch with places around that point in second image at time t0+Dt
        at (x0+Dx, y0+Dy)
      • minimize the measure of Sum of Squared Differences
        i.e., find best (Dx, Dy)
      • optical flow at (x0, y0) is (vx, vy) = (Dx/Dt, Dy/Dt)
      • there needs to be some texture for this to work

      • Segmentation of Images
      • break image into regions of similar pixels *11*
      • regions often indicate edges of objects
      • can either detect region boundaries, or regions themselves
      • detect region boundaries: train classifier based on brightness, color and texture
        estimates Pb(x,y,θ) --- probability of boundary b at x,y at angle θ
      • however, may not form closed curves
      • Alternative approach: cluster pixels based on brightness, color and texture
      • maximize similarity of pixels in cluster, and maximize difference between clusters

    • Object recognition by appearance
      • appearance: what object looks like
      • simple/consistent objects: just test for distinctive features in the image
      • e.g., works quite well for faces
      • slide round window over image, compute features, use classifier, find faces!
      • overlapping windows might be combined to report single face
      • train classifier with marked-up face images *12*

      • Complex appearance and pattern elements
      • several effects move features around in an image: *13*
        • foreshortening: viewing slanted surface
        • aspect: object at different rotation angles
        • occlusion: parts hidden by other parts or objects
        • deformation: objects with moving parts/regions
      • try looking across image for object parts (also vary scale)
        if related parts are close together then object detected
      • i.e., look for image features together in approx the right place
      • heuristic --- use spatial information (e.g., car wheels at bottom)

      • Pedestrian detection with Histogram of Gradient features
      • use histograms of local orientations in an image *14*
      • break image into cells -- make orientation histogram for each cell
      • emphasise important gradients by weights that show how significant they are relative to others in the same cell
      • gives Histogram of Gradient feature
      • train classifier with existing training sets

    • Reconstructing the 3D world
      • recover 3D model from image
      • i.e., can we do P(Scene|Image) = P(Image|Scene)P(Scene) ?

      • Motion parallax
      • camera moves relative to 3D scene *16*
      • apparent motion in image tells us about camera mvt and depth info in scene
      • viewer translational velocity T
      • Z(x,y) is z-coordinate of point in scene corresponding to image point (x,y)
      • optical flow
        vx(x,y) = xTz/Z(x,y)
        vy(x,y) = yTz/Z(x,y)
      • can detect relative depths from optical flow

      • Binocular stereopsis
      • two images separated in space *17*
      • disparity: difference in location in two images of same features
      • need to solve the correspondence problem
      • displacement of eyes (cameras) by amount b along x-axis (approx 6cm)
      • horizontal disparity (in image) H = b/Z
      • measure disparity, know b, obtain Z the depth of some point on object
      • humans fixate: look at a certain depth
      • small variations in depth correspond to small angles at the eye
      • smallest detectable angle is about 5 seconds of arc
        (a minute of arc is 1/60th of a degree)
        (a second of arc is 1/60th of a arcminute)
      • e.g., at 30cm we can detect 0.036mm!
      • generize to multiple views *18*19*

      • Shading
      • variation in intensity of light from different portions of a surface in the scene
      • due to geometry and reflectance properties
      • very hard to recover these from the image
      • there are many interflections

      • Contour
      • we can extract distance and 3D properties from outlines *21*
      • figure-ground problem: which is foreground, which is background?
      • big clue is T-junctions
      • assume "ground plane"
      • i.e., nearer objects project to points lower in image

      • Objects and geometric structure of scenes
      • can use horizon detector: images closer to the horizon are further away *22*
      • also, pedestrians are approx same height so images size reflects distance

      • for solid object with distinct feature points mi
      • pose detection, for use for industrial robots manipulating parts
      • assume rotation and translation of object, and projection to image
      • image point pi = Q(mi)
      • Q is the same for all image points
      • if three object features can be found in the image then equations can be solved (e.g., using edges and vertex detection)
      • i.e., all mi of object can be predicted
        and object position and "pose" is known allowing manipulation

    • Object recognition from structural information
      • use knowledge of object being seen
      • e.g., simple model of human body
      • deformable template: moveable image blocks with relationships
        e.g., leg image relative to body image *23*

      • model geometry of body with eleven rectangular segments with connections and constraints
      • "cardboard people": model forms a tree rooted at torso
      • segments move independently of segment to which they're connected
        e.g., lower arm relative to upper arm
      • image rectangle should resemble the model segment
      • relationship between image rectangles should match expected relationships between associated model segments
      • find best match
      • can use size of rectangle/image to help
      • color can help matching
      • Appearance model: model of segments reflecting most likely position of person in the world, based on the image *24*

      • Coherent appearance
      • tracking people in video *25*
      • look for torso in lots of frames
      • build up a reliable appearance model that explains many frames

    • Using Vision
      • many applications!
      • e.g., surveillance, sports, HCI, games, ...
      • in simple cases with large fixed backgrounds can subtract background from complete image leaving image of interest
      • can train classifier on optical flow to recognize standard actions

      • Image retrieval
      • find relevant images from d-b
      • can be done via IR techniques (e.g., images have keywords)
      • can learn keywords for image by using tagged training images and nearest-neighbors methods (test image similar to training image?)

      • Reconstruction from many views
      • assume a familiar 3D object, then we have an object model
      • determine correspondences between image points and object points
      • use correspondences to determine parameters of camera (and lense)
      • test this by projecting other model points through camera to image
      • determine whether there are matching image points nearby
      • can confirm model
      • applications include...
        • Model-building: use video or collection of pictures to extract detailed 3D model of object *26*
        • Matching moves: to put computer graphics characters in real video, determine actual camera moves so that graphics characters can be rendered correctly.
        • Path reconstruction: robots can reconstruct object that they have seen, and use camera information to construct record of path

      • Using vision for controlling movement
      • navigation -- e.g., autonomous vehicles
      • Lateral control: stay in lane
      • Longitudinal control: stay away from vehicle ahead
      • Obstacle avoidance: avoid other cars, and pedestrians
      • adjust steering, accelaration and braking
      • need position & orientation relative to lane
      • use edge detection to find lane markers
      • augment with map knowledge: vision is confirmation
      • but obstacles aren't (usually) on the map
      • use binocular stereopsis for car ahead distance
      • augment with laser rangefinders to build probabilitiy maps of surroundings
      • use landmarks to reset absolute position information
      • for driving you don't need ALL the information from an image
      • DARPA Urban Challenge

    Lecture 26: Watson

      (see Watson talk slides & videos)

    Lecture 27: AI at WPI

    Lecture 28: AI at WPI

    Markov Decision Processes Quick Overview

    • agent must chose action from ACTIONS(s) from each state s (at each time step)
    • begins at start state in a fully observable environment
    • sequential decision problem: find a (good) sequence of actions to terminal state
    • terminal states have rewards (may be +ve or -ve)
    • actions are unreliable (stochastic)
      --- some probability that movement will not be in direction chosen
      --- e.g., 0.8 in intended direction, 0.1 in two others.
    • transition model: the outcome of each action at each state
    • transition probabilities (to s' from s due to a) are known --- P(s' | s,a)
    • transitions are Markovian: probabilities do not depend on earlier states, just s.

    • utility function for agent depends on sequence of states (environment history)
    • in each state agent gets a reward R(s)
      --- may be +ve or -ve
      --- negative rewards encourage agent not to be there!
    • simple utility is sum of the rewards received
      --- including at a terminal state, where a larger reward may occur (perhaps -ve)
      --- U([s0, s1, ...] = R(s0) + R(s1) + ...
    • discounted rewards (using "discount factor" γ)
      --- U([s0, s1, ...] = R(s0) + γR(s1) + γ2R(s2) +...
    • γ between 0 and 1,
      --- expresses preference for known current rewards over less well known future rewards.

    • Markov Decision Process: states, actions, rewards, Markovian transitions.
    • Policy π(s) : Solution to MDP: what action to take in any state
    • each time policy is executed from s0 it may lead to a different sequence of states (stochastic)
    • quality of policy is "expected utility" of environment histories generated by policy.
    • Optimal Policy π*(s): one that yields the highest expected utility
    • if agent knows current state s it can then executes action π*(s) (Reflex Agent)
    • changing R(s) values affects π*(s)
    • maximize expected utility
      π*(s) = argmax Σs' P(s' | s,a)U(s')
      i.e., agent can choose action that maximizes expected utility of next state

      Return to Lecture 22 notes