WORCESTER POLYTECHNIC INSTITUTE
Computer Science Department
CS4341 ❏ Artificial Intelligence
Version:
Wed Apr 24 19:52:44 EDT 2013
Course Contents
Lecture 1: Introduction (1)
 Course Information
 email, web, book, intro page, projects, weekly exams
 myWPI, webturnin
 my preparation
 sources for slides
 What is AI? Definitions:
 AI is the study of ...
 computations that make it possible to
perceive, reason, and act.
 how to make computers do things which, at the
moment, people do better.
 the design of intelligent agents.
 how to make computers act like those in the movies!
 Four goals: thinking/acting, humanly/rationally
 Rational: does the right thing given what it knows
 Thinking Humanly (reasoning)
 Cognitive modeling
 Implement model of reasoning
 Does it reason like a human?
 Acting Humanly (behavior)
 Turing Test
 avoids definition of intelligence
* How would you define it ???
 includes language, learning, knowledge, reasoning
 system intelligent if passes test
 person or machine ?
(Eliza)
 does not include perception
 Thinking Rationally (reasoning)
 Laws of thought ("logic")
 works in practice?
 Acting Rationally (behavior) [this book]
 The Rational Agent approach
 tries to find best outcome, or best 'expected' outcome.
 actions should achieve one's goals
 Engineering goal  solve realworld problems
 Scientific goal  explain various sorts of intelligence
 How AI has changed
 focus on systems that act rationally
 this is the book's focus
 there are areas that this book doesn't include (e.g., design, creativity)
 Foundations of AI
 Philosophy
 Mathematics
 Economics
 Neuroscience
 Psychology
 Computer Engineering
 Control theory and cybernetics
 Linguistics
 The NearTerm Applications
 e.g., routine design
 e.g., detect credit card fraud
 The LongTerm Applications
 what is still left to do...????
 chess?
Deep Blue
 space?
Remote Agent and Deep Space 1
"Remote Agent (RA) is a modelbased, reusable,
artificial intelligence (AI) software system that enables
goalbased spacecraft commanding and robust fault recovery. RA
was flight validated during an experiment onboard Deep Space 1
(DS1) between May 17 and May 21, 1999."
 autonomous vehicles?
 What Intelligent Systems Can Do
 diagnosis, design, planning, scheduling, navigation, vision,
tutoring, learning, ...
 AI Sheds New Light on Traditional Questions
 computers provide new concepts & language
 computers require precision (e.g., what is "creativity"?)
 explore impact of technique or knowledge (add/remove)
 theories > computational models > implementations > results > refinements
 use of computers allows testing
 well tested methods used as tools
 AI Helps Us to Become More Intelligent
 suggests new/better ways to tackle problems
 AI Is Becoming Less Conspicuous, yet More Essential
 Airport gate allocation
 many embedded applications (cars, washing machines, ...)
 Criteria for Success
 clear definition of task and implementable procedure for it
 regularities or constraints available
 other knowledge
 solves real problem
 provides new theory/method
 suggests new opportunities
Lecture 2: Intelligent Agents (22.3)
 Agents & Environments
 agent, sensors, actuators, environment
 percept, percept sequence
 action, action sequence
 agent program implements agent function (percepts > actions)
 Rationality
 agent actions change state on environment
 Performance measure evaluates sequence of environment states
 Rational agent
For each possible percept sequence, a rational agent should
select an action that is expected to maximize its performance
measure, given the evidence provided by the percept sequence
and whatever builtin knowledge the agent has.
 What is rational depends on
 performance measure,
 prior knowledge,
 performable actions,
 percept sequence.
 maximize expected performance
 information gathering changes future percepts
(helps maximize expected performance)
 exploration (investigate unknown environment)
 agent autonomy: doesn't only rely on agent designer's kowledge
 The Nature of Environments
 Task environment for agent
 PEAS = Performance measure, Environment, Actuators, Sensors
 Properties of task environments
 Fully observable vs partially observable
 sensors detect all relevant aspects of environment
 Single agent vs multiagent
 multiagent: competitive vs cooperative
 Deterministic vs stochastic
 i. state of environment completely determined by current state & agent action.
 ii. outcomes determined by probabilities
 Episodic vs sequential
 i. agent experiences atomic episodes
 i. next episode does not depend on previous actions
 ii. current action could affect all future ones.
 Static vs dynamic
 dynamic if environment changes while agent is deliberating
 Discrete vs continuous
 relates to percepts and actions
 Known vs unknown
 refers to agents knowledge
 are outcomes (or their probabilities) known for all actions?
 hard! = partially observable, multiagent, stochastic,
sequential, dynamic, unknown
Lecture 3: Intelligent Agents (2.4, 3.1)
 Structure of Agents
 Agent = architecture + program
 Tabledriven program: table indexed by percept sequences
 full table not practical for real problems
 but note CaseBased Reasoning, tables in chess, and
memoization (lookup tables).
 Simple Reflex Agents
 next action depends on current percept only
 conditionaction rule
 RuleMatch picks rule to use
 environment must be fully observable
 there must always be a matching rule (otherwise ???)
 the basic idea behind rulebased systems
 ModelBased Reflex Agents
 internal state: keep track of best guess of state of environment
 model: how next state depends on current state and action
 in casual use, model = internal state (i.e., a model of environment)
 GoalBased Agents
 goals: desirable situations (result is achieved/happy or not)
 needs to have: what will happen if I do this...?
 can check relevant actions wrt achieving goal
 UtilityBased Agents
 combine with model
 utility: quality of being useful (degrees of happy)
 utility function: estimates the performance measure
 maximize expected utility: will behave rationally
 Learning Agents
 agents can learn to become more competent
 learning element: makes improvements
 performance element: selects actions
 critic: determines (using fixed performance standard)
whether/how performance element should be modified
 i.e., it will perform differently after modification
 problem generator: suggests actions that lead to new experiences
 Representations of the environment
 atomic: no internal structure
 factored: vector of attribute values (features)
 structured: objects with attributes and relationships
 consequences ???
 Problemsolving Agents
 Goal formulation: adopt goal: first step in problemsolving
 Problem formulation: decide what actions and states to consider
 with options: may need to examine future actions to determine value
 solution to some problems is a set of actions ("path")
 solution to other problems is a state
 Welldefined problems & solutions
 initial state
 set of possible actions applicable in state s
 transition model gives state resulting from each action
 state space: set of reachable states from initial state
 statetostate transitions form a graph
 goal test detects goal state (the state or its properties)
 might be more than one goal state
 step cost: cost of taking an action from state to state
 path costs: cost of following a path
 solution: path from initial state to goal
 optimal solution: lowest cost solution
Lecture 4: Uninformed search (3.23.4)
 Example problems
 toy problems vs realworld problems
 toy:
 vacuum world (goal = squares clean; solution = path)
 8puzzle (goal = configuration; solution = path)
 8queens puzzle (goal = configuration; solution = state)
 realworld
 routefinding
 touring (e.g., traveling salesperson problem)
 VLSI layout
 robot navigation
 packing a cargo plane
 Searching for solutions
 search tree
 nodes = states
 links = actions (with costs)
 root node = start state
 expand node: apply possible actions to generate new states
 parent nodes lead to child nodes
 leaf node: no children (yet)
 frontier: leaf nodes ready for expansion
 search strategy: how to select which node to expand next
 determined by how frontier queue built and how selection made
 e.g., FIFO queue, LIFO queue, priority queue
 loops and redundant paths (graph)
 TreeSearch vs GraphSearch
 for graph search recognize where you have already searched
 Uninformed search
 Uninformed: no additional information about states
 Informed: uses knowledge of how "promising" a state is (wrt goal)
 Breadthfirst
 all nodes at one level expanded before any nodes at next level
 test for goal at generation time (save time/space)
 huge memory requirements
 Uniformcost
 assumes different step costs
 expand node with lowest current path cost: g(n)
 use priority queue
 alternative higher costs paths to node are ignored
 Depthfirst
 expands most recently generated node
 goes deep down a path before investigating alternatives
 involves backing up from nodes that don't expand (aren't expanded)
 space complexity much better than Breadthfirst
 the basic search of AI (often with modifications)
 Depthlimited
 depthfirst with predetermined search depth limit
 path not explored past depth limit
 need to pick good value for limit (based on problem)
 Iterative deepening depthfirst
 depthfirst with varying depth limit
 start with depth at 0 and increase it
 some redundancy but not significant
 adds a touch of Breadthfirst, as at each level,
whole tree may be searched
 prefered uninformed search
Lecture 5: Informed search (3.53.7
 Heuristic/Informed Search
 use problemspecific knowedge to gain efficiency
 can guide and prune
 evaluation function  f(n)
 cost estimate for path through n to goal
 actual path cost to node n  g(n)
 heuristic function  h(n)
 estimated cost of cheapest path from n to goal
 uses "heuristic" to estimate ("rule of thumb")
 Greedy bestfirst search
 f(n) = h(n)  instead of g(n)
 sample heuristic = "as the crow flies"
 e.g., roads are always longer, but its a good estimate.
 greedy  doesn't take current cost into account!
 A* search
 "A star": a kind of bestfirst search
 estimated path cost through n
 f(n) = g(n) + h(n)
 pick lowest f(n) each time
 complete: will always find goal if there is one
 optimal: finds best path
 h(n) must be admissible  i.e., optimistic!
 it always underestimates actual cost to goal
 accurate h(n) close to or equals actual cost
 what if h(n) = actual cost???
 can run out of space
 Memorybounded heuristic search
Iterativedeepening A* (IDA*)
 use f values for cutoff, instead of d
 Recursive bestfirst
 it prunes search if another branch becomes better
 but remembers best cost of pruned subtree
 Simplified Memorybounded A* (SMA*)
 uses A* until memory full
 expands newest best leaf, deletes oldest worst leaf.
 SMA* robust choice for searching
 Heuristic functions
 good heuristics lower effective branching factor
 i.e., branching that actually occurs in a search
 ebf close to 1 indicates few unnecessary branches
 heuristic function with close to correct values are best
 use relaxed problems (fewer restrictions) to generate heuristics
 cost of optimal soln. to relaxed problem is admissible
heuristic for original problem
 (e.g., Manhattan distance for 8 puzzle)
 Pattern databases: store exact costs for subproblems
 gives heuristic value for cost of full problem
Lecture 6: Local Search
 Local search & optimization problems
 local search usually looking for a solution state, not a path
 usually looks around a state (or states) by modifying it (them)
 optimization: find best state, measured by an objective function
 state space "landscape"
 surface formed by function's value across all states
 global maximum (optimum) vs. local maximum
 could be looking for minimum (gradient descent)
 Hillclimbing
 looking for maximum
 search moves in direction of most improvement at each move
 steepest ascent (it's greedy)
 just records current state
 problems: local maxima; ridges; plateaux
 getting unstuck: stochastic (add some randomness at each move)
 random restart hillclimbing: a set of random start states
 Simulatedannealing
 annealing = heating then gradually cooling
 minimize cost (descent)
 disturb search out of local minima
 gradually disturb ("shake") less over time
 makes a random move: accepts it with some probability
 probability decreases if move makes things worse (a shake)
 you're still trying to go down hill to global minimum
 probability slowly decreases also depending on time
 Local beam search
 beam searches move in restricted areas of search space
 k random start states
 expand all states
 pick k best, and continue
 may have poor diversity (i.e., stuck in a region of the state space)
 variants add some randomness to encourage "diversity"
 Local search in Continuous spaces
 continuous actions/states lead to infinite branching factors!
 easiest solution  make discrete changes
 e.g., consider new states only by making discrete (delta) changes
 can also compute local gradients for hillclimbing
Lecture 7: Genetic Algorithms
 Genetic Algorithms (text's overview)
 analogy to natural selection
 survival of the fittest
 works on a series of populations of individuals (states)
 each population producing the next
 initial population of k random states (k often 100+)
 each state is rated by a objective/fitness function
 higher value, fitter individuals
 individuals represent descriptions of states (using features)
 often as a binary string
 fitter individuals replicated
 fitter get better chance of taking part in production of next population
 more fit, more copies
 randomly select pairs for mating (crossover)
 for each pair, randomly select crossover point.
 crossover produces new pairs (for next population).
 a small number of individuals are mutated (very small random change)
 stop after some number of generations,
when very fit individual appears,
or if best (or avg) fitness is stable.
 Genetic Algorithms (additional information)
 See these
A Quick Introduction to Genetic Algorithms notes.
 many variations of algorithm
 all have individuals, populations, fitness, crossover, mutation
 vary by:
 population size
 whether the population size varies
 representation of individuals
 direct representation (e.g., LISP program)
 coded representation (e.g., binary string(
 how crossover done
 probability of mutation
 whether some individuals copied from previous population
 whether individuals are checked for legality after crossover/mutation
 how fitness is calculated and used
 whether diversity is used to select for a new population
 See these
Diversity Selection notes.
 GAs and Creativity
 Koza
 automated circuit design
 uses circuit description language
 each individual in the population is a circuit description
Lecture 8: Adversarial Search
 Games
 multiagent, competitive
 deterministic, turntaking, twoplayer, zerosum, fully observable
 zero sum: one wins & one loses; or both draw.
 very large game trees (search spaces): need to "prune" and ignore parts of
game tree
 (search tree < game tree)
 chess has 10^{40} nodes in game tree (intractable)
 terminal state: one person has won
 looking ahead: complete search can find terminal states (correct utility)
 utility function: e.g., win (+1), lose (1), draw (0)
 looking ahead: can limit depth and estimate utility
 ply: a move by one player
 need legal move generator (can filter by what's "plausible")
 use transposition (hash) table of evaluations at previously seen positions
 can use pruning strategies
 e.g., based on shallow, fast evaluation
 danger: may prune the path that leads to a win!
 Optimal decisions in games (Minimax)
 assume both players play optimally (they want to win)
 A plays their best move, assuming that B responds with their best move
 all the way down the tree!
 High utility = player1 wins; Low utility = player2 wins.
 Player1 tries to move value up, Player2 tries to move value down.
 Search down the tree to terminal state, then back the values up taking
min or max values until all states resulting from move choices
have values that indicate what they'll lead to if played. Pick the
best.
 pick move that avoids opponents best moves!
 time is exponential in search depth. :(
 getting to optimal requires searching to terminal states
 just not viable for huge game trees!
 AlphaBeta pruning
 pruning!
 an addition to minimax
 dont expand a node that can't provide a score that's better than what you already have
 time/space saved can allow deeper searches (e.g., twice as deep)
 still exponential with depth, but visits fewer nodes due to pruning
 game tree branch order affects pruning possibilities
 chess: could order by expected utility
 e.g., captures; threats; move forward; move back
 Imperfect decisions
 can't search tree to terminal state
 cut off search earlier and use evaluation function
 accurate estimate of chances of winning in that state (i.e., utility)
 depth limited, or iterative deepening ("anytime algorithm")
 Features:
 # of pieces
 strength of pieces (queen > pawn)
 mobility (poss. moves)
 control (squares threatened)
 threats (potential captures)
 patterns of pieces (e.g., diagonal pawns)
 Evaluation function: often a weighted linear function
 Chess: Heuristic Continuation Fights the Horizon Effect
 fixed depth search produces a "horizon" (may be bad beyond it!)
 singularextension
 if one move's value is much better than rest, then keep
looking down that branch, as it's a place where the most
change in value could result from minimaxing
 searchuntil quiescent
 look for quiet (i.e., no possible captures)
 Chess: Deep Blue plays Grandmaster Chess
 see this
and this
 first machine to win chess game against reigning world champion
 uses alphabeta search, with selective extensions
 could search to a depth of 12 ply
 has opening "book" and all fiveorfewer piece endgames
 massively parallel, 30node, RS/6000, SPbased computer system
enhanced with 480 special purpose VLSI chess chips
 evaluates 200,000,000 chess positions per second
 several months working with a grandmaster on evaluation function
 "In three minutes, ... it computes everything it knows about
the current position from scratch."
 Chinook: world manmachine checkers champion
Lecture 9: Constraint Satisfaction Problems 1 (6.16.2)
 Defining CSPs
 Constraint Satisfaction Problem (CSP)
 set of constraints that specify allowable combinations of
values of variables
 e.g., X_{1} ≥ X_{2},
X_{1} > X_{3},
X_{2} ≥ X_{3}
 set of variable (each one can have a value)
 e.g., Vbls = { X_{1}, X_{2}, X_{3} }
 a set of allowable values (domain) for each variable
 e.g., the domain of each variable is {1, 2, 3, 4, 5}
 usually discrete, finite domains
 the problem is to find a complete and consistent assignment
 all variables have values, no constraints are violated
 there may be several, or no, consistent assignments
 the result may need to be all or one consistent assignment
 constraint graph: nodes = variables; links show constraint influence
 If constraint SA ≠ WA then SAWA in graph
 constraint propagation:
 the influence of removing
inconsistent values can spread through the graph (prune domains)
 constraints can be fully enumerated
 show all allowable assignments for variables in the constraint
 e.g., { (red, green), (red, blue), ... (blue, green)}
 types of problem solvers for CSPs
 search making one variable assignment at a time
 gradually eliminate inconsistent values from domains
 manipulate a potential solution until it becomes consistent
 unary constraints include one variable (e.g., X ≠ blue )
 binary constraints include two variables (e.g., A > B+3 )
 usually can reduce to all binary constraints
 global constraints: e.g., Alldiff (means "all different")
 preference constraints: ( ProfDCB prefers afternoon )
 other assignments are consistent, but suboptimal (incur cost)
 resource constraints: Atmost(10, A, B, C, D) (i.e., 10 max)
 bounds: reason using variable domains represented by [lower, upper]
 Examples: map coloring, scheduling, 8 queens,
cryptarithmetic, Sudoku
 Inference in CSPs by Constraint propagation
 Node consistency: variable's unary constraints satisfied
 Arc consistency: binary constraints satisfied between two variables
 e.g., variables X and Y
 for every value in the domain of X there's a value in the
domain of Y that satisfies constraint
(i.e., there's potential for a solution!)
 larger goal: aim to make whole graph arc consistent by removing domain
values that don't give arc consistency

AC3 algorithm: if domain of a variable is reduced, then
look to see if that affects variables connected to it by
constraints!
 i.e., the effects are propagated, until failure, or graph is
arc consistent.
 even if result isn't a solution, it will be much easier to
solve! (small domains)
 Path consistency: look at triples of variables.
 IF ABC is a path, THEN, for every consistent assignment of
values to both A and B (consistent with the
constraints on both A and on B), there must be an assignment to B
that is consistent with the AB constraints AND the BC
constraints.
Lecture 10: Constraint Satisfaction Problems 2 (6.36.4)
 Backtracking search for CSPs
 depth 1st search that choses value for one variable at a
time,
and backtracks when a variable has no legal value left
to assign.
 backtrack to a choice point on failure.
 keeps a single representation of the state and alters it
 Choices?
 which variable to assign next?
 which order to assign values to that variable?
 Variable choice
 choose vbl with fewest remaining values
 most constrained vbl is more likely to fail soon
 1,000+ times better performance
 choose vbl that is involved in constraints with largest
number of other vbls
 most influence
 Value assignment order
 prefer the value that rules out the fewest values in the
closest vbls in the constraint graph
 leave max flexibility for subsequent assignments
 Search mixed with inference
 after choice of value for vbl X do inference (e.g., arc consistency)
 forward checking: check arc consistency
 maintaining arc consistency (MAC): do AC3 on neighbors of X
 Intelligent backtracking on failure
 normal backtracking is "chronological"
 unwind in reverse temporal order
 improved backtracking is "dependencybased"
 unwind to point that contributed to failure
 e.g. conflictdirected backjumping
 nogood: keep track of set of vbls and their values that
cause a problem
 nogood set gives early warning of failure
 Minconflicts
 Local search for CSPs  uses one state and modifies it
 8queens problem
 move randomly chosen conflicted piece
 move it to position with least conflicts (minconflicts)
 works well for hard problems
 works well if there are many solutions in state space
 Constraint posting
 constraints can record knowledge
 consider vbl X
 reasoning infers constraints
 post a constraint (X > 10)
 post another constraint (X < 12)
 don't decide value for X until you know a lot about it!
 Least Commitment
 Conditional CSPs
 configuration problems
 not all variables known in advance (unlike basic CSP!)
 use a part in the config, then add its variables
 i.e., vbls are conditional
 e.g., car config rules
 RV means Require Variable
 RNV means Require No Variable
 Package="luxury" ==>_{RV} Sunroof
 Sunroof="type2" ==>_{RV} Opener
 Type="convertible" ==>_{RNV} Sunroof
Lecture 11: Logical Agents & Propositional Logic (7.17.5, 7.7)
 Knowledgebased agents
 reasoning using representations of knowledge
 KB = knowledge base = collection of knowledge
 logic = declarative knowledge representation language
 TELL = agent told new kowledge
 ASK = agent asked what it knows or can "infer"
 axiom: taken as given, as being true
 knowledge level vs. implementation level
 Wumpus World
 discrete, static, singleagent, partially observable
 requires reasoning to update world model in order to decide moves
 Logic Intro
 allows truth values True and False
 KB has sentences in logic
 syntax = legal structure of sentence
 semantics = meaning of sentence given "possible world"
 model = possible world
 a sentence is true in some models and false in others
 model m makes sentence a is true
≡ m satisfies a
 a entails b: b follows logically from a: a = b
 iff every model for which a is true, b is also true
i.e., M(a) ⊆ M(b)
 logical inference uses logic to provide answers (e.g., about s)
 model checking = enumerating all possible models
to see if for all models in which KB is true, s is true
M(KB) ⊆ M(s)
KB = s
 Inference: finding if something follows from what you know
 lots of things are entailed by the KB, inference is looking
for one particular one.
 _{i} = inference using algorithm i
 KB _{i} s = s can be
derived from the KB
 a "sound" inference algorithm is truth preserving
 model checking is sound
 a "complete" inference algorithm can produce any sentence
that is entailed
 i.e., anything that follows logically
 if KB is true in the real world, then any sentence a
derived from KB by a sound inference procedure is also true in
the real world.
 grounding: connecting the logical reasoning with the agent's real world
 the agent's sensors create the connection
 Propositional Logic
 propositional symbols: each stands for a proposition (true
or false)
 connectives: 'not' (negation), 'and' (conjunction), 'or'
(disjunction),
'implies' (implication/ifthen), 'iff'
(ifandonlyif/biconditional/equivalence)
 operator precedence
 a model determines a truth value for every propositional symbol
 semantics: how to compute truth value for any sentence
 rules for evaluating truth of the 5 connectives
 note TFF for (P ⇒ Q) implication, and F implies anything
 truth tables: every assignment of T/F to propositions
 KB is set of propositions saying when they're true
e.g., P_{x,y} is true if there's a pit in location [x,y]
 KB includes sentences about propositions
e.g., ¬B_{1,1}
 simple inference: model checking for KB _{i} s
 check all assignments of T/F to propositions
 find assignments where KB is true (all sentences are true)
 look for how s is assigned.
 Propositional Theorem Proving
 theorem proving = applying rules of inference to KB to try
to show what we want
 logical equivalence = true in same set of models [e.g., ¬(¬P) ≡ P ]
 valid sentence = tautology = true in all models [e.g., P v ¬P ]
 satisfiable sentence = true in some model
 P is valid iff ¬P is unsatisfiable
i.e., if there are no models that satisfy ¬P
 KB = b iff (KB ∧ ¬b) is unsatisfiable
 e.g., to show b assume b to be false
and add ¬b to the KB
i.e., KB ∧ ¬b
 then try, by inference, to show this causes a contradiction
 if there's a contradiction then b must in fact
follow from KB
 known as proof by "refutation"
 Inference and Proofs
 inferences rules can be used in sequence in a proof
 Modus Ponens: given a and (a ⇒ b) then b can be inferred
 AndElimination: given (a ∧ b) infer a
 all the logical equivalences can be used as inference rules,
as they preserve truth
e.g., ¬(¬P) ≡ P
 monotonicity: set of entailed sentences only grows as more
are added to the KB
 inference rules might apply to anything in the KB (control needed)
 Proof by Resolution
 Resolution: an inference rule
 works on clauses: disjunction of literals
e.g., P ∨ Q ∨ ¬R
 (a ∨ b) resolves with (¬a ∨ c)
giving (b ∨ c)
 removes the complementary literals (a, ¬a)
 result has all of the other literals
 remove duplicated literals
 Resolution uses Conjunctive Normal Form (CNF)
 e.g., <clause> ∧ <clause> ∧ <clause>
 can convert any propositional logic sentence to CNF
 If you're trying to prove a
 1. convert (KB ∧ ¬a) into CNF
 2. use resolution inference rule on the resulting clauses
 3. if a resolvent is empty then we have a contradiction,
and a is proved.
 4. if no new clauses result then the proof ends.
 Using Horn clauses
 Horn clause: disjunction of literals, with at most one positive
e.g., P ∨ ¬Q ∨ ¬R
 resolution on Horn clauses produces Horn clauses
 Horn clauses can be written as implications (nicer to read/write)
e.g., (a ⇒ b) ≡ (¬a ∨ b)
 normal form is A ∧ B ⇒ C
 proofs controlled by forwardchaining or backwardchaining
search strategies
 ANDOR graph
 forward: (datadriven) starts from known facts (positive literals) and
works forwards by inferences until the query is found.
e.g., if you want to prove C, given A and also B, then use
(A ∧ B ⇒ C) to provide C.
 backward: (goaldirected) starts from query and works back trying to show
that all the things that lead to the query can be
inferred.
e.g., if you want to prove C, and (A ∧ B ⇒ C), then
prove both A and also B.
 Agents based on Propositional logic (brief summary)
 problem: percepts (e.g., Stench) only apply at a particular time
 adding ¬Stench to a KB that alread contains Stench gives contradiction!
 fluent: something that changes
 need to to state what changes and what doesn't for each action
 this is known as the "frame problem"
 hard to deal with in propositional logic as there are only symbols
 we can make symbols Stench^{1} and
Stench^{2} etc to show different times
N.B., the superscript is part of the symbol and has no
influence in the logic.
Lecture 12: First Order Logic (8.18.3, 8.4)
 Representation revisited
 Propositional logic  facts
 First Order Logic  facts, objects and relations
 can include variables
 includes statements about some or all (quantifiers)
 FOL assumes world with objects and relations
 true or false or unknown
 standard syntax  "syntactic sugar" provides allowed variants
 Syntax & Semantics
 models contain objects (Richard), relations (brotherof),
properties (king), functions (left leg)
 syntactic elements in the language are symbols
 constant symbols (Richard) stand for objects
 predicate symbols (Brother) stand for relations
 function symbols (LeftLeg) stand for functions
 interpretation specifies exactly what in the model symbols
refer to
 terms refer to objects  e.g., Richard, or LeftLeg(Richard)
 atomic sentences = facts  e.g., Brother(Richard, John)
 logical connectives
 Quantifiers  'for all' ∀ and 'there exists' ∃
 use variables
 ∀x King(x) ⇒ Person(x)  note TFF
 ∃x Crown(x) ∧ OnHead(x, John)
 quantifier order matters
 ∀x ∃y Loves(x, y)
 ∃y ∀x Loves(x, y)
 use different vbl names for each quantifier
 ∃ and ∀ are related by ≡ rules  how?
 equality: two terms refer to same object  e.g., Father(John) = Henry
 alternative semantics
 uniquenames assumption  every constant refers to distinct object
 closedworld assumption  if we don't know it's true, it's false
 domain closure  # domain elements = # constant symbols
 Using FOL
 TELL  add "assertions" to KB
 ASK queries  can retrieve directly or infer
 ASKVARS gives vbl bindings/substitutions for the answer
e.g., ASKVARS(KB, Person(x)) gives {x/John}
and also {x/Richard}
 theorems are derived from axioms (i.e., from basic factual info and definitions)
 theorems can be used in inference too
 unlike Propositional logic, can make statements about any time
e.g., ∀t HaveArrow(t + 1) ⇔ (HaveArrow(t) ∧ ¬Action(Shoot, t))
 Knowledge Engineering in FOL
 knowledge engineering = KB construction for task/domain
 Identify task: what needs to be represented
 Assemble relevant knowledge: knowledge acquisition
 Decide on vocabulary: predicates, functions and constants
i.e., define the Ontology
 Encode general knowledge about domain
 Encode specific problem instance (e.g., info from sensors)
 Pose queries and get answers (ASK)
 Debug the KB (and individual sentences)
Lecture 13: Inference in First Order Logic (9.19.5)
 Propositional vs First Order Inference
 simple inefficient approach: convert FOL to propositional logic then do inference
 remove quantifiers and variables
 ∀  if possible do Universal Instantiation
(substitute variables with ground terms)
 ∃  pick a Skolem constant to stand for the thing
that exists.
 typically generates lots of sentences, many irrelevant
 Unification
 for FOL use Generalized Modus Ponens (MP)
 find substitutions for variables that makes regular MP useable
 Generalized MP is MP "lifted" to apply to variables
 unification = finding substitutions that make different
logical expressions look identical
e.g., UNIFY(Knows(John,x), Knows(y,Bill)) = {x/Bill, y/John}
 after unification then a P with vbls matches
the P in (P ⇒ Q) allowing MP
Note: skip section about making retrieval more efficient
 Forward Chaining
 useful for Situation ⇔ Response systems (rules)
 use definite clauses: disjunctions of literals with exactly
one positive
 perfect for sentences such as: King(x) ∧ Greedy(x) ⇔ Evil(x)
which converts into a definite clause
 algorithm: start from known facts, use all rules whose
premises are satisfied, and add the conclusions to the known
facts, and repeat until query answered.
 sound and complete
 may not be efficient
 incremental forward chaining: every new fact inferred in
iteration t must be derived from at least one new fact inferred
in iteration t1.
 Backward Chaining
 works backwards from goal query
from conclusions back to premises
 uses definite clauses
 needs to keep track of accumulated substitutions
 can be done by depth1st search
 ANDOR tree
 used in Logic Programming (e.g., Prolog)
Note: skip section 9.4.39.4.6
 Resolution
 Every sentence of FOL can be converted into an inferentially
equivalent Conjunctive Normal Form (CNF) sentence
i.e., a conjunction of clauses, with each clause being a
disjunction of literals:
clause e.g., ¬American(x) ∨ ¬Weapon(y) ∨ ¬Hostile(z)
∨ ¬Sells(x,y,z) ∨ Criminal(x)
 to convert to CNF
 eliminate implications
 move ¬ inwards
 standardize variables
 Skolemize to remove existential quantifiers
 drop universal quantifiers
 distribute ∨ over ∧
 result is a clauses connected by ∧
 resolution inference:
 take two clauses with complementary literals
 find a substitution that allows one to "cancel out the other"
 what's left over, with the substitution, forms the resolvent clause
 resolution proof: prove KB = a by proving that (KB ∧ ¬a)
is unsatisfiable,
by deriving the empty clause.
 each resolution step adds a new clause to the KB (increasing in size)
Note: skip section 9.5.49.5.5
 Resolution Strategies: resolution needs guidance about which
clauses to try to resolve
 Unit preference: always include a single literal in the
resolution (gets shorter clauses back)
 Set of Support: always use a member of a predetermined
set of clauses in each resolution step (e.g., initially use negated
query  add every resolvent to the set of support)
 Input Resolution: always use clauses from KB or the query
 Subsumption: eliminate all sentences that are more
specific (subsumed by) than something already in the KB
Lecture 14: Classical Planning (1010.3, 10.4.4, 11.111.2.2)
 Definition
 devising a plan of action to achieve ones goals
 world is represented by a collection of variables
 a search problem: inital state; actions available; result of
acting; goal test.
 state: a conjunction of fluents (with no variables)
 closed world assumption
 unique names assumption
 Action: defined using an action schema using vbls (represents a set of
specific actions)
e.g., Fly: fly from Boston to SF, fly from Austin to NYC, ...
 actions only mention preconditions and effects
 preconditions must be true in order to do the action
 effects: delete list (no longer true) & add list (new fluents)
e.g., ¬At(p,from) ∧ At(p,to)
 initial state: a specific state description
e.g., At(C1,SFO) ∧ At(C2,JFK) ∧ ...
 goal: a conjunction of literals that may contain vbls
e.g., At(C1,JFK) ∧ At(C2,SFO)
 note that actions may have costs, or the count of actions
could be used if we assume equal costs.
 Planning as statespace search
 Forward statespace search (progression)
 start from initial state and apply actions until goal is found
 strong domainindependent heuristic needed, and available
 most planning systems use forward search
 Backward relevantstates search (regression)
 start from goal and apply relevant actions backwards until initial
state found
 select actions that could contribute to the goal, but
dont negate an element of the goal
 previous state is current state without the add list and
including the preconditions
 hueristics for planning
 try to find a relaxed problem
 ignore all preconditions
 ignore some preconditons
 ignore delete lists
 ignore some fluents
 use decomposition
 assume independent subgoals, solve separately, combine costs
 use pattern databases
 stored cost for problems with particular pattern in them
 Planning graph
 can give a better heuristic estimate for guiding planning search
 graph can be used to estimate how many steps to reach goal
 GraphPlan: extract plan from searching in the planning graph
 for propositions only (no variables)
 connects possible states with possible actions
 S_{0}, A_{0}, S_{1}, A_{1}, ...
 S_{i} is all the literals that could hold at time i,
depending on the actions taken in prior steps.
 A_{i} is all the actions that could be taken from
S_{i} including "persistence" (i.e., no change /
noop action).
 build new S levels with actions between until there's no
change in the literals included (levelled off)
 planning graph isn't too costly to construct
 can extract plan as a backward search once all literals from
the goal are present in some S level and they aren't
marked as mutually exclusive.
 Mutex links = mutual exclusion
i.e., things that can't exist together
e.g., Have(Cake) with ¬Have(cake)
e.g., Have(Cake) with Eaten(Cake)
e.g., Bake(Cake) with Eat(Cake) (i.e., actions have conflicting prereqs)
 Mutex between actions too: Inconsistent effects; Interferences; Competing needs.
 if any goal literal is not in final S_{i} level then
problem is not solvable
 heuristic: can estimate the cost of achieving any goal literal by what
level of the graph it first appears (level cost)
 heuristics: for goal with conjunction of literals, try sum of level costs
 Partial Order Planning
 totally ordered plans: linear sequence of actions
 partial order plans: actions with ordering constraints
i.e., add liquid to flour BEFORE whisk together
 find flaw in plan at each stage and suggest an action to
add to fix it
 use "least commitment" to fix flaw
 build partial order plan
 backtrack if necessary
 can combine with libraries of highlevel plans
 Schedules
 include how long an action takes, and when it should occur
 plan first and schedule later
 can also have resource constraints
e.g., there is only one engine hoist
that's important as plan is partial order plan
 resources reusable or consumable
 duration of plan used as cost function
 actions have durations, and earliest & latest start times
 slack: range of start times
 CPM: Critical Path Method
 critical path is the one whose duration is longest
whole plan can't be shorter
 from start, can look at earliest start for each action in a path
 from end, can look at latest start for each action in a path
 order constraints impose possible actual start times
 resource constraints add additional restrictions
e.g., actions using the one hoist can't overlap
 Hierarchical Planning (Hierarchical Task Networks)
 humans plan at using high level actions (HLA) first
e.g., get to airport, fly, drive to destination
i.e., HLA + HLA + HLA
 hierarchical decomposition:
higher = more abstract; lower = more concrete
 each HLA has one of more "refinements"
 a refinement is a more concrete sequence of actions
(either HLAs or primitive actions)
 can refine plans recursively down to primitives
 at least one of the fully refined plans must achieve the goal
 can use a plan library of refinements
 a lot of knowledge about refinements can be encoded
 planner effectively searches the space of plan refinements
 it can be done breadth first.
Lecture 15: Knowledge Representation (8.4, 12.112.6)
 "knowledge is power"
 how many types of knowledge representation have we seen so far?
 Ontological Engineering
 ontology: those concepts that exist and can be reasoned about
in the world
 general concepts: events, time, physical objects, beliefs
 Ontological Engineering: representing these concepts
 Upper Ontology (e.g.,
SUMO)
(Adam Pease, WPI, BS&MS)
 add more details down to specific levels (e.g., Wumpus)
 all upper level details (axioms) must still be relevant at lower
levels (apart from exceptions)
 ontologies produced by:
 a team of ontologists/logicians
 importing categories, attributes and values from databases
 extracting information from text documents automagically
 doing it wiki style with open access
 Categories and Objects
 category knowledge is vital
e.g., supports recognition and also prediction
 use Basketball(b) or "reify" it to Basketballs
 subclass and member relations
 subclasses form a taxonomy (e.g., plants)
 Basketballs ⊂ Balls
 BB9 ∈ Basketballs
 for categories assume ∀
 (x ∈ Basketballs) ⇒ Spherical(x)
 Orange(x) ∧ Round(x) ∧ Diameter(x) = 9.5
∧ x ∈ Balls ⇒ x ∈ Basketballs
 Males and Females are subclasses of Animals
 they are an exhaustive decomposition
 they are disjoint (no members in common)
 can define categories
x ∈ Bachelors ⇔ Unmarried(x) ∧
x ∈ Adults ∧ x ∈ Males
 natural kinds: most realworld categories have no clearcut
definitions
e.g., games, tomatoes, chairs, ...
... think of a definition based on an example, think of a counterexample!
 Physical decomposition also needs to be represented
 Partof hierarchies
 tricky! is "cheek partof face" the same as
"wheel is partof car"?
 composite objects have structural relationships between
parts: e.g., Attached(x,y)
 bunch: objects with definite parts but no structure
BunchOf(Apples)
 Measurements: uses measure objects
Length(L1) = Inches(1.5) = Centimeters(3.81)
 some things don't have a scale (e.g., beauty), but still
can use
Beauty(Rose1) > Beauty(Weed1)
 Stuff  part of stuff is stuff (e.g., butter)
 intrinsic properties: belonging to the substance of the
object
e.g., color, flavor, ownership, ...
 extrinsic properties: belonging to the object
e.g.,
length, shape, weight, ...
 a category that includes only intrinsic properties is a substance
 what is half of a pile of sand?
 Events
 events are actions based on points in time
 fluent: may change over time  At(DCB, Office)
 assert that its true  T(At(DCB, Office))
 events take place over a time interval
Happens(e,i) where i = (t1, t2)
 events can make fluents become true or false at some time
Terminates(e,f,t)  event e causes fluent f to cease to
hold at time t
 Processes: actions where any part of the action is still the
same type
 sorta like "stuff" for objects
 e.g., Flyings
 Time intervals: moments (zero duration) and extended intervals
 predicates for time intervals
 Meet(i,j) ⇔ End(i) = Begin(j)
 Before(i,j) ⇔ End(i) < Begin(j)
 After(i,j)
 During(i,j)
 Overlap(i,j)
 Begins(i,j)
 Finishes(i,j)
 Equals(i,j)
 Fluents and objects  an object is a chunk of spacetime!
 President(USA) denotes a single object that consists of
different people at different times!
 Mental events and objects
 agents need statements about beliefs (mental objects)
 propositional attitudes: believes, knows, wants, intends, informs
 need Modal logic: include qualifications of a statement,
such as "usual", "possible", "necessary", "impossible", "always",
"believed", ...
 K_{A}P means "A knows P"
 can make statements about one agent's knowledge about
another's knowledge
e.g., K_{A}[K_{B}P]
i.e., A knows that B knows
 K_{A}P ⇒ K_{A}(K_{A}P)
i.e., if they know something then they know that they know it
 need complicated (!) collection of "possible worlds" to figure
out the semantics.
 Reasoning with categories
 semantic networks: graphical way of representing knowledge + inference
 most semantic networks have an underlying logic
 distinguish between categories and individuals
MalePersons vs. John
SubsetOf vs. MemberOf
 inheritance: properties of categories flow down to subcategories
 multiple inheritance: MemberOf(tux,Penguins),
MemberOf(tux,Birds), does tux fly?
 semantic nets allow "default" values
these can be overridden by specified values in subcategories
 description logics: logics tuned to categories and for
deciding relationships between them
 subsumption: checking if one category is a subset of another
by checking definitions
 classification: checking whether an object belongs to a
category
 consistency: checking if category definition is logically satisfiable
 dl language is intended to be easier to write than FOL
 but they typically lack negation and disjunction
 dl emphasises tractability of inference
 And[Man, AtLeast(3, Son), AtMost(2, Daughter),
All(Son, And(Unemployed, Married, All(Spouse, Doctor)))
All(Daughter, And(Professor, Fills(Department, Physics, Math)))]
 Default information
 example of default knowledge?
 monotonic: new statements produced by inference added to KB
 nonmonotic: override inherited properties: e.g., with Legs(John,1)
 new evidence can override default statement
(can't have both 1 and 2 legs!)
 nonmonotic logics: "circumscription", and "default logic"
 circumscription: add circumscribed predicates
e.g., Bird(x) ∧ ¬Abnormal(x) ⇒ Flies(x)
 assume ¬Abnormal(x) unless Abnormal(x) is declared to be true
 default logic: includes default rules
 Bird(x) : Flies(x) / Flies(x)
if prereq Bird(x) is true, and justification Flies(x) is consistent with
KB, then conclude Flies(x)
 Nixondiamond semantic net example
Republican(Nixon) ∧ Quaker(Nixon)
Republican(x) : ¬Pacifist(x) / ¬Pacifist(x)
Quaker(x) : Pacifist(x) / Pacifist(x)
 Truth Maintenance: retracting facts as needed (belief revision)
 suppose P had been assumed by default, but ¬P is found
 need to retract P and assert ¬P, but also retract all
sentences inferred from P!
 JTMS: justificationbase truth maintenance
 annotate each sentence in KB with justification
sentences from which it was inferred
 allows sentences with multiple justifications not to be retracted
 sentences without justification are marked as out
(not deleted), allowing efficient future changes
 ATMS: assumptionbased TMS
keeps track of all the assumptions that would cause a
sentence to be true.
Lecture 16: Quantifying uncertainty (13.113.3)
 Acting under uncertainty
 Intro...
 uncertainty due to partial observability, nondeterminism
 uncertainty due to
Laziness, Theoretical Ignorance, Practical Ignorance.
 belief state: set of all possible worlds the uncertain agent might be in
 Summarizing uncertainty...
 connections between effect and cause is not a logical
consequence, but is affected by degree of belief (probability)
 probability summerizes uncertainty
 probability statements made wrt knowledge states (what's known)
 Uncertainty and rational decisions...
 agents prefers some outcome over others
 utility: quality of being useful (preferences)
 basic idea: if it is highly probably and highly useful, that's good!
 Decision Theory = Probability Theory + Utility Theory
 Principle of maximum expected utility
Agent is "rational" iff it chooses the action that yields
the highest expected utility, averaged over all the possible
outcomes.
 Basic Notation
 what probabilities are about...
 sample space: set of all possible worlds
mutually exclusive & exhaustive
e.g., set of all rolls from a pair of dice (1,1),(1,2),...,(6,6)
 probability model: numerical probability with each possible
world (0 to 1)
 pair of dice: P(Total=11) = P((5,6)) + P((6,5)) = 1/36 +
1/36 = 1/18 (an unconditional probability)
 P(doubles) = 0.25
 P(cavity) = 0.2
 unconditional P, or prior P (i.e., there's no other evidence)
 if first dice is 5, P(doubles  Die1 = 5) = ??
 conditional P, or posterior P (i.e., it depends on other evidence)
e.g., P(cavity  toothache) = 0.6
P(cavity  toothache ∧ ¬cavity) = 0
 product rule: P(a ∧ b) = P(a  b) P(b)
 the language of propositions (probability assertions)...
 random variable: variables in probability theory
e.g., Weather, Cavity, Toothache
 each random variable has a domain of values
e.g., Weather has {sunny, rain, cloudy, snow}
 can write "sunny" for Weather = sunny
 P(Weather) = < 0.6, 0.1, 0.29, 0.01 >
stands for
P(Weather = sunny) = 0.6
P(Weather = rain) = 0.1
P(Weather = cloudy) = 0.29
P(Weather = snow) = 0.01
 probabilities sum to 1.
 the P statement defines a "probability distribution"
for the single variable Weather (here, as a vector)
 joint probability distribution: P(Weather, Cavity)
includes some of the random variables
 this is a 4 * 2 table of probability values
{sunny, rain, cloudy, snow}, {cavity, ¬cavity}
 P(sunny, Cavity) is 2 element vector
sunny with cavity, sunny with no cavity
 P(sunny, cavity) is a 1 element vector
 full joint probability distribution
includes all of the random variables
e.g., P(Weather, Toothache, Cavity)
 a possible world is an assignment of values to all the
variables under consideration
e.g., 4 * 2 possible worlds for vbls Weather and Cavity
 skip probability axioms and their reasonableness...
 where do probabilities come from...
 different views
 frequentist: from experiments, observed samples
 objectivist: probabilities are real aspects of the universe
 subjectivist: a way of characterizing an agent's belief,
without external physical significance
 Inference using Full Joint Distributions
 full joint distribution for Toothache, Catch, Cavity (sum to 1)
 look at worlds where proposition is true and add their probabilities
 marginal probability: use a subset of the variables
P(cavity) = 0.108 + 0.012 + 0.072 + 0.008
i.e., cavity in all of the 4 situations of the 2 other vbls.
 marginalization: sum up all values over the other variables
P(Cavity) = sum of P(Cavity, z), over z,
where z is {Catch, Toothache}
 similarly for conditional probabilities (conditioning)
 usually want to compute conditional probabilities
i.e., use the effect of evidence
 P(cavity  toothache) = P(cavity ∧ toothache) / P(toothache)
from product rule
 P(¬cavity  toothache) = P(¬cavity ∧ toothache) / P(toothache)
 view 1/P(toothache) as a "normalization factor" = α
without knowing value of P(toothache)
 P(Cavity  toothache) = α P(Cavity, toothache)
 = α[P(Cavity, toothache, catch) +P(Cavity, toothache, ¬catch)]
 but you need full joint distribution to answer, so it doesn't scale :(
 in general
P(X  e) = αP(X, e)
= α∑P(X, e, y)
where e is all the evidence, y is all possible combinations of values from
the unobserved vbls.
Lecture 17: Uncertainty & Bayes (13.413.5)
 Independence
 some variables have no influence on others
e.g., evidence about toothache, catch and cavity have no influence on
cloudiness (they're independent)
i.e., P(cloudy  toothache, catch, cavity) = P(cloudy)
 if independent (P(a  b) = P(a) or (P(b  a) = P(b) or P(a ∧ b) = P(a)P(b)
 can generalize for P (probability distributions)
 it factors large joint distributions into smaller ones.
 nice but often hard to find.
 Bayes' Rule
 Rule: P(ba) = P(ab)P(b) / P(a)
 as a set of equations with background evidence e
P(YX,e)
= P(XY,e)P(Y,e) / P(X,e)
where e could be toothache and catch
 Applying Bayes' rule: the simplest case...
 Best thought of as
P(causeeffect)
= P(effectcause)P(cause) / P(effect)
with e.g., effect = symptom, cause = disease
 diagnosis problem: given a symptom what is the disease?
 uses causal knowledge  what things cause what effects
 Using Bayes' rule: combining evidence...
 Toothache and Catch are probably dependent
 If there's a Cavity, then Cavity can cause Toothache, and
Cavity can cause Catch, but neither has a direct effect on
the other.
 i.e., in the presence of Cavity, Toothache and Catch can be
considered independent
 called "conditional independence"
 P(toothache∧catch  Cavity)
= P(toothacheCavity) P(catchCavity)
 to decompose a full joint distribution, using conditional independence
P(Toothache, Catch, Cavity)
= P(Toothache, Catch  Cavity) * P(Cavity)
= P(ToothacheCavity) P(CatchCavity) * P(Cavity)
= P(Cavity) * P(ToothacheCavity) P(CatchCavity)
giving three smaller tables
 this allows probabilistic systems to scale up.
 in general
P(Cause, Effect_{1},...,Effect_{n})
= P(Cause) *
Π P(Effect_{i}  Cause)
Lecture 18: Probabilistic Reasoning (14.114.2, 14.4, 16.116.2)
 Representing knowledge in an uncertain domain
 Bayesian network: data structure that can represent a full
joint distribution using conditional independence and smaller distributions.
 a directed acyclic graph.
 if node1>node2 then node1 is "parent" of node2
node1 has a "direct influence" on node2
 conditional independence is indicated by lack of link between
two nodes, but with shared parent
 independent variables aren't connected to others
 nodes annotated with conditional probability distribution
P(X_{i}  Parents(X_{i}))
 giving effects of parents on that node
 when building a network order variables so that causes
precede effects
 include links from parents if one variable directly influences another
 Semantics of Bayesian networks
 For a particular entry in the joint distribution over all n variables
i.e., X_{1}=x_{1} ∧ ... ∧
X_{n}=x_{n}
P(x_{1},....,x_{n})
= Π P(x_{i}  parents(X_{i}))
 varying i from 1 to n.
 e.g., for john, mary, alarm, not burglary, not earthquake
P(j, m, a, ¬b, ¬e)
= P(ja) P(ma) P(a  ¬b∧¬e) P(¬b) P(¬e)
by tracing back to parents.
 causal models: causes > effects
 diagnostic models: effects > causes
 causal models easier to build, and easier to get
probabilities for nodes
 skip 14.2.2 and 14.3
 Exact inference in Baysian Networks
 usual problem is to compute posterior probability for query
vbls
given some event (some assignment to evidence variables)
 X is query vbl
 E is set of evidence variables E_{1},...,E_{m}
 e is observed event (evidence)
 Y is set of nonevidence, nonquery vbs Y_{1},...,Y_{l}
the "hidden variables"
 complete set of vbls X = {X} ∪ E ∪ Y
 typical query P(X  e)
 sample query P(Burglary  JohnCalls=true, MaryCalls=true) = <0.284, 0716>
 i.e., P(B  j, m), and e = earthquake, a = alarm, b = burglary
 Inference by enumeration...
 for typical query
 in general
P(X  e) = αP(X, e)
= α∑P(X, e, y)
where y is all possible combinations of values from
the unobserved vbls.
 note that
P(x_{1},....,x_{n})
= Π P(x_{i}  parents(X_{i}))
 that allows P(x, e, y) to be calculated
 P(B  j, m) = αP(B, j, m)
= α∑_{e}∑_{a}P(b)P(e)P(ab,e)P(ja)P(ma)
 note that this uses each of the P(x_{i}  parents(X_{i}))
in the network
 skip the rest
 Quick Intro to Utility
 Decision Theory: choose amongst actions based on immediate outcomes
 in nondeterministic, partially observable environment
 RESULT(a) is a random vbl that has values that are possible outcome states of
action a
 P(RESULT(a)=s'  a, e)
probability of outcome s' given action a executed and evidence observations e
 utility function: U(s') given a number expressing desirability/usefulness
of the state s'
 EU(a  e)  expected utility of an action:
with lots of outcomes we need a way of
weighting their utility by their probability
 EU(a  e) = ∑_{s'} P(RESULT(a)=s'  a, e) * U(s')
 maximum expected utility (MEU): a rational agent should pick
the action that maximizes the expected utility
 action = argmax_{a} EU(a  e)
 Preferences in choice:
A > B  agent prefers A over B
A ~ B  agent is indifferent between A and B
A ≥ B  agent prefers A over B, or is indifferent between them
 there are axioms of utility theory that if followed will have
an agent exhibit rational behavior.
 if so
U(A) > U(B) ⇔ A > B
U(A) = U(B) ⇔ A ~ B
Lecture 19: Learning from examples (18.118.4)
 Intro
 Review the "Learning Agent"
 agent is learning if it changes its performance, hopefully
for the better, on future tasks after obtaining observations
about the world.
 basic case: "from examples"...
given inputoutput pairs, learn function that predicts
outputs for new inputs.
 called "Inductive Learning"
 inductive inference learns something general from specific things
 learning handles lack of agent designer's knowledge about the
world, how it changes, or how to operate in it.
 Forms of Learning
 Factors affecting learning:
 Component to be improved
 Prior knowledge agent has
 Representation used for the data/observations
 Representation used for the Component
 Feedback available to learn from
 Components that might be learned include:
 direct mapping from state to actions
 inference of relevant properties of the world from
percept sequence
 information about the way the world evolves
 information about the results of possible actions
 the desirability of world states (utility)
 the desirability of actions
 goals describing classes of states to be achieved
 Component Representations include logic, and Bayesian networks.
 Much learning concerns factored data representations (vector of attribute/values)
 Feedback to learn from: three types of learning...
 unsupervised: learns patterns in input with no feedback (e.g., clustering)
 reinforcement: agent learns from rewards/punishments
which actions were good/bad
 supervised: agent gets input and is told the matching output
 problems: noise in data: incorrect or missing
 Supervised Learning
 "training set": inputoutput pairs (x_{i}, y_{i}),
generated by unknown function y = f(x)
 find function h (hypothesis) that approximates f
 "test set": some additional examples ≠ training set
 used to test h (i.e., can h(x) correctly predict y?)
 classification: discrete set of y values (e.g., diseases)
 Boolean classification: y=true or y=false (learn goal predicate)
 regression: y is a number
 hypothesis space: a set of functions that h belongs to
 consistent hypothesis: agrees with all the data
 Ockham's razor: prefer the simplest consistent hypothesis
e.g., prefer small decision trees
 Learning Decision Trees (by induction)
 decision tree representation
 trees can be understood by people
 decisions trees are good for some types of problems but not all
 decisions reached by a series of tests (path through tree)
 a node is a test of an attribute
 links from each node are labelled with each of the possible
attribute values
 leaf nodes are labelled with a y value (the output)
 as trees are built additional nodes are added below single
root node
 not all attributes need to be included
 there are many possible trees (most are inefficient)
 if useful, paths through trees can be rewritten as rules, or logical statements.
 inducing decision trees from examples
 typical input is a vector of x values and a single y value
x = { Sunny=very, Windy=moderate }, y = Sailing
x = { Sunny=moderate, Windy=none }, y = Hiking
 use greedy divideandconquer approach to learn trees
 grow one level of tree below each node, moving down the tree
 nodes are picked by their discriminating/sorting power
("important attributes")
i.e., splitting the data to maximize progress towards leaf nodes
 start at top with most important node, next level is a set of
decision tree learning problems with smaller sets of data
that were produced by the previous node's split.
 results
 reach leaf node with single y value if data is split perfectly
 run out of data but there are still attributes left to
use on that path, then we don't have an observation for that case
 if we use all the attributes on a path but still have
data, then there is noise in the data.
 learning curve: improvement in accuracy of learning
e.g., gradually increase training set size, and get increase
in proportiion of test set correct (exponential)
 choosing attribute tests
 pick most important attribute at each step of tree learning
 how good are the subsets of the data produced by each
attribute
i.e., how well sorted
 use entropy: a measure of uncertainty
 a data subset with an equal mix of data leaves us uncertain
about the result
 want to reduce uncertainty  increase the amount of sorting
that has been done  "information gain"
 Gain: entropy of data set before using attribute,
minus entropy of data subsets after
using an attribute, is expected reduction in entropy
(information gain)
 check the Gain for each available attribute at that point in
the tree, and use the one with the greatest Gain.
 generalization and overfitting
 overfitting: having more data tends to introduce more patterns in the data,
and the tree will try to accomodate that.
i.e., it overcommits, and learns too much (such as noise)
 decision tree pruning: eliminate nodes (leading to leaf nodes)
that are not relevant.
 likely to prune nodes that provide very small information gain
 significance test: use statistics to test whether that
deviation in the data is significantly different from no or normal
deviation
i.e., what are the chances that this could occur normally
 pruning reduces the decision tree learning's sensitivity to noise
 broadening the applicability
 need to handle
 missing data
 attributes with many possible attributes (weakens Gain test)
 continuous and integer valued attributes (infinites set of values)
:: use split points for node tests (e.g., Weight > 160)
 continuous valued output attributes: regression tree to predict output value
 Evaluating and Choosing the Best Hypothesis
 Intro
 stationarity assumption: probability distribution over
examples doesn't change over time.
 independent: each example is independent of previous examples
 identically distributed: each example has an identical prior probability distribution
 error rate of hypothesis h(x): proportion of mistakes it makes
 low error rate may still not predict well for other data
 crossvalidation: using the data in multiple ways to build and test
 holdout crossvalidation: randomly split data set into
training set and test set
 need large training set to learn well
 but...need large test set to test well
 kfold crossvalidation: divide data into k subsets; use
each subset to test; use average error to estimate the accuracy of
a tree trained on all data. k=10 is common.
 Model selection: complexity vs. goodness of fit
 model selection: choosing the type of hypothesis to define a
space of things that can be learned. i.e., h comes from the space.
 optimization: getting the best h from the space
 size: an approximation of the complexity of the hypothesis
 e.g., linear function < quadratic function
 e.g., small decision tree < larger decision tree
 find best 'size' that balances underfitting and overfitting to
give best test set accuracy.
 wrapper: an algorithm to try to find the best size, that
takes a learning algorithm (e.g., decision tree learning) and some
examples
 it varies size, uses cross validation to learn error rate
 stops at lowest error, when h starts to overfit
 then learns with all data for a hyp of that size.
 From error rates to loss
 not all errors are created equal!
 better to get false +ves? (told you have disease when you don't)
 false ves? (not told you have disease when you do)
 need to take that utility into account as well
 assume h(x) gives ÿ instead of y
 loss function: loss of utility by getting an error
L(x,y,ÿ) = Utility(result of using y given an input x)
 Utility(result of using ÿ given an input x)
 can use just L(y, ÿ)
 small loss is better (we want to minimize it)
 Loss functions
 Absolute value loss: L_{1}(y,ÿ) = yÿ
 Squared error loss: L_{2}(y,ÿ) = (yÿ)^{2}
 0/1 loss: L_{0/1}(y,ÿ) = 0 if y=ÿ else 0
 generalized loss: taking prior probability distribution over
all I/O pairs into account
 empirical loss: for an h, assume data equally likely, sum
loss for each h(x)
 estimated best hypothesis: the h with the minimum emperical loss
 smallscale learning: problems with dozen's to 1000s of examples
 largescale learning: millions of examples  restricted by computation
 Regularization
 explicitly penalizing complex hypotheses
 can search for hypotheses that minimize
empirical loss + complexity
Lecture 20: More learning (18.718.8)
 Artificial Neural Networks
 Intro
 neurons: brain cells
 neural networks (NNs): networks of simulated neurons (units)
 neuron "fires" when a linear combination of inputs exceeds
some threshold
 Neural network structures
 units: the nodes/units of a NN
 link: connections between nodes
 activation: the output from a node
 output of one node can be the input to another
 weight: links have weights w_{i,j} on them
 unit j takes weighted sum of all inputs w_{i,j} × a_{i}
 weighted sum is in_{j}
 bias weight: each node has a dummy input fixed to 1 with a weight on it
 an activation function g converts in_{j} to a_{j}
 perceptron: a unit with g as a hard threshold
 sigmoid perceptron: a unit with g as a softer threshold
 these are nonlinear activation functions
 feedforward network: connections are only towards the output from input
 recurrent network: allows loops (i.e., more complex, and powerful)
 layers: single layer has input to units and output from those units.
 hidden units: a layer of units that do not connect to inputs or outputs
 classification/categorization: usually as many outputs as classes
 Singlelayer feedforward neural networks
 known as "perceptron networks"
 activation function g determines training process
 error is y  h_{w}(x)
 as this does 0/1 classification both y and
h_{w}(x) can be 0 or 1.
 perceptron learning rule: assumes hard threshold, does weight
updates depending on error
w_{i} ←
w_{i} + α(y  h_{w}(x))
× x_{i}
 logistic regression: uses softened threshold, does weight
updates depending on error
 h_{w}(x) = sigmoid function
applied to the data (i.e., to x).
 w_{i} ←
w_{i} + α(y  h_{w}(x))
× h_{w}(x)(1  h_{w}(x))
× x_{i}
 function can be learned if it is linearly separable
i.e., it learns linear decision boundaries
OK = { and, or } Not OK = { xor }
 learning curve for perceptrons sometimes better than decision
trees, sometimes not.
 Multilayer feedforward neural networks
 has hidden units in a layer or layers
 network is a function h_{w}(x)
parameterized by weights w, where x is an input vector.
 output is expressed as a fn of inputs and weights (including
use of g)
 train using gradient descent lossminimization method
 neural network does nonlinear regression
 i.e., fitting a nonlinear fn to some data
 nonlinear as NN provides nested nonlinear threshold/activation fns.
 Learning in Multilayer neural networks
 goal output is y
 NN returns h_{w}(x)
 error vector at output is y  h_{w}(x)
 outputs may depend on all weights in the NN
 backpropagate error from output layer to hidden layers
 at output layer, update rule adjusts weights depending on error:
 Let Err_{k} be error of k^{th} element of error vector
 Define
Δ_{k} = Err_{k} × g'(in_{k})
where g' is the derivative of g, and in_{k} is the
sum of the inputs to unit k.
 update rule for the weight between hidden unit j and output unit k is
w_{j,k} ←
w_{j,k} + α × a_{j} × Δ_{k}
where
α is the learning rate (how much you want to
update the weight each time), and
a_{j} is the output from the hidden unit j.
 at hidden layer, update rule adjusts weights depending the
amount of error for which the hidden layer unit might be responsible.
 the Δ_{k} values are divided according to
strength of connection between hidden node and all the
connected output nodes k.
 Define
Δ_{j}
= g'(in_{j}) ∑_{k} w_{j,k} Δ_{k}
where in_{j} is the sum of the inputs to hidden unit j,
the w_{j,k} are the weights from unit j to all the
output nodes to which it is connected,
and Δ_{k} is the error for each of those
nodes.
 update rule for the weight between inputs and hidden unit j is
w_{i,j} ←
w_{i,j} + α × a_{i} × Δ_{j}
 Learning in neural networks structures
 if use fully connected networks
 choices  how many hidden layers and their sizes.
 usually trial and error
 use cross validation technique to estimate error.
 Nonparametric Models (skim!)
 a parametric model uses a fixed number of parameters (e.g.,
the size of x )
 nonparametric model can change with more data
 instancebased learning stores data as it arrives
 simple table: ask for h(x) find x in the table and return the y
 if not in table then a problem.
 use knearest neighbors in the stored data
 take plurality vote of the neighbors as the answer.
 nearest: needs a distance metric
 use Manhattan instance or Euclidean distance between query
and data points
 works well in lowdimensional spaces, with lots of data
 kd trees: balanced binary tree with arbitrary number of dimensions
 split data at every dimension
 nearest neighbors is easy if query isn't near a boundary
 if it is you need to check on both sides of the split
 works well with up to 20 dimensions with millions of examples
Lecture 21: Knowledge in Learning (19.119.3)
 Logical formulation of learning
 ML using prior knowledge of the world to learn hypothesis
 put Hypothesis (h), Examples and matching Classifications (x's
and y's) as set of logical sentences
 given new example (in logic) use h to infer classification
 Examples and hypotheses
 examples in terms of values for Attributes
 example x_{1}: Alternate=Yes, Bar=No, Fri=No, Hungry=Yes, ...
 i.e., Alternate(X_{1}) ∧ ¬Bar(X_{1}) ∧
¬Fri/Sat(X_{1}) ∧ Hungry(X_{1})...
 classification (Goal predicate)  WillWait(X_{1}) or
¬WillWait(X_{1})
 each hyp h_{j} is in form 
∀x Goal(x) ⇔ C_{j}
where candidate definition C_{j} is a logical expression
 C_{j} for a decision tree can be expressed as the a
logical expression for each path (using ∧) linked by ∨
 h_{j} predicts that the set of examples that satisfies C_{j}
are examples of Goal(x)
 Those examples are the "extension" of the goal
 Hyp space H = {h_{1}, ..., h_{n}}
 Learning alg believes h_{1} ∨ h_{2} ∨ ... ∨ h_{n}
 if h_{i} not consistent with new example it can be removed
 can be false negative for h_{i}
h falsely says that it should be negative, but it is in fact positive
 can be false positive for h_{i}
h falsely says that it should be positive, but it is in fact negative
 note that hyp space H is vast, so this is not practical via
theorem proving.
 Currentbesthypothesis search
 maintain single h and adjust it as new examples arrive
 for each h_{i} keep all examples that it classifies (+ve)
(the extension)
 those examples define the hypothesis
 if new example is false negative  include in the extension
("generalization")
 if new example is false positive  remove from the extension
("specialization")
 note that when doing generalization or specialization you
need to check that the result is compatible with previously seen
examples.
 in fact what is needed is for h_{i} to be modified to
reflect generalization or specialization.
 for generalization h_{i} needs to become less precise
(drop conditions from C_{i})
 for specialization h_{i} needs to become more precise
(add conditions to C_{i})
 at each step there are multiple possibilities, not all of
which are good, but a choice must be made, so backtracking will be needed.
 at each step checking that the result is compatible with previously seen
examples is expensive.
 i.e., with large number of examples and large hyp space H it
isn't practical.
 Leastcommitment search (Version space)
 leastcommitment: make least change necessary
 keep around summary of all hyps consistent with data seen so far
 new example may alter summary slightly to reduce it
 "version space": only those hyps still consistent with data (after reduction)
 incremental learning
 version space defined by upper boundary G (general) and lower
boundary S (specific)
 *** do simple example **
 G starts with True (i.e., the most general example)
 S starts with False (i.e., the most specific example)
 S and G get updated by +ve and ve examples
 any hyp between S and G must agree with all the examples
 updates
 False positive for S_{i}  S_{i} is
too general, so throw it out of S
 False negative for S_{i}  S_{i} is
too specific, so replace it by all of its immediate
generalizations (i.e., move that portion of S up towards G)
 False positive for G_{i}  G_{i} is
too general, so replace it by all of its immediate
specializations (i.e., move that portion of G down towards S)
 False negative for G_{i}  G_{i} is
too specific, so throw it out of G
 results
 one hyp remains (hooray!)
 S or G becomes empty (i.e., no consistent h for training set)
 run out of examples with several h remaining
 Version space approach is probably not practical in many
situations (especially with noise), but it's a great model
 Knowledge in learning
 ...skim this section...
 moral: background knowledge can allow faster learning
 Note Explanation Based Learning (EBL)
 Hypothesis: what is being learned (h)
 Descriptions: all the examples (x's)
 Classifications: all the classifications (y's)
 Background: existing relevant knowledge
Hypothesis ∧ Descriptions = Classifications
Background = Hypothesis
 Explanation based learning (EBL)
 Intro
 converts general "firstprinciples" theories to useful
specialpurpose knowledge
 allows reasoning speedup in the future
 take solution to a specific problem and learns a general
method for slightly more specific problems.
 more than just memoization (the specific case is learned)
 it works by "explaining" a solution
 Extracting general rules from examples
 construct proof for problem (e.g., using backwardchaining theorem prover)
 e.g., prove Derivative(X^{2}, X) = 2X
 e.g., prove Simplify(1 × (0 + X), w)
i.e., can it be simplified?
 construct two proof trees simultaneously
 original proof
 the same proof with all constants replaced by variables
i.e., a generalized proof tree
 extract general rule from generalized proof tree
 EBL steps
 construct proof of example using background knowledge
 also construct parallel proof with variables
 construct new rule with lhs including leaves of proof tree ⇒
rhs as example with variables and bindings applied.
i.e., lhs terms are the conditions that the background
knowledge shows to be true, which need to be true to make
this inference again in the future
 drop any conditions on lhs that are true regardless of values
of variables in rhs
 result is a new rule that summarizes the result of applying
background knowledge
ArithmeticUnknown(z) ⇒ Simplify(1 × (0 + z), z)
 Improving eficiency
 can also extract more general rules from the generalized
proof tree by using nonleaf nodes
 tradeoff: general rules apply to more cases, but don't find
answer as directly
 tradeoff: adding lots of specific rules makes each one apply
directly to a specific set of situations, but finding the
right one becomes harder (increased branching factor!)
 tradeoff: check whether parts of each new rule are easy to
solve, but this make learning time longer.
 tradeoff: "easy to solve" varies as rules are added.
Lecture 22: Reinforcement Learning (21.121.2)
 Introduction
 "reward" or "reinforcement": feedback for action
 Markov Decision Processes: to MDP quick overview!
 reinforcement learning: based on rewards
 simple, fully observable environments, but with probabilistic action outcomes
 possible use by different agent types
 utilitybased agent: learns utility function on states
 uses it to select actions that maximize expected outcome utility
 Qlearning agent: learns actionutility function (Qfunction)
 the expected utility of taking a given action in a given state
 reflex agent: learns a policy that maps states directly to actions
 Model based vs. Model free
 Model based approach to RL
 learn MDP model: transitions and rewards (or approximation)
 Model free approach to RL
 do not learn the model
 Passive Reinforcement Learning
 "passive learning": agent's policy is fixed, learn utilities of states
 statebased representation, fully observable environment
 given a policy
 goal: learn how good the policy π is
i.e., learn utility function U^{π}(s)
 does not know transition model in advance
 does not know reward function in advance
 agent make "trials" using the policy
 each trial runs to the terminal state
 the agent's percepts supply the current state s and the
reward for that state.
 use reward info learn the expected utility for each state s
Direct utility estimation
 rewardtogo: expected total reward from that state onwards
to terminal state
 after each trial calculate rewardtogo for each state, and
make expected utility for that state the running average.
 use rewardtogo as direct evidence of actual expected
utility for state
 need many trials to get right answer (converges slowly).
 however, utilities of states are not independent, as...
The utility of each state equals its own reward plus the
expected utility of its successor states
 They obey Bellman's equations
U^{π}(s) = R(s) +
γΣ_{s'}P(s'  s, π(s))U^{π}(s')
 i.e., U^{π}(s) depends on U^{π}(s'),
the next state's utility
Adaptive Dynamic Programming
 does trials as before
 learns transition probabilities from observations
 how often do you get to s' from s by doing a?
 learns reward function R(s) from observations
 in new state, just store the reward given
 plugs values into Bellman equations
 solve for utilities
Temporal Difference (TD) Learning
 make computation easier and obtain an aproximate utility
 just adjust utility of state based only on the observed successor
 don't need transition model, as transitions are observed.
 e.g., after some learning, calculate
U^{π}(1,3) = R(1,3) + U^{π}(2,3)
where (2,3) is the observed successor
 if that calculated value ≠ current utility value for
U^{π}(1,3) then update it in the right direction.
 update using the TD Update Rule for s to s'
U^{π}(s) = U^{π}(s)
+ α( R(s) + γU^{π}(s')  U^{π}(s) )
where α = learning rate, γ = discount
 R(s) + γU^{π}(s') is approx/noisy utility measure
 make learning rate gradually decrease with the number of samples
Lecture 23: Natural Language Processing (22.122.4)
 Intro
 knowledge acquisition: need language understanding for
getting new knowledge
 Language models
 language model: predict the probability distribution of language
 language: set of strings of characters
 grammar: rules that define legal structure (syntax)
 semantics: allocate meaning
 natural language: English, Spanish, ...
 word combinations have probabilities (some rare; some sorta OK)
 ambiguity: probability distribution over possible meanings
 "He saw her duck"
 language is huge so models are approximate
Ngram character models
 simple language model: probability distribution over characters
 probability of sequence of N characters P( c_{1:N} )
 e.g., P("the") = 0.027
 ngram: sequence of length n
 (bigram, trigram samples)
 Google books Ngram Viewer
 ngram is Markov chain of order n1
 P(c_{i}) depends on immediately preceding
characters (e.g., previous 2 for a trigram)
 i.e., P(c_{1:N})
= Π_{i=1..N} P(c_{i}  c_{i2:i1})
 extract ngram probabilities from a corpus (large body of text)
 language identification: given text, what language is it written in ?
example
 trigram model of each language (i.e., probabilities)
 i.e., have P(textlanguage)
 want P(languagetext)
 = P(textlanguage)P(language)/P(text) and drop P(text)
 P(language) is dominated by P(textlanguage) term in
calculation so it can be approximate and still OK
argmax_{l} P(language) Π_{i=1..N} P(c_{i}  c_{i2:i1})
Smoothing ngram models
 one corpus isnt the same as another, so ngram model approx
 things claimed to be 0 probabilities actually are possible
 smoothing: adjust zero probabilities up, and others slightly
down (sum to 1)
Ngram word models
 ngrams for words
 probability of word sequence
 3gram word model sentences are staring to look somewhat reasonable
 Text Classification
 categorization: given text what type is it?
 e.g., spam, positive/negative movie review, ...
 could use supervised learning
 "features" for category: word level, character level
 keep top 100 or so features
 can use supervised learning with features (e.g., decision tree)
 train ngram word model for ¬spam and another for spam.
 P(categorymessage) = P(messagecategory)P(category)
by Bayes rule and ignoring P(message)
 pick larger probability P(¬spammessage) vs. P(spammessage)
 can use data compression for classification
 e.g., add new msg to spam and compress, add same msg to
¬spam and compress, the greatest relative reduction
indicates category!
 Information Retrieval (IR)
 task of finding relevant documents
 needs
 corpus of documents
 query in query language
 result set (possibly relevant documents)
 presentation of result set
 Boolean keyword model
 query language with AND/OR/NOT
 look in document for keywords
 IR scoring functions: query returns a score for a document
 high score = high relevance
 TF = frequency of a word in a document
 IDF = inverse domain frequency of a word
 if a word appears in most documents it has less importance
 DF = the number of documents that contain a word
 use these to return a score for a document and some query words.
 Precision = proportion of result set that are actually relevant
 Recall = proportion of all relevant documents in corpus that
are returned in the result set.
 can make tradeoffs between P and R
 tweaks include adjusting case (car = CAR = Car);
stemming (run = runs = running); synonyms (sofa = couch)
 PageRank developed by Google
 PR(p) depends on PR of all pages that link to page p, and
the count of number of links from each of the pages that link to p.
i.e., depends on
Σ_{i}( PR(in_{i})/C(in_{i}) )
 the HITS algorithm first gets pages that satisfy query,
then does a similar sort of analysis
 Finds Hubs and Authories
 e.g., authority pages have many relevant pages pointing to them.
 Question answering: query is a question
 been around for a while!
D.C.Brown (1974) A survey and analysis of question answering systems,
M.Sc. Thesis, University of Kent, Canterbury, England.
 Can use standard question types
 Convert questions into standard type, then into web search query.
 Selections of text retrieved are analysed.
 Uses knowledge about what type of answer is expected
e.g., who vs. how many expects name vs. number
(used in Watson)
 Information extraction
 Acquire knowledge by skimming text and looking for objects & relationships
e.g., extract addresses
 Approaches:
 Finitestate automata
 Probabilistic models (skip this)
 Conditional random fields (skip this)
 Ontology extraction
 Automated template construction
 Machine reading
 Finitestate automata
 assume text is description of single thing
 extract attributes (e.g., Manufacturer, Model, Price)
 define "template" for each attribute
 template defines as finitestate automata (e.g., regular expression)
 regex  can define sequence, repetition, optional items
 template may have test for pre and post context
e.g., price is 100 dollars
 finitestate automata can be cascaded (sequence)
 modularizes the knowledge
 works very well with text in restricted domains
 1st tokenize
 2nd detect complex words (e.g., company names)
 3rd group words and tag (e.g., noun phrases)
 4th handle complex phrases
 5th merge related structures
 Ontology extraction
 build ontology of facts from large corpus
 precision is vital
 use very general templates
 templates that match factgiving syntax
 Automated template construction
 looking for templates that reveal particular relation
e.g., subcategory; authortitle; etc.
 start with some examples in the form of simple templates
 use those to retrieve text
 infer other templates from the text
 use context around the match to add to new templates
(e.g., "type of"; "wrote")
 Machine reading
 needs to learn many templates
 start with general syntactic templates
 learns underlying probabilities
Lecture 24: Natural Language for Communication (23)
 Communication
 language intended send messages
 syntax = structure
 semantics = meaning
 pragmatics = practical issues affecting meaning that relate to context
 language is too vast and complex for trigrams to be the only tool
 Phrase Structure Grammars
 need rules that define the legal language  a grammar
 part of speech (lexical category)  Noun, Verb, Article, Pronoun, etc.
 syntactic categories  noun phrase (NP), verb phrase (VP)
 combinations form phrase structure of sentence  e.g., NP VP
 Nonterminals  Article, Noun, NP, ...
 Terminals  "the", "wumpus", ...
 parsing  finding the structure of a sentence using grammar
usually tree form
[S [NP [Article "every"] [Noun "wumpus"]] [VP [Verb "smells"]]]
 generation  using the grammar rules to produce sentences
 simple grammars can overgenerate (e.g., "me go home")
 need rules that define the legal language  a grammar
 the form of the rules alter the complexity of the languages
that the grammar can parse/generate (Chomsky Hierarchy)
 recursively enumerable (unrestricted rules)
 contextsensitive (can apply a rule in a specific context)
 contextfree (used in any context)
 regular (highly restricted)
 contextfree grammar
S → NP VP
NP → Article Noun
...
 probabilistic contextfree grammar (PCFG)
S → NP VP [0.90]
NP → Article Noun [0.25]
...
 probability assigned to every string
 lexicon  words with lexical category and probabilities
 probability of sentence is product of probabilities of rules and words
 Syntactic Analysis (Parsing)
 Parsing: using grammar to find phrase structure
 top down: start with S and work down to words
 bottom up: start with words and work up to S
 use memory (chart) to keep track of successful parses of
parts of sentence to prevent having to reparse them again
later
 syntactic ambiguity: multiple ways to parse a sentence
"he eats grass and leaves" (leaves can be a N or a V)
 look for best parse  related to probability
 could use A* with cost 1/p of parse found so far
 learning probabilities for PCFGs
 learn grammar from data
 large corpus of correctly parsed sentences (treebank)
 extract rules from parses and learn count frequencies
 Augmented grammars and Semantic Interpretation
 lexicalized PCFGs
 probabilities depend on relationships between words that
rule includes
"eat a banana" vs. "eat a bandana"
 augmented PCFG includes sytactic structure as well as word relationships
 'head' of phrase is most important word (e.g., v = "eat", n = "banana")
 VP(v) = Verb(v) NP(n) [P(v, n)]
 P(v, n) depends on v and n.
 P(eat, bandana) is very low
 use smoothing for very low probabilities so that they aren't zero
 can learn P(v, n) from treebank
 grammar rules can be expressed in logic
 parsing can be expressed as logical inference
 not really practical for unrestricted parsing
 could be used for language generation
 Case agreement and subjectverb agreement
 there are a variety of additional linguistic rules that need to
be expressed somehow in order to parse/generate correctly.
 getting them all into the grammar could mean adding lots of extra nonterminals
 e.g., subjective case ("I"), objective case ("me")
 e.g., subjectverb agreement ("I smell bad", "he smells bad", "they smell bad")
 Instead, add parameters to the nonterminals
NP(c, pn, head)
c = case, pn = person/number (e.g., 1st person singular), head = head word of phrase
 Semantic interpretation
 compositional semantics: semantics of phrase depends on semantics of subphrases
i.e., the meaning can be built up during bottomup parse
 syntax rules annotated with semantic functions
 meanings carried up the parse tree and composed
 "John loves Mary" → Loves(John, Mary)
 meaning of "loves" is the lambda expression
λy λx Loves(x,y)
 "Mary" gets bound to y, on one branch of parse tree.
 Higher up the parse tree, "John" gets bound to x.
 Pragmatics  influence of current situation on the meaning
 Indexicals: "I am in Worcester today"  "I", "today"
 Speech Act: determining speaker's intent
"Could you close the door?" ("yes, I could")
 could even require input from perception
"Give me that book"
 Ambiguity!
 "Squad helps dog bite victim"
 Almost every utterance is ambiguous.
 Alternative meanings get pruned out by native speakers.
 Lexical ambiguity: "bank" two kinds of noun, a verb, and an adjective
 Syntactic ambiguity: "I saw the flower in the park"
seeing in the park, flower in the park
 Metaphor: "All the world's a stage" (no it isn't)
 Disambiguation: needs knowledge
 World model: knowledge of what is likely in the world
 Mental model: speaker's belief and hearer's belief
 Language model: likelihood of certain string of words
 Acoustic model: concerns sequences of sounds
 Machine Translation
 translate source to target (e.g., English to French)
 perfect translation requires complete understanding of the text
 Alternative meanings get pruned out by native speakers.
→ Alternatív jelentések kap metszett ki anyanyelvű.
→ Los informes alternativos se cortan fuera a hablar.
→ Alternative reports are cut out to speak.
 other languages have different words for different
situations where English may have one (and v.v.)
 Levels of translation:
 English → Interlingua → French
 English Semantics → French Semantics
 English Syntax → French Syntax
 English words → French Words
 Statistical machine translation
 use large bilingual corpus of translations to train probabilistic model
 f* = argmax_{f} P(f  e) = argmax P(e  f)P(f)
 P(e  f) is a translation model (but P(f  e) can be found directly)
 P(f) is a language model for french
 Phrase approach  find best french phrase of short english phrase
 P(f_{i}  e_{i}) are known
 sequence of french phrases are 'distorted' to a new order
(for better french)
 P(d_{i}) distortion probabilities are known (learned)
 P(f, d  e) = Π_{i} P(f_{i}  e_{i}) P(d_{i})
 use a search to find best f for the e.
 Speech recognition
 Speech recognition: identify sequence of spoken words
 many problems...
 Segmentation: no pauses between spoken words
 Coarticulation: adjacent sounds affect each other
 Homophones: to, too, two.
 Use vector of features from audio signal to represent the speech
 argmax P(word  sound) = argmax P(sound  word) P(word)
for some time period
 P(sound  word) is the acoustic model  the sounds of words
 P(word) is the language model (for each utterance)
 Markov assumption: the current state Word_{t} depends on
a fixed number of previous states.
 Acoustic Model
 sounds waves  AtoD converter  sampling rate
 quantization factor: precision of each measurement (812 bits)
 phones: different speech sounds (about 100)
 phoneme: smallest unit of sound with a distinct meaning for
a language (e.g., pill vs. kill)
 kit vs. skill  the K is two different phones but one phoneme
 frames: overlapping time slices through signal (e.g., 10 ms)
 vector of discrete features for each frame (e.g., energy at different frequencies)
 phone model
 transition probabilities between parts of a phone
 Form hidden Markov Model
 parts have expected features
 parts are onset, middle, end
 could take 510 frames as input and recognize phone [m] for e.g.
 pronunciation model
 transition probabilities between phones
 e.g., [ t ow m aa t ow ]
 can augment to show dialect variation and coarticulation
 [t] [ow] vs. [t] [ah] at the start of "tomato"
 Language Model
 based on corpus of taskspecific text
 use transcripts of spoken interactions (e.g., airline reservations)
 include all taskspecific vocabulary
 have voice interface ask specific questions to constrain user input
 Building a Speech Recognizer
 Components:
 high quality microphone
 low background noise
 signal processing algorithms
 features used
 phone models
 word pronunciation models
 language model
 phone models & word pronunciation models often hand developed
 probabilities come from speech corpus
 models can now be learned automatically
 performance error less than 1% for limited topics
 up to 1020% error in larger vocabularies
 task specific interaction lowers error
Lecture 25: Perception
 Intro
 Perception: interpreting response of sensors
 vision, hearing, touch  plus radio, GPS, infrared, etc
 sensor model: sensor (S) provides evidence about the
environment (E), i.e., P(E  S)
 object model: describes objects in the world (e.g., 3D geometry)
 rendering model: how stimulus is produces from the world
(e.g., lighting)
 lots of ambiguity in vision: some managed by using prior knowledge
 video camera may deliver 10 GB per minute
 i.e., what to use, what to ignore?
 feature extraction: simple computations applied to sensor observations
 recognition: making key distinctions between objects, perhaps
labelling them
 reconstruction: build geometric model of world from image(s)
 Image formation
 imaging distorts the appearance of objects (e.g.,
perspective, foreshortening) *1*
 scene → sensor → 2D image
 pixels: smallest units of image
 image formed at the image plane (e.g., via pinhole camera) *2*
 f is distance from pinhole to image plane
 (x,y) is point on image plane
 (X,Y,Z) is location in scene
 x = fX/Z, y = fY/Z
 image is inverted updown & leftright
 larger Z, smaller x & y
 parallel lines converge in the image at vanishing point
 note the importance of Z: if you know the rest, you can
calculate Z!
 Lens Systems
 lens gathers more light *3*
 have limited depth of field
i.e., can 'focus' light from a limited range of Z values
 outside that range will give unsharp image
 Scaled orthographic projection
 if points on object have very limited Z variation then
scaling factor f/Z (in fX/Z) is effectively a constant s
 i.e., x = sX, y = sY
 Light and Shading
 brightness of image depends on brightness of patch of
surface that projects to the pixel.
 main causes of varying brightness:
 overall intensity of light
 reflecting more or less of the light
 shading due to not facing the light as much
 diffuse reflection: light evenly scattered
i.e., brightness doesn't depend on viewing direction
 specular reflection: brightness depends on viewing direction
 specularities: small patches where there's specular reflection *4*
 default assumption is distant point light source
 amount of light at surface patch depend on angle between
the normal to the patch and the illumination direction. *5*
 diffuse surface patch reflects some fraction of light
 diffuse albedo (e.g., white paper has 0.90)
 Lambert's cosine law for brightness of diffuse patch
I = ρI_{0}cosθ
where ρ is diffuse albedo,
I_{0} is intensity of light source,
θ is angle between light source
direction and surface normal.
 note that lighting provides surface information (due to θ)
 surface with no light is in shadow
 interreflections: prevent shadows from being completely black
 ambient illumination: from interreflections
 Color
 (or, using my trigram system, Colour)
 energy at different wavelengths (spectral energy density)
 humans see red, green, blue
(dogs)
 principle of trichromacy: by mixing three colors humans can
be fooled into seeing the original color (e.g., TV)
 model light source with different R/G/B intensities
 model surfaces with different albedos for R/G/B
 Early imageprocessing operations
 early: reducing the amount of data, starting interpretation
into compact representation
 early: usually local operation (rely on small part of the image)
 early: often in parallel
 edge detection
 straightlines or curves in image
 significant change in brightness
 different kinds of edges (types detected later) *6*
 depth discontinuities (object to background)
 surface orientation discontinuities (edge of object)
 reflectance discontinuities (change of surface material)
 illumination discontinuities (shadows)
 in 1D brightness is I(x)
 edge is sharp change in brightness *7*
 detect change by large change in derivative I'(x)
 noise may give this, so smooth/blur first  (I * Blur)'
 Blur = Gaussian filter G_{σ}
 (I * Blur)' = (I * G_{σ})' = I * G_{σ}'
 convolution of I and G_{σ}'
 σ is the standard deviation  small blurs less
 corresponds to replacing each pixel by avg values of those around
 giving closer ones more weight and further away less weight.
 think of it as a small operator that scans across the image
 peaks (max of large gradient) in processed image correspond
to edges *8*
 similar in 2D  also interested in edge orientation θ(x,y)
 link edge points that are related by orientation
 texture analysis
 spatially repeating pattern on surface that can be detected visually
 e.g., grass, pebbles
 use multipixel patch  characterize patch by histogram of
pixel (edge) orientations
 histogram changes in an image area suggest change in object
 orientations largely illumination invariant
 optical flow
 direction and speed of motion of object in the image *10*
 object or camera moving between frames of video
 rate of flow can indicate distance, and show actions
 need corresponding point between two images (2 frames)
 select image patch at (x_{0}, y_{0}) at time t_{0}
 compare patch with places around that point in second image
at time t_{0}+D_{t}
at (x_{0}+D_{x}, y_{0}+D_{y})
 minimize the measure of Sum of Squared Differences
i.e., find best (D_{x}, D_{y})
 optical flow at (x_{0}, y_{0}) is
(v_{x}, v_{y})
= (D_{x}/D_{t}, D_{y}/D_{t})
 there needs to be some texture for this to work
 Segmentation of Images
 break image into regions of similar pixels *11*
 regions often indicate edges of objects
 can either detect region boundaries, or regions themselves
 detect region boundaries: train classifier based on
brightness, color and texture
estimates P_{b}(x,y,θ) 
probability of boundary b at x,y at angle θ
 however, may not form closed curves
 Alternative approach: cluster pixels based on brightness, color and texture
 maximize similarity of pixels in cluster, and maximize
difference between clusters
 Object recognition by appearance
 appearance: what object looks like
 simple/consistent objects: just test for distinctive features in the image
 e.g., works quite well for faces
 slide round window over image, compute features, use classifier, find faces!
 overlapping windows might be combined to report single face
 train classifier with markedup face images *12*
 Complex appearance and pattern elements
 several effects move features around in an image: *13*
 foreshortening: viewing slanted surface
 aspect: object at different rotation angles
 occlusion: parts hidden by other parts or objects
 deformation: objects with moving parts/regions
 try looking across image for object parts (also vary scale)
if related parts are close together then object detected
 i.e., look for image features together in approx the right place
 heuristic  use spatial information (e.g., car wheels at bottom)
 Pedestrian detection with Histogram of Gradient features
 use histograms of local orientations in an image *14*
 break image into cells  make orientation histogram for each cell
 emphasise important gradients by weights that show how
significant they are relative to others in the same cell
 gives Histogram of Gradient feature
 train classifier with existing training sets
 Reconstructing the 3D world
 recover 3D model from image
 i.e., can we do P(SceneImage) = P(ImageScene)P(Scene) ?
 Motion parallax
 camera moves relative to 3D scene *16*
 apparent motion in image tells us about camera mvt and depth info in scene
 viewer translational velocity T
 Z(x,y) is zcoordinate of point in scene corresponding to image point (x,y)
 optical flow
v_{x}(x,y) = xT_{z}/Z(x,y)
v_{y}(x,y) = yT_{z}/Z(x,y)
 can detect relative depths from optical flow
 Binocular stereopsis
 two images separated in space *17*
 disparity: difference in location in two images of same features
 need to solve the correspondence problem
 displacement of eyes (cameras) by amount b along xaxis (approx 6cm)
 horizontal disparity (in image) H = b/Z
 measure disparity, know b, obtain Z the depth of some point on object
 humans fixate: look at a certain depth
 small variations in depth correspond to small angles at the eye
 smallest detectable angle is about 5 seconds of arc
(a minute of arc is 1/60th of a degree)
(a second of arc is 1/60th of a arcminute)
 e.g., at 30cm we can detect 0.036mm!
 generize to multiple views *18*19*
 Shading
 variation in intensity of light from different portions of a
surface in the scene
 due to geometry and reflectance properties
 very hard to recover these from the image
 there are many interflections
 Contour
 we can extract distance and 3D properties from outlines *21*
 figureground problem: which is foreground, which is background?
 big clue is Tjunctions
 assume "ground plane"
 i.e., nearer objects project to points lower in image
 Objects and geometric structure of scenes
 can use horizon detector: images closer to the horizon are
further away *22*
 also, pedestrians are approx same height so images size reflects distance
 for solid object with distinct feature points m_{i}
 pose detection, for use for industrial robots manipulating parts
 assume rotation and translation of object, and projection to image
 image point p_{i} = Q(m_{i})
 Q is the same for all image points
 if three object features can be found in the image then
equations can be solved (e.g., using edges and vertex detection)
 i.e., all m_{i} of object can be predicted
and object position and "pose" is known allowing manipulation
 Object recognition from structural information
 use knowledge of object being seen
 e.g., simple model of human body
 deformable template: moveable image blocks with relationships
e.g., leg image relative to body image *23*
 model geometry of body with eleven rectangular segments with
connections and constraints
 "cardboard people": model forms a tree rooted at torso
 segments move independently of segment to which they're connected
e.g., lower arm relative to upper arm
 image rectangle should resemble the model segment
 relationship between image rectangles should match expected
relationships between associated model segments
 find best match
 can use size of rectangle/image to help
 color can help matching
 Appearance model: model of segments reflecting most likely position of
person in the world, based on the image *24*
 Coherent appearance
 tracking people in video *25*
 look for torso in lots of frames
 build up a reliable appearance model that explains many frames
 Using Vision
 many applications!
 e.g., surveillance, sports, HCI, games, ...
 in simple cases with large fixed backgrounds can subtract
background from complete image leaving image of interest
 can train classifier on optical flow to recognize standard actions
 Image retrieval
 find relevant images from db
 can be done via IR techniques (e.g., images have keywords)
 can learn keywords for image by using tagged training images and
nearestneighbors methods (test image similar to training image?)
 Reconstruction from many views
 assume a familiar 3D object, then we have an object model
 determine correspondences between image points and object points
 use correspondences to determine parameters of camera (and lense)
 test this by projecting other model points through camera
to image
 determine whether there are matching image points nearby
 can confirm model
 applications include...
 Modelbuilding: use video or collection of pictures
to extract detailed 3D model of object *26*
 Matching moves: to put computer graphics characters in
real video, determine actual camera moves so that graphics
characters can be rendered correctly.
 Path reconstruction: robots can reconstruct object that
they have seen, and use camera information to construct
record of path
 Using vision for controlling movement
 navigation  e.g., autonomous vehicles
 Lateral control: stay in lane
 Longitudinal control: stay away from vehicle ahead
 Obstacle avoidance: avoid other cars, and pedestrians
 adjust steering, accelaration and braking
 need position & orientation relative to lane
 use edge detection to find lane markers
 augment with map knowledge: vision is confirmation
 but obstacles aren't (usually) on the map
 use binocular stereopsis for car ahead distance
 augment with laser rangefinders to build probabilitiy maps
of surroundings
 use landmarks to reset absolute position information
 for driving you don't need ALL the information from an image
 DARPA Urban Challenge
Lecture 26: Watson
(see Watson talk slides & videos)
Lecture 27: AI at WPI
Lecture 28: AI at WPI
Markov Decision Processes Quick Overview
 agent must chose action from ACTIONS(s) from each state s (at each time step)
 begins at start state in a fully observable environment
 sequential decision problem: find a (good) sequence of actions to terminal state
 terminal states have rewards (may be +ve or ve)
 actions are unreliable (stochastic)
 some probability that movement will not be in direction chosen
 e.g., 0.8 in intended direction, 0.1 in two others.
 transition model: the outcome of each action at each state
 transition probabilities (to s' from s due to a) are known  P(s'  s,a)
 transitions are Markovian: probabilities do not depend on earlier states, just s.
 utility function for agent depends on sequence of states (environment history)
 in each state agent gets a reward R(s)
 may be +ve or ve
 negative rewards encourage agent not to be there!
 simple utility is sum of the rewards received
 including at a terminal state, where a larger reward may occur (perhaps ve)
 U([s_{0}, s_{1}, ...] = R(s_{0}) + R(s_{1}) + ...
 discounted rewards (using "discount factor" γ)
 U([s_{0}, s_{1}, ...]
= R(s_{0}) + γR(s_{1}) + γ^{2}R(s_{2}) +...
 γ between 0 and 1,
 expresses preference for known
current rewards over less well known future rewards.
 Markov Decision Process: states, actions, rewards, Markovian transitions.
 Policy π(s) : Solution to MDP: what action to take in any state
 each time policy is executed from s_{0} it may lead to a different sequence
of states (stochastic)
 quality of policy is "expected utility" of environment histories
generated by policy.
 Optimal Policy π*(s): one that yields the highest expected utility
 if agent knows current state s it can then executes action π*(s) (Reflex Agent)
 changing R(s) values affects π*(s)
 maximize expected utility
π*(s) = argmax Σ_{s'} P(s'  s,a)U(s')
i.e., agent can choose action that maximizes expected utility of next state
Return to Lecture 22 notes
http://web.cs.wpi.edu/~dcb/courses/CS4341/2013/contents.html
