WPI Worcester Polytechnic Institute

Computer Science Department
------------------------------------------

AIRG Topics - Spring 2015


Our group meets on Thursdays at 11:00 a.m.
  • Fall Semester
    A term - FL 320
    B term - Beckett Conf. Rm.
  • Spring Semester
    C term - Kaven Hall 111B
      (next to the vending machines)
    D term - FL311


Jan 15
No meeting

Jan 22
Andi Dhroso, PhD Student, Bioinformatics
Detecting genomic elements of extreme conservation in higher eukaryotes by integration of hash mapping and cache-oblivious in-memory computing
    Genomics is one of the first life science disciplines to enter the era of Big Data, facing challenges in all three dimensions -- volume, variety, and velocity. Yet, in spite of plethora of sequencing data, we are still far from creating a complete encyclopedia of functional and structural elements of the genome. In 2004, an example of this knowledge gap came about when Bejerano and Haussler discovered 481 DNA elements in the syntenic positions of human, mouse and rat genomes that were 100% identical, called the ultraconserved elements (UCEs). Recently, using an advanced data-mining alignmentfree approach, we have shown that this phenomenon exists beyond the animal kingdom and outside the regions of synteny.
          Our current goal is to provide a comprehensive atlas of the regions of extreme conservation in higher eukaryotes providing insights into the structural organization, function and evolution of these elements. However, this task of all-against-all comparison of dozens of eukaryotic genomes may not be feasible using current approaches. For instance, the original findings of syntenic-only UCEs relied on a wholegenome alignment of the three genomes and took 1 day on a 24-nodes cluster. Our comprehensive alignment-free algorithm that guarantees finding all non-syntenic and syntenic LIMEs took 3 days on a 48 CPU cluster. Here, we present a new hybrid approach that integrates the ideas of hash mapping and cache-oblivious in-memory computing. Our algorithm leverages the concept of help-me-help-you, where the data structures are tailored to maximize cache-hit while minimizing cache-miss. As a result, our hybrid algorithm is almost 300 times faster than the current state-of-the-art method and is scalable to deal with the unassembled genomes. It has been applied to detect the earliest evidence of extreme conservation by including into the large-scale analysis recently sequenced genomes of coelacanth and lamprey.

Jan 29
POSTPONED
Seth Adjei
Refining Learning Maps with Data Fitting Techniques
MS Thesis presentation
Advisor: Prof. Neil Heffernan
Reader: Prof. Joseph Beck
    Learning maps have been used to represent student knowledge for many years. These maps are usually hand made by experts in a given domain. However, these hand-made maps have not been found to be predictive of student performance. Several methods have been proposed to find better fitting learning maps. These methods include the Learning Factors Analysis (LFA) model and the Q-matrices.
          In this thesis I report on the application of one of the proposed operations in the LFA method to a small section of a skill graph and develop a greedy search algorithm for finding better fitting models for this graph. Additionally an investigation of the factors that influence the search for better data fitting models using the proposed algorithm is reported. I also present an empirical study in which PLACEments, an adaptive testing system that employs a skill graph, is modified to test the strength of prerequisite skill links in a given learning map and propose a method for refining learning maps based on those findings.
          I found that the proposed greedy search algorithm performs as well as an original skill graph but with a smaller set of skills in the graph. Additionally it was found that, among other factors, the number of unnecessary skills, the number of items in the graph, and the guess and slip rates of the items tagged with skills in the graph have an impact on the search. Further, the size of the evaluation data set impacts the search. The more data that exists, the more predictive the learned skill graph.
          Additionally, I have shown that PLACEments can be used to refine skill graphs by detecting the strengths of prerequisite links between skills within a graph.

Feb 5
Faculty Candidate talk at same time
AIRG cancelled for today

Feb 12
Carolina Ruiz
Discovering stroke patient sub-populations via clustering
    The objective of our study is to find meaningful groups in the data of ischemic stroke patients using unsupervised clustering. The data are modeled using Gaussian mixture models with a variety of covariance structures. Cluster parameters in each of these models are estimated by maximum likelihood via the Expectation-Maximization algorithm. The best models are then selected by relying on information-theoretic criteria. It is observed that the stroke patients can be grouped into a small number of medically relevant clusters that are defined primarily by the presence of diabetes and atrial fibrillation. Characteristics of the clusters found are discussed, using statistical comparisons and data visualization. ---- This is joint work with CS PhD student Ahmedul Kabir and researchers from Boston College and the University of Massachusetts Medical School.

Feb 19
Academic Advising Appointment Day
No AIRG Meeting

Feb 26
Xinyue Liu
Collective Traffic Prediction using Social Media Data
    Road traffic prediction is an important task in modern transportation systems. It impacts people's daily life in different aspects. Existing solutions on this subject focus on exploiting past and current traffic data, collected by various kinds of sensors like loop detectors, GPS devices, etc. However, these sensors are expensive to deploy, hard to maintain, and they may fail under extreme conditions (such as during winter storm). In real-world road systems, only a small fraction of the road segments are deployed with sensors. While, for all the other road segments, we don't have any traffic sensors, thus have no access to their past and current traffic condition data. In such cases, previous traffic prediction methods will fail to work. In this work, we examine the possibility of using online social media data with location tags to predict traffic conditions in road system. Social media data can capture a much larger area of the road systems than conventional road sensors. The idea is when detection devices are unavailable in some locations, we can still use the tweets published in the area to help predicting the traffic conditions. We first tested the performance of using social media information with complete historical traffic data to predict the traffic condition. Then we propose a method to address the problem that parts of the road system have no historical traffic data. Our proposed method makes predictions based on Twitter semantics and the spatial-temporal dependencies in the traffic network. Experimental results on traffic data collected in Los Angeles area of California demonstrated that our proposed method can significantly improve the prediction performances of traffic conditions.

March 5
Yan Wang
How Long Must We Spin Our Wheels?
    Analysis of Student Time and Classifier Inaccuracy
    Wheel-spinning is the phenomenon where students, in spite of repeated practice, make no progress towards mastering a skill. Prior research has shown that a considerable number of students can get stuck in the mastery learning cycle --- unable to master the skill despite the affordances of the educational software. In such situations, the tutor's promise of "infinite practice" via mastery learning becomes more a curse than a blessing. Prior research on wheel spinning overlooks two aspects: how much time is spent wheel spinning and the problem of imbalanced data. This work provides an estimate of the amount of time students spend wheel spinning. A first-cut approximation is that 24% of student time in the ASSISTments system is spent wheel spinning. However, the data used to train the wheel spinning model were imbalanced, resulting in a bias in the model's predictions causing it to undercount wheel spinning. We identify this misprediction as an issue for model extrapolation as a general issue within Educational Data Mining, provide an algebraic workaround to modify the detector's predictions to better align with reality, and show that students spend approximately 28% of their time wheel spinning in ASSISTments.

Mar 12
Ugrad & Grad student break

D term Location = FL 311

Mar 19
Seth Andjei
Refining Learning Maps with Data Fitting Techniques
Advisor: Prof. Neil Heffernan
Reader: Prof. Joseph Beck
    Learning maps have been used to represent student knowledge for many years. These maps are usually hand made by experts in a given domain. However, these hand-made maps have not been found to be predictive of sytudent performance. Several methods have been proposed to find better fitting learning maps. These methods include the Learning Factors Analysis (LFA) model and the Q-matrices.
        In this thesis I report on the application of one of the proposed operations in the LFA method to a small section of a skill graph and develop a greedy search algorithm for finding better fitting models for this graph. Additionally an investigation of the factors that influence the search for better data fitting models using the proposed algorithm is reported. I also present an empirical study in which PLACEments, an adaptive testing system that employs a skill graph, is modified to test the strength of prerequisite skill links in a given learning map and propose a method for refining learning maps based on those findings.
        I found that the proposed greedy search algorithm performs as well as an original skill graph but with a smaller set of skills in the graph. Additionally it was found that, among other factors, the number of unnecessary skills, the number of items in the graph, and the guess and slip rates of the items tagged with skills in the graph have an impact on the search. Further, the size of the evaluation data set impacts the search. The more data that exists, the more predictive the learned skill graph.
        Additionally, I have shown that PLACEments can be used to refine skill graphs by detecting the strengths of prerequisite links between skills within a graph.

Mar 26
Ruikun Luo
Recognition and Prediction of Human Reaching Motion in Manipulation Tasks
    This presentation focuses on recognition and prediction of human reaching motion in manipulation tasks. Several supervised learning methods have been proposed for this purpose, but we seek a method that can run on-the-fly and adapt to new people and new motion styles as they emerge. Thus, unlike previous work, we propose an unsupervised online learning approach to the problem, which requires no offline training or manual categorization of trajectories. Our approach consists of a two-layer library of Gaussian Mixture Models (GMM) that can be used both for recognition and prediction. We do not assume that the number of motion classes is known a priori, and thus the library grows if it cannot explain a new observed trajectory. Given an observed portion of a trajectory, the framework can predict the remainder of the trajectory by first determining what GMM it belongs to, and then use Gaussian Mixture Regression to predict the remainder of the trajectory. We tested our method on motion-capture data recorded during assembly tasks. Our results suggest that the proposed framework outperforms supervised methods in terms of both recognition and prediction. We also show the benefit of using our two-layer framework over simpler approaches.

Apr 02
Chiying Wang
Efficient & Scalable CDMC Framework
    The Collective Dynamic Modeling Clustering (CDMC) framework is a general algorithmic strategy for simultaneous clustering and dynamical modeling of sequence data. However, it leaves open several important choices yet to be made during its development. In the current research, I work on three essential problems need to be solved in the CDMC framework: clustering initialization techniques, dynamic model type, and convergence property. In addition to those outstanding problems, scalability has become an equivalently significant factor when dealing with big volumes of sequence data. A systematic deployment plan for distributing CDMC algorithmic framework across clusters is also very valuable for practical applications. The experimental results in my work has proved that our proposed clustering initialization technique and dynamic models could significantly strengthen the collective dynamical modeling and clustering of time series, with the goals of effectiveness, efficiency, robustness, and scalability.

Apr 9
Adrian Boteanu
Solving and Explaining Analogies Using Semantic Networks
    Analogies are a fundamental human reasoning pattern that relies on relational similarity. Understanding how analogies are formed facilitates the generalization and transfer of knowledge between contexts. The approach presented in this work focuses on obtaining precise interpretations of analogies. We leverage noisy semantic networks to answer and explain a wide spectrum of analogy questions. The core of our contribution, the Semantic Similarity Engine, consists of methods for extracting and comparing graph-contexts that reveal the relational parallelism that analogies are based on, while mitigating uncertainty in the semantic network. We demonstrate these methods in two tasks: answering multiple choice analogy questions and generating human readable analogy explanations. We evaluate our approach on two data-sets totaling 600 analogy questions. Our results show reliable performance and low false-positive rate in question answering; human evaluators agreed with 96% of our analogy explanations. In addition, we present initial results on using our analogical reasoning approach for generalizing robot tasks.

Apr 16
Thierry Petit
Associate Professor at Mines de Nantes / LINA, France
Visiting Professor, School of Business WPI
Finding Diverse Solutions of High Quality to Constraint Optimization Problems
    A number of effective techniques for constraint-based optimization can be used to generate either diverse or high-quality solutions independently, but no framework is devoted to accomplish both simultaneously. In this work, we tackle this issue with a generic paradigm that can be implemented in most existing constraint solvers. We show that our technique can be specialized to produce diverse solutions of high quality in the context of over-constrained problems. Furthermore, our paradigm allows us to consider diversity from a different point of view, based on generic concepts expressed by global constraints. Experiments show the good solving efficiency of our approach.

Apr 23
Undergraduate Project Presentation Day

Apr 30
(reserved for possible MS thesis presentations)


[Feedback] [Search Our Web] [Help & Index]

[Return to the WPI Homepage] [Return to the CS Homepage]

AIRG Coordinator / Fri Apr 3 19:44:16 EDT 2015