Yizhou Yan

PhD student in Computer Science(WPI) | yyan2@wpi.edu


My name is Yizhou Yan, a second year PhD student in the Computer Science department of WPI. I'm interested in big data analytics and mining. Recently, my major focus is on scalable outlier detection, especially local outlier detection on distributed systems such as Hadoop. I also have skills in other fields like Non-negative Matrix Factorization or Bioinformatics.



PhD in Computer Science
Worcester Polytechnic Institute
5 Years Course
Supervisor: Elke A. Rundensteiner
GPA: 4.0/4.0


Master in Computer Science
Dalian University of Technology
2 Years Course
Supervisor: Yu Liu
GPA: 3.8/4.0 RANK: 2/51

JUNE 2015

Bachelor in Computer Science
Dalian University of Technology
4 Years Course
GPA: 3.8/4.0 Major Course GPA: 4.0/4.0 RANK: 15/289

JUNE 2013


Multi-tactic Distance-based Outlier Detection. ICDE 2017. (Accepted)
Distributed Local Outlier Detection in Big Data. (In Submission. )

Yu LIU, Zhen HUANG, Yizhou YAN, Yufeng Chen. Science Navigation Map: an Interactive Data Mining Tool for Literature Analysis. WWW’ 15 Companion, Florence, Italy May 18-22,2015

Yu LIU, Zhen HUANG, Jing FANG, Yizhou YAN. An Article Level Metric in the Context of Research Community. WWW’14 Companion, Seoul, Korea, April 7-11, 2014.

(First Student Author) Zhewen SHI, Yu LIU, Yizhou YAN, Xiaowei ZHAO. A Hierarchical Community Detection Method in Complex Networks. Journal of Computational Information Systems, vol.9, no.24, pp. 9715-9724, 2013.

In preparation of submission
1. Correlation-aware Top-N Local Outlier Detection in big Data.
2. Knowledge Discovery leveraging Citation Cascade.


Outlier Detection with Frequent Sequence Patterns.
Working on developing an algorithm for a new definition of frequent sequence mining.

In time series data, strong time dependencies often exists amongst events/items. This can be used for detecting outliers in time series data. To be specific, a given time series is an outlier if it violates the frequent time dependency. In this work, we first aim to propose an algorithm that is able to capture the time dependency in a new definition of frequent sequence mining. Then design a streaming version for this kind of outlier detection.

Jan 2017 --- present

Correlation-aware Top-N Local Outlier Detection in Big Data.
Proposed a correlation-aware multi-granularity pruning strategy together with a customized indexing structure for Top-n LOF calculation.

In this work, we present the first distributed solution for detecting Top-N local outliers in the shared-nothing distributed platform called DTLOF. DTLOF is based on a multi-granularity pruning strategy that is able to quickly prune most points from the outlier candidates without computing their exact LOF scores or even without conducting kNN search. A customized indexing structure is designed that not only effectively supports the pruning strategy, but also accelerates the kNN search when necessary. Furthermore, we propose a "safe-eliminating" zone bounding strategy that efficiently locates the points that are not needed by any other machine and therefore significantly reduces the communication costs. Lastly, we propose a correlation-aware strategy that effectively scales DTLOF to high-dimensional data.

June 2016 --- Jan 2017

Distributed Local Outlier Detection At scale.
Proposed the first framework for Distritbued LOF calculation, together with two efficient partitioning strategies (PDLOF and DDLOF), also came up with a faster method for kNN search.

In this work, we present the first distributed solution for the Local Outlier Factor (LOF) method. Our solution features a distributed LOF pipeline framework, called DLOF. Each stage of the LOF computation is conducted in a fully distributed fashion by leveraging the critical invariant observation on intermediate value management. We design a partitioning strategy which ensures that each machine is self-sufficient in all stages of the LOF pipeline. Moreover, based on the convergence property derived from analyzing this strategy in the context of real world datasets, we introduce a number of data-driven optimization strategies.

Nov 2015 --- June 2016

Distance-based Outlier Detection At scale.
Proposed two novel partitioning strategy(data driven and cost driven) for distributed distance-based outlier detection.

In this work, we present the first distributed distance-based outlier detection approach using the MapReduce-based infrastructure, called DOD. DOD features a single-pass execution framework that minimizes communication overhead. Furthermore, DOD overturns two fundamental assumptions widely adopted in the distributed analytics literature, namely cardinality-based load balancing and one algorithm for all data. The multi-tactic strategy of DOD achieves a truly balanced workload by taking into account the data characteristics in data partitioning and assigns most appropriate algorithm for each partition based on our theoretical cost models established for distinct classes of detection algorithms.

Sep 2015 --- Nov 2015

Knowledge discovery with multi-level citation networks.
Proposed a simple method taking citation cascades into consideration when exploring through corpora.

We propose a novel model, Matrix Factorization with Markov Chains (MF-MC), which modeling documents from two different but related perspectives: the document perspective and the citation perspective. The combination is implemented with a Multi-View Non-negative Matrix Factorization (MVNMF) model. Furthermore, the multi-view structure of citation cascades is captured by a Markov Chains process as well as a Bernoulli process. To illustrate the usefulness of the MF-MC model, we conduct two experimental studies including document clustering and topic detection on a benchmark dataset Cora. According to the results, our model outperforms several state-of-the-art models.

Dec 2014 --- March 2015

A benchmark dataset for H-sequence.
Generated a benchmark dataset for h-sequence.

We have constructed a benchmark dataset that can be used for various dynamic academic impact assessments concerning time sequence (e.g. h-sequence). A corresponding system is under development, which will provide management of sequence data for scholars majoring in Computer Science. This system will be publicly accessible as a website very soon.

March 2014 --- Oct 2015

Gene Set Enrichment Analysis.
Proposed NMF method for gene bi-clustering.

By collaborating with Dr. Aedin C Culhane at Harvard School of Public Health, we incorporate the notion of degree of membership in fuzzy math into traditional NMF-based bi-clustering method an proposed a novel process for classifying genes and phenotypes, finding associations between them at the same time. Sparseness also be calculated to avoid noise.

March 2013 --- Sep 2013

Automated retrieve relevant articles for GeneSigDB.
Proposed efficient strategy to expand GeneSigDB.

In cooperate with Dr. Aedin C Culhane at Harvard School of Public Health, we utilize data mining methods to describe a new strategy to identify the subset of publications most relevant to GeneSigDB. This approach is expected to improve the efficiency of manual biocuration pipeline for GeneSigDB. The process contains the optimization of PMC search keywords using Latent Semantic Analysis and Vector Space Model, the extraction of tables from PDF files, and the classification of results. Biocurators found the pipeline useful and manually confirmed 94% of predicated gene-list-positive articles contained gene signatures.

Aug 2012 --- March 2013


2016 Graduate Research Innovation Exchange (GRIE) Final List
From Worcester Polytechnic Institute

Oct 2014

Excellent Postgraduate Award
From Dalian University of Technology, Top 5%

Oct 2014

First Class Scholarship for Postgraduates
From Dalian University of Technology, Top 15%

Sep 2013

The third prize in NPMCM
National Postgraduate MCM, China

Sep 2013

Learning Merit Scholarship,Individual Scholarship
Learning Merit Scholarship (Top 15%, twice; Top 5%, once), Individual Scholarship (Top 10%, once), From Dalian University of Technology, China

Sep 2009 --- June 2012

Honorable Mention in ICM, USA

Feb 2012

The third prize in CUMCM
Undergraduate MCM, China

June 2011

The third prize in ACM
Dalian, China

Sep 2010


Teaching Assistant in Worcester Polytechnic Institute
Discrete Math: once; Database I: three times; Software Engineering: once; Introduction to Program Design: once; Object-Oriented Design Concepts: once

Aug 2015 --- present

Internship in Baidu
Research Developer in Baidu knowledge

April 2015 --- July 2015

Teaching Assistant in Dalian University of Technology
Computer Networking: once; Introduction to Algorithms: once; Database: once

Sep 2013 --- June 2014
















+1 508-414-8049

Created by Yizhou Yan