Feiyang Huang

Machine Learning @ Weill Cornell Medicine



Genome Set-Sketch



Fast reference-free genome comparison using Set-Sketch, written in C++ and MATLAB

Alignment-free (AF) sequence comparison is beneficial with its speed and accuracy for sequences of low similarity. SetSketch is a recently developed sketching algorithm that allows the user to trade off the space efficiency of HyperLogLog with the speed of MinHash. Therefore, we create GenomeSetSketch, a simple C++ interface, and MATLAB GUI for SetSketch to enable efficient alignment-free comparison of DNA sequences. We adopt the AFproject workflow to enable benchmarking across several alignment-free sequence comparison settings. Benchmarking experiments demonstrate that GenomeSetSketch performs better than most AF comparison tools. We also provided a pipeline for tuning multiple hyperparameters in the algorithm to achieve desired performance. We investigated how different parameters affect the performance of the algorithm in accurately determining the distance between two sequences. GenomeSetSketch can be used for fast comparison between genome sequences when there is a large number of sequences or a lack of sequence consensus in the context of rapid mutations and gene transfers. It can also be used to generate summary statistics for large numbers of short reads produced by next-generation sequencing.

In collaboration with Trisha Karani, Wenxuan Lu and Yuqi Zhang.
Final project for Computational Genomics: Sequences, taught by Ben Langmead.
Share




Follow this website


You need to create an Owlstown account to follow this website.


Sign up

Already an Owlstown member?

Log in