Main image
sanjitsbatra _at_ gmail _dot_ com

Bio


I am a graduate student at the University of California at Berkeley in the Department of Computer Science. My primary interests are Computational Genomics and Artificial Intelligence. Prior to graduate school, I was a bioinformatics programmer at The Center for Genome Architecture at Baylor College of Medicine and Rice University where I worked with Dr. Erez Aiden to develop computational methods for genome assembly using Hi-C.

I was a Computer Science undergraduate (B.Tech and M.Tech) at the Indian Institute of Technology, Delhi, from 2010-2016, where I worked primarily with Dr. Jayadeva on machine learning as part of my undergraduate thesis. I also worked on number theory under the guidance of Dr. Amitabha Tripathi and on approximation algorithms with Dr. Naveen Garg.


Publications

The Zika outbreak, spread by the Aedes aegypti mosquito, highlights the need to create high-quality assemblies of large genomes in a rapid and cost-effective fashion. Here, we combine Hi-C data with existing draft assemblies to generate chromosome-length scaffolds. We validate this method by assembling a human genome, de novo, from short reads alone (67X coverage). We then combine our method with draft sequences to create genome assemblies of the mosquito disease vectors Aedes aegypti and Culex quinquefasciatus, each consisting of three scaffolds corresponding to the three chromosomes in each species. These assemblies indicate that virtually all genomic rearrangements among these species occur within, rather than between, chromosome arms. The genome assembly procedure we describe is fast, inexpensive, accurate, and can be applied to many species.


Machine Learning

Learning a hyperplane regressor by minimizing an exact bound on the VC dimension

The proposed approach, termed as the Minimal Complexity Machine (MCM) Regressor, involves solving a simple linear programming problem. Experimental results show, that on a number of benchmark datasets, the proposed approach yields regressors with error rates much less than those obtained with conventional SVM regresssors, while often using fewer support vectors. On some benchmark datasets, the number of support vectors is less than one tenth the number used by SVMs, indicating that the MCM does indeed learn simpler representations.


Machine Learning

Learning Sparse Fuzzy Classifiers throughMinimizing the VC dimension

We suggest theuse of a simple fuzzy membership in combination with the MCM. Experimental results show, that ona number of benchmark datasets, the the fuzzy MCM yields comparable results to fuzzy SVMs, andlearns representations that require far fewer support vectors. On several benchmark datasets, the fuzzyMCM classifier yields excellent test set accuracies while using one-tenth the number of support vectorsused by fuzzy SVMs.


Machine Learning

Feature Selection for classification of hyperspectral data by minimizing a tight bound on the VC dimension

This paper employs a recently proposed filterfeature-selection algorithm based on minimizing a tight bound on the VC dimension. We have successfully applied this algorithm to determine a reasonable subset of bands without any user-defined stopping criteria on widely used hyperspectral images and demonstrate that this method outperforms state-of-the-art methods in terms of both sparsity of feature set as well as accuracy of classification.


Machine Learning

Feature Selection through Minimization of the VC dimension

Experimental results show that the MCM learns very sparse representations; on many datasets, the kernel MCM yields comparable or better test set accuracies while using less than one-tenth the number of support vectors. For a linear hyperplane classifier in the input space, the VC dimension is upper bounded by the number of features; hence, a linear classifier with a small VC dimension is parsimonious in the set of features it employs. In this paper, we use the linear MCM to learn a classifier in which a large number of weights are zero; features


Machine Learning

Sparse Short-Term Time Series Forecasting Models via Minimum Model Complexity

Time series forecasting is of fundamental importance in big data analysis. The prediction of noisy, non-stationary, and chaotic time series demands good generalization from small amounts of data. Vapnik showed that the total risk is dependent on both, empirical error as well as model complexity, where the latter may be measured in terms of the Vapnik–Chervonenkis (VC) dimension. In other words, good generalization requires minimizing model complexity. The recently proposed Minimal Complexity Machine (MCM) has been shown to minimize a tight bound on the VC dimension, and has further been extended to Minimal Complexity Machine Regression (MCMR). In this paper, we present an original approach based on the MCM regressor, which builds sparse and accurate models for short-term time series forecasting. Results on a number of datasets establish that the proposed approach is superior to a number of state-of-the-art methods, and yields sparse models. These sparse models are able to extract only the most important information present in the data sets, thereby achieving high accuracy. Sparsity in time series forecasting models is also important in reducing the evaluation time. This assumes importance when the models need to be evaluated in real time, such as when they are used as part of trading flows.


Machine Learning

An Optimization-Free Approach to Predictive Analytical Chromatography

An Optimization-Free Approach to Predictive Analytical Chromatography

Manuscript in Preparation


Number Theory

On a Linear Diophantine Problem involving the Fibonacci and Lucas sequences

For a positive and relatively prime set A, let Gamma(A) denote the set of integers that are formed by taking nonnegative integer linear combinations of integers in A. Then there are finitely many positive integers that do not belong to Gamma(A). For A, let g(A) and n(A) denote the largest integer and the number of integers that do not belong to Gamma(A), respectively. We determine both g(A) and n(A) for two sets that arise naturally from the Fibonacci sequence and the Lucas sequence


Number Theory

Some problems concerning the Frobenius number for extensions of an arithmetic progression

Some problems concerning the Frobenius number for extensions of an arithmetic progression

Manuscript Submitted


Side Projects