Ph.D., Department of Computer Science, University of Wisconsin-Madison,U.S.A, 2015(expected)
M.S., Department of Computer Science, University of Wisconsin-Madison,U.S.A, 2011)
B.Tech, Department Computer Science and Engineering, IIT Madras, India, 2009.
At a high level, my current research focuses on problems at the intersection of large-scale data management and machine learning. I am currently investigating the application of relational-style optimizations for feature engineering and machine learning. My previous projects include:
Columbus: We aim to systematize the black art of feature engineering in analytics using data management ideas. We formulate a framework of declarative operations for feature selection and devise a novel optimizer that improves performance at scale.
Bismarck: A unified system to handle several machine learning techniques. We use a popular numerical optimization algorithm and standard RDBMS features to achieve simplicity, efficiency and scalability. We also contributed code to the open-source library MADlib.
Staccato: Integrating the management of uncertain content, specifically Optical Character Recognition (OCR) data, with an RDBMS. We use a probabilistic model and devise a novel approximation framework to trade off between quality and performance.
SystemML: Integrating scalable ensemble learning and cross-validation techniques into SystemML, a declarative system for machine learning that provides an R-like language over Hadoop.
Distributed and Scalable PCA in the Cloud. Arun Kumar, Nikos Karampatziakis, Paul Mineiro, Markus Weimer, and Vijay Narayanan. NIPS BigLearn 2013.
Feature Selection in Enterprise Analytics: A Demonstration using an R-based Data Analytics System. Pradap Konda, Arun Kumar, Christopher Ré, and Vaishnavi Sashikanth. VLDB 2013 (Demo).
Hazy: Making it Easier to Build and Maintain Big-data Analytics. Arun Kumar, Feng Niu, and Christopher Ré. ACM Queue, 2013 (Invited to CACM March 2013).
Brainwash: A Data System for Feature Engineering. Anderson et. al. CIDR 2013 (Vision).
Towards a Unified Architecture for in-RDBMS Analytics. Xixuan Feng*, Arun Kumar*, Benjamin Recht, and Christopher Ré. ACM SIGMOD 2012.
The MADlib Analytics Library or MAD Skills, the SQL. Hellerstein et. al. VLDB 2012 (Industrial).
Probabilistic Management of OCR Data using an RDBMS. Arun Kumar, and Christopher Ré. VLDB 2012.
On Reducing Delay in Mobile Data Collection-based WSNs. Arun Kumar, Krishna Sivalingam, and Adithya Kumar. Springer Wireless Networks 2012.
Flexible Multimedia Content Retrieval Using InfoNames. Arun Kumar, Ashok Anand, Athula Balachandran, Vyas Sekar, Aditya Akella, and Srinivasan Seshan. ACM SIGCOMM 2010 (Demo)
Energy-Efficient Mobile Data Collection in WSNs with Delay Reduction using Wireless Communication. Arun Kumar, and Krishna Sivalingam. IEEE/ACM COMSNETS 2010.
Large-Scale Analytics for the Enterprise from the R Environment. Microsoft Jim Gray Systems Lab 2013, Madison, Wisconsin.
Brainwash: A Data System for Feature Engineering. CIDR 2013, Asilomar, California.
Probabilistic Management of OCR Data using an RDBMS. VLDB 2012, Istanbul, Turkey.
Towards a Unified Architecture for in-RDBMS Analytics. ACM SIGMOD 2012, Scottsdale, Arizona.