Random Forest Optimizations under Data Drifts (Thesis)
– Created a distributed ensemble learning system for binary classification, using multiple decision trees (Random Forest) with datasets over 25M+ evolving streaming data.
– Researched and proposed three scaling optimizations for:
- Data Modeling: Gaussian Approximation for streamlined handling of numerical attributes in unbounded data streams, yielding to improvement on resource allocation. (95% improvement)
- Resampling Enhancement: Refined the Online Bagging approach by centralizing its function, resulting in reduced data transactions (80% improvement)
- Base Learner Adaptation: Designed and implemented a dynamic accuracy monitoring method to halt/resume learner adjustments accounting for "static data" periods, optimizing performance without unnecessary growth (70% memory improvement, 90% accuracy under data drifts)
Online Credit Card Fraud Detection
– Built an end-to-end real-time fraud detection system for credit card transactions using adaptive Random Forest of Hoeffding Trees for datasets 10M+ tuples, addressing issues such as imbalance classes, online bagging, voting boosting and avoiding overfitting.
– Developed with Scala and Java. Deployed on Apache Spark using both HDFS and Apache Kafka source/sink. Results: 92% accuracy and 95% F1-score.
Lupus (NPSLE) Classification using ML
– Implemented and designed ML pipelines including Support Vector Machines (SVM), K-Nearest Neighbors, and Random Forest on clinical data for the diagnosis of Lupus (NPSLE) disorder, researching the impact of integrating Machine Learning techniques on such sensitive high-dimensional data.
– Analyzed resting-state connectivity fMRI data by performing feature selection, classification, and cross validation techniques. Results: Random Forest (Acc: 77% F1-score: 79%), SVM (Acc: 74% F1-score: 73%).
Online Random Sampling for Group-By Queries
– Implemented Random Sampling for Group-By Queries in order to effectively answer Single Aggregate along with a single group-by clause.
– The algorithm is divided into two phases. The first phase is about pre-processing (first Flink Job) and the second phase performs the reservoir sampling (second Flink Job).
– Founded a start-up initiative, RεScan, a user-friendly app promoting correct recycling habits. By scanning the barcode of items, the integrated machine learning model determines and suggests the recyclability category (if any), guiding users to dispose of their waste properl
– Implemented Convolutional Neural Networks to distinguish the recycling categories based on barcode images, used Firebase to maintain a dynamic repo of barcode-to-product feature mappings.
– Designed a PyCharm extension integrating AI for code generation in extensive software development. Released a Beta version offering AI-generated code suggestions with descriptive change descriptions.
– Developed with Python, using OpenAI’s GPT-3.5-Turbo Model
Ask Question App
– Built front and back-end modules, secure blockchain authentication using Metamask wallet integration, smart contracts development to support Q&A interactions and tipping (upvoting) posts through Ethereum Attestation Service using my custom schema .
– Implemented on Rinkeby (Ethereum TestNet) decentralized internet, developed using Solidity, Truffle Suite, Node.js and React.js
School Dashboard App
– Full stack website development for managing classes, students and grades by authorized teachers, personnel registration and authentication.
IEEE TUC Website
– Built a full stack website for the local IEEE student branch to support user registration and authentication, team creation for contests, workshop management, posts and news sections
– Developed using DJango framework, Python, PostgreSQL, html, css.