About

data science · 12 years
Illustrated portrait of Ujjwal Singh Rao

Hi, I'm Ujjwal — @brightertiger around the internet.

I'm a data science leader with 12 years of experience across big data analytics, predictive modeling, machine learning, deep learning, and natural language processing. I graduated from the Indian Institute of Technology (IIT) Kharagpur in 2013. Outside work, I compete on Kaggle as @brightertiger, maintain open source Python packages, and write about machine learning.

If you come across any collaboration opportunities, don't hesitate to get in touch.

Professional Experience

2014 — present

2024 — now

MSCI

Vice President

I am a member of the Data Extraction team, leading initiatives in AI-powered document intelligence and workflow automation. My work involves:

  • Building LLM agents for answering complex questions across millions of financial documents using agentic RAG architectures and vector databases. These systems enable accurate information retrieval and synthesis at scale.
  • Designing and deploying LLM agents for automating end-to-end workflows, reducing manual intervention and accelerating data extraction processes across the organization.
  • Developing Retrieval Augmented Generation (RAG) pipelines using Large Language Models (LLMs) for fetching data and information from financial documents with high precision and reliability.

2023 — 2024

HERE Maps

Lead Data Scientist

I was a member of the Map Observables team, tasked with constructing Self-Driving Maps for BMW's Urban Cruise Control. My work involved:

  • Tackling global-scale challenges by harnessing petabytes of data for creating high-definition maps in the field of autonomous driving. I have successfully enhanced crucial performance indicators such as False Positives, False Negatives, and Accuracy by more than 50% when compared to traditional legacy systems.
  • Applying machine learning algorithms and XGBoost models to integrate data observations from diverse input sources, including dashcams and overhead imagery. This process allows to deduce the accurate location and attributes of road signs.
  • Crafting innovative graph-based solutions to counteract positional observation drift from drive-based data sources used in map content. This implementation resulted in a notable reduction of False Positives by around 5%, surpassing the performance of radial search-based clustering.
  • Constructing a question-answering engine using LLAMA over extensive product and data requirement documents for data validations. This tool empowers users to efficiently search through these documents, extracting details and significantly enhancing productivity.

2021 — 2023

Gojek Tech

Senior Data Scientist

I was a member of the Care Tech team, where I leveraged machine learning, deep learning, and natural language processing techniques to extract insights and facilitate automation. This involved analyzing customer service interactions across diverse channels such as email, in-app requests, chat, Twitter, and more. My work involved:

  • Facilitating AI/ML-driven intent detection through the implementation of multilingual NLP models. I developed intent classification models based on XLM-RoBERTa to support various languages, including Bahasa and English, achieving an accuracy rate exceeding 80%. Additionally, I deployed these models into production using torchscript and MLFlow.
  • Constructing named entity recognition (NER) models based on IndoBERT, utilizing open-source IndoNLU datasets. These models were designed to identify entities such as food, quantity, date, and chit-chat within text utterances.
  • Enhancing the search experience for help center articles by incorporating tags to encompass semantic diversity in search queries. I implemented a TF-IDF and Logistic Regression pipeline to extract pertinent keywords for each article, contributing to an improved search functionality.
  • Establishing a pipeline for issue discovery to identify emerging themes in service tickets and app reviews. Utilizing PyLDAVis and BERTopic libraries, I implemented topic modeling. Additionally, I trained sentence transformer models using SetFit for better results.

2014 — 2021

American Express

Data Analyst → Senior Data Scientist

Senior Data Scientist 2018 — 2021

I was part of the data science team working on Natural Language Understanding (NLU) layer of the AskAmex chatbot. My work involved:

  • Training transformer-based models (like BERT, distilBERT, RoBERTa etc.) for intent classification. I removed label noise from training datasets using various robust machine learning techniques which lead to 5% increase in prediction accuracy.
  • Building human-in-the-loop (HITL) pipelines for collecting labeled data at a minimal cost. I used weak supervision and active learning strategies to filter relevant data points for annotation. I built various interactive tools to help data labelers work efficiently. I introduced best practices and quality checks in the annotation pipelines to ensure high-quality output.
  • Collaborating with product teams to improve customer experience. I built interactive tools to visualize the performance of servicing journeys. These tools helped identify the edge cases that often lead to automation failures. I introduced tracking around sentiment level KPIs (apart from automation) to holistically capture the channel performance.

Data Scientist 2017 — 2018

I was part of the data science team working on an offer recommendation engine for the mobile app and website. My work involved:

  • Building factorization machine models to predict click-through rate. I built spark-based feature engineering pipelines to process terabytes of clickstream data for training these models. The models were part of the final stacked ensemble that got deployed in production.
  • Optimizing impression caps on offers to drive higher overall engagement on the channel. I built xgboost models to analyse the sensitivity of click-through rate with respect to impressions. I used the partial dependency plots from these models to identify the impression cap that maximised f-beta score.

Senior Data Analyst 2015 — 2017

I was part of the modeling team working on up-sell, cross-sell targeting via email campaigns. My work involved:

  • Building artificial neural network-based models. These were binary classification models which predicted the probability of an existing customer taking up a more premium product. These models replaced the legacy logistic regression models by delivering better performance while simultaneously driving operational efficiency.
  • Migrating the legacy data transformation and feature engineering pipelines from sas to python to support the deployment of above mentioned neural network models in production. Enabled automated re-training pipelines to solve for data drift.

Data Analyst 2014 — 2015

I joined the customer marketing team focusing on international markets (non-US). I worked on:

  • Targeting strategy for dynamic email campaigns in partnership with movable ink. The focus was to increase customer spending on small merchants in the UK. I analyzed transaction data to understand the location and category preferences of the customers. The analysis generated content-based recommendations displayed to the customer via dynamic emails. The open and click rates for these campaigns were significantly higher than the long term average.
  • Supporting a joint venture with Gurunavi. Amex partnered with Gurunavi to offer dining recommendations to customers in Japan. I designed customer segments by clustering spending patterns across various industry verticals. The customer segments mapped to different personas, each of which received an exclusive set of restaurant recommendations.

Education

2020 — 2025

Georgia Institute of Technology

Master of Science in Analytics

abverdict

pip install abverdict

A/B testing statistics for Python — sample size calculation, statistical significance, Bayesian and sequential analysis, survival analysis, and stakeholder report generation, with a web calculator.

pygarble

pip install pygarble

A zero-dependency Python library that detects gibberish, garbled text, keyboard mashing, encoding errors, and text corruption — 24 detection strategies including Markov chains, n-gram analysis, and mojibake detection, with 99.5% precision.

More projects — LLM agents, computer vision, NLP — at github.com/brightertiger.

Kaggle

Competitions Master

🥇 ×3  ·  🥈 ×12  ·  🥉 ×4  ·  as @brightertiger

  • #3 Jigsaw Multilingual Toxic Comment Classification of 1,621 teams
  • #6 SIIM-ISIC Melanoma Detection of 3,308 teams
  • #16 TalkingData AdTracking Fraud Detection of 3,943 teams

Writing

substack →

A/B Testing Done Right

A cohesive Python library that brings together sample size calculations, statistical analysis, and result interpretation for A/B testing — supporting conversion, magnitude, and timing experiments with both frequentist and sequential methods.

PyGarble: Detecting Gibberish Text in Python

A Python library that detects gibberish, keyboard mashing, and corrupted text using complementary detection strategies including keyboard pattern detection, vowel ratio analysis, and ensemble methods.