Available for collaboration

Raj Thakur - Senior ML Engineer

Senior ML Engineer at Amazon AWS

Specializing in ML frameworks optimization and acceleration on AWS Trainium & Inferentia chips. Expert in PyTorch, JAX, and custom silicon optimization for high-performance ML systems.

About Raj Thakur - Senior ML Engineer

I'm a Senior Software Development Engineer at Amazon, specializing in ML frameworks optimization and ML acceleration on AWS Trainium and Inferentia chips. With 10+ years of experience, I architect high-performance ML systems that enable efficient training and inference at cloud scale.


My expertise spans PyTorch and JAX optimization, custom silicon acceleration, PJRT runtime development, and building production-ready ML frameworks. I specialize in optimizing deep learning workloads for AWS Trainium (training) and AWS Inferentia (inference) custom chips.

ML Engineering Skills & Technical Expertise

🧠

ML Frameworks & Deep Learning

PyTorch XLA, JAX, PJRT Runtime, Transformers, CNNs, RNNs, LSTMs, Attention Mechanisms, BERT, GPT architectures

ML Acceleration & Custom Silicon

AWS Trainium, AWS Inferentia, Custom Silicon Optimization, Hardware-Software Co-design, Performance Profiling, Memory Optimization

🔄

Distributed ML & Training

Data Parallelism, Model Parallelism, Gradient Synchronization, Distributed PyTorch, Multi-node Training

🚀

ML Operations & Deployment

Model Serving, A/B Testing, ML Pipelines, Kubernetes, Docker, CI/CD for ML, Model Monitoring

📊

Data Engineering & Processing

AWS Services, S3, Lambda, Step Functions, ElasticSearch, OpenSearch, Apache Spark, Kafka, ETL Pipelines, Data Lakes, Real-time Streaming, Big Data Analytics

🔬

Programming Language

C++, JAVA, Python, Shell

Professional Experience - Amazon AWS & Tech Companies

Amazon Web Services (AWS)

5 yrs 2 mos • Full-time

Senior Software Development Engineer

Jul 2024 - Present • Calculating... United States • On-site

Leading ML frameworks optimization for AWS Trainium and Inferentia chips, focusing on PyTorch and JAX acceleration

  • PJRT backend development for AWS Trainium enabling seamless JAX training on custom silicon
  • Optimized PyTorch XLA integration achieving 1.2x faster training performance on Trainium chips
  • Native PyTorch development
  • HLO based Graph Optimizations
PyTorch JAX AWS Trainium AWS Inferentia

AWS AI SDE II

Dec 2022 - Jul 2024 • 1 yr 8 mos Santa Clara, California, United States

ML Workloads for AWS Personalize Service, Managing Custom Payments Solution for AWS Customers

  • Contributed to Control Plane of AWS Personalize Service
  • Development of Custom Payments Pipeline for AWS Strategic Customer involving management of custom contract lifecycle
  • Architected and driven projects that Contributed to $200+ Million Free Cash Flow benefits for AWS
AWS SageMaker AWS Bedrock AWS Lambda AWS Textract AWS Step Functions AWS S3 AWS Personalize PyTorch Machine Learning

Software Development Engineer II

Jun 2020 - Dec 2022 • 2 yrs 7 mos Bengaluru, Karnataka, India

OpenSearch development and distributed systems engineering

  • Core contributor to OpenSearch project development
  • Built scalable search and analytics solutions
  • Developed distributed system components for high-performance data processing
OpenSearch Distributed Systems Search Analytics

Grab

2 yrs • Full-time

Senior Software Engineer

Oct 2019 - Jun 2020 • 9 mos Bengaluru Area, India

Lead Engineer in Settlement Platform for Grab Financial Group

  • Designed and developed core business units for post-payment processing to SME merchants
  • Built payment gateway integration systems handling millions of transactions
  • Architected international payment facilitation systems
  • Migrated settlement platform from worker-based to event-driven architecture
Fintech Payment Systems Event-Driven Architecture

Tata Consultancy Services

2 yrs 1 mo • Full-time

Software Engineer

Aug 2015 - Aug 2017 • 2 yrs 1 mo Hyderabad, Telangana, India

Full-stack development and enterprise software solutions

Enterprise Software Full-Stack Development Java

Technical Writing

I regularly write about machine learning infrastructure, system design, and emerging AI technologies. My articles focus on practical insights from building production ML systems.

Editor of Software System Design publication on Medium, featuring in-depth articles on scalable architecture and distributed systems.

📱
UX Design

Split Bill App Design: A Guide to Creating a Seamless Splitwise-like Experience

Comprehensive guide to designing intuitive bill-splitting applications with focus on user experience.

Read Article →
🔧
Software Engineering

DI Frameworks — Spring, Guice and Dagger

Comprehensive comparison of dependency injection frameworks for large-scale applications.

Read Article →
🏗️
System Design

System Design Interview Primer

Essential tips and strategies for tackling high-level system design questions in technical interviews.

Read Article →

Let's Connect

Interested in discussing ML innovations, system architecture, or collaboration opportunities?