GSoC 2025

Posts

Finalizing the Pipeline: Testing and Refinements

September 26, 2025

This is the final push. I've been spending the last few days adding a suite of unit tests to validate the behaviour of individual components and ensure the pipeline's robustness.

The Final Frontier: Documentation and Usability

September 22, 2025

As the project nears completion, I’m focusing on what is arguably one of the most critical phases: documentation. A tool is only as good as its documentation, and my goal is to ensure that another developer or researcher can get up and running with RIPPLe with minimal friction. I’ve been writing a comprehensive README, adding detailed docstrings to all major classes and functions, and creating a series of Jupyter Notebook tutorials that walk through common use cases (thinking of postponing notebook tutorials for later, though). It’s meticulous work, but it’s essential for the long-term success and adoption of the project within the DeepLense and wider LSST communities.

Decoupling Configuration from Code with YAML

September 14, 2025

With the core performance issues addressed, I’ve shifted my focus to improving the pipeline's usability and maintainability. This week, I moved all the configuration parameters—such as cutout sizes, normalization methods, model paths, and detection thresholds—out of the Python code and into external YAML files. This change allows researchers to easily experiment with different settings without having to modify the source code. It also makes the pipeline far more flexible and adaptable to different scientific use cases. It’s a crucial step in transforming the project from a custom script into a reusable and configurable scientific tool.

Implementing a Parallel Processing Workflow

September 09, 2025

The refactoring work is complete, and the results are promising. By leveraging Python’s multiprocessing module to create a pool of data workers, the pipeline is now able to overlap data I/O and preprocessing with model inference. What previously took hours to run on a large dataset now completes in a matter of minutes. The GPU utilization has increased significantly, and the performance metrics are finally within the targets I set at the beginning of the project. This architectural change was a major undertaking, but it was essential for building a pipeline that can realistically handle the scale of LSST data.

Profiling the Pipeline and Hunting for Bottlenecks

September 02, 2025

This week was all about performance analysis. I’ve been using Python’s profiling tools to get a detailed breakdown of where the pipeline is spending its time. The results confirmed my suspicions: a significant amount of time is lost to I/O-bound operations and redundant preprocessing steps that are not being efficiently batched. Based on this analysis, I've started refactoring the core processing loop. The plan is to implement a producer-consumer pattern, where a pool of worker processes is dedicated to fetching and preparing data, feeding a steady stream of tensors to the GPU for inference. This should decouple data preparation from model execution and allow for much higher throughput.

The Scalability Challenge: From a Single Image to a Mock Survey

August 25, 2025

The pipeline works for a single image, but how does it perform at scale? This week, I pointed RIPPLe at a full tract of mock LSST data, which contains thousands of potential targets. The results were illuminating. While the pipeline completed the task without crashing, the processing time was unacceptably long. The performance metrics clearly indicate that the current, sequential approach to data fetching and preprocessing is a major bottleneck. Each cutout is processed one by one, leaving the GPU idle for long periods. It's clear that to handle the volume of data LSST will produce, a more sophisticated, parallelized workflow is necessary. The next phase of this project will be dedicated to optimization.

Milestone Achieved: First End-to-End Pipeline Run

August 18, 2025

I've reached a significant milestone: the first successful end-to-end run of the RIPPLe pipeline. I provided a set of sky coordinates, and the system automatically fetched the data from the Butler, preprocessed the images, fed them into a DeepLense classification model, and returned a prediction. Seeing the entire chain of operations execute without a single crash was incredibly satisfying. The performance is, as expected, not yet optimized. A single prediction takes several minutes, which is far from our goal. However, this successful run serves as a critical proof-of-concept. It validates the overall architecture and demonstrates that all the individual components can work together. Now, the focus shifts from functionality to performance.

Engineering a Flexible Model Interface

August 11, 2025

This week, my focus has been on software architecture, specifically how the pipeline will interact with the various DeepLense models. The DeepLense project includes a diverse set of models—classifiers, regressors, and generative models—each with slightly different input and output requirements. To avoid writing custom code for each one, I've designed a ModelInterface using an abstract base class. The goal is to create a consistent API that allows the pipeline to load and run any compatible model with a single, unified command. This upfront investment in a flexible architecture should make the system much easier to maintain and extend in the future as new models are developed. It’s a classic software engineering problem, and solving it correctly now will prevent significant headaches down the road.

Returning to the Code with a Fresh Perspective

August 04, 2025

I'm back at the keyboard after a productive week away at the Dunlap Summer School. While the focus there was on radio astronomy, stepping away from the RIPPLe project has provided some much-needed perspective. It’s easy to get lost in the details of a complex software project, and I’m finding that returning with a clear mind is helping me identify issues and solutions that I previously overlooked. The core challenge remains: bridging the gap between the LSST Science Pipelines and the DeepLense models. With the first half of the project focused on building the foundational data access and preprocessing layers, the next step is to integrate the machine learning models and create a true end-to-end workflow. It’s time to get started.

RIPPLe: Building a Bridge Between LSST and DeepLense

July 29, 2025

It's hard to believe seven weeks have flown by. In that time, I've consumed countless cups of green tea, and developed a single obsession: getting petabytes of astronomical data ready for deep learning. When I started this Google Summer of Code project, the mission seemed straightforward enough. I was tasked with building a pipeline to feed data from the Legacy Survey of Space and Time (LSST) into machine learning models for the DeepLense project. The Vera C. Rubin Observatory, which will conduct the LSST, is a firehose of cosmic data, set to produce 20 terabytes every single night. Buried in that data, we expect to find around 100,000 new gravitational lenses—a massive jump from the few hundred we know of today. Each one is a cosmic magnifying glass that can help us understand the mysteries of dark matter. But first, you have to find them. That’s where my project, RIPPLe, comes in. Phase 0: The Foundation I look back at who I was in February, happily working my way through An...

The Normalization Nightmare

July 21, 2025

How hard can it be to scale pixel values to a nice range for a neural network, like 0 to 1? Turns out, it's incredibly hard when dealing with astronomical data. Min-Max Scaling? A single hot pixel or cosmic ray outlier will completely wreck the scale for the entire image. Z-score Standardization? This works better but can result in negative values, which some neural network architectures don't like. Asinh Stretch? This is what astronomers use to visualize images, as it handles the huge dynamic range. It's fantastic for making pretty pictures, but trying to explain it to PyTorch is another story.

Welcome to PSF Hell

July 16, 2025

Today I learned about the Point Spread Function (PSF). In simple terms, it's how a star (a point of light) gets "smeared out" by the atmosphere and telescope optics. The problem? This smearing effect is different for each filter band (g, r, i, etc.). For a neural network to properly compare colors, the images need to have a consistent "blurriness." This means I have to perform PSF matching: taking the sharpest image and deliberately blurring it to match the fuzziest one. It feels completely counterintuitive, and the math is giving me flashbacks to my toughest signal processing classes.

Diving into Preprocessing

July 13, 2025

With the core data fetching stable, I'm getting a head start on Phase 2 goals to show for the midterm. This means preprocessing: turning the "raw" data into something a machine learning model can use. I created a new file, preprocessor.py, and started mapping out the steps. It feels like climbing a whole new mountain.

If You Can't Measure It, You Can't Improve It

July 10, 2025

I've officially implemented performance monitoring. Every major operation—from fetching to caching to quality control—now logs its timing and memory usage. The metrics are beautiful, a little concerning, and beautiful again. I've already spotted a few bottlenecks, with some operations taking over 5 seconds (I'm looking at you, collection fallback logic). Time to optimize!

Obsession with Quality Control

July 05, 2025

It turns out that just getting the data isn't enough—it has to be good data. So this week, I built in a quality validation step. My validate_cutout_quality() function now checks for bad pixels, saturation, cosmic rays, signal variation, and proper dimensions.

Multi-Band Synchronization

June 30, 2025

Getting synchronized multi-band cutouts took a full week to solve. The trick, which seems so obvious now, was to calculate the bounding box just once using a reference coordinate system (WCS) and then apply that same box to all the different filter bands (g-band, r-band, etc.). Before this, I was wrestling with cases where the g-band had data but the r-band would throw an exception. The goal was to "handle missing bands gracefully." When I asked for a definition of "gracefully," we settled on a simple one: "don't crash." Mission accomplished.

Batch Processing

June 25, 2025

ThreadPoolExecutor is my new best friend. I've redesigned the data flow (for the third time), and now the system can process over 100 cutouts per minute! The parallel fetching pipeline works like a charm.

Cache Me If You Can

June 18, 2025

The LRU (Least Recently Used) cache is implemented! It's strangely satisfying to watch the cache hit rate climb above 80% on repeated queries for the same tracts and patches. The performance difference is night and day. Since we're dealing with "astronomical" images (pun very much intended), I also added some memory monitoring to make sure the cache doesn't eat all our RAM.

The subtle art of chugging coffee

June 14, 2025

When code doesn't work the way you want, you debug. When debugging doesn't work the way you want, you don't sleep. When you don't sleep, you chug coffees, when you chug coffees you don't sleep. When you don't sleep, you debug code. When you debug code and find a solution, you release dopamine and sleep like a baby (albeit in the afternoon). That's what has been the whole week like.

Down the Error Handling Rabbit Hole

June 10, 2025

You know what they say about the best-laid plans? I initially thought I'd just need to handle a few common Butler exceptions. I was wrong. So, so wrong. I've now encountered DataIdValueError, LookupError, connection timeouts, missing datasets, invalid coordinate systems (WCS), and more. I’ve had to build an entire hierarchy of custom exceptions to manage it all. But on the bright side, the code can now fail gracefully instead of just crashing!

The LsstDataFetcher Awakens

June 05, 2025

Three days into Phase 1, and the LsstDataFetcher class is already becoming a beast. It started as a simple idea: a tool to fetch a small image cutout. Now, it's ballooned to handle bounding box parameters, multi-band image synchronization, data quality checks, and even partial data recovery. This is quickly evolving from a simple wrapper into a full-fledged data access layer. My coffee consumption has officially doubled.

Phase 0 Complete! (Finally)

June 01, 2025

Success! All seven of my comprehensive environment tests are passing. The foundation for the project feels rock-solid. I've confirmed that PyTorch gets along with the LSST stack (which feels like a small miracle) and even managed to get GPU support working with our Quadro P600. Feeling really good about where things stand. With the setup phase complete, tomorrow I start on Phase 1: implementing the actual data access tools. I'm nervous but incredibly excited!

My Brain Hurts

May 28, 2025

Okay, so astronomers working with LSST don't just use Right Ascension and Declination like everyone else. They have a whole sky-mapping system built on tracts and patches. The best way I can describe it is to imagine the night sky is a giant quilt. A tract is a big square piece of that quilt, and a patch is a smaller square within it. I spent days just trying to figure out how to convert standard coordinates into this system. The CoordinateConverter class is finally starting to make sense, but it's been a deep dive into concepts like SpherePoint, degrees, and the difference between PARENT and LOCAL origins. Astronomy coordinate systems are wild!

The Great LSST Stack Update

May 19, 2025

This week's adventure? Environment management. I discovered we were running an older version of the LSST Stack (v28.0.1) while the latest was v29.1.1. The update took two full days, mostly because of breaking changes between the versions. Somehow, I now have three different LSST stacks on my machine (don't ask why). The good news is that VS Code is playing nicely with all of them, and lsst-activate has become my favorite new command. It’s clear that managing the environment is half the battle in this project.

My First Foray into the Butler Repository

May 15, 2025

My first week of the coding period is in the books! I started with what I thought would be a simple task: creating a Butler repository from some demo data. I quickly learned that "simple" and "LSST" rarely belong in the same sentence. The demo data has its own peculiar structure, and I had to wrap my head around instrument registration and setting up collections. My terminal is now a historical record of my many failed butler create and butler register-instrument attempts. But, after a lot of trial and error, I finally have a working repository! In a project this massive, these small wins feel absolutely huge.

Damn couldn't believe I got in again.

May 09, 2025

I was glued to reddit and the website the whole time for 4 hours. The results came out later than UTC 18:00 and didn't get the mail until four hours beyond the scheduled time. None of my friends got in, so was kinda anticipating a rejection too, but to my surprise I got in. I am really excited to go through this journey again. It would be amazing.

Proposal Sent! The Waiting Game Begins...

April 08, 2025

Drafted, revised, got feedback (mentally, from myself mostly haha), polished, and finally hit submit on the proposal! It feels pretty comprehensive, outlined the pipeline modules, the timeline, potential hurdles... Pouring all that learning from the courses and the tests into it felt good. Now... we wait. Fingers, toes, everything crossed! Whatever happens, learned a ton already.

Proposal Mode: Activated

April 04, 2025

Okay, tests done, now the real writing begins. The proposal. Need to show I understand LSST, DeepLense, the project goals, AND have a solid plan. Drafting sections on objectives, methodology, the timeline... This is where the deep dive into the LSST pipeline docs and the project description really matters.

Tests Submitted!

April 01, 2025

Phew! That was a rush. Spent the last couple of days tuning the models, generating plots (ROC curves!), writing up the notebooks for Task 1 and Task 2. Calculating metrics, explaining the strategy... it's a mini-project in itself. Uploaded everything to GitHub, sent the email with the links, CV, weights. Fingers crossed the results look okay! Glad I got the ML/DL courses done beforehand.

Task 2: Lens Finding

March 29, 2025

Finished up Test I (mostly). Now onto Specific Test II - Lens Finding. Downloaded the data... wow, way more non-lenses than lenses. This class imbalance is serious! Accuracy won't mean much here. Need to focus on ROC AUC and maybe precision/recall for the lens class. Sticking with ConvNeXt V2 but need to add class weighting to the loss function. This feels more like a real-world astro problem.

ConvNeXt V2 Seems to Work! (Task 1)

March 28, 2025

After some experimentation, fine-tuning ConvNeXt V2 Tiny seems to be the way to go for Test I. Setting up the custom dataset loader, handling the 1-channel to 3-channel conversion, applying ImageNet normalization... It's training! Watching the validation AUC climb is oddly satisfying. Need to calculate the ROC/AUC properly for the report.

Test Time! Task 1

March 25, 2025

Okay, time to tackle the GSoC tests. Downloaded the "Common Test I" dataset. Three classes: no substructure, sphere, vortex. Looks like a classic image classification problem. The Deep Learning courses are paying off! Decided to go with PyTorch and try fine-tuning a pre-trained model. Maybe a ResNet? Or something newer... ConvNeXt V2 looks promising. Let the coding commence!

The LSST Butler

March 20, 2025

Finally making some progress interacting with a mock LSST repository using the Butler. butler.get('calexp', ...) It works! Retrieving calibrated exposures, checking metadata... It's complex, abstracting away the file system, but I can see how powerful it is for managing petabytes of data. Still feels a bit like black magic though.

CNNs - Seeing the Light (or Lens?)

March 16, 2025

Reached the Convolutional Neural Networks (CNNs) course in the Deep Learning Specialization. YES! This feels directly relevant. Processing images... detecting features... This is exactly what lens finding and classification models do. Suddenly the DeepLense project tasks make a lot more sense. Getting excited about actually applying this.

Deep Dive into Deep Learning (Literally)

March 12, 2025

While wrestling with the LSST setup, also started Andrew Ng's Deep Learning Specialization. Course 1: Neural Networks and Deep Learning. Feels like a good refresher and goes deeper than the ML one. Vectorization, activation functions... Good stuff. Need this foundation solid if I'm going to build a pipeline for deep learning models.

Environment Setup Shenanigans

March 08, 2025

Alright, time to get the LSST stack running locally. Following the lsstinstall guide... dependencies... conda conflicts... Docker maybe? Spent a good chunk of the day just getting the basic LSST commands to run without throwing errors. Small victory, but a victory nonetheless! One step at a time.

Understanding LSST & DeepLense

March 04, 2025

Trying to wrap my head around this project. LSST Science Pipelines, the Butler API... it's a whole ecosystem. And DeepLense – classifying dark matter substructure? Super-resolution? Feeling a mix of "this is awesome" and "where do I even start?". Time to hit the docs. Hard.

Projects are LIVE! DeepLense & LSST

March 01, 2025

Okay, GSoC orgs and projects announced! Scrolled through... ML4Sci... DeepLense... "Data Processing Pipeline for the LSST". Huh. Rubin Observatory, massive data, deep learning for lens finding... This sounds intense. And exactly the kind of astro+coding challenge I was looking for. Reading the description..

Neural Nets & TensorFlow – Entering the Matrix?

February 28, 2025

Moved onto the next part of the ML Specialization – Advanced Algorithms. Neural Networks! And TensorFlow! This feels like the real deep stuff. Multi-class classification...My laptop is getting a workout just running the examples.

Supervised Learning... Supervised Me!

February 23, 2025

Making decent progress on the ML course. Linear Regression, Logistic Regression... check! It's clicking. Starting to imagine how this could apply to sorting astronomical data. Still feels very general, but useful nonetheless.

Back to School!

February 18, 2025

Decided if I'm serious about GSoC '25, especially in the ML/Astro space, I need to brush up hard. Diving into Andrew Ng's Machine Learning Specialization on Coursera. Regression, classification... feels like building foundational bricks. Gotta get these concepts down pat before the projects even drop.

GSoC again?

February 15, 2025

I am thinking of applying to GSoC again, it was such an amazing learning experience last time that I feel there is still lot more to learn.