Posts

RIPPLe: Building a Bridge Between LSST and DeepLense

It's hard to believe seven weeks have flown by. In that time, I've consumed countless cups of green tea, and developed a single obsession: getting petabytes of astronomical data ready for deep learning. When I started this Google Summer of Code project, the mission seemed straightforward enough. I was tasked with building a pipeline to feed data from the Legacy Survey of Space and Time (LSST) into machine learning models for the DeepLense project. The Vera C. Rubin Observatory, which will conduct the LSST, is a firehose of cosmic data, set to produce 20 terabytes every single night. Buried in that data, we expect to find around 100,000 new gravitational lenses—a massive jump from the few hundred we know of today. Each one is a cosmic magnifying glass that can help us understand the mysteries of dark matter. But first, you have to find them. That’s where my project, RIPPLe, comes in. Phase 0: The Foundation  I look back at who I was in February, happily working my way through An...

The Normalization Nightmare

How hard can it be to scale pixel values to a nice range for a neural network, like 0 to 1? Turns out, it's incredibly hard when dealing with astronomical data. Min-Max Scaling? A single hot pixel or cosmic ray outlier will completely wreck the scale for the entire image. Z-score Standardization? This works better but can result in negative values, which some neural network architectures don't like. Asinh Stretch? This is what astronomers use to visualize images, as it handles the huge dynamic range. It's fantastic for making pretty pictures, but trying to explain it to PyTorch is another story.

Welcome to PSF Hell

Today I learned about the Point Spread Function (PSF). In simple terms, it's how a star (a point of light) gets "smeared out" by the atmosphere and telescope optics. The problem? This smearing effect is different for each filter band (g, r, i, etc.). For a neural network to properly compare colors, the images need to have a consistent "blurriness." This means I have to perform PSF matching: taking the sharpest image and deliberately blurring it to match the fuzziest one. It feels completely counterintuitive, and the math is giving me flashbacks to my toughest signal processing classes.

Diving into Preprocessing

With the core data fetching stable, I'm getting a head start on Phase 2 goals to show for the midterm. This means preprocessing: turning the "raw" data into something a machine learning model can use. I created a new file, preprocessor.py, and started mapping out the steps. It feels like climbing a whole new mountain.

If You Can't Measure It, You Can't Improve It

I've officially implemented performance monitoring. Every major operation—from fetching to caching to quality control—now logs its timing and memory usage. The metrics are beautiful, a little concerning, and beautiful again. I've already spotted a few bottlenecks, with some operations taking over 5 seconds (I'm looking at you, collection fallback logic). Time to optimize!

Obsession with Quality Control

It turns out that just getting the data isn't enough—it has to be good data. So this week, I built in a quality validation step. My validate_cutout_quality() function now checks for bad pixels, saturation, cosmic rays, signal variation, and proper dimensions.

Multi-Band Synchronization

Getting synchronized multi-band cutouts took a full week to solve. The trick, which seems so obvious now, was to calculate the bounding box just once using a reference coordinate system (WCS) and then apply that same box to all the different filter bands (g-band, r-band, etc.). Before this, I was wrestling with cases where the g-band had data but the r-band would throw an exception. The goal was to "handle missing bands gracefully." When I asked for a definition of "gracefully," we settled on a simple one: "don't crash." Mission accomplished.

Batch Processing

ThreadPoolExecutor is my new best friend. I've redesigned the data flow (for the third time), and now the system can process over 100 cutouts per minute! The parallel fetching pipeline works like a charm. 

Cache Me If You Can

The LRU (Least Recently Used) cache is implemented! It's strangely satisfying to watch the cache hit rate climb above 80% on repeated queries for the same tracts and patches. The performance difference is night and day. Since we're dealing with "astronomical" images (pun very much intended), I also added some memory monitoring to make sure the cache doesn't eat all our RAM.

The subtle art of chugging coffee

 When code doesn't work the way you want, you debug. When debugging doesn't work the way you want, you don't sleep.  When you don't sleep, you chug coffees, when you chug coffees you don't sleep. When you don't sleep, you debug code. When you debug code and find a solution, you release dopamine and sleep like a baby (albeit in the afternoon).  That's what has been the whole week like. 

Down the Error Handling Rabbit Hole

 You know what they say about the best-laid plans? I initially thought I'd just need to handle a few common Butler exceptions. I was wrong. So, so wrong. I've now encountered DataIdValueError, LookupError, connection timeouts, missing datasets, invalid coordinate systems (WCS), and more. I’ve had to build an entire hierarchy of custom exceptions to manage it all. But on the bright side, the code can now fail gracefully instead of just crashing!

The LsstDataFetcher Awakens

Three days into Phase 1, and the LsstDataFetcher class is already becoming a beast. It started as a simple idea: a tool to fetch a small image cutout. Now, it's ballooned to handle bounding box parameters, multi-band image synchronization, data quality checks, and even partial data recovery. This is quickly evolving from a simple wrapper into a full-fledged data access layer. My coffee consumption has officially doubled.

Phase 0 Complete! (Finally)

Success! All seven of my comprehensive environment tests are passing.  The foundation for the project feels rock-solid. I've confirmed that PyTorch gets along with the LSST stack (which feels like a small miracle) and even managed to get GPU support working with our Quadro P600. Feeling really good about where things stand. With the setup phase complete, tomorrow I start on Phase 1: implementing the actual data access tools. I'm nervous but incredibly excited!

My Brain Hurts

Okay, so astronomers working with LSST don't just use Right Ascension and Declination like everyone else. They have a whole sky-mapping system built on tracts and patches. The best way I can describe it is to imagine the night sky is a giant quilt. A tract is a big square piece of that quilt, and a patch is a smaller square within it. I spent days just trying to figure out how to convert standard coordinates into this system. The CoordinateConverter class is finally starting to make sense, but it's been a deep dive into concepts like SpherePoint, degrees, and the difference between PARENT and LOCAL origins. Astronomy coordinate systems are wild!

The Great LSST Stack Update

This week's adventure? Environment management. I discovered we were running an older version of the LSST Stack (v28.0.1) while the latest was v29.1.1. The update took two full days, mostly because of breaking changes between the versions. Somehow, I now have three different LSST stacks on my machine (don't ask why). The good news is that VS Code is playing nicely with all of them, and lsst-activate has become my favorite new command. It’s clear that managing the environment is half the battle in this project.

My First Foray into the Butler Repository

My first week of the coding period is in the books! I started with what I thought would be a simple task: creating a Butler repository from some demo data. I quickly learned that "simple" and "LSST" rarely belong in the same sentence.  The demo data has its own peculiar structure, and I had to wrap my head around instrument registration and setting up collections. My terminal is now a historical record of my many failed butler create and butler register-instrument attempts. But, after a lot of trial and error, I finally have a working repository! In a project this massive, these small wins feel absolutely huge.

Damn couldn't believe I got in again.

I was glued to reddit and the website the whole time for 4 hours. The results came out later than UTC 18:00 and didn't get the mail until four hours beyond the scheduled time.  None of my friends got in, so was kinda anticipating a rejection too, but to my surprise I got in. I am really excited to go through this journey again. It would be amazing. 

Proposal Sent! The Waiting Game Begins...

 Drafted, revised, got feedback (mentally, from myself mostly haha), polished, and finally hit submit on the proposal! It feels pretty comprehensive, outlined the pipeline modules, the timeline, potential hurdles... Pouring all that learning from the courses and the tests into it felt good. Now... we wait. Fingers, toes, everything crossed! Whatever happens, learned a ton already.

Proposal Mode: Activated

 Okay, tests done, now the real writing begins. The proposal. Need to show I understand LSST, DeepLense, the project goals, AND have a solid plan. Drafting sections on objectives, methodology, the timeline... This is where the deep dive into the LSST pipeline docs and the project description really matters. 

Tests Submitted!

 Phew! That was a rush. Spent the last couple of days tuning the models, generating plots (ROC curves!), writing up the notebooks for Task 1 and Task 2. Calculating metrics, explaining the strategy... it's a mini-project in itself. Uploaded everything to GitHub, sent the email with the links, CV, weights. Fingers crossed the results look okay! Glad I got the ML/DL courses done beforehand.

Task 2: Lens Finding

 Finished up Test I (mostly). Now onto Specific Test II - Lens Finding. Downloaded the data... wow, way more non-lenses than lenses. This class imbalance is serious! Accuracy won't mean much here. Need to focus on ROC AUC and maybe precision/recall for the lens class. Sticking with ConvNeXt V2 but need to add class weighting to the loss function. This feels more like a real-world astro problem.

ConvNeXt V2 Seems to Work! (Task 1)

 After some experimentation, fine-tuning ConvNeXt V2 Tiny seems to be the way to go for Test I. Setting up the custom dataset loader, handling the 1-channel to 3-channel conversion, applying ImageNet normalization... It's training! Watching the validation AUC climb is oddly satisfying. Need to calculate the ROC/AUC properly for the report.

Test Time! Task 1

 Okay, time to tackle the GSoC tests. Downloaded the "Common Test I" dataset. Three classes: no substructure, sphere, vortex. Looks like a classic image classification problem. The Deep Learning courses are paying off! Decided to go with PyTorch and try fine-tuning a pre-trained model. Maybe a ResNet? Or something newer... ConvNeXt V2 looks promising. Let the coding commence!

The LSST Butler

 Finally making some progress interacting with a mock LSST repository using the Butler. butler.get('calexp', ...) It works! Retrieving calibrated exposures, checking metadata... It's complex, abstracting away the file system, but I can see how powerful it is for managing petabytes of data. Still feels a bit like black magic though.

CNNs - Seeing the Light (or Lens?)

Reached the Convolutional Neural Networks (CNNs) course in the Deep Learning Specialization. YES! This feels directly relevant. Processing images... detecting features... This is exactly what lens finding and classification models do. Suddenly the DeepLense project tasks make a lot more sense. Getting excited about actually applying this. 

Deep Dive into Deep Learning (Literally)

While wrestling with the LSST setup, also started Andrew Ng's Deep Learning Specialization. Course 1: Neural Networks and Deep Learning. Feels like a good refresher and goes deeper than the ML one. Vectorization, activation functions... Good stuff. Need this foundation solid if I'm going to build a pipeline for deep learning models. 

Environment Setup Shenanigans

 Alright, time to get the LSST stack running locally. Following the lsstinstall guide... dependencies... conda conflicts... Docker maybe? Spent a good chunk of the day just getting the basic LSST commands to run without throwing errors. Small victory, but a victory nonetheless! One step at a time.

Understanding LSST & DeepLense

 Trying to wrap my head around this project. LSST Science Pipelines, the Butler API... it's a whole ecosystem. And DeepLense – classifying dark matter substructure? Super-resolution? Feeling a mix of "this is awesome" and "where do I even start?". Time to hit the docs. Hard.

Projects are LIVE! DeepLense & LSST

Okay, GSoC orgs and projects announced! Scrolled through... ML4Sci... DeepLense... "Data Processing Pipeline for the LSST". Huh. Rubin Observatory, massive data, deep learning for lens finding... This sounds intense. And exactly the kind of astro+coding challenge I was looking for. Reading the description.. 

Neural Nets & TensorFlow – Entering the Matrix?

Moved onto the next part of the ML Specialization – Advanced Algorithms. Neural Networks! And TensorFlow! This feels like the real deep stuff. Multi-class classification...My laptop is getting a workout just running the examples.  

Supervised Learning... Supervised Me!

Making decent progress on the ML course. Linear Regression, Logistic Regression... check! It's clicking. Starting to imagine how this could apply to sorting astronomical data. Still feels very general, but useful nonetheless. 

Back to School!

Decided if I'm serious about GSoC '25, especially in the ML/Astro space, I need to brush up hard. Diving into Andrew Ng's Machine Learning Specialization on Coursera. Regression, classification... feels like building foundational bricks. Gotta get these concepts down pat before the projects even drop. 

GSoC again?

 I am thinking of applying to GSoC again, it was such an amazing learning experience last time that I feel there is still lot more to learn.