Posts

Showing posts from July, 2025

RIPPLe: Building a Bridge Between LSST and DeepLense

It's hard to believe seven weeks have flown by. In that time, I've consumed countless cups of green tea, and developed a single obsession: getting petabytes of astronomical data ready for deep learning. When I started this Google Summer of Code project, the mission seemed straightforward enough. I was tasked with building a pipeline to feed data from the Legacy Survey of Space and Time (LSST) into machine learning models for the DeepLense project. The Vera C. Rubin Observatory, which will conduct the LSST, is a firehose of cosmic data, set to produce 20 terabytes every single night. Buried in that data, we expect to find around 100,000 new gravitational lenses—a massive jump from the few hundred we know of today. Each one is a cosmic magnifying glass that can help us understand the mysteries of dark matter. But first, you have to find them. That’s where my project, RIPPLe, comes in. Phase 0: The Foundation  I look back at who I was in February, happily working my way through An...

The Normalization Nightmare

How hard can it be to scale pixel values to a nice range for a neural network, like 0 to 1? Turns out, it's incredibly hard when dealing with astronomical data. Min-Max Scaling? A single hot pixel or cosmic ray outlier will completely wreck the scale for the entire image. Z-score Standardization? This works better but can result in negative values, which some neural network architectures don't like. Asinh Stretch? This is what astronomers use to visualize images, as it handles the huge dynamic range. It's fantastic for making pretty pictures, but trying to explain it to PyTorch is another story.

Welcome to PSF Hell

Today I learned about the Point Spread Function (PSF). In simple terms, it's how a star (a point of light) gets "smeared out" by the atmosphere and telescope optics. The problem? This smearing effect is different for each filter band (g, r, i, etc.). For a neural network to properly compare colors, the images need to have a consistent "blurriness." This means I have to perform PSF matching: taking the sharpest image and deliberately blurring it to match the fuzziest one. It feels completely counterintuitive, and the math is giving me flashbacks to my toughest signal processing classes.

Diving into Preprocessing

With the core data fetching stable, I'm getting a head start on Phase 2 goals to show for the midterm. This means preprocessing: turning the "raw" data into something a machine learning model can use. I created a new file, preprocessor.py, and started mapping out the steps. It feels like climbing a whole new mountain.

If You Can't Measure It, You Can't Improve It

I've officially implemented performance monitoring. Every major operation—from fetching to caching to quality control—now logs its timing and memory usage. The metrics are beautiful, a little concerning, and beautiful again. I've already spotted a few bottlenecks, with some operations taking over 5 seconds (I'm looking at you, collection fallback logic). Time to optimize!

Obsession with Quality Control

It turns out that just getting the data isn't enough—it has to be good data. So this week, I built in a quality validation step. My validate_cutout_quality() function now checks for bad pixels, saturation, cosmic rays, signal variation, and proper dimensions.