Wasco County Asbestos Scraper

When 3,500 Addresses Taught Me Why Geocoding Is Harder Than It Looks The Problem Nobody Thinks About Until They Need To Here’s a question that seems deceptively simple: How do…
Python Pandas Playwright

About This Project

When 3,500 Addresses Taught Me Why Geocoding Is Harder Than It Looks

The Problem Nobody Thinks About Until They Need To

Here’s a question that seems deceptively simple: How do you find every building in a county that might contain asbestos?

The answer is straightforward enough- scrape the county assessor’s website, filter for properties built before 1978 when the EPA banned asbestos, and boom, you’re done. Except you’re not done at all. Because a list of 3,500 street addresses sitting in a CSV file is about as useful as a phone book without phone numbers. You need those addresses on a map, and that’s when things get interesting.

Spoiler alert: I started with a 26% success rate and ended up at 95.7%. The journey between those two numbers taught me more about data persistence than any tutorial ever could.

When Free Geocoding APIs Let You Down (Gently)

I built the scraper first- that part was almost fun. Playwright automated the browser, clicked through thousands of property records, and extracted everything: parcel IDs, addresses, construction years. Clean data, well-structured, ready to visualize. I felt like a data engineering genius for approximately 45 minutes.

Then came the geocoding. “No problem,” I thought, “I’ll just use Nominatim.” For the uninitiated, Nominatim is OpenStreetMap’s free geocoding service, and it works great… for major cities with well-documented addresses. Wasco County, Oregon? Not so much.

The first test run was humbling. Of my 3,526 properties, Nominatim successfully geocoded about 900. That’s 26%. Twenty-six percent! I had addresses like “15 Oak Street” being placed in completely different states, rural routes ending up in the Pacific Ocean, and PO boxes somehow geocoding to random coordinates in Idaho. The system was trying its best, but rural Oregon addresses were just outside its comfort zone.

I could have stopped there, called it “good enough,” and moved on. But watching 74% of my data fail felt like leaving a puzzle three-quarters unsolved. Besides, what’s the point of building an asbestos risk dashboard if three-quarters of the properties don’t show up on the map?

Building a Geocoding Frankenstein (The Good Kind)

This is where the project shifted from “quick weekend scraper” to “okay, now I’m genuinely curious how to solve this.” I started researching every free geocoding API I could find, and that’s when I discovered the US Census Bureau’s geocoding service- less known than Google or Nominatim, but specifically designed for US addresses.

The Census geocoder was night and day better. On the first test, my success rate jumped from 26% to nearly 92%. Suddenly, rural routes were resolving correctly, small-town addresses were landing on actual buildings, and the map started looking legitimate. But I still had 8% of addresses failing, and I’d already invested too much time to accept 92% when 95%+ felt achievable.

So I built what I now lovingly call my “hybrid geocoding Frankenstein.” The system tries four different strategies in sequence:

  1. Primary attack: Hit the Census geocoder with the full address including ZIP code
  2. First fallback: If that fails, try Census without ZIP (sometimes ZIP codes cause false negatives)
  3. Second fallback: Try Nominatim with full address (catches what Census missed)
  4. Third fallback: Nominatim without ZIP code
  5. Last resort: If an address has failed every API, place a marker at the city center with a flag that says “approximate location only”

The logic was simple- each geocoding service has different strengths and different datasets. The Census excels at residential addresses but sometimes chokes on commercial properties. Nominatim understands intersections and landmarks better. By cascading through multiple services, I could catch edge cases that would slip through any single approach.

Watching Progress Bars Like They’re Sporting Events

Here’s something nobody tells you about geocoding 3,500 addresses: it takes forever. The free Nominatim API enforces a one-request-per-second rate limit, which is totally reasonable and respectful, but it means you’re looking at nearly an hour just for that service alone. Add in Census queries and fallback attempts, and suddenly you’re staring at a 2.5-hour processing window.

I built in progress updates every 50 addresses because watching a script run for two and a half hours without feedback feels like sending a probe to Mars and hoping it lands. The terminal would spit out messages like “Processed 250/3526 addresses, 228 successful (91.2%)” and I’d find myself doing mental math on the remaining time, calculating whether I could grab coffee without missing the moment when the success rate ticked past 95%.

The really nerdy satisfaction came from watching the success rate climb as the hybrid approach caught addresses that initially failed. An address would fail Census, fail the Census-without-ZIP attempt, and then suddenly Nominatim would come through with perfect coordinates. Or vice versa. It felt like watching a relay team where each runner had different specialties.

When the final run completed with 3,375 successful geocodes out of 3,526 (95.7%), I honestly did a small fist pump. Those remaining 151 failures? Mostly placeholder addresses like “NO SITUS ADDRESS” or parcels with incomplete data in the county records. Not much any geocoding service could do with those.

From Data Points to Actual Insight

Once I had coordinates for 95% of my properties, building the Streamlit dashboard felt almost anticlimactic. Folium handles the mapping, Pandas wrangles the data, and Streamlit ties it together with a clean interface. But watching it come alive was genuinely satisfying.

The color-coding system emerged naturally: red markers for properties built before 1970 (highest asbestos risk), orange for 1970-1977 (medium risk), green for 1978 and later (post-ban, low risk). Toggle on the heat map and suddenly you can see geographic clusters- neighborhoods where entire blocks were built in the 1950s and 60s, rural areas with scattered high-risk properties, downtown commercial districts lit up in red.

The filtering capabilities transformed the tool from “pretty map” to “actually useful.” Click to show only high-risk properties and you’re looking at a targeted list for inspection prioritization. Drag the year slider to show only 1940s-era buildings and suddenly you’re building a historical survey. Export to CSV and you’ve got a spreadsheet ready for whatever analysis comes next.

What I didn’t expect was how much fun the heat map would be. There’s something weirdly hypnotic about watching density patterns emerge- you can see where Wasco County grew over decades, identify boom periods by construction clusters, and spot outliers where single old buildings sit isolated in otherwise newer developments.

The Unglamorous Truth About Data Projects

Building this taught me something important about real-world data work: the actual scraping and dashboard code probably represent 30% of the effort. The other 70% was solving the geocoding problem, testing edge cases, handling failures gracefully, and making sure the system degrades sensibly when things go wrong.

Nobody sees the hours spent figuring out why certain addresses geocode to the wrong state. Nobody cares about the careful rate limiting that keeps you from getting banned by free APIs. Nobody notices the fallback logic that quietly catches failures and tries alternative strategies. But without those unglamorous pieces, you’d have a broken dashboard showing 26% of the data in the wrong locations.

There’s a particular satisfaction in building systems that just work, quietly handling complexity so the end user sees something simple and reliable. When someone clicks on a property marker and sees accurate information about a building that might contain asbestos, they’re not thinking about the multi-API geocoding strategy or the two-and-a-half-hour processing window. They’re just getting the answer they need, which is exactly the point.

Why I’ll Never Look at Addresses the Same Way

This project fundamentally changed how I think about location data. Every address is a small puzzle- some are trivial to solve, others require multiple attempts and creative strategies. Rural addresses are harder than urban ones. Older properties often have messier records. ZIP codes can help or hurt depending on how the geocoding service interprets them.

I’ve also developed a weird appreciation for the US Census Bureau’s geocoding team. Their service isn’t as well-known as commercial offerings, but it’s solving genuinely hard problems with US address standardization, and it does it well. Combining it with community-driven services like Nominatim creates something more robust than either could achieve alone.

The dashboard itself has become more useful than I initially planned. What started as “let me visualize asbestos risk” turned into a tool for understanding county-wide construction patterns, identifying geographic clusters for public health interventions, and making publicly available data actually accessible to people who need it. Sometimes the best projects are the ones that grow beyond their original scope because you keep asking “what if I could also…”

And honestly, there’s something deeply satisfying about taking messy, scattered public records and transforming them into an interactive map where anyone can explore and filter and export exactly what they need. It’s not flashy AI or cutting-edge algorithms- it’s just persistent problem-solving and a willingness to try four different approaches when the first one doesn’t work.

Which, when I think about it, might be the most valuable skill in data engineering: knowing when to stop accepting “good enough” and figure out how to get to “actually good.”