Big data vs. bad data

My first blog. Written from experience, with the purpose of building a community around the way I see big data.

This is the first thing I’ve ever written here. I’m not a tool reviewer and I’m not selling anything — I’ve spent years building large-scale big data solutions from scratch, mostly without leaning on open-source tools. Those tools accelerate your time to go-live, sure. But they’re hard to maintain, and that maintenance bill comes due later. So I want to start this community from the perspective of the problems, not the tooling.

Let me start with the thing I believe most strongly.

The real enemy is bad data, not big data

First, what even is “big data”? Anything that involves large volumes of data. That’s it. And here’s my honest take after years of doing this: the challenge a seasoned data engineer actually faces in warehousing is bad data, far more than it’s a big-data problem.

That sounds backwards, so let me be precise. Big data engineers need to understand that they’re there to cater to scale — to warehouse everything and build a platform that data analysts and ML engineers can make sense of. I’m not saying you should ignore data quality; warehousing everything reliably is the job. But the pain, the late nights, the silent dashboard breakages — those almost always trace back to data that’s dirty, late, duplicated, or wrongly shaped. Not to volume.

It gets even more challenging when people lean on NoSQL, or stuff JSON / blob attributes into SQL columns. Those databases and datatypes are fine — until they’re not used properly. When they get exploited in places where no ORM/ODM schema is enforced, you get code smell, and it lands on your plate.

Here’s the upside, though: as a data engineer, you always hold the power of transformation.

You can always transform your way out — carefully

Say a field needs to be deprecated from the source DB, and going forward it’ll be inferred from some other column instead. You don’t have to do a big-bang migration and risk breaking every dashboard. You handle it in a transformation script, and you route the change through a versioning strategy:

Keep both versions of the data alive.
Re-point the dashboard to the new version.
Then kill the column from the warehouse.
And only after that, kill it from the source DB.

1 Keep both versions live old + inferred

2 Re-point dashboard to the new version

3 Drop column from warehouse now unused

4 Drop column from source DB safe last step

That’s controllable — if you have a data engineering team that owns the code and the custom requirements, with features curated for your needs. This is also exactly where Data SaaS lets you down. It either gets very expensive, or it lacks the features you actually want. Usually both, eventually.

Data sourcing is the part you cannot mess up

If I had to point at the single most technical, most sophistication-hungry part of this job, it’s data sourcing. This is where you cannot afford to be careless.

Once data is sourced, it has to be done with 100% reliability — especially data that cannot be reproduced. Think about stateful data: you’re not just replicating rows, you’re capturing every state transition of that data. Lose one, and you can’t reconstruct it later. There’s no re-run that saves you. So this is the zone where reliability gets engineered first, before anything else.

And the way you get reliability here is boring and underrated: you ask questions. Even basic ones.

Ask the dumb questions first

Before I write a line of ingestion code, I walk down a chain like this:

What kind of data source is it? Is it a DB? What kind of DB?
What kind of ingestion does it need? Do we do log-based ingestion — reading the WAL / oplog / binlog — or can we get away with something less sophisticated, like querying on a schedule?
Are deletes possible at the source? This one decides the whole approach.

That delete question matters more than people expect. If data is never deleted from the source DB, then it’s important that we don’t reach for log-based capture — a scheduled watermark read is fine. But then the watermark column has to be updated by the developer. And right there you’ve created a human dependency: if the developer does a manual migration and forgets to bump the watermark column, you silently lose data. That’s a bad-data failure dressed up as an ingestion success.

SELECT *
FROM orders
WHERE updated_at > :last_watermark   -- bump :last_watermark after every successful run
ORDER BY updated_at
LIMIT 10000;

Are deletes possible at the source?

Scheduled watermark read simpler — but the dev must bump the watermark, or you silently lose data

yes

Log-based CDC read the WAL / oplog / binlog — captures every state transition

How do you reach it?

VPC peeringPrivate Service Connectbastion (SSH)

Then the questions stop being about data and start being about plumbing:

Where does this DB physically live? It may sit in another VPC, or another cloud provider entirely.
So how do you even reach it? VPC peering? Private Service Connect if you’re on the same cloud provider? A bastion host over SSH?

Networking and bandwidth aren’t someone else’s problem here — they’re warehousing concerns. How data physically moves is part of the reliability story, not a footnote to it.

Keep asking questions. That’s not a junior habit you grow out of; it’s the habit that keeps you correct.

The job, honestly

People think this role is “backend engineering, but for analytics.” It’s closer to the opposite of glamorous. Being a data engineer means you’re essentially an expert backend engineer — plus the added leverage of being a data expert. You carry both. And a lot of the work isn’t the engineering itself; it’s the unglamorous groundwork at the start. Framing the problem. Mapping the source. Tracing where state can be lost. Wiring the access. The “engineering” is the small part at the end.

So if you take one thing from this first post: stop optimizing for volume before you’ve earned the right to. Get the sourcing 100% reliable, ask the basic questions out loud, and treat correctness as the headline problem. Big data will always be there to scale into. Bad data is the thing that quietly burns you.

More soon. This is the perspective I want to build this community around — the problems, and what they actually cost to solve.