Logs Saved Me More Than Tests Ever Did

There's this moment every developer experiences... Production is down. Users are complaining. Your carefully crafted test suite with 95% coverage is sitting there, smugly green in CI/CD, while your application burns.

I used to be a test evangelist. TDD was my religion. I'd spend hours writing unit tests, integration tests, end-to-end tests. My coverage reports looked beautiful. Then production taught me a harsh lesson: tests tell you what you thought could go wrong. Logs tell you what actually went wrong.

The Test-First Mindset (And Why I Held It)

For the first three years of my dev journey, I believed in a simple formula: More tests = fewer bugs. Write the test first, watch it fail, make it pass, refactor. The TDD mantra. And honestly, it worked pretty well in development.

My test suites caught type errors, prevented regressions, and made refactoring safer. When I changed something, the tests would scream if I broke something else. It felt like having a safety net. The dopamine hit of seeing all tests pass was real.

I remember arguing with a friend(also a dev) who insisted on adding detailed logging to every major function. "Why?" I asked. "If the tests pass, the code works. If something breaks, the tests will tell us where." He just smiled and said, "Wait until you have to debug something that only fails in production with real user data."

I thought he was stuck in the past. Logs were for people who didn't write good tests.

What Logs Tell You That Tests Can't

Here's what I finally understood: tests validate your assumptions. Logs record reality.

Tests run in controlled environments. You mock the database to return exactly what you expect. You stub the API to throw exactly the exceptions you anticipate. But production is chaos. Real users do things you never imagined. External services behave in ways their documentation doesn't mention. Network conditions you've never seen in testing cause timing issues that only manifest under load.

According to Henrik Warne's analysis, problems in production often aren't even bugs. Sometimes the software works exactly as coded, but the result isn't what someone expects. Maybe a configuration is wrong. Maybe an external system changed its behavior. Maybe a user discovered a workflow that makes no sense but is technically valid.

Logs give you the breadcrumbs. The actual state of variables at runtime. The sequence of events that led to the problem. The context that tests can't capture because tests don't run against real data, real timing, real concurrency.

Tests Still Matter (I'm Not a Monster)

Nevertheless, I haven't abandoned testing. I still write tests. But my perspective shifted. Tests are great for:

Catching regressions during development
Validating business logic
Ensuring code behavior matches expectations
Making refactoring safer
Serving as documentation

But tests are terrible at:

Debugging production issues
Revealing unexpected system states
Showing actual data that triggered a bug
Capturing timing and concurrency issues
Demonstrating real-world integration failures

According to the BrowserStack testing guide, testing focuses on prevention while debugging focuses on problem-solving. Testing says "does this work?" Logging says "what happened?"

The Better Stack logging practices guide points out that most production environments default to INFO level logging to avoid noise, but often lack enough detail to troubleshoot issues when they arise. That's the balance we need to find.

How to Do Logging Right

After burning myself enough times, here's what I learned about effective logging:

Log the Right Things

Don't log everything. Don't log nothing. Log the things that tell a story:

Entry and exit of major functions (with parameters and results)
External API calls (request, response, timing)
Database queries that modify data
Authentication and authorization decisions
State transitions (order created > payment processed > shipped)
Configuration values at startup
Resource usage at critical points

Use Structured Logging

Stop doing this:

console.log("User " + userId + " purchased " + itemName + " for $" + price);

Start doing this:

logger.info('Purchase completed', {
  userId: userId,
  itemId: itemId,
  itemName: itemName,
  price: price,
  currency: 'USD',
  timestamp: Date.now()
});

When you need to find all purchases by a specific user or above a certain price, structured logs are searchable. String concatenation is not.

Get Your Log Levels Right

I follow this pattern based on AWS logging guidelines:

DEBUG: Variable values, detailed execution flow (development only)
INFO: Significant business events, user actions, system milestones
WARN: Unexpected but handled situations, performance issues, deprecated usage
ERROR: Operation failures that don't crash the app
CRITICAL: Failures that affect the entire system

Production should run at INFO by default. When something breaks, you temporarily bump to DEBUG for the affected service, investigate, then dial it back down.

Include Context, Always

A log line like "Payment failed" is useless. This is useful:

logger.error('Payment processing failed', {
  orderId: 'ORD-12345',
  userId: 'USR-789',
  amount: 49.99,
  paymentMethod: 'credit_card',
  gateway: 'stripe',
  gatewayResponse: response.body,
  attemptCount: 3,
  error: error.message,
  stackTrace: error.stack
});

Now you can actually investigate. You know which order, which user, what amount, which payment gateway, what the gateway said, and how many times it was tried.

Don't Log Sensitive Data

This should be obvious but I've seen it happen: never log passwords, credit card numbers, API keys, session tokens, or personally identifiable information without proper hashing or redaction.

If you must log something sensitive for debugging, hash it consistently so you can track it across logs without exposing the actual value:

logger.info('User authentication attempt', {
  userId: userId,
  passwordHash: crypto.createHash('sha256').update(password).digest('hex').substring(0, 8),
  success: true
});

Make Logs Searchable and Aggregated

Local log files are fine for development. In production, use a log aggregation service (I use Datadog, but ELK Stack, Splunk, or Papertrail work too). Being able to search across all service instances and correlate logs by request ID is essential.

Every request should get a unique trace ID that flows through all services:

// API Gateway
const traceId = generateTraceId();
logger.info('Request received', { traceId, path: req.path });

// Pass traceId to downstream services
// All subsequent logs include it

logger.info('Database query', { traceId, query: 'SELECT...' });
logger.info('Cache hit', { traceId, key: 'user_123' });
logger.info('Response sent', { traceId, status: 200 });

Now you can filter your logs by that trace ID and see the entire request lifecycle.

The Balance I've Found

These days, I write tests for logic and logs for runtime. Tests validate my understanding of how code should work. Logs validate how it actually works.

When I write a new feature:

Write unit tests for the business logic
Add integration tests for critical paths
Instrument the code with comprehensive logging
Deploy to staging and verify logs tell a coherent story
Monitor production logs for the first 48 hours after deployment

When production breaks:

Check recent deployments
Search logs for errors and warnings
Find the trace ID of a failing request
Follow the log trail to identify where things went wrong
Reproduce locally if possible, add a test for the fix
Deploy the fix
Verify in logs that the issue is resolved

Tests catch bugs in development. Logs catch bugs in production. You need both, but if I had to choose which one saves me more time and stress when things go wrong... it's logs. Every single time.

Production is the Real Test

By-the-way, your test suite, no matter how comprehensive, is a simulation. Production is reality. Tests run with mocked data, controlled timing, and predictable conditions. Production has real users doing weird things, external services having bad days, network glitches, configuration drift, and all the chaos that comes with actual usage.

The Rookout engineering blog notes that many teams avoid collecting DEBUG and TRACE logs in production because they're expensive in storage and performance. But when you need them, you REALLY need them. They recommend dynamic log verbosity - being able to crank up logging temporarily when investigating an issue, then dial it back down.

That's the approach I've adopted. Run lean in production normally, but have the ability to increase logging verbosity for specific users, specific endpoints, or specific time windows when something smells off.

In my opinion

I'm not telling you to stop writing tests. Write tests. Write good tests. But also write good logs.

Think of it this way: tests are your seatbelt. Logs are your security camera. The seatbelt helps prevent damage. The camera shows you exactly what happened when something still goes wrong.

And when you're debugging a production issue at 3 AM, you'll be grateful for every log line that helps you understand what went wrong. Trust me on that one.

So these days, before I call something “done,” I ask myself one question: if this breaks at 3 AM, will the logs tell me why? If the answer is no, I’m not done yet

Future me will thank present me. Past me wishes he'd learned this lesson sooner.

Table of Contents