We ❤️ Open Source
A community education resource
Building resilient applications: 6 best practices in software architecture
Designing for failure: Essential strategies for creating robust, fault-tolerant systems.

In today’s world, apps don’t just need to work—they need to keep working, even when things break. That’s where resilience in software architecture comes in. In my experience, resilience is often misunderstood as just “adding retries” or “putting stuff in queues.” But it’s much deeper than that. It’s about designing systems that expect failure—and recover gracefully.
Best practices in software architecture
Here are a few best practices I follow when aiming for resilience:
1) Design for failure, not perfection
Assume the downstream services will fail. What happens next should be planned. For example: If the Payment gateway is down, do not bring the entire system down. Allow the user to save their cart or mark the order as payment pending. Show a message that “The payment service is currently down, but your order is saved. We will notify you once the service is up.” Or, consider offering alternate payment methods.
2) Use bulkheads and circuit breakers
Isolate failures so they don’t cascade across the system. Think of bulkheads like separating parts of your system so if one breaks, it doesn’t drag the others down with it. For example: Separate your user authentication service from your product catalog service. If auth fails, the catalog should still be accessible.
Circuit breakers monitor communication between services. For example: If the downstream system fails or is too slow, the circuit breaker will trip off and halt further requests to that service temporarily. It will then periodically check if the service has recovered before allowing traffic again. This prevents overwhelming a failing service with more requests and avoids cascading failures that can crash the whole system.
3) Graceful degradation
Make sure users can still get partial functionality when something’s down. It’s about failing smart, not failing loud. Provide users with the best service you can, even when everything is not perfect.
For example: If image processing fails, show a placeholder image instead of breaking the entire product page.
4) Idempotency
Retry safely without causing duplicate actions or side effects.
For example: When processing a payment, use a unique transaction ID to ensure retries don’t result in double charges. Or say a user clicks on “Pay” twice, the user should not be charged twice. Without idempotency, the user will be charged twice.
Read more: What is OpenTelemetry?
5) Observability is non-negotiable
Logs, metrics, tracing… you need all three to catch and recover from failure fast. When things break and they will, you should be able to troubleshoot effectively. What failed, where, and when. In my experience, some teams skip logging altogether, while others log everything and drown in noise. Striking the right balance is the real mantra.
6) Chaos testing
Don’t just hope your app survives real-world failures. Test it. On purpose. Resilience isn’t an add-on—it’s part of good architecture. The more we design with failure in mind, the more reliable our systems become. Kill a service, stop a timer, disconnect a database, delay a response from the service by adding a timer. Create chaos! It’s smart engineering.

Conclusion
In my experience, resilience isn’t just about handling failure—it’s about expecting it. Using bulkheads, circuit breakers, and chaos testing has helped me design systems that bounce back fast. Observability is non-negotiable if you want to catch issues before your users do. Build with failure in mind, and your system will thank you.
If you’re an architect or engineer, I’d love to hear what’s worked for you too.
More from We Love Open Source
- What is OpenTelemetry?
- Why AI won’t replace developers
- Build a successful dev career with curiosity, community, and consistent practice
- Building your data plumbing: Lessons from a backyard drainage project
- How I use AI agents to automate my workflow and save hours
The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.