We ❤️ Open Source

A community education resource

4 min read

Building resilient applications: 6 best practices in software architecture

Designing for failure: Essential strategies for creating robust, fault-tolerant systems.

In today’s world, apps don’t just need to work—they need to keep working, even when things break. That’s where resilience in software architecture comes in. In my experience, resilience is often misunderstood as just “adding retries” or “putting stuff in queues.” But it’s much deeper than that. It’s about designing systems that expect failure—and recover gracefully.

Best practices in software architecture

Here are a few best practices I follow when aiming for resilience:

1) Design for failure, not perfection

Assume the downstream services will fail. What happens next should be planned. For example: If the Payment gateway is down, do not bring the entire system down. Allow the user to save their cart or mark the order as payment pending. Show a message that “The payment service is currently down, but your order is saved. We will notify you once the service is up.” Or, consider offering alternate payment methods.

2) Use bulkheads and circuit breakers

Isolate failures so they don’t cascade across the system. Think of bulkheads like separating parts of your system so if one breaks, it doesn’t drag the others down with it. For example: Separate your user authentication service from your product catalog service. If auth fails, the catalog should still be accessible.

Circuit breakers monitor communication between services. For example: If the downstream system fails or is too slow, the circuit breaker will trip off and halt further requests to that service temporarily. It will then periodically check if the service has recovered before allowing traffic again. This prevents overwhelming a failing service with more requests and avoids cascading failures that can crash the whole system.

3) Graceful degradation

Make sure users can still get partial functionality when something’s down. It’s about failing smart, not failing loud. Provide users with the best service you can, even when everything is not perfect.

For example: If image processing fails, show a placeholder image instead of breaking the entire product page.

4) Idempotency

Retry safely without causing duplicate actions or side effects.

For example: When processing a payment, use a unique transaction ID to ensure retries don’t result in double charges. Or say a user clicks on “Pay” twice, the user should not be charged twice. Without idempotency, the user will be charged twice.

Read more: What is OpenTelemetry?

5) Observability is non-negotiable

Logs, metrics, tracing… you need all three to catch and recover from failure fast. When things break and they will, you should be able to troubleshoot effectively. What failed, where, and when. In my experience, some teams skip logging altogether, while others log everything and drown in noise. Striking the right balance is the real mantra.

6) Chaos testing

Don’t just hope your app survives real-world failures. Test it. On purpose. Resilience isn’t an add-on—it’s part of good architecture. The more we design with failure in mind, the more reliable our systems become. Kill a service, stop a timer, disconnect a database, delay a response from the service by adding a timer. Create chaos! It’s smart engineering.

Building resilient software architecture diagram.
Image by Tapasya Syal created using Napkin.AI. 

Conclusion

In my experience, resilience isn’t just about handling failure—it’s about expecting it. Using bulkheads, circuit breakers, and chaos testing has helped me design systems that bounce back fast. Observability is non-negotiable if you want to catch issues before your users do. Build with failure in mind, and your system will thank you.

If you’re an architect or engineer, I’d love to hear what’s worked for you too.

More from We Love Open Source

About the Author

Tapasya Syal is a Software Architect and AI enthusiast with over 18 years of experience designing scalable enterprise solutions. She enjoys simplifying complex tech, and believes in continuous learning and sharing practical insights.

Read Tapasya Syal's Full Bio

The opinions expressed on this website are those of each author, not of the author's employer or All Things Open/We Love Open Source.

Want to contribute your open source content?

Contribute to We ❤️ Open Source

Help educate our community by contributing a blog post, tutorial, or how-to.

This year we're hosting two world-class events!

Check out the AllThingsOpen.ai summary and join us for All Things Open 2025, October 12-14.

Open Source Meetups

We host some of the most active open source meetups in the U.S. Get more info and RSVP to an upcoming event.