Connect with us

Hi, what are you looking for?

Tech & Science

Ashwin Poojary on the role of SRE to ensure optimal system infrastructure

Poojary believes SREs like him must be agile learners, quickly assimilating new information and technologies to solve complex problems creatively

Photo courtesy of Ashwin Poojary
Photo courtesy of Ashwin Poojary

Opinions expressed by Digital Journal contributors are their own.

Site Reliability Engineering (SRE) is becoming increasingly relevant as information technology (IT) evolves. SRE ensures that intricate infrastructures deliver peak performance and reliability. Leading companies in social media, Artificial Intelligence (AI), gaming, and semiconductor technologies, such as NVIDIA, lean heavily on SRE principles to bolster their system infrastructures. This strategic focus supports their continuous integration, deployment practices, and large-scale operations, all while prioritizing customer satisfaction through unmatched reliability and performance.

Central to NVIDIA’s push towards technological excellence is Ashwin Poojary, a seasoned professional who has significantly shaped the company’s SRE. As the director of site reliability engineering and DevOps, Poojary leads a dynamic team overseeing NVIDIA’s cloud foundation, database foundation, and enterprise applications. His role is critical in ensuring that NVIDIA’s infrastructure is reliable, scalable, and prepared for future AI, gaming, and semiconductor technology demands.

With years of cumulative experience at renowned tech companies, Poojary has gained deep expertise in SRE. Leveraging his experience in companies like Facebook (Meta), Google and Twitter (X), where he played a critical role in supporting infrastructure services, he offers valuable insights into SRE’s vital role in achieving balanced and peak system performance.

Harmonizing dev ambitions and ops stability

SRE bridges the divide between development (Dev) teams, driven to rapidly innovate and release new features, and operations (Ops) teams, whose primary focus is maintaining system stability and reliability. Poojary notes that this disconnect often leads to tension, with Ops imposing restrictions to safeguard system integrity and Dev finding ways to circumvent these restrictions for faster updates.

SREs harmonize these conflicts by establishing a framework that quantifies reliability regarding Service-Level Objectives (SLOs) and Service-Level Agreements (SLAs). This provides a clear, objective basis for decision-making. SREs bring a more strategic approach to launching new features and managing system reliability. By adopting a formula to assess whether a new feature should be released or not, SRE eliminates the subjective debate over release readiness, making the process transparent and fair.

Poojary explains, “SREs not only oversee the ongoing reliability of products but also work closely with Dev teams to ensure that innovations align with the system’s capacity to maintain performance standards.”

Through this balance and unity, SREs ensure that the drive for innovation does not compromise user experience, leading to more robust and reliable services that meet expectations.

Reducing the cost of failure

Addressing and learning from system failures is crucial for robust system performance in SRE. Poojary emphasizes the significance of SRE teams in monitoring and implementing a proactive approach to failure management. By thoroughly analyzing failures and their impact on system performance, SREs can devise solutions that mitigate such issues, thereby reducing the cost of failure.

While at Meta, Poojary’s expertise in SRE shone through his critical analysis of Peripheral Component Interconnect Express (PCIe) hardware vulnerabilities. He authored an article detailing how he and his team monitored PCIe-based components, uncovering their susceptibility to various software-related failures and performance degradations that posed significant challenges for system maintenance.

To address these issues, Poojary spearheaded the development of system tools such as PCIcrawler, MachineChecker, etc. and also used Facebook’s standard tooling like Scuba, and IPMI Tool, among others, to enhance fault detection and streamline maintenance processes. As a result, Poojary’s team significantly strengthened the reliability, resilience, and performance of Meta’s hardware fleet, setting new benchmarks in system optimization and maintenance efficiency.

“This approach to failure management underlines the philosophy that each failure is an opportunity for growth, driving innovations that contribute to more seamless and efficient system operations,” Poojary mentions.

These strategies and learning processes enable the team to refine their products and systems continuously, ensuring they evolve to be more reliable and user-friendly for future applications.

Excellent collaboration among teams 

Poojary’s leadership at Twitter’s geographically distributed team across Bangalore, Boston, London, San Francisco, San Jose, and Seattle exemplifies the critical role of SREs in ensuring seamless collaboration. Under his guidance, the team provided various infrastructure services supporting Twitter’s overall structure, including Blobstore, Cache, Database, and Data Platform SREs, among others. 

“SRE’s purpose is not limited to ensuring the reliability and efficiency of systems. It also extends to bridging the gap between various departments to harmonize practices and decisions across the board,” Poojary explains.

SREs are instrumental in disseminating best practices and reviewing reliability decisions, which are crucial for enhancing cross-departmental product development. Such collaborative effort is essential in creating a cohesive environment where teams can work harmoniously towards common goals, leveraging the strengths of diverse units to innovate and improve products.

The greater responsibility ahead

With the level of work and expectations placed on SREs, Poojary acknowledges the need for a proactive stance toward innovation and learning. This involves staying abreast of the latest cloud technologies, automation tools, and software development practices. It also demands a solution-oriented and efficiency-centered mindset, always looking for ways to automate tasks, improve system resilience, and reduce downtime.

Poojary believes SREs like him must be agile learners, quickly assimilating new information and technologies to solve complex problems creatively. Such a culture of innovation and learning is essential for personal and team growth, ensuring that organizations remain competitive and resilient in the face of technological change.

Avatar photo
Written By

You may also like:

Life

To stay hydrated it’s important to make sure your water levels are constantly being topped up throughout the day.

World

Stop pretending to know what you’re talking about. You’re wrong and you know you’re wrong. So does everyone else.

World

Western nations broadly want ramped-up surveillance and rapid sharing of all data and samples on emerging pathogens.

Business

Facebook and other online platforms must not force users to pay for the right to data protection when offering ad-free subscriptions.