Enable America Jobs

Enable America Logo

Job Information

Procter & Gamble Senior Site Reliability Engineer (Manufacturing IT Operations) in Taguig City, Philippines

Job Location

Taguig City

Job Description

Overview of the job

As the Senior SRE Lead in the Manufacturing IT Operations – Incident Response Team, you will be responsible for leading incident response efforts, ensuring swift and effective resolution of critical system issues. You will also play a critical role in ensuring the reliability, scalability, and performance of our systems and services. Collaborating with cross-functional teams, you will design, implement, and automate robust systems, monitoring tools, and processes. Additionally, you will have responsibilities including leading the SRE team, managing time/schedule, managing SLOs and SLIs, managing reporting, and reporting directly to the IT Operations Director.

Your team

You will lead the SRE – Incident Response team, providing guidance, support, and mentorship to the team members as they navigate their roles. Collaborating closely with technically skilled professionals, including software engineers, DevOps specialists, Subject Matter Experts, and other SREs, you will foster a culture of technical expertise, continuous learning, and knowledge sharing, while encouraging innovation and embracing new ideas. In addition, you will directly collaborate with our site customers and users, ensuring their needs and expectations are met through reliable and high-performing systems. You will report directly to the IT Operations Director.

How success looks like

Success as an SRE Lead involves different areas of the role, including incident response, monitoring and reliability, effective collaboration with customers and users, and additional responsibilities as a leader:

  • Lead the swift response and resolution to critical incidents, ensuring minimal impact on system availability and user experience, while driving continuous improvement in incident management processes.

  • Ensure high system availability and reliability through robust monitoring, optimization of system architecture, and cross-functional collaboration to design and implement resilient systems.

  • Lead comprehensive monitoring solutions to gain real-time insights into system performance, enabling proactive incident response and continuous improvement of system visibility and resource optimization.

  • Collaborate directly with customers and users to understand their needs, proactively address concerns, and provide exceptional customer support to ensure reliable and performant systems that meet their expectations.

  • Lead the SRE team, providing guidance, support, and mentorship to team members, fostering a culture of technical excellence and continuous learning.

  • Manage time/schedule effectively to ensure coverage and support across the week, maintaining the reliability and availability of our systems.

  • Manage SLOs and SLIs, ensuring that the defined service level objectives and indicators are met or exceeded.

  • Oversee reporting, providing accurate and timely updates on incident response, system performance, and other relevant metrics to stakeholders.

  • Report directly to the IT Operations Director, providing insights, recommendations, and collaborating on strategic initiatives.

Responsibilities of the role

Team Leadership:

  • Lead the SRE team, providing guidance, support, and mentorship to foster a culture of technical excellence and continuous learning.

  • Manage time/schedule effectively to ensure coverage and support across the week, maintaining the reliability and availability of systems.

  • Oversee reporting, providing accurate and timely updates on incident response, system performance, and other relevant metrics to stakeholders.

  • Foster a collaborative and inclusive team culture, promoting effective communication, knowledge sharing, and professional development.

    Incident Response:

  • Lead incident response efforts, swiftly resolving critical incidents to minimize downtime and user impact.

  • Implement effective incident management processes, ensuring clear communication, coordination, and documentation.

  • Conduct root cause analysis, implementing preventive measures, and driving continuous improvement.

    Reliability:

  • Ensure high system availability through robust monitoring, alerting, and automated incident response systems.

  • Optimize system architecture and configurations for improved performance, scalability, and fault tolerance.

  • Collaborate cross-functionally to design and implement resilient systems using industry best practices.

    Monitoring:

  • Implement comprehensive monitoring solutions, providing real-time insights into system performance and health.

  • Configure and manage monitoring tools, ensuring accurate and actionable alerts for proactive incident response.

  • Continuously evaluate and enhance monitoring strategies to improve system visibility and resource optimization.

    Upskilling:

  • Stay updated with industry trends, technologies, and best practices in Site Reliability Engineering.

  • Continuously develop technical skills in system architecture, automation, cloud technologies, and incident response.

  • Share knowledge, mentor team members, and foster a culture of learning and upskilling.

    Managing Customers:

  • Collaborate directly with users and customers to understand their needs and pain points.

  • Proactively address customer/user concerns, ensuring reliable and performant systems.

  • Provide exceptional customer support, communicate updates, resolutions, and gather feedback for continuous improvement.

Job Qualifications

Role Requirements

Technical Expertise and Experience:

  • Knowledge or familiarity in system administration, including Linux/Unix environments, cloud platforms (such as AWS, Azure, or GCP).

  • Experience with configuration management tools and infrastructure-as-code frameworks (e.g., Terraform).

  • Proficiency in at least one programming language (e.g., Python, C#) and experience with scripting for automation tasks.

  • Understanding of networking protocols, network infrastructures, load balancing, and DNS management.

  • Familiarity with containerization and orchestration technologies (e.g., Docker, Kubernetes).

  • Familiarity with databases and proficiency in writing SQL queries.

  • Experience or familiarity with monitoring and observability tools (e.g., Prometheus, Grafana).

  • Knowledge of incident response methodologies, root cause analysis, and implementing preventive measures.

  • Understanding of security best practices and experience with implementing secure systems.

  • Experience in Manufacturing Execution Systems (e.g. Proficy) or Manufacturing Operations is a plus.

  • At least 7 years of experience in the industry preferably in Software Engineering, Software Development, SRE or DevOps or Technical Consulting

    Soft Skills:

  • Strong problem-solving and troubleshooting skills, with the ability to analyze complex issues and devise effective solutions.

  • Excellent communication and collaboration skills to work effectively with cross-functional teams, stakeholders, and customers.

  • Strong leadership skills to guide and mentor the SRE team, fostering technical excellence and continuous learning.

  • Ability to manage time and schedule effectively, ensuring coverage and support across the week.

  • Strong attention to detail and commitment to delivering high-quality work.

  • Proactive and self-motivated, with a continuous learning mindset and a drive for staying updated with industry trends and technologies.

  • Ability to thrive under pressure and effectively manage incidents, ensuring timely resolutions and minimizing downtime.

    This role requires a commitment to work a standard 5-day workweek, with 4 weekdays and at least one weekend day (Sunday or Saturday). The nature of the SRE Lead position necessitates coverage and support across the week, ensuring the reliability and availability of our systems. We value work-life balance and will strive to provide a predictable and manageable schedule within this framework, while still meeting the needs of our customers and maintaining the stability of our services

About us

We produce globally recognized brands and we grow the best business leaders in the industry. With a portfolio of trusted brands as diverse as ours, it is paramount our leaders are able to lead with courage the vast array of brands, categories and functions. We serve consumers around the world with one of the strongest portfolios of trusted, quality, leadership brands, including Always®, Ariel®, Gillette®, Head & Shoulders®, Herbal Essences®, Oral-B®, Pampers®, Pantene®, Tampax® and more. Our community includes operations in approximately 70 countries worldwide.

Visit http://www.pg.com to know more.

We are an equal opportunity employer and value diversity at our company. We do not discriminate against individuals on the basis of race, color, gender, age, national origin, religion, sexual orientation, gender identity or expression, marital status, citizenship, disability, HIV/AIDS status, or any other legally protected factor.

Job Schedule

Full time

Job Number

R000114678

Job Segmentation

Experienced Professionals (Job Segmentation)

DirectEmployers