Get to know our Team:
We are living in exciting times. Technology is reshaping how we live and we want to use it to redefine how financial services are offered. Grab is leading a consortium to apply for a Digital Bank license and build a bank with the right foundation - using data, technology and trust to solve problems and serve customers. We have big dreams to unlock and financial inclusion for people in our region is just one. If you have what it takes, help build our new Digibank with us.
We are seeking a driven and motivated individual to join our Engineering team for our new Digibank initiative. This role will be based in Malaysia.
Get to know the Role:
As a Site Reliability Engineer (SRE) you will help build a meaningful engineering discipline, combining software and systems to develop creative engineering solutions to operations problems.
Much of our support and software development focuses on optimizing existing systems, building infrastructure and reducing work through automation.
You’ll join a team of curious problem solvers with a diverse set of perspectives who are thinking big and taking risks.
As an SRE you’ll be focused on running better production applications and systems.
SRE is a key contributor to core infrastructure and functional development teams throughout the life cycle to help support software for reliability and scale.
Key areas of focus include automation, application/platform uptime and quality, packaging/distribution techniques, platform design “operability”, analytics, deployment, adoption, and tool development, among others.
The position will wear many hats from owning day to day health and performance, to identifying incidents/developing remediation plans, to working with open source software and experienced packaging techniques, to working with development teams and contributing to the strategic roadmap and execution.
Candidates from a variety of software, platform, or automation engineering backgrounds will be considered for this position.
The day-to-day activities:
Design, code, test and deliver software to automate manual operational work
Troubleshoot priority incidents, facilitate blameless post-mortems and ensure permanent closure of incidents
Engage with development team throughout the life cycle to help develop software for reliability and scale, ensuring minimal refactoring or changes
Perform analytics on previous incidents and usage patterns to better predict issues and take proactive actions
Perform the L1/L2/L3 support activities for the Production Support project with analysis and design work, including impact of requirements across all system components
Build and drive adoption for greater self-healing and resiliency patterns
Design automated software and product upgrades, change management, and release management solutions
Participate in the 24x7 support coverage as needed
The must haves:
Bachelor's degree in information systems, information technology, computer science, or similar.
3+ years professional experience in a software management position.
Experience with dockers / containers / k8s.
Direct production operations experience in a cloud environment.
Experience contributing to technology and product strategy.
Experience leading capability building initiatives across diverse areas such as infrastructure and operations automation, software quality, delivery automation and other core engineering.
Demonstrated experience of driving operational efficiency and transparency of a growing engineering organization.