Submit

AWS Operational Support Engineer

Job description

Provide technical support and maintenance for applications and infrastructure hosted on Amazon Web Services (AWS).
Monitor system performance, troubleshoot issues, implement best practices for scalability and reliability, and ensure smooth operation of cloud environments while collaborating with development and operations teams.

  • Monitor and optimize AWS-based production environments, ensuring high availability, performance, and resilience.
  • Manage observability tools such as CloudWatch, X-Ray, Datadog, and custom dashboards to provide real-time visibility across infrastructure and applications.
  • Lead the response to critical production incidents, including troubleshooting, root cause analysis, mitigation, and post-mortem reporting.
  • Maintain and improve disaster recovery processes, failover procedures, backup strategies, and resilience testing.
  • Support ECS clusters, EC2 instances, load balancers, databases, RabbitMQ, and S3 services.
  • Troubleshoot and support CI/CD pipelines, Infrastructure as Code deployments, and production releases.
  • Collaborate with development teams to improve application performance, database queries, and operational reliability.
  • Manage IAM roles, security configurations, certificates, and access controls according to best practices.
  • Create and maintain operational documentation, runbooks, and incident procedures.


Requirements

  • Proven experience in AWS operations and production support environments.
  • Strong knowledge of EC2, ECS, S3, IAM, VPC, CloudWatch, Route 53, Load Balancers, and Infrastructure as Code tools such as CloudFormation or AWS CDK.
  • Experience with Docker, ECS, CI/CD pipelines, Jenkins, and AWS Code services.
  • Knowledge of Aurora PostgreSQL, MongoDB Atlas, RabbitMQ, and monitoring tools such as Datadog and X-Ray.
  • Strong troubleshooting skills across infrastructure, networking, databases, and cloud services.
  • Experience with disaster recovery, failover testing, backup strategies, and resilience practices.
  • Scripting and automation skills using Python and Bash.
  • Ability to read and troubleshoot Java, Python, or TypeScript code.
  • Experience managing production incidents, root cause analysis, and operational runbooks.
  • Strong communication, analytical, and documentation skills.


Want to apply?
Position
Name*
Email*
Phone number*
Country*
City*
Linkedin
Faça upload do seu CV* (max. 4MB)
Upload your photo or video (max. 4MB)
Submit