Job Specifications
Requisition ID: 255228
Join a purpose driven winning team, committed to results, in an inclusive and high-performing culture.
The Senior Platform Engineer will play a critical role within the Enterprise Data & AI Technology organization - one of Scotiabank’s most significant enterprise wide strategic initiatives. This organization drives data enabled decision making, AI innovation, and technology modernization across the Bank.
The Senior Platform Engineer will be responsible for the building, tuning, managing infrastructure, DevOps, Platform site reliability, monitoring, troubleshooting, enhancing, enabling new features on Data & AI platform(s) as per banks Data & AI strategy. This consists of working with cross functional teams like IAM, Network, Cloud Ops, Security, Client partners etc for integration, process automation, platform enhancement and delivery of new projects.
Is this role right for you? In this role, you will:
Guidance and Direction: Provide clear direction to the team, set goals, and keep the team accountable for their deliverables. Align team goals with the overall direction of the Azure & Databricks Platform roadmap and enterprise standards.
Technical Oversight: Own the technical direction across Azure and Databricks: Azure networking and security architecture (VNets, Private Endpoints, NSGs, route tables, Azure Firewall), Azure Identity & Access Management (RBAC, PIM), and Databricks platform governance (Unity Catalog, workspace configuration, cluster policies). Ensure best practices for reliability, cost, and security are consistently applied.
Quality Assurance: Ensure a high quality of support delivery for platform users; adhere to platform SLAs/SLOs and service objectives
Process Improvements: Continually improve platform processes and SOPs for efficiency and automation. Design and develop reusable Terraform modules for Azure native resources and Databricks (clusters, SQL warehouses, Unity Catalog objects), enabling consistent, scalable, and automated deployments via Terraform Cloud/Enterprise and CI/CD.
Customer Relations: Build strong relationships with data engineers, analysts, and platform users. Communicate proactively with stakeholders and cross‑functional teams (Platform, Security, Cloud Ops, Networking, Data Governance) to align priorities, manage expectations, and drive adoption of platform standards.
Advanced Monitoring and Troubleshooting: Troubleshoot and resolve performance issues across Databricks jobs, clusters, SQL warehouses, and Azure dependencies. Implement Azure Monitor and Log Analytics‑based observability with custom dashboards for cluster/job health, driver/executor metrics, and cost insights. Establish proactive alerting and early issue detection via logs/metrics for Databricks and Azure services.
Site Reliability: Analyze, triage, and resolve platform issues promptly to achieve SLOs and platform reliability objectives. Drive error‑budget aware practices, post‑incident reviews, and resilience engineering (e.g., autoscaling, retry/backoff strategies, policy guardrails).
Incident Management: Provide support during major incidents, including after‑hours support. Lead incident response, communications to users and stakeholders, and root‑cause analysis with clear action items and follow‑through.
Observability Tools Development: Design, build, and deploy logging/monitoring solutions for early detection and actionable insights. Standardize ingestion to Log Analytics from Databricks (audit logs, cluster events, job runs) and key Azure resources; built dashboards and alert rules to reduce MTTR.
Release Control Management: Maintain and enhance the Infrastructure & Platform release pipeline using Terraform, Terraform Cloud, Azure DevOps and/or GitHub Actions, with source control in GitHub/Bitbucket and artifact promotion via ACR/Artifacts. Enforce approvals, change windows, and automated checks to ensure safe, repeatable releases.
Client Pipeline Management: Implement CI/CD for infrastructure and analytics workloads using Terraform, Docker, Azure DevOps/GitHub Actions, and Artifact/Container registries.Automated Terraform plan/apply, Databricks Bundle releases, policy validation, and security scanning to streamline delivery and ensure compliance.
Credential Security: Set up Azure Key Vault and HashiCorp Vault for secret management; integrate with Databricks secret scopes and workload identities. Enforce least‑privilege access via Azure RBAC and rotate credentials per policy.
Vendor and Technical Support Interaction: Partner with Microsoft and Databricks support and product teams to fine‑tune and troubleshoot components, plan upgrades, and adopt new capabilities aligned to roadmap and enterprise controls.
Mentorship: Mentor junior engineers in best practices for building, deploying, testing, and supporting services on Azure and Databricks. Promote a culture of automation, documentation, and continuous learning.
Skills
Do you have the skills that will enable you to su