I keep production infrastructure stable, scalable, and quiet at 3 AM. Currently an SRE at an AI company running Azure-heavy environments — GPU workloads, vector databases, and customer-facing services — where uptime and cost both matter.I work best when I can own infrastructure end-to-end: design it, automate it, monitor it, and fix it before users notice. I write clear docs, communicate proactively in async setups, and treat on-call seriously.Core stack:Azure: AKS, VMs, Storage, Networking, Monitor, RBACIaC: Terraform / OpenTofu — including state drift remediation and resource importsKubernetes: Helm, Deployments / StatefulSets, pod troubleshooting, zero-downtime rolloutsCI/CD: GitHub Actions, Jenkins automation pipelinesObservability: Prometheus, Grafana, Azure Monitor, Slack alertingDatabases: PostgreSQL, MongoDB Atlas, MilvusDR/HA: AKS disaster recovery with Velero, GitOps/Flux, zone-redundant storage (~30–45 min RTO)Also experienced with: Docker/ACR, compliance frameworks (ISO 27001, ISO 42001, SOC 2 Type 2), and Git workflows on messy real-world repos.Open to long-term part-time remote roles. Comfortable with US, AU, or EU time zone overlap.