Job Description:
The Resident Engineer – Data Center is responsible for the end-to-end on-site technical operations, maintenance, and SLA governance of mission-critical data center infrastructure. This role acts as the single on-site technical authority and service representative, ensuring maximum uptime, compliance with SLAs, operational excellence, and seamless coordination between the client, OEMs, subcontractors, and internal support teams.
The position requires deep hands-on expertise across electrical, mechanical, IT, and monitoring systems, strong troubleshooting capabilities, and the ability to manage incident response, preventive maintenance, compliance audits, and service reporting in line with industry best practices (Uptime Institute, TIA, ISO, ITIL).
Position Structure:
Department:
Line Manager:
Stream:
Job Requirements
-
Key Qualifications:
- Master Degree
- Graduate
Position Type:
- FULL-TIME
Education
-
Bachelor’s Degree in Electrical Engineering, Mechanical Engineering, Electronics, Mechatronics, or Computer Engineering
Diploma holders with extensive data center experience may also be considered
Key Skills & Competencies
Technical Skills
-
Strong knowledge of UPS, generators, electrical distribution, cooling systems
-
Hands-on experience with BMS, DCIM, and monitoring tools
-
Understanding of IT infrastructure, cabling standards, and rack power management
-
Strong troubleshooting and fault isolation skills
Operational & Soft Skills
-
SLA and KPI-driven mindset
-
Excellent incident handling and RCA skills
-
Strong documentation and reporting ability
-
Ability to work under pressure in critical environments
-
Excellent communication and coordination skills
-
Strong customer service orientation
Preferred Certifications
(highly desirable)
-
Uptime Institute:
-
Accredited Tier Specialist (ATS)
-
Data Center Operations Specialist (DCOS)
-
-
Vendor Certifications:
-
UPS / Cooling OEM certifications
-
-
IT & Service Management:
-
ITIL Foundation
-
ISO 27001 / ISO 20000 Awareness
-
-
Safety:
-
NEBOSH / IOSH
-
First Aid & Fire Safety
-
Required experience:
Experience
-
5–8 years of hands-on experience in data center operations, preferably in managed services or SLA-driven environments
-
Proven experience working in Tier II / Tier III / Tier IV data centers
Experience managing mission-critical infrastructure with 24/7 operations
Working Conditions
-
24/7 operational environment with shift-based or on-call requirements
-
On-site presence at customer data center
-
High accountability role in mission-critical infrastructure
Duties & Responsibilities
-
Key Responsibilities
1. Data Center Operations & SLA Management
Act as the primary on-site engineer responsible for compliance with contractual Service Level Agreements (SLAs).
Ensure 24/7 availability, reliability, and performance of data center infrastructure.
Monitor and manage KPIs, MTTR, MTBF, uptime metrics, and service response times.
Ensure timely escalation, coordination, and resolution of incidents as per SLA matrix.
Maintain service continuity during planned and unplanned activities.
2. Electrical Systems Management
-
Operate and maintain LV/MV panels, UPS systems, battery banks, PDUs, ATS, STS, and grounding systems.
-
Monitor power quality, load balancing, redundancy (N, N+1, 2N), and capacity utilization.
-
Supervise UPS battery health checks, discharge tests, and replacement activities.
-
Coordinate shutdowns, switchovers, and power maintenance activities with zero impact to live IT loads.
3. Mechanical & Cooling Systems
-
Operate and monitor precision air conditioning systems (CRAC/CRAH), chillers, AHUs, DX units, and cooling towers.
-
Manage temperature, humidity, airflow, hot/cold aisle containment, and energy efficiency.
-
Perform root cause analysis for cooling alarms and thermal incidents.
-
Ensure compliance with ASHRAE thermal guidelines.
4. IT & Low Voltage Infrastructure
-
Support server racks, structured cabling, fiber/copper links, patching, labeling, and rack power distribution.
-
Coordinate installations, de-installations, and migrations with client IT teams.
-
Monitor network, BMS, DCIM, CCTV, access control, fire detection and suppression systems.
-
Ensure documentation accuracy for rack layouts, power maps, and connectivity diagrams.
5. Preventive & Corrective Maintenance
-
Plan and execute preventive maintenance (PM) schedules for all DC assets.
-
Supervise OEMs and subcontractors during PM and corrective maintenance (CM) activities.
-
Ensure all maintenance is performed in accordance with OEM guidelines and safety standards.
-
Review maintenance reports, punch lists, and corrective actions.
6. Incident Management & Root Cause Analysis
-
Lead incident response for alarms, faults, and outages.
-
Perform detailed Root Cause Analysis (RCA) and submit incident reports within defined timelines.
-
Implement corrective and preventive actions (CAPA) to avoid recurrence.
-
Participate in post-incident reviews and service improvement plans.
7. Monitoring, Reporting & Documentation
-
Monitor alarms and alerts through BMS, DCIM, EMS, and NMS platforms.
-
Prepare and submit daily logs, weekly summaries, and monthly SLA reports.
-
Maintain asset registers, O&M manuals, SOPs, EOPs, MOPs, and escalation matrices.
-
Ensure documentation readiness for audits and client reviews.
8. Compliance, Safety & Best Practices
-
Ensure compliance with ISO 27001, ISO 20000, ISO 22301, ISO 45001, and local safety regulations.
-
Enforce HSE policies, LOTO procedures, and risk assessments.
-
Support internal and external audits, certifications, and inspections.
-
Promote continuous improvement and operational excellence.
9. Client & Stakeholder Coordination
-
Act as the on-site technical interface between client, vendors, and internal teams.
-
Participate in service review meetings, change management discussions, and planning sessions.
-
Provide technical guidance, recommendations, and capacity planning insights to clients.
-
Maintain professional communication and customer satisfaction at all times.
Reporting Responsibilities
-
Daily Tasks:
Monitor DC infrastructure alarms and system health
Log readings for power, cooling, and environmental parameters
Respond to incidents, alarms, and service requests
Update shift logs and incident trackers
Coordinate minor maintenance and vendor activities
Weekly Tasks:
Review SLA performance and incident trends
Conduct preventive checks on critical systems
Validate backup systems, redundancy paths, and failover readiness
Update asset and configuration documentation
Participate in coordination and planning meetings
Monthly Tasks:
Perform and oversee scheduled preventive maintenance
Prepare and submit monthly SLA, availability, and performance reports
Conduct capacity, risk, and compliance reviews
Review RCAs and implement improvement actions
Support audits, drills, and management reviews