Skip Ribbon Commands
Skip to main content
EMS: Risk and Mitigations
By, Brian Thiry, Principal Analyst

​RF had the pleasure of substantially contributing to the recently published NERC Operating Committee Reference Document titled Risks and Mitigations for Losing EMS Functions. 

Developed by the EMS Working Group (EMSWG), this document discusses the risk associated with losing Energy Management System (EMS) functions and shares mitigation strategies used to reduce these risk when operators lose situational awareness tools.  The purpose of this Reference Document is to:

  • identify and discuss the risk of losing EMS functions,
  • analyze the causes of EMS events reported through the ERO Event Analysis Process (EAP), and
  • share mitigation strategies to reduce these risks to EMS reliability. 

The paper begins with a short summary of “What is an EMS and why is it important?”   It defines and explains the components of an EMS system including but not limited to State Estimation (SE), Contingency Analysis (CA), and Inter Control Center Protocol (ICCP).   Figures 1 and 2 provide illustrations showing a simplified EMS configuration and the inter-dependencies among EMS applications.

The paper details the risks associated with losing EMS functions.    The most impactful EMS risk is the loss of System Control and Data Acquisition (SCADA).   Without SCADA, the operators do not have the ability to remotely operate devices, nor do they have the metered data points from the RTUs to monitor system stability.    Different challenges are presented for the loss of SE, CA, or ICCP such as the inability to determine non-metered data points, the impact of the worst credible contingency, and data from neighboring systems.   Even though the loss of SCADA had the highest number of occurrences over the four years analyzed, the RF region is seeing a significant trend of fewer SCADA outages, but a rise of more SE outages.

 2018 EMS Page 8.png

The paper goes on to analyze the underlying reasons for these EMS outages.  The EAP process includes cause coding to determine the root and contributing causes of the EMS events. Four underlying categories were developed:

  • Software failures,
  • Communication failures,
  • Facility outages, and
  • Maintenance outages

These four categories are used to explain that there are different reasons for EMS outages, each involving different mitigating strategies.   If every outage was a software failure, the industry could point at the vendors and demand more resilient platforms, or work internally through their own IT department to ensure that the architecture for their EMS system (including databases and memory allocations) is suitable for their needs.   However, EMS outages have different causes.  Some outages are due to external modeling issues, while other outages reveal settings that need to be fine-tuned for convergence.   Sometimes the loss of a communications path or supply power to the control center (or data center) results in an EMS outage.   In other cases, a system upgrade or patch disables EMS functionality.  
Whatever the reason, all of these events are analyzed through the Events Analysis Process (EAP) and cause-coded.   During the cause-coding process, the entities provide details on the corrective actions and mitigation strategies to address the root and contributing causes (e.g., software upgrades, additional training, verifying the model, enhancing the loss of data procedure, or calibrating settings with the help of the vendor).

Mitigation strategies, specifically detective and corrective controls are discussed to explain how the loss of EMS functions has not directly led to the loss of generation, transmission lines, or customer load.  Possibly the biggest change from the early advent of EMS systems has been the overlapping coverage of situational awareness.   Reliability Coordinators and neighboring Transmission Operators and Balancing Authorities work together to help monitor member and neighboring systems during the loss of EMS functionality.   The paper highlights ten good-utility practices including manning substations and implementing conservative operations that help maintain reliability while EMS systems are being repaired.  

Finally, the reference document highlights the NERC Monitoring and Situational Awareness Conference where industry gathers annually to provide awareness of current and emerging EMS issues, exchange best practices, and collaborate with the vendors.   Past Lessons Learned from the EAP are shared, plus entities highlight some of their own best practices for EMS resiliency.