GPU Hang Detection and Recovery - 1

GPU Hang Detection and Recovery

July 15, 2007

Abstract

This paper provides background information about how Windows Vista® detects GPU hangs and works to recover the desktop.

The current version of this paper is maintained on the Web at:

References and resources discussed here are listed at the end of this paper.

Contents

Background

Improvements in Windows Vista

Step 1. Detects the Unresponsive GPU

Step 2. Recovers the Desktop

Running Benchmarks

Windows Vista and OCA Data

Conclusion

Resources

Disclaimer

This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein.

The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication.

This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.

Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation.

Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property.

Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, email address, logo, person, place or event is intended or should be inferred.

© 2007 Microsoft Corporation. All rights reserved.

Microsoft, Windows, and Windows Vista are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries.

The names of actual companies and products mentioned herein may be the trademarks of their respective owners.

Background

The last ten years have seen major evolutions in the graphicsprocessing units (GPUs) in the graphics-rich Windows® platform. Simultaneously, the increase in graphics horsepower that began after the release of Windows 2000 has been accompanied by increased graphics driver-related crashes, adverselyaffecting the stability of Windows systems.This increasing stability problem encompasses every GPU vendor’s hardware.

Windows XP introduced a “timeout” during which Windows detected hung GPUs and produced a “bugcheck 0xEA” crash report. Soon, this crash report became the most widely encountered report in the Microsoft Online Crash Analysis (OCA) data.To curb data loss related to GPU hangs, Windows XP Service Pack 1 introduced a significant design change by reverting to the basic Microsoft-provided video graphics array (VGA) driver when this hang situation occurred. Endusers were given an opportunity to save their data but were still prompted to reboot as soon as possible to regain a fully functional Windows desktop.

Microsoft OCA data shows that 20 percent of all Windows crashes are due to GPU hardware hangs or instabilities and that the GPU is the largest device category reporting crashes (ahead of categories such as network adapters, hard drives, and USB cameras).Avoiding GPU hangs represents the most significant opportunity to improve the user experience by increasing system stability. This was a key factor in the redesign of the Windows graphics software infrastructure in Windows Vista®.

Improvements in Windows Vista

A key goal for Windows Vista is to make the graphically rich Windows desktop consistently highly interactive to endusers. Usability research and industry standards indicate that 2 to 5 seconds is the maximum acceptable response time for ensuring a continuous workflow and a satisfactory user experience. If desktop activity refreshesless frequently than every 2 seconds, a poor end-user experience results.

If graphics hardware causes system hangs, Windows Vista attempts to ameliorate the end-user experience by resetting the GPU and restoring the desktop to the stateit was in when the hang occurred.The goalsare thatnon-graphics–related applications do not lose data and that well-behaved graphics applications continue to function after the recovery.Microsoft OCA data reveals that Windows Vista successfully restores the desktop in 93 percentof cases of timeout detection.

Figure 1. Breakdown of successful system recovery versus crashes

To restore the desktop, Windows Vista takes these steps, as discussed in the following sections:

1.Detects the unresponsive GPU.

2.Recovers the desktop.

Step1. Detectsthe Unresponsive GPU

Windows Vista continually monitors the work that is being submitted to the GPU. The GPU processes this work and then displays it on the user’s screen.

On most GPUs, graphics operations take no more than a few milliseconds to complete. Occasionally, however, the GPU encounters a bad hardware state (such as corrupted data in the hardware registers) and can make no further progress. In this situation, the GPU hangs and the screen display freezes. Often users cannot move the mouse,and even typing CTRL+ALT+DEL produces no response.If the problem is not corrected, the system remains in this state. If end users receive no audio feedback from movie or game play, they are likely tohold downthe power button to restart the system. Unfortunately, this causes endusersto lose all unsaved work.

Windows Vista allows the GPU 2seconds to finish a work item that should typically take a few milliseconds. If the work item is not completed within 2 seconds, Windows Vista initiates steps toreset and recover the GPU.

Important: Detection of a GPU hang is independent of the level of system activity such as disk activity, paging, and so on. System activity and CPU workload neither triggers nor increases the chances of inducing a timeout detection and recovery (TDR) event.

Step 2. Recovers the Desktop

After Windows Vista detects a GPU hang, it attempts to restore the original state of the desktop. Restoring a fully functional desktopto endusers avoids loss of non-graphics data andalso avoidsthe need for a reboot. TDR events are automatically logged in the background as a “0x117” TDR,which contributes to the graphics driver OCA counts.

Running Benchmarks

Graphics hardware might require a long time to complete certain intensive operations such as running benchmarking applications, which are only common for system reviewers and avid gamers. During benchmarks with advanced tests, high resolution, and high-quality visual effects, low-end GPUsmight require more than2secondsto complete a command. If endusersare watching a video or playing a game at less than one frame per second, the display is reduced to a stuttering slideshow. Even though the GPU is not actually hung, it is making little progress,which leads to the generation of a TDR eventand end usersseeing the TDR notification.

Best Practice: Test benchmark applications by using the registry keys for GPU hang detection, as documented in “Timeout Detection and Recovery of GPUs through WDDM.” Altering these registry keys does not affect any other functionality in Windows Vista because they are intended only for GPU timeout detection. System manufacturers can run benchmarking or similar applications by altering these registry values and reverting to default values for shipping production systems.Graphics IHVs must not workaround this problem by disabling features in their graphics drivers.Although benchmark applicationsmightresult in a false-positive hang detection, normal applications run by end users should not.New benchmark applications should be designed in a way that they dynamically adjust to the limitations of the GPU.

Windows Vista and OCA Data

Microsoft OCA data for Window Vista shows that,in most TDR reports, the graphics hardware is truly hung; that is, it cannot process any additional commands and is not refreshing the screen. Allowing additional time to the GPU does not solve the problem.

Figure2 shows the most common end-user scenarios—involving applications that do not demand more than the allotted time for execution on the GPU—that trigger the hang recovery process. Hung GPU problems in these common end-user scenarios must be diagnosed and fixed.A number of TDR notifications occur when the graphics hardware is not making progress quickly enough, even though it is functioning properly. Although we cannot measure this number directly, we believe it is low for two reasons: the class of application or user activitythat commonly causesTDRs and the fact that fast GPUscause just as many TDRs as slow GPUs do.

Figure 2. Top user activities triggering a GPU hang

Conclusion

Features in Windows Vista significantly improve the end-user experience by recovering from GPU hang conditions and returning control of the desktop to the user. Data loss for the user is avoided, and most applications continue to function after the desktop has been restored.

Resources

Timeout Detection and Recovery of GPUs through WDDM

Provides technical details about GPU hang detection in Windows Vista.

System vendors who need to validate various aspects of the system through stress testing and running powerful benchmarks can disable the timeout detection mechanism by using the registry keys discussed in this paper.

Queries

If you have questions that are not answered in this document, send an email to:

July 15, 2007
© 2007 Microsoft Corporation. All rights reserved.