Understanding Triple Buffering Issues on Intel Integrated Graphics in Windows

Hey there,

I’m a bit confused about the whole triple buffering situation with Intel integrated graphics on Windows. From what I know, triple buffering in OpenGL drivers for Intel integrated graphics on Windows 10 and newer uses a three-frame swapchain, which means there’s always a one-frame delay since there’s an extra frame in-flight. However, I stumbled upon this discussion (link: Error: Synchronization Failure - #6 by mariokleiner), and it seems like things might not be that straightforward.

I’m really curious about why triple buffering, combined with VSync, could cause timing issues beyond the usual one-frame delay. Can someone explain this?

I’m mainly asking because I don’t see these issues (at least I think so) using OpenGL, i.e., I get quite a long delay between trigger and actual screen change, but it seems to be constant. Just want to understand what can go wrong and make sure I’m not overlooking something.

I think it is confusing because even the aim of triple-buffering is ill-defined, and driver implementation details add further unknowns. The driver could even switch between different strategies on-the-fly, depending on whether the application at hand appears to be consistently faster than needed for serving the current refresh rate, or slower. So, both safely detecting triple-buffering just in software and safely working around it is close to impossible. Normally, in OpenGL, you call swapbuffers(), possibly followed by a dummy graphics operation, followed by glFinish(), followed by taking the flip timestamp. This assumes that swapbuffers() does request to swap the frame just created to the foreground (meaning to the buffer that is physically sent out to the monitor) and glFinish() waits until this request has actually been served. But with triple-buffering, the “foreground” is just an intermediate buffer, which might or might not become the real output buffer sometime later; and the glFinish() just waits until the swap to the intermediate buffer has been served, which can be immediate or in sync with a hardware VSYNC (not necessarily the one that would put the just prepared frame into the foreground, though). Which one it is depends on the implemented triple-buffer strategy and the timing history that defines the state/availability of the involved buffers. So, there are many unknowns, which makes is so difficult to deal with triple-buffering. Just my 2 cents.

2 Likes

@_qx1147 explains the meat of it quite well. Let me add a few things, hopefully not being all too repetitive:

Triple-buffering as a hazard can be replaced by n-buffering with any n ~= 2. Different versions of different operating systems, or of the same OS can change n or even change n dynamically depending on various factors.

Even for pure triple-buffering, there isn’t necessarily a one frame extra lag. That may, e.g., not happen on a first flip after some multi-frame pause, as the display system is idle and can just flip your current frame to the display, as your “just flipped” frame is at the head of the wait queue. If your applications renders and flips faster than video refresh, the triple buffering will kick in and add a one frame delay, as “flipped” buffers queue up in a waiting queue. If your applications slows down or pauses a tiny bit at some point in time, the wait queue may run empty again and the next flip will be back to displaying immediately. So you might alternate between 1 frame extra lag and no extra lag. Given that the timestamps you can collect will most likely not correspond at all to what really happened on the display, and that you may use these timestamps to schedule presentation of future stimulus images, the whole system can get in some kind of complex feedback cycle where what you want and what gets reported to you as stimulus onset time is more than 1 frame off from reality, even with triple-buffering. Modern graphics drivers also judge gpu load by how many frames are pending in a wait queue, or how often vblank “deadlines” were missed, to speed up or slow down the gpu (as a power-saving measure), change the n-buffering strategy and various other stuff.

In practice i’ve seen errors of up to 3 video frame duration, e.g., 50 msecs at a standard 60 Hz refresh cycle.

There’s also multiple sources of triple buffering. E.g., on MS-Windows there is the DWM compositor as the typical culprit, but then the low level display driver itself may impose its own triple buffering on top of the triple buffering by the DWM. And user settings depending on gpu and driver. Another source of triple buffering on MS-Windows is most kinds of hybrid graphics setup with two graphics chips, e.g., NVidia Optimus laptops.

The DWM, like other compositors on other OS’es, can also decide to drop a frame from presentation completely if opportune to the system. Iow. a stimulus image is not ever shown, without you knowing it. These things can easily trip up attempts to validate timing with photo diodes and similar standard approaches. I found many ways to fool myself, or get fooled by the system doing something I didn’t expect, when using photo-diodes. Which is why I generally don’t trust statements of random people that they verified something with a photo-diode and it was fine. Depends a lot on the specific software and hardware setup, paradigm and exact stimulus script tested and analysis, if the photo-diodes tell the truth or if they just tell the truth one likes to hear.

On MS-Windows, with a properly configured system, a non-buggy graphics/display driver, properly written experiment software, and many other factors, one can get the DWM to switch to standby/bypass, and then only the low level display driver could introduce triple buffering again. Getting to this configuration depends on various rules that change over time, depending on OS version etc. The safest choice is a single-display setup with a standard DPI monitor.

On MS-Windows with Intel graphics, to my knowledge, triple buffering is applied by the Intel display driver in addition to potential DWM based buffering, and I don’t know of a way to disable that, after hours of searching the internet, forums, Intel user support forums, the Window registry etc. There may be some Intel chips that don’t do this in combination with some Intel driver versions on some Windows versions under some unknown conditions. Such unicorns have been spotted occasionally. But almost always when somebody reports using Windows with an Intel chip, it is the triple buffering causing trouble. The only machine with Intel chip that I have, running Windows 11, always triple buffers, and it is disruptive to my work - making timing tests on Windows 11 impossible for me and also annoying me with constant sync failures, so you can be sure I tried a lot to get rid of it for my own benefit, to no avail.

Psychtoolbox startup timing tests are designed - and failure thresholds chosen - to usually detect such DWM or triple-buffering interference indirectly by causing the sync tests to fail, e.g., because less than 50 valid flip samples can be collected within the test duration, or the “stddev” of the timing samples exceeds 0.2 msecs and other things I can’t remember right now. So a sync failure usually means a real failure, although an occasional sync failure, if it is the exception instead of the rule, could also mean that your system was just noisy or overloaded - false positives are unavoidable. If the welcome image with the frog wiggles during startup, you can be sure triple-buffering is active, specifically the one caused by the Intel display driver, so that is a strong visual indication that it is game over for you. There are more methods of diagnostics described in our help texts.

So all in all, if you wanted to have a constant one frame lag due to constantly active triple-buffering, you’d probably need to write a script that always presents stimuli at a speed higher than the video refresh rate of your monitor, without any pause (e.g., between blocks of trials, between trials, waiting for some external trigger or subject response etc.) ever, for whatever reasons, carefully maintaining that steady state against all the things that can cause timing variations on a Windows machine. And photo diodes to verify at all times this is the case, fool-proofed against all the ways one can fool oneself with such a setup.

The basic underlying problem with software-based visual stimulus presentation timing is that no operating system, apart from Linux in certain configurations, does have well working programming interfaces for visual timing and time-stamping and such, where the application just can tell the operating system what it wants timing-wise and the OS can report back timestamps of real stimulus onset with reliability and accuracy.

On Windows and macOS we always depend on all kinds of hacks and tricks to get timing working ok enough, and we have to make assumptions about how the underlying software and hardware work for these tricks to work. Those assumptions used to be fine in the olden days, but much less so since at least around the year 2010. These assumptions are sometimes broken and then things go sideways. E.g., whenever a desktop compositor kicks in for whatever reason. Or triple-buffering or n-buffering of any kind. Psychtoolbox is most advanced when it comes to tricks and diagnostics, compared to all other alternatives I am aware of, but there is still plenty of ways to break it. At least most of the time our diagnostics is capable of reporting the problems, e.g., in form of the dreaded sync failures, before harm is done.

On properly configured Linux/X11 with open-source graphics/display drivers, ie. generally on graphics hardware other than NVidia, we have reliable api’s to tell the OS when we want something to show up on the screen, and to get reports back iff it was shown and iff it was shown in a reliable way and when exactly the stimulus image left the video output of your graphics card. Triple-buffering or 4-buffering or 5-buffering or such doesn’t matter. Even on Linux+NVidia we at least have some more robust ways of detecting trouble if it happens, although abilities to fix or avoid trouble are way more limited.

That’s why the recommendation is given to switch to Linux if one wants to use Intel graphics in a safe way.

1 Like

Thank you both for the extensive explanation!

I’m in the lucky position of having a way to diagnose these things using a BioSemi EEG system and a photo sensor. Maybe I’ve jut been lucky so far, but my delays between buffer flip and monitor response seems to be constant (I’m not sure what will happen under load though).

All the things you’re describing seem to be like unfortunate side effects of using a relatively high-level API like OpenGL on an intrasparent OS like Windows. Out of interest, do you think that using something more modern and low-level like a recent DirectX version or Vulkan would alleviate these problems? I know that for a large project like Psychtoolbox switching graphics APIs is not feasible, but I’m curious.

Assuming you use it correctly in your context. Not saying you don’t, I don’t know you or your skills or experience. Just that I’ve spent enough time fooling myself with photo sensors and more advanced hardware methods to know there is plenty of ways to fool oneself, especially given that modern operating systems and display hardware have added many new failure modes that didn’t exist 10 or 20 years ago, and most people in the field seem to operate on mental models that might have been barely appropriate around the year 2005.

Some things that always make me shake my head is when somebody on a forum makes the general claim that timing with system X is just fine, because they tested it with a photo-diode. Such tests almost never translate to anything beyond their specific restricted hardware+software setup for their specific paradigm coded with their specific script, if at all.

The type of API doesn’t have much to do with it, only the specifics of the implementation of that API on a specific OS and Windowing system. Right now OpenGL on Linux/X11/GLX is the only reliable system I’ve encountered. The relevant OpenGL timing API was created and standardized already somewhere in the early 1990’s, it just wasn’t implemented anywhere but on some multi-million dollar SGI graphics supercomputers running SGI’s Unix variant IRIX, and on desktop Linux/X11. And the Linux implementation is partially so good for our use cases, because I spent over a thousand hours working on improving it since late 2009, specifically with the goal to make it as good as possible for neuroscience applications, working in collaboration with other open-source Linux developers. So there was a specific strong drive and effort from my side to make Linux so strong for neuroscience.

Other OS’es have made some half baked attempts at timing in the past. I have tried using these in Psychtoolbox, and each time it ended in failure after spending many weeks. Windows Vista introduced DWM related api’s which never ever worked in any meaningful or reliable way on Windows Vista, Windows 7, Windows 8, … MS gave up and has disabled/deprecated these apis somewhere around the time of Windows 8 or 8.1. Given that these DWM api’s were built on top of DirectX api’s, that wouldn’t inspire confidence into anything more low level Direct3D/DirectX either.

macOS had some api’s which were just as unusable in every macOS version tested since OSX 10.4 or 10.5. Psychtoolbox hacks were always able to outperform those easily in precision, performance, reliability (and generally not crashing for random reasons)

It always depends on what use cases the implementers of such api’s target, and all of them target way lower quality and reliability standards than what is needed for neuroscience, at least so far, on the non-Linux operating systems.

As far as the future goes:

Psychtoolbox can and does use modern Vulkan already as a display api for certain use cases where it is beneficial, e.g., HDR display support, or very high precision SDR and HDR framebuffers on Linux. OpenGL is then used for drawing/rendering/image post-processing, then the final stimulus image is handed over to our Vulkan driver for final display. One can build such hybrid solutions to try to get the best of both worlds. But the display part for Linux and Windows alone took about 1000 hours of work, paid by VESA specifically to enable HDR support in Psychtoolbox. And another > 270 hours of work just for macOS, where Vulkan is implemented as a very thin wrapper on top of macOS Metal and CoreAnimation, provided by the OSS MoltenVK driver to which I also contributed timing related enhancements already.

Timing-wise there is no real benefit of Vulkan right now in practice, it is actually worse than with our OpenGL trickery on any OS, or with the proper Linux OpenGL timing api’s, although I found hacky tricks to abuse the OpenGL timestamping api’s of Linux to sort of get reasonable timing under Vulkan with Linux.

The Khronos consortium, authors and shepherds of the Vulkan spec, has a working group that tries to specify and release a new optional timing api extension for Vulkan. I am involved in the feedback and discussions towards that extension since years, trying to make sure it is also a good fit for neuroscience apps as well. Unfortunately it is a very slow standardization process, involving many stakeholders, dragging out for multiple years now, starting, stopping, sometimes going a bit in circles. Judging from initial progress, I expected that thing to be finished already sometime late 2020, but now we have 2023 and I wouldn’t dare to predict with confidence if the spec is weeks, months or even years away from release. Or how long it would take after its official release for actual usable implementations of it to show up on different operating systems, if at all, given that this Vulkan extension is optional, not mandatory. On Linux + AMD/Intel graphics, I have an open-source prototype implementation of an earlier predecessor of this spec running since early 2021. Since autumn 2021, after multiple hundred hours of research and development, I even have a prototype implementation of some novel fine-grained visual timing support that uses VRR displays in a very new way to allows to control stimulus timing at sub-millisecond granularity, on top of that other prototype on top of an earlier spec version. But all that work is stalled indefinitely, until the new spec is out, and implemented on Linux, and our funding situation is better - if ever.

So someday hopefully sooner than later Vulkan will have something new to offer, and Psychtoolbox is already more ready for that than any other software in this field. The actual quality and usability for neuroscience will depend a lot on how much OS and graphics cards vendors actually care about the level of quality needed by niche fields like neuroscience, and how much I can do about that. Which in practice means almost nothing at all on Windows or macOS or proprietary systems, and a lot on Linux, assuming the lack of financial funding provided to Psychtoolbox by our mostly indifferent users doesn’t force me to give up long before.

2 Likes

I want to start by saying a big thank you for all the hard work and personal sacrifices you’ve made for the visual and cognitive neuroscience community. While I usually steer clear of using Matlab for reasons unrelated to Psychtoolbox, I truly appreciate the significant contributions you’ve made in terms of precise timing, troubleshooting OS and driver quirks, and your efforts to improve timing in the Linux ecosystem.

And while I’m certainly not in a position to make funding decisions, I will suggest considering a membership next time our lab applies for a grant.

I’m still trying to wrap my head around a couple of timing-related aspects here. Specifically, when we talk about timing issues, are we talking into the timing of the frame itself, or are we more concerned with timestamping the presentation it? For instance, on a non-VRR display with true double buffering, does Psychtoolbox employ any strategies or methods to reduce jitter? Or is the main problem on such a system obtaining accurate timestamps for scanout (mitigated by using a photo diode instead of timestamps/triggers)?

Psychtoolbox also works fine with GNU/Octave since forever, if one doesn’t like Matlab. I do most of my development and testing with Octave to make sure it does.

Wrt. timing in the Linux ecosystem, you would need PTB though to profit from all the visual timing improvements, as other toolkits have way too primitive and naive approaches wrt. visual stimulation timing and also verification of correct timing. Generally visual stimulation is far ahead of anything else I know of, not just wrt. timing.

Both, trying to get stimuli on the screen at the time when the users script requests it, and timestamping and problem reporting. Frame-pacing, ie. using these timestamps to decide what to show when, and calculating proper target presentation times ‘twhen’ for the Screen('Flip', ..., twhen); command is usually left to the users script, as only a users script can know what the proper pacing for a task is. But our demos show best practices and there are various higher level toolkits available for PTB, which would do some domain specific frame pacing. But ofc. the timestamps are the critical ingredient for any frame pacing strategy, sync between different stimulation modalities, reaction time measurements, etc. Our Psychtoolbox/PsychDocumentation/ folder contains multiple PDF’s with info and tips about frame pacing etc.

Over the holidays, I had the chance to spend some time experimenting with timings on MacOS/Metal and without much hassle, I can get the latency between submititng a frame and presentation down to apprx. 5ms (at 120fps and trusting what Metal reports as the presentation timestamp), which seems pretty good - though you need to limit the number of frames in flight as well as disable tripple-buffering + compositing to archive this kind of latency.

I’m sure the same can be archived on D3d12 and theoretically also on Vulkan (altough I haven’t been able to test that yet).

In theory, by rendering into the front buffer and using beam-racing or beam-chasing, much smaller latencies should be possbile, but that is very tricky to get right and probbaly not worth the hassle.

Edit: I should add that I’m aware that your experiments with presentedTime were not super sucessful a few years ago, but it looks like that was fixed - looking at the distribution of presentedTime timestamps (pending actual testing using a photosensor):
Average: 8.333382173840471 ms, Std: 5.078905225148959e-10 ms (calculated from 999 deltas)