Parallel Playback of multiple 4K HDR Videos on different displays

MatLab R2023b, Psychtoolbox 3.0.19, GStreamer 1.22.6, CPU: AMD Ryzen 9 5900X 12-Core processor 3.7GHz, RAM: 32GB, GPU: NVIDIA GeForce RTX 3080, OS: Windows 10 Pro, Displays: two 55” LG OLED G2 (3840x2160).

Hello again,

I have written beforehand a post regarding the playback of one 4K HDR (PQ-encoded) video file (3840x2160 at 60Hz) using Psychtoolbox-3. However, now I would like to ask if it is possible to play three 4K HDR video files with different compression rates.

All video files have a framerate between 24Hz and 60Hz. The videos are of a duration of 5-10s, do not have any sound, and are encoded using AV1 (which my GPU is capable of decoding). The three videos I want to play at the same time share the same content and, the same framerate, but have different compression rates, and sometimes different resolutions.

Using the latest update of Psychtoolbox-3, I am able to run the three videos in real-time; however, many frames are dropped, and for high-framerate videos, the judder is very visible.

Knowing the capacities of the computer’s graphic card, I am still expecting some dropped frames but they shouldn’t result in such a noticeable judder.

Here is the script that I am using to play the three videos, and to verify the number of dropped frames.

global GL;
% Get our 4K displays indexes

display_indexes = [] ;
j = 1 ;
for i=0:3
    metadata = Screen('Resolution',i) ;
    if metadata.width == 3840 && metadata.height == 2160
        display_indexes(j) = i ;
        j=j+1 ;
    end
end

right_index = 1 ;

% Define the screen no
screen_no = display_indexes(right_index) ;

try
    %Define the preferences

    Screen('Preference', 'SkipSyncTests', 0);
    Screen('Preference', 'Verbosity', 4) ;
    PsychGPUControl('SetGPUPerformance', 10) ;

    % - Prepare setup of imaging pipeline for onscreen window. This is the
    % first step in the sequence of configuration steps.
    PsychImaging('PrepareConfiguration');
    PsychImaging('AddTask', 'General', 'EnableHDR', 'Nits', 'HDR10');
    PsychImaging('AddTask', 'General', 'FloatingPoint32Bit');

    % Open the window

    [win, rect] = PsychImaging('OpenWindow', screen_no, 0, [], [], [], [], [], 0);
    AssertGLSL

    hdrProperties = PsychHDR('GetHDRProperties', win);
    display(hdrProperties);

    glActiveTexture(GL.TEXTURE0);
    AssertOpenGL;

    KbName('UnifyKeyNames');

    % OPEN MOVIE PARAMETERS 

    async = 0 ;
    preloadSecs=1 ; % We probably want to load all the video beforehand
    specialFlags1=0;
    pixelFormat=11; % 11 for HDR
    maxNumberThreads=[];
    movieOptions=[];

    % Get the movies path

    pathA='path\DevilMayCry5.mp4';
    pathB='path\DevilMayCry5_H_3840x2160.mp4';
    pathC='path\DevilMayCry5_M_3840x2160.mp4';

    % Open the videos
    [movieA, duration, fps] = Screen('OpenMovie', win, pathA,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);
    movieB = Screen('OpenMovie', win, pathB,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);
    movieC = Screen('OpenMovie', win, pathC,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);

    % Seek to start of movies (timeindex 0):
    Screen('SetMovieTimeIndex', movieA, 0);
    Screen('SetMovieTimeIndex', movieB, 0);
    Screen('SetMovieTimeIndex', movieC, 0);

    % PLAY MOVIE PARAMETERS
    rate=1;
    loop=1;
    soundvolume=0;

    % Start playback of movies.
    Screen('PlayMovie', movieA, rate, loop, soundvolume);
    Screen('PlayMovie', movieB, rate, loop, soundvolume);
    Screen('PlayMovie', movieC, rate, loop, soundvolume);

    time_to_get_frame = 0;
    frames_read = 0;

    while true

        % Return next frame in movie, in sync with current playback
        % time.

        ttgf = tic;

        %GET MOVIE IMAGE PARAMETERS
        texA = Screen('GetMovieImage', win, movieA, 1, []);
        texB = Screen('GetMovieImage', win, movieB, 1, []);
        texC = Screen('GetMovieImage', win, movieC, 1, []);

        read_time = toc(ttgf);
        time_to_get_frame = time_to_get_frame + read_time;
        frames_read = frames_read + 1;


        if( mod(frames_read,fps)==0 )
            fprintf( 1, 'Time to read a second of the movie is  = %g s\n', time_to_get_frame/frames_read*fps )
        end

        if texA>0
            Screen('Close', texA);
        end
        if texB>0
            Screen('Close', texB);
        end
        if texC>0
            Screen('Close', texC);
        end


        [keyIsDown, secs, keyCode, deltaSecs] = KbCheck();

        if(keyIsDown)

            if all(keyCode(KbName('ESCAPE')))
                Screen('PlayMovie', movieA, 0);
                Screen('CloseMovie', movieA);

                Screen('PlayMovie', movieB, 0);
                Screen('CloseMovie', movieB);

                Screen('PlayMovie', movieC, 0);
                Screen('CloseMovie', movieC);

                ME = 'break the experiment' ;
                throw(ME) ;

            end
        end

    end


catch ME
    % catch error: This is executed in case something goes wrong in the
    % 'try' part due to programming error etc.:

    Screen('CloseAll');
    fclose('all');
    Priority(0);

    sca ;

    display( 'Exception caught' );
    rethrow(ME);

    % Output the error message that describes the error:
end

Would there be any trick in Psychtoolbox that optimizes the performance if we are using multiple short video files with no sound?

Also, I wonder if using dual-GPU can improve the playback of multiple videos at the same time. I have not looked much yet, but is PTB capable of using two GPUs for decoding three videos at the same time, or would I need to specify which videos are decoded in which GPU, and which GPU is connected with which display?

Thanks,
Dounia

Two things:

  1. I don’t see any screen flip in your script, how do you flip the three screens?
  2. What makes you think this is a gpu limitation (it may well be, but i wonder what evidence you have)?

Hello,

Thank you very much for responding to me.

I simply removed it to make the code less complicated, so I did beforehand flip the textures, two videos showing on the two screens, and I specified a button to switch from those two videos to another video showing on the two screens again.

So the idea behind this experiment, is that I want to show two distorted videos (either distorted by AV1 or by decreasing the resolution) and let the observer the choice between switching between the two distorted videos and the reference one.

However, just for simplicity, I did not add all of these details to the example code, since I wanted to focus on the speed of reading the textures and the number of dropped frames in that case.

From the specifications on this website, the Nvidia RTX3090 which is slightly faster than RTX3080, can only decode 849 frames per second of a 1080p 8bits AV1 encoded video. So, since we have 4K 10 bits (HDR), we are talking now about 170 frames per second. So if we play three movies, we can only decode around 56 frames per second using RTX3090.

Because of this, I was already expecting some frames to be dropped from the videos, but not enough to show a noticeable judder to the observers.

So, that is why I was looking if it can be optimised, or if adding another GPU can be an option, and is doable using PTB, without further modifications to the PTB code.

Thanks,
Dounia

Hi again Dounia,

Congratulations, I have picked your specific sub-problem of improving parallel playback performance of multiple 4k HDR-10 movies again for “free support on resolving a non-trivial issue”, sponsored by Mathworks Neuroscience - MATLAB and Simulink Solutions - MATLAB & Simulink. You get advice and help on this sub-topic of your more general question for free that would normally have cost your lab way more than 1000 Euros. Mathworks provided us with 5 such “Jokers” for free support in the time period October 2023 to June 2024, and you get to use the second one out of five.

On to the topic:

I agree that according to NVidia specs, 56 fps would be the absolute best you could hope for playing 3 movies in parallel, although that’s a theoretical best, probably less in practice, ergo dropped frames. NVidia mentions on that page you cited that some more expensive Quadro gpu’s may have multiple NVDEC engines, so would allow for higher performance, but with all GeForce cards there is only one NVDEC per graphics card.

And those multi NVDEC ones are very expensive, starting at 1500-2000 Euros for entry level models: https://developer.nvidia.com/video-encode-and-decode-gpu-support-matrix-new

But luckily for your specific use case of short movie clips without sound, there is some cost-free or cheaper trickery that might work. There are some other pitfalls though, due to dual-display use and HDR, so lets see how this goes.

My first question would be how do you address your two monitors? Do you use one single PTB onscreen window that spans both HDR monitors, because the two monitors are set up as one single virtual ultra-wide HDR monitor by use of NVidia Mosaic functionality? You display_indexes() selection code suggests otherwise, but your movie playback code suggests there is only one onscreen window? This will be mostly interesting, or rather potentially problematic, if we assume that presentation timing control matters and you’d want the two monitors to display the same frames of each of the two movies, ie. play both presented movies in sync? So there might be challenges ahead.

Wrt. actual movie playback:

  1. Your script doesn’t have optimal parallelism between fetching of frames from the different movies, as the GetMovieImage calls ask to block for arrival of new movie frames (1 flag). For better performance you’d want to avoid blocking and instead poll each movie for availability of new frames and only redraw and update the display for new movie frames. The approach can be seen in PlayDualMoviesDemo.m and PlayDualMoviesTutorial.m.

  2. One cheap way to try to maybe get by with your current hardware would be to try to improve buffering of data during live playback. This will probably only help for non-looped playback though. In your code, set specialFlags1 = 2 + 256 to disable any decoding and output of sound (saving all computation time and overhead for that), and to disable deinterlacing (pointless for non-interlaced footage, but maybe gets some processing block out of the way for slightly lower overhead anyway). Then set async=4 and preloadSecs=4 or something around that. This only works meaningfully for movies without sound like yours, and will would buffer up to preloadSecs == 4 seconds of decoded movie data internally. You’d essentially open all movies, then at the beginning of a trial, call Screen('PlayMovie', ...) on all of them, which would start video decoding into the internal buffer queue. Then wait up to preloadSecs seconds before actually starting your GetMovieImage loop to fetch+draw+display frames. This way GStreamer would fill up a reservoir of preloadSecs seconds of movie, and then you would drain it in your loop at an appropriate speed. In this case it might make sense to use the wait flag 1 in GetMovieImage as you do right now, and only draw and flip once you have new frames from each movie. Given the decoding at 56 fps is not fast enough to sustain 60 fps, it would drain the internal buffer queue, but possibly slow enough / with enough headstart to make it through the movies without dropping frames or slowing down below the target 60 fps. Note that due to the async = 4 setting, the playback should never drop any movie frames, but play out all frames in order. The whole playback would slow down to < 60 fps though if this approach can’t keep up. A preloadSecs of at least 10 seconds would allow to buffer the whole movie this way, but I don’t think your machine has enough RAM to do this with 3 movies of 10 seconds each, as the movies alone would need ~42 GB of RAM and your machine only has 32 GB in total.

  3. The slightly more expensive way, but still not very expensive, and much cheaper than adding more graphics cards or a way more powerful and expensive graphics card, would be to upgrade your computer to at least 64 GB of RAM. Looking up the specs of your processor suggests one could upgrade it to up to 128 GB. Then all content fits into the buffer queue, you can set preloadSecs to a bit greater than 10 seconds and make it though the movie without slowdown, assuming enough headstart from ‘PlayMovie’ to start of the actual playback loop. The good thing is that movies of your kind, YUV 4:2:0 10 bit encoding, decoded with pixelFormat 11 for HDR/WCG, only take up half of the texture memory of YUV 4:4:4 or RGB 4:4:4 content due to the chroma subsampling, so you get by with about 2 Bytes per color component * 1.5 = 3 Bytes per pixel (10 bit net color packed into 16 bit color containers ie. 2 Bytes per color component), so 10 seconds * 60 fps * 3840 * 2160 * 2 * 1.5 ~ 13.9 GB per movie * 3 ~ 41.8 GB for all three 10 second movies.

  4. If looped playback is needed and 2 or 3 doesn’t do the trick well enough, you could take playback matters completely into your hands, and load the whole movies into PTB textures at the beginning of a trial. You wouldn’t use active playback by GStreamer, but prefetch everything, and then your main loop would just loop through all textures and draw+flip them. This approach definitely needs a RAM upgrade to at least 64 GB, and then follow the method in LoadMovieIntoTexturesDemo.m to prefetch everything, with a correspondingly long break between trials for loading the movies. That demo has various builtin benchmark modes to test how fast such loading can work.

If you use active playback as in item 2 and 3, then you can check for dropped frames yourself by comparing the presentation timestamps pts optionally returned by
[tex, pts] = Screen('GetMovieImage', win, movie, 1); for expected delta of 1/fps or at the end of playback, dropped = Screen('PlayMovie', movie, 0) will also return a dropped frame count, based on comparing expected vs. actual pts. This is the same number as printed by PTB into the Matlab windows if dropped frames are detected at end of playback.

I’d first try approach 2 as the cheap solution, then if that isn’t good enough upgrade RAM to at least 64 GB and try 3, or 4 for full control - especially for looped playback.

There are also hybrid approaches between 3 and 4 possible to try to shorten loading time of movies = wait time between trials, but it makes the code more involved.

So these are some things to try. However, the use of multiple displays and HDR may cause some more complications and timing issues, independent of all the above, which just relates to optimizing decoding and playback of demanding movie content, and would apply to any movie content and pixelFormat 11 playback.

Once you choose real HDR display, Psychtoolbox will switch to its Vulkan display backend for driving HDR display devices, instead of the its standard OpenGL display backend. At least on MS-Windows. This comes with additional unavoidable overhead. My recent Psychtoolbox Matlab R2023b tests showed another problem with Vulkan display on MS-Windows with NVidia graphics cards, which is that flip completion is reported one frame too early, apparently a NVidia Vulkan graphics/display driver bug on MS-Windows. This means wrong visual stimulus onset timestamps, but ironically could help slightly with performance in your case if one doesn’t care about perfect visual timing. So to separate causes, it could be useful to first test all this without the ‘EnableHDR’ PsychImaging task, ie. for display in SDR mode.

More importantly, on dual-display you can run into the problem of the video refresh cycles of both monitors not being synchronized, so if there is a difference in refresh timing or phase of the display scanout cycles, this can cause all kinds of additional timing stutter and lead to more dropped frames than what you would expect. We do have GraphicsDisplaySyncAcrossDualHeadsTest() as a test of synchronized display scanout for MS-Windows.E.g., GraphicsDisplaySyncAcrossDualHeadsTest([2,3]) would test sync between PTB screen 2 and 3. It could be that “NVidia Mosaic” mode could help with that if it is supported by NVidia consumer gpu’s on Windows - I certainly never tested if it exists or works works with PTB, as we don’t generally recommend NVidia hardware for optimal use with PTB. If it worked, one could turn both HDR monitors into a virtual ultra-wide virtual HDR monitor, open one onscreen window on that monitor - which spans both physical monitors - and hopefully the NVidia driver would synchronize display scanout across the monitors to avoid some timing judder. But as I said, this was not ever tested with PTB on Windows, there may be problems wrt. timing, or it may not work with HDR.

Update: Dual-display mode should be workable on GeForce class:

I just tried under Windows 10 on my GeForce GTX 1650, so this is how to enable it:

  1. Open NVidia control panel → “Configure Surround, PhysX”
  2. Check the “Span displays with surround” checkbox.
  3. Press the “configure surround button” and set up your two HDR monitors to get unified into one virtual monitor with 60 Hz refresh rate and 7680 x 2160 pixels resolution.
  4. Apply settings etc.
  5. Start Matlab. Then run: PerceptualVBLSyncTest([],[],[],[],300, 0, 1)

You should get a window spanning both HDR monitors, with the slightly jittering horizontal yellow line around half-way down the display, and close to that a horizontal tear-line, as we intentionally provoke tearing flips. Important is that the tear-line is at the same vertical position on both monitors → This tells you their refresh cycles are properly synchronized, which is what you want. ESCape key ends the test, or waiting for 300 seconds.

Repeat PerceptualVBLSyncTest([],[],[],[],300, 1, 1) with vsync, to confirm homogeneous grayscale flicker without tear-line. Or PerceptualVBLSyncTest([],4,[],[],300, 1, 1) to confirm dual-display stereo works.

You can also run PerceptualVBLSyncTest([],[],[],[],300, 0, 1, 1) to repeat the test under Vulkan display backend.

In my test, the sync worked as desired, and also HDR mode worked across both displays, so if that reproduces on your setup, at least dual-display HDR related timing problems should not be an issue. And you could (ab)use the stereo mode 4 to draw the separate movies into the left / right display (== left/right eye “stereo setup”), as a small convenience.

End of update.

We do have some special optimizations (more of a dirty hack actually, but it seems to work) for dual-display stereo HDR-10 on Psychtoolbox for Linux with suitable AMD graphics hardware, which could help with such dual-display HDR setups. You’d need a sufficiently powerful AMD RDNA2 graphics card for that however, and it has some limitations. help PsychImaging section UseStaticHDRHack explains that procedure. But maybe lets first stick with your existing Windows + NVidia setup and see how far you get.

Wrt. to use of multiple graphics cards: It depends a lot on the use case and specific hardware configuration if Psychtoolbox can deal well with multiple gpu’s or not. For your specific use case of multiple HDR display, you absolutely want PTB to only use one gpu for driving the displays. Now in theory you could use multiple gpu’s for the hw accelerated video decoding, because PTB doesn’t care from which gpu the decoded video frames in RAM come from. But in practice, as far as I can see from skimming GStreamer code, the hardware decoder pipeline won’t automatically distribute movie decoding workload across different gpu’s, and the gpu drivers probably won’t do this either, so you probably would still end up with one overloaded gpu and another completely idle gpu with unused decoders. So if you wanted to throw more hardware at the problem, it would be way more advisable to buy one, more powerful (and way more expensive!), gpu with multiple hardware decoders, because NVidia docs suggest that movie decoding of multiple movies would get automatically distributed and load balanced across multiple NVDEC video decoding engines. But the approaches pointed out above are possibly good enough for your purpose and free of cost, or way cheaper. Also, memory bandwidth and overall latency could become a limiting factor with 3 simultaneous movies at these resolutions and framerates, so a gpu with more decoders may not necessarily help as much.

A few other things wrt. to your script, unrelated to the actual problem, just some redundant code:

  1. If you use PsychDefaultSetup(2); at the very beginning of the script, you can avoid the KbName('UnifyKeyNames'); and AssertOpenGL and AssertGLSL - the latter two would be too late in your script anyway to help - AssertOpenGL needs to go before the first Screen command, and PsychImaging would not work for HDR if AssertGLSL would be able to detect errors.

  2. The Screen('SetMovieTimeIndex', movie, 0); calls are redundant, because time index 0 is where every movie starts after ‘OpenMovie’.

  3. I don’t know what the global GL and glActiveTexture(GL.TEXTURE0); are supposed to do in your sample code? Should be pointless / redundant.

  4. [win, rect] = PsychImaging('OpenWindow', screen_no, 0); would be enough, as you leave all other parameters at their defaults anyway.

  5. PsychGPUControl('SetGPUPerformance', 10) doesn’t hurt, but doesn’t do anything on NVidia gpu’s either atm., it is only implemented for AMD gpu’s.

  6. PsychImaging('AddTask', 'General', 'FloatingPoint32Bit'); doesn’t hurt, to document explicitly what precision you want to use for your study. It is implied though by ‘EnableHDR’, so not strictly technically necessary.

So far until here,
-mario

1 Like

Hello Mario,
Thank you very much for your detailed response and for all the suggestions and solutions that you provided me.
I am truly sorry for the late response. I completely missed the notification of your response because of the Christmas break.

In the first place, I wished to use the dual-display mode, but I could not use the Nvidia surround while also keeping the HDR mode enabled on Windows.

Since you said you have done it, I will test it again and try to do it myself.

Thus, because I could not use the dual-display model, I loaded the monitors separately.
I did not show the full extent of the code in the example since I was trying to briefly explain what I did.
I was loading the first test video frame on one display and then the second test video frame on the second display. This method did introduce some very small delay between the 2 monitors when capturing each frame displayed (since I couldn’t flip both frames on each monitor at the exact same time), but overall, that delay was not noticeable for the subjects in the experiment. So, the solution, even if not perfect, was good enough for our specific use case.

I have done that, and it did improve the performance. I was able to successfully play the whole three HDR 4K high-framerate videos at the same time and in sync. Because in my experiment, I had different distorted videos with varying bitrates, the full experiment ran very smoothly, with no noticeable lag. There were a small number of frames dropped for each video, but such a small number was not noticeable and did not cause any issues with the data we were collecting. Also, the dropped frames were similar between the three videos, so we did not have any synchronization issues throughout the experiment.

Thank you very much for this suggestion. As you said, I could not really upload all the video offline before the start of the playback because of memory issues, but I was able to make it work by loading only a few seconds while dealing with a very small drop of the frames that were not noticeable, leading to smooth playback of all videos, and a smooth running of the experiment.

I have not had the opportunity to try all of the suggestions since approaches 1 and 2 were enough for my specific use case of the videos and for the experiment in itself. However, I will be definitely using all of these suggestions to improve my future experiments with higher resolution/framerate HDR videos.

This completely makes sense. We were wondering if it is more beneficial for the lab to add more GPUs or upgrade the GPU. Thank you for your input; it is incredibly helpful.

I have removed most of these. I tested multiple parameters before to ensure that none of these were causing any issues of playback or limiting the playback of the videos.

Again, I cannot thank you enough for all the help throughout this process and this whole playback issue. I can confidently confirm that all issues have been resolved successfully and that the full experiment was run successfully and smoothly with no apparent issues.

Thank you very much.

Best Regards,
Dounia

Great it worked out. I’ve removed that support membership token, as you won’t need it. Mathworks pays for this support incident.

Best,
-mario

1 Like