Real-time playback of 4K HDR Videos

Windows10, MatLab R2023b, Psychtoolbox 3.0.19, GStreamer 1.22.6, GPU: NVIDIA GeForce RTX 3080, Display: 55” LG OLED G2 (3840x2160)

Hello,

I wrote a small script that plays a 4K HDR (PQ-encoded) video file (3840x2160 at 60 Hz) using Psychtoolbox-3. However, the video stutters and Psychotoolbox cannot maintain the frame rate.

After further investigation, I found out that it was not a GPU issue since the GStreamer could decode the video in real-time, and it was also not a video complexity issue since the Psychtoolbox could play an HD video with higher complexity in real-time without any issue.

I inspected the code provided in Psychtoolbox (PsychMovieSupportGStreamer.c), and I think the main issue is that the Psychtoolbox is getting the textures from the GStreamer as a CPU buffer (using the GstSample class, in line 2505) instead of getting the GL textures directly from the GStreamer in the GPU, using these classes: GstGLBuffer and GstGLMemory. Then, the frame is transferred again from the CPU memory to the GPU while uploading the texture.

Did anyone have a similar issue and find a workaround for it?
Is it possible to play frames from GStreamer without the need for GPU-CPU-GPU memory transfers?

Thanks,
Dounia

Example script and representative sample movie file would be needed. How does our own PlayMoviesDemo fare? What framerate does it maintain? What are the specs of your PC? How long are your movie clips? Do they need sound, or just video?

How did you find any of this out? Details!

While that is true, as GStreamer’s GL support was far from ready when we would have needed it, I wouldn’t expect that in itself to be a huge drain on performance, as even with download and upload again, we are only talking about 2.78 GB/sec bandwidth per movie * 3 movies = 8.34 GB/sec, and even PCIe 3.0 16x busses introduced in the year 2010 can handle over 15 GB/sec bandwidth, so should be able to deal with this, barring any significant operating system driver bugs and inefficiencies.

The question is more, what exactly do you want to achieve, and can that be achieved on your setup.

Hello,

Thank you very much for responding to me.

So first, I think it’s better to give some context to what I’m trying to conduct with Psychtoolbox. I want to do a psychophysical experiment where I show the observer two distorted 4K HDR videos on two LG OLED G2 displays (on repeat). I also want to add an option where the observer can switch from viewing the distorted videos to the reference video. So in summary I want to play three 4K HDR videos (at 60 Hz) at the same time. And because the videos have the same content at different distortion levels, I also want to play all three videos in synchronisation.

All videos are between 5 and 10 seconds, with no sound.

The PC specs are: CPU: AMD Ryzen 9 5900X 12-Core processor 3.7GHz, RAM: 32GB, GPU: NVIDIA GeForce RTX 3080, OS: Windows 10 Pro.

I have the code for the whole experiment, but for simplicity, I included a simplified version of it (In this code I am not drawing the textures, just trying to measure how much time the Psychtoolbox takes to decode and give a texture). Also, some sample movie files can be found in this drive folder.

global GL;
% Get our 4K displays indexes

display_indexes = [] ;
j = 1 ;
for i=0:3
    metadata = Screen('Resolution',i) ;
    if metadata.width == 3840 && metadata.height == 2160
        display_indexes(j) = i ;
        j=j+1 ;
    end
end

right_index = 1 ;

% Define the screen no
screen_no = display_indexes(right_index) ;

try
    %Define the preferences

    Screen('Preference', 'SkipSyncTests', 0);
    Screen('Preference', 'Verbosity', 3) ;
    PsychGPUControl('SetGPUPerformance', 10) ;

    % - Prepare setup of imaging pipeline for onscreen window. This is the
    % first step in the sequence of configuration steps.
    PsychImaging('PrepareConfiguration');
    PsychImaging('AddTask', 'General', 'EnableHDR', 'Nits', 'HDR10');
    PsychImaging('AddTask', 'General', 'FloatingPoint32Bit');

    % Open the window

    [win, rect] = PsychImaging('OpenWindow', screen_no, 0, [], [], [], [], [], 0);
    AssertGLSL

    hdrProperties = PsychHDR('GetHDRProperties', win);
    display(hdrProperties);

    glActiveTexture(GL.TEXTURE0);
    AssertOpenGL;

    KbName('UnifyKeyNames');

    % OPEN MOVIE PARAMETERS 

    async = 0 ;
    preloadSecs=1 ; % We probably want to load all the video beforehand
    specialFlags1=0;
    pixelFormat=11; % 11 for HDR
    maxNumberThreads=[];
    movieOptions=[];

    % Get the movies path

    pathA='path\DevilMayCry5.mp4';
    pathB='path\DevilMayCry5_H_3840x2160.mp4';
    pathC='path\DevilMayCry5_M_3840x2160.mp4';

    % Open the videos
    [movieA, duration, fps] = Screen('OpenMovie', win, pathA,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);
    movieB = Screen('OpenMovie', win, pathB,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);
    movieC = Screen('OpenMovie', win, pathC,  async, preloadSecs, specialFlags1, pixelFormat, maxNumberThreads, movieOptions);

    % Seek to start of movies (timeindex 0):
    Screen('SetMovieTimeIndex', movieA, 0);
    Screen('SetMovieTimeIndex', movieB, 0);
    Screen('SetMovieTimeIndex', movieC, 0);

    % PLAY MOVIE PARAMETERS
    rate=1;
    loop=1;
    soundvolume=0;

    % Start playback of movies.
    Screen('PlayMovie', movieA, rate, loop, soundvolume);
    Screen('PlayMovie', movieB, rate, loop, soundvolume);
    Screen('PlayMovie', movieC, rate, loop, soundvolume);

    time_to_get_frame = 0;
    frames_read = 0;

    while true

        % Return next frame in movie, in sync with current playback
        % time.

        ttgf = tic;

        %GET MOVIE IMAGE PARAMETERS
        texA = Screen('GetMovieImage', win, movieA, 1, []);
        texB = Screen('GetMovieImage', win, movieB, 1, []);
        texC = Screen('GetMovieImage', win, movieC, 1, []);

        read_time = toc(ttgf);
        time_to_get_frame = time_to_get_frame + read_time;
        frames_read = frames_read + 1;


        if( mod(frames_read,fps)==0 )
            fprintf( 1, 'Time to read a second of the movie is  = %g s\n', time_to_get_frame/frames_read*fps )
        end

        if texA>0
            Screen('Close', texA);
        end
        if texB>0
            Screen('Close', texB);
        end
        if texC>0
            Screen('Close', texC);
        end


        [keyIsDown, secs, keyCode, deltaSecs] = KbCheck();

        if(keyIsDown)

            if all(keyCode(KbName('ESCAPE')))
                Screen('PlayMovie', movieA, 0);
                Screen('CloseMovie', movieA);

                Screen('PlayMovie', movieB, 0);
                Screen('CloseMovie', movieB);

                Screen('PlayMovie', movieC, 0);
                Screen('CloseMovie', movieC);

                ME = 'break the experiment' ;
                throw(ME) ;

            end
        end

    end


catch ME
    % catch error: This is executed in case something goes wrong in the
    % 'try' part due to programming error etc.:

    Screen('CloseAll');
    fclose('all');
    Priority(0);

    sca ;

    display( 'Exception caught' );
    rethrow(ME);

    % Output the error message that describes the error:
end

To be able to find the main issue, I have tried different videos. I will summarise my findings here:

  • A 4K HDR video at 60 fps (DevilMayCry5.mp4 file in the drive folder) cannot be played in real-time (the frames are not dropped). I have also tried the PlayMoviesDemo from Psychtoolbox, and the issue is the same. This is the output of the code:

ITER=1::Movie: path\DevilMayCry5.mp4 : 10.000000 seconds duration, 60.000000 fps, w x h = 3840 x 2160…
Elapsed time 58.436810 seconds, for 2543 frames. Average framerate 43.517091 fps.

  • Three 4K HDR videos at 60 fps (DevilMayCry5.mp4, DevilMayCry5_H_3840x2160.mp4, DevilMayCry5_M_3840x2160.mp4 files in the drive folder) can be decoded in real-time using GStreamer (or even opening them on the web browser in full mode). Here is the code I have used to decode them with GStreamer:
#include <gst/gst.h>
#include <time.h>
#include <stdio.h>

int main(int argc, char* argv[]) {
    GstElement* pipeline1;
    GstElement* pipeline2;
    GstElement* pipeline3;

    GstBus* bus1;
    GstBus* bus2;
    GstBus* bus3;

    GstMessage* msg1;
    GstMessage* msg2;
    GstMessage* msg3;

    GMainLoop* loop;

    double time_init; 
    double time_diff;


    /* Initialize GStreamer */
    gst_init(&argc, &argv);
    loop = g_main_loop_new(NULL, FALSE); 

    /* Create the elements */
    pipeline1 = gst_parse_launch("playbin uri=file:///path/DevilMayCry5.mp4", NULL);
    pipeline2 = gst_parse_launch("playbin uri=file:///path/DevilMayCry5_H_3840x2160.mp4", NULL);
    pipeline3 = gst_parse_launch("playbin uri=file:///path/DevilMayCry5_M_3840x2160.mp4", NULL);

    /* Start measuring the time*/
    time_init = (double) clock() / CLOCKS_PER_SEC; 

    /* Start playing */
    gst_element_set_state(pipeline1, GST_STATE_PLAYING);
    gst_element_set_state(pipeline2, GST_STATE_PLAYING);
    gst_element_set_state(pipeline3, GST_STATE_PLAYING);

    /* Wait until error or EOS */
    bus1 = gst_element_get_bus(pipeline1);
    bus2 = gst_element_get_bus(pipeline2);
    bus3 = gst_element_get_bus(pipeline3);

    msg1 = gst_bus_timed_pop_filtered(bus1, GST_CLOCK_TIME_NONE, (GstMessageType)(GST_MESSAGE_ERROR | GST_MESSAGE_EOS));
    msg2 = gst_bus_timed_pop_filtered(bus2, GST_CLOCK_TIME_NONE, (GstMessageType)(GST_MESSAGE_ERROR | GST_MESSAGE_EOS));
    msg3 = gst_bus_timed_pop_filtered(bus3, GST_CLOCK_TIME_NONE, (GstMessageType)(GST_MESSAGE_ERROR | GST_MESSAGE_EOS));
    
    

    /* Look for errors */
    if (GST_MESSAGE_TYPE(msg1) == GST_MESSAGE_ERROR || GST_MESSAGE_TYPE(msg2) == GST_MESSAGE_ERROR || GST_MESSAGE_TYPE(msg3) == GST_MESSAGE_ERROR) {
        g_error("An error occurred! Re-run with the GST_DEBUG=*:WARN environment variable set for more details.");
    }

    /* Display the time spent to decode the videos*/
    time_diff = (double) clock() / CLOCKS_PER_SEC - time_init;
    printf("The elapsed time is %f seconds\n", time_diff); 

    /* Free resources */
    gst_message_unref(msg1);
    gst_message_unref(msg2);
    gst_message_unref(msg3);

    gst_object_unref(bus1);
    gst_element_set_state(pipeline1, GST_STATE_NULL);
    gst_object_unref(pipeline1);

    gst_object_unref(bus2);
    gst_element_set_state(pipeline2, GST_STATE_NULL);
    gst_object_unref(pipeline2);

    gst_object_unref(bus3);
    gst_element_set_state(pipeline3, GST_STATE_NULL);
    gst_object_unref(pipeline3);

    return 0;
}
  • Three 4K HDR videos at 30fps (Pubg.mp4, Pubg_H_3840x2160.mp4, Pubg_M_3840x2160.mp4 files in the drive folder) can be played in real-time; however, Psychtoolbox could not keep all the framerates and around 2~17 frames were dropped for each file, which may cause a synchronisation problem.

  • Three HD HDR videos at 60fps (DevilMayCry5_H_1920x1080.mp4, DevilMayCry5_M_1920x1080.mp4, DevilMayCry5_L_1920x1080.mp4 files in the drive folder) can be played in real-time.

  • An HD HDR video at 60fps with a higher complexity of ~ 6.9MB (DevilMayCry5_H_1920x1080.mp4) can be played in real-time, while a 4K HDR video at 60 fps with a lower complexity of ~ 2.53MB (DevilMayCry5_L_3840x2160.mp4) cannot be played in real-time.

So after testing all of these variations, I knew it was not an issue decoding with decoding speed, but mostly an issue of the texture size and how many textures that needed to be transferred from GStreamer to Psychtoolbox. Which is why I think that videos with high resolutions and framerate cannot be played in real time if we rely on the GPU-CPU-GPU memory transfers.

In summary, my main goal is to be able to play three 4K HDR videos at 60 Hz in real time without dropping any frames (to keep the synchronisation of the videos). From the results of GStreamer I can determine that my GPU is capable of it, but I am not sure if such an experiment is possible to do in Psychtoolbox.

Hi again Dounia,

Congratulations, due to it being of general enough interest, I have picked your specific sub-problem of improving playback performance of 4k HDR-10 content via hardware accelerated video decoding on modern gpu’s for “free support on resolving a non-trivial issue”, sponsored by Mathworks Neuroscience - MATLAB and Simulink Solutions - MATLAB & Simulink. You get advice and help on this sub-topic of your more general question for free that would normally have cost your lab way more than 1000 Euros. Mathworks provided us with 5 such “Jokers” for free support in the time period October 2023 to June 2024, and you get to use the first out of five.

I’ve spent a couple of hours looking at this, and these are my findings of why these performance issues happen and what one could do about it. Skip over the following section to the “Proposed solution” if you want to get to my advice of what you should probably do instead of trying to get better performance:

Problem analysis, technical background:

So you are right that there are inefficiencies in our GStreamer movie playback engine, but not quite where you thought they are:

  • Our current movie playback engine indeed uses our own homemade conversion code to convert and upload from decoded movie frames in system RAM to suitable OpenGL textures for processing and display on the gpu. It doesn’t use GStreamers own OpenGL post-processing path. The reason for this is that GStreamer didn’t have those capabilities when the current movie engine was written, but once that functionality became available in later GStreamer versions, it turned out to be insufficient for many of the more demanding and special use cases for PTB movie playback, especially when it comes to high color precision (10/12/16 bpc) content for wide color gamut (WCG) and high dynamic range (HDR) cases, but also for various special playback modes that PTB has to offer. Nowadays it would be a lot of work to convert our code to the new GL based GStreamer elements, but me feeling is that even latest GStreamer 1.22 is not quite up to the task of implementing all of what PTB can do in a proper manner. So while this may happen someday, now is not that day. While testing HDR playback and looking at the capabilities of various GStreamer hardware accelerated video decoding plugins on Windows and Linux, I saw, e.g., that they sometimes don’t offer interop between hw decoding and the OpenGL plugins, or they don’t offer “data stays on gpu at all times” processing at higher than 8 bpc bit depths, iow. not suitable for WCG/HDR. E.g., by default, on Windows GStreamer will use the Direct3D11/DXVA based hardware video decoder plugins for decoding, but these can not decode to a GLMemory OpenGL texture surface, but only to a D3D11Memory surface. This is fine if one has a purely MS-Windows specific playback pipeline that uses Direct3D 11 throughout, but otherwise it will always download content to system RAM, even if one would use the OpenGL plugins of GStreamer, iow. a gpu (D3D11) → cpu → gpu (OpenGL) copy. Similar on Linux when using VAAPI hardware decoding, where one can decode to a VAAPI surface, great for simple video playback, but use with OpenGL requires a gpu (vasurface) → cpu → gpu (OpenGL) copy. Iow. this bounce via system RAM is often unavoidable.

  • The good news is that while this gpu → cpu → gpu detour does cause some overhead, it is not that significant, because PCIe busses are fast nowadays, as my estimates from a previous post show. Playback via PTB will therefore be a bit slower due to the data transfers, than if you use a fulll on-gpu pipeline suitable for simple playback scenarios, e.g., a typical native video player app for your operating system, or maybe what your web browser will use if it supports hw accelerated playback. But the slowdown is not that significant in most cases, and you gain enormous flexibility for experimental paradigms via PTBs OpenGL pipeline.

  • The actual reason why performance is so degraded is related though: The video hardware decoders can only output frames in certain video memory encodings / layouts efficiently (or at all). So whenever there is a mismatch between what the codec provides and what PTB expects for its own internal texture conversion routines, GStreamer will employ format converters which execute in software on the cpu, and depending on source and target format/layout, the computational costs can be rather large. So this is what happened here, as PTB’s HDR/WCG code path all expect a format that doesn’t sit well with hardware decoders and most time is not spent decoding or displaying video, but converting from one format into another on the cpu! Not neccessarily multi-threaded btw., so feven many cores don’t neccessarily help.

  • So what I did in the proposed solution below is add new/improved HDR/WCG decode shader code and other enhancement so Screen() should now be able to accept video frames in the most common formats/memory layouts produced by hardware decoders on at least Windows and Linux with typical AMD/NVidia/Intel gpu’s. This avoids all costly conversion and turns into just small overhead for gpu → cpu → gpu transfer, at least for common codec formats like MPEG2, H264, H265/HEVC, VP9 and on very modern gpu’s of the latest generations also AV1.

  • My own hardware is all a bit too old for AV1 decoding, so I could see speedups in decoding of at least 2x or more for H265, VP9, whereas your provided sample movies in AV1 decoding trigger full software decoding at framerates of about 5 - 10 fps at most.

  • Now video decoding is one thing, but the HDR processing (color space conversion as needed and PQ encoding, all done in GLSL shaders) and actual output to the gpu via our Vulkan display backend also adds quite a bit of overhead. That’s why playback in “SDR mode” and just selecting pixelFormat 11 in Screen('Openmovie') can be faster - although in somewhat wrong colors due to our primitive HDR → SDR conversion (no clever tone mapping at all, just color space conversion BT 2020 → BT 709, and remapping to a more SDR range) than actual HDR display when enabling our full HDR support for native HDR display devices. If that slowdown due to more gpu overhead is significant depends on the speed of your graphics card. On my machines, with the most modern ones being 5 - 7 years old, 4k HDR 60 fps is often not possible, except on Linux with some special HDR-10 opmode, the overhead pushes it down from 60 fps to 40 fps. But you have a much faster machine and gpu (more than 4x afaics), so you may fare better.

One way to test what decoding is likely used (software vs. hardware) and if precision is retained as wanted is to run with Screen('Verbosity', 4) and look for PTB-INFO messages about source and sink color spaces and video encoding. Obviously you’d want sink and source colorspace to be bt2100-pq to not lose actual HDR/WCG somewhere and mess up the color spaces. And 10 bpc output precision instead of degraded to 8 bpc.

In general seeing a pixel encoding of P010_10LE, or P012 or P016 or something with a P suggests hardware video decoding and efficient image transfer. something like I420_10LE or I420_something suggests that software decoders are used or some inefficient conversion path is chosen. This because YUV semi-planar formats (P) are usually way more efficient to decode and transfer to OpenGL than YUV planar formats (I). Packed pixel formats a la RGBA16 or ABGR16 retain all the desired precision, but are usually a death sentence to performance.

So verbosity level 4 is a good spot check for what happens.

If you want a detailed view of how the GStreamer decoding pipeline looks from reading the file through decoding + post-processing to PTB, you can use …
PsychTweak('GStreamerDumpFilterGraph', '/tmp/') as first command after launching Matlab - before any movie playback - to let the engine dump a GraphViz .dot file into the /tmp/ folder. The same can be achieved with your own test program or command line GStreamer pipelines by setting the environment variable GST_DEBUG_DUMP_DOT_DIR=/tmp

The xdot utility on Linux can read these .dot files and display a nice graph visualizing the whole pipeline, so you can see exactly what codecs are used and what format conversions happen. On MS-Windows you’ll likely find d3d11av1dec or d3d11h265dec, on Linux vaapih265dec or vaapinv1dec for typical hardware decoders, versus avdec_h265, av1dec or dav1d for pure software decoders.

Proposed solution for testing by you:

I have built new experimental Screen() mex files for MS-Windows 10 and later (tested by myself with GStreamer 1.22.5 on AMD Ryzen 5 2400 G “Raven Ridge” integrated Vega graphics and on a NVidia GeForce GTX 1650 Turing class graphics card), and for Ubuntu Linux 20.04.6 LTS (tested by myself on Ubuntu Linux 22.04.3-LTS with the distributions standard GStreamer 1.20.3 installation, again on integrated AMD Raven Ridge / Vega and on a machine with AMD Polaris 11). I tested with different 4k HDR movies, typically encoded in H265/HEVC, some encoded in VP9, with typical YUV 4:2:0 sampling and 10 bit per color channel precision. Iow. typical run of the mill 4k HDR-10 content. On all tested systems, performance was significantly improved, usually more than doubled, due to use of hardware video decoding with the new optimized “hardware codec friendely” playback path.

Please find the following Screen() mex files for download and testing by yourself. These are pre-releases, not yet contained in the upcoming PTB 3.0.19.5 release, but will be integrated in a followup release on successful testing.

For Matlab on Windows:

https://github.com/kleinerm/Psychtoolbox-3/raw/gstreamerHDRHwPerfExp/Psychtoolbox/PsychBasic/MatlabWindowsFilesR2007a/Screen.mexw64

For Matlab on Ubuntu Linux:

https://github.com/kleinerm/Psychtoolbox-3/raw/gstreamerHDRHwPerfExp/Psychtoolbox/PsychBasic/Screen.mexa64

Simply use PlayMoviesDemo() for this, with suitable settings, e.g.,

PlayMoviesDemo ('/path/to/movie.mp4',[],[],[],11)

… to test playback performance in SDR mode, or …

PlayMoviesDemo ('/path/to/movie.mp4',1) for proper HDR mode.

Eager to here your test results!

1 Like

Hello,

I cannot express enough my gratitude for your immense help in this project, and for using one of the five jokers for this task.

I have tested the new MEX Screen() file using my code on different 4K HDR 60HZ videos, and for each video, I was able to play it in real-time with no noticeable motion distortion. From the output of the code, I noticed some dropped frames, but because there are not many of them and I could not notice them visually, I would say it is completely negligible.

Here is the output using the DevilMayCry5 that I have referenced before:

ITER=1::Movie: C:\Users\dh706\Documents\Sequences\pilot\ref\DevilMayCry5.mp4  : 10.000000 seconds duration, 60.000000 fps, w x h = 3840 x 2160...
Elapsed time 26.807423 seconds, for 1599 frames. Average framerate 59.647658 fps.
PTB-INFO: Movie playback had to drop 6 frames of movie 0 to keep playback in sync.

I would say that compared to before when the average framerate was 43.5 fps, the improvement is immense and the playback looks similar to when I use the GStreamer to play it.

Again, I am very thankful to you for solving this issue and improving the real-time playback of 4K HDR 60Hz videos.

Thanks,
Dounia

1 Like