Can you use OpenMovie in Aysnc mode under WinXP + Gstreamer?

I have some movies which play at roughly 90% realtime, so if I could just pre-load some of the video I could play them without dropping frames. But attempts to use the aysnc flag for open movie seems to do nothing. I tested this by modifying LoadMovieIntoTexturesDemoOSX to use the async flag and then wait some time, before actually loading the movie and calling GetMovieImage in benchmark mode. But whether waitsecs is (2) or (0) the FPS for loading movies is the same. Could this be due to the movie format (MJPEG), my OS (WinXP 64), or Gstreamer (I have to use Gstreamer since I use WinXP64).

Screen('OpenMovie', win, moviename, 1); % start loading in the background
WaitSecs(2); % let some frames accumulate
[movie movieduration fps imgw imgh] = Screen('OpenMovie', win, moviename); % open the movie for real

....
--- In psychtoolbox@yahoogroups.com, "Alan Robinson" <x_e_o_s_yahoo@...> wrote:
>
> I have some movies which play at roughly 90% realtime, so if I could just pre-load some of the video I could play them without dropping frames. But attempts to use the aysnc flag for open movie seems to do nothing. I tested this by modifying LoadMovieIntoTexturesDemoOSX to use the async flag and then wait some time, before actually loading the movie and calling GetMovieImage in benchmark mode. But whether waitsecs is (2) or (0) the FPS for loading movies is the same. Could this be due to the movie format (MJPEG), my OS (WinXP 64), or Gstreamer (I have to use Gstreamer since I use WinXP64).
>
> Screen('OpenMovie', win, moviename, 1); % start loading in the background
> WaitSecs(2); % let some frames accumulate
> [movie movieduration fps imgw imgh] = Screen('OpenMovie', win, moviename); % open the movie for real
>

That's not how it works. What the flag does is it executes the 'Openmovie' function in the background on a separate thread.That includes things like locating and opening the file, parsing its headers, loading all needed demuxers and decoders, setting up the pipeline, allocating memory, starting processing threads etc. Depending on the movie format it may even load and decode the first few (dozen) frames until the pipeline is filled. This process is called "prerolling the pipeline". But that's it. It doesn't decode your whole movie or even substantial parts of it in the background, just readies everything for a fast start of playback. I think the Qucktime playback engine on os/x had the option to decode larger parts of the movie via the optional 'preloadSecs' parameter. However, that never really worked realiably or consistently across movie formats or operating system versions and was a nice source of hangs, crashes and other funny effects. E.g., my old powerpc mac pro with osx 10.4.11 crashed so hard when trying to gapless play and preload a movie, i had to power it down for 10 seconds to get it back when i tested this yesterday.

When you say 90% you mean what? Plays at 90% of the expected rate? Or 90% of the time?

The next ptb beta is almost ready for release, probably today. It contains many optimizations, bugfixes, improvements to the Gstreamer playback engine, which may help. E.g., a new PlayGaplessMoviesDemo2 -- GStreamer only, but "even more gapless" gapless playback than the classic gapless demo, because it can make use of GSTreamers builtin gapless playback support instead of needing to play tricks like with QT.Then there's always Linux as the better alternative. Linux GStreamer is of a more recent version, with improved support for multi-threaded decoding from Ubuntu 11.10 onwards. E.g., H264 encoded material can get a nice boost, utilizing up to 4 cores of a 8-core machine.

What format are your movies in? Btw., i have a deja vu feeling. Didn't we have this conversation already?

-mario
> 0. while whatever
> 1. getimage.
> 2. drawimage.
> 3. drawingfinished.
> 4. kbcheck and other non-visual logic.
> 5. flip.
> 6. endwhile
>
> 3. makes sure that the gpu starts processing, in parallel to 4., instead of potentially waiting with start of processing until 5, yielding more parallelism between cpu and gpu. Without 3., the driver will kick the gpu into action either at 5. or if enough rendering work has accumulated to warrant start of processing, or after some timeout has been reached. So 3. may or may not help, but usually it doesn't hurt and usually does help in corner-cases where you'd miss deadlines only by a small amount of time etc.

Sorry, should have been more clear: method 1-6 is the very first thing I tried after downloading the beta(^2?) version of screen (minus #4, just to eek out whatever additional ms I could get). It performs the most consistently, and also the worst. The version I posted before, which pre-loads frames instead of just a simple waitsecs sometimes performs perfectly, with no missed flips, whereas the simple approach ALWAYS misses a decent number of flips. I've also tried a version which pre-loads textures, and uses AsyncFlipBegin before loading the next texture in the draw-load loop, which performs the best of the bunch, often missing as few as 3% of flips, but sometimes as many as 15%.

> So at the moment, with Priority(2), the GStreamer threads run at the same priority as the main visual thread. At Priority(1), the GStreamer threads would still run at elevated realtime-priority levels, but lower than the main (visual) thread -- i think that's what you want.

Right. I've tried 0,1, and 2. No consistent winner.

> And to clarify: It is not a GStreamer thread, but potentially a dozen threads, depending on the movie format and various settings + the graphics driver also has internal threads for some gpu's + the ptb threads == There are many more threads than your 4 cores at any time competing for resources. Assigning proper priorities is especially important for realtime-apps, but hard to do on an os with rather limited scheduling control and a non-rt scheduler. I could write an essay about the ways the windows scheduler sucks for rt apps.

Interesting. I tried enabling hyperthreads on the computer, resulting in 8 somewhat wimpy cores instead of 4 real ones. Seems to help a little. Watching task manager, it looks like a total of 4-5 cores get pegged while the draw-decode loops is running, and all 8 some activity.

> For fine-grained rt control (with (m)any threads), a properly set up Linux system is the way to go. Even there, ptb currently only uses a fraction of the available tuning mechanisms, i expect to spend many more days/weeks of incrementally improving it, once i find the time to do so.
>
> > Here's some code, in case anybody else wants to give this a try.
> > >
>
> Looks awfully complex, possibly self-defeating.

Possibly. I tried the simplest possible design, and added one little bit of complexity at a time until arriving at the final monster that you see here. Can't say for sure each bit of the complexity helped, since the testing is not deterministic, but it is true that the final version with all the bits does perform the best, on average.

> 1. Try the specialFlags1 settings 1, 2 and 1+2 -- you don't have sound, so no need to setup sound decoding [although that might get skipped automatically, i don't know]. If your gpu + driver supports yuv textures, that could squeeze out another msec or two.

Tried all three. +2 seems to be slower. +1 makes no difference.

> 2. Only use 1 texture at a time, don't create multiple ahead of time, that will defeat some internal recycling of texture objects and lead to slower texture creation. The code is optimized for the common use pattern for live playback, which you are using in your code.

Texture creation does seem to be one of the bottle necks. Too bad 1 texture at a time performs the worst.

> 3. At high verbosity levels (6 or so) you'll get some debug output about movie decoding. From that output (something like "xxx buffers queued") you can get an estimate of how long it really takes to decode the full movie, ie., how to set your wait time. Obviously for a 2 second clip @ 120Hz you'd like to see 240 buffers queued the first time you 'getmovieimage' a texture -- then the decoding would be completed at start of playback. The timing stats there also give you a feeling for how long texture creation takes.

Even having all 240 queued doesn't always ensure good performance, because the texture creation and uploading take a while.

These numbers are typical while PTB buffering is occurring:

PTB-DEBUG: Start of frame query to decode completion: 5.144509 msecs.
PTB-DEBUG: Decode completion to texture created: 3.419429 msecs.
PTB-DEBUG: Texture created to fetch completion: 0.020114 msecs.

and then after PTB buffering is finished:

PTB-DEBUG: Start of frame query to decode completion: 0.039949 msecs.
PTB-DEBUG: Decode completion to texture created: 4.603937 msecs.
PTB-DEBUG: Texture created to fetch completion: 0.282717 msecs.

> 4. You still have various flags from previous posts you could try to squeeze out a msec somewhere.

Sorry my post was unclear on this: I have tried all flags (though perhaps not all possible combinations of flags). None consistently rises above the run-to-run noise.

> 5. "Linux is your friend", although i find it adorable how enthusiastic you try to make the elephant dance.

At this point it would have been faster to install WUBI, no doubt. But we have a lot of lab infrastructure that's based on WinXP and dual booting is always such a pain - believe me, I've tried; my first Linux install was Slackware from floppies, using the umsdos filesystem. The problems change dramatically over the years, but the experience remains underwhelming.
--- In psychtoolbox@yahoogroups.com, "Alan Robinson" <x_e_o_s_yahoo@...> wrote:
> Sorry, should have been more clear: method 1-6 is the very first thing I tried after downloading the beta(^2?) version of screen (minus #4, just to eek out whatever additional ms I could get). It performs the most consistently, and also the worst. The version I posted before, which pre-loads frames instead of just a simple waitsecs sometimes performs perfectly, with no missed flips, whereas the simple approach ALWAYS misses a decent number of flips. I've also tried a version which pre-loads textures, and uses AsyncFlipBegin before loading the next texture in the draw-load loop, which performs the best of the bunch, often missing as few as 3% of flips, but sometimes as many as 15%.
>
> > So at the moment, with Priority(2), the GStreamer threads run at the same priority as the main visual thread. At Priority(1), the GStreamer threads would still run at elevated realtime-priority levels, but lower than the main (visual) thread -- i think that's what you want.
>
> Right. I've tried 0,1, and 2. No consistent winner.
>
> > And to clarify: It is not a GStreamer thread, but potentially a dozen threads, depending on the movie format and various settings + the graphics driver also has internal threads for some gpu's + the ptb threads == There are many more threads than your 4 cores at any time competing for resources. Assigning proper priorities is especially important for realtime-apps, but hard to do on an os with rather limited scheduling control and a non-rt scheduler. I could write an essay about the ways the windows scheduler sucks for rt apps.
>
> Interesting. I tried enabling hyperthreads on the computer, resulting in 8 somewhat wimpy cores instead of 4 real ones. Seems to help a little. Watching task manager, it looks like a total of 4-5 cores get pegged while the draw-decode loops is running, and all 8 some activity.
>
> > For fine-grained rt control (with (m)any threads), a properly set up Linux system is the way to go. Even there, ptb currently only uses a fraction of the available tuning mechanisms, i expect to spend many more days/weeks of incrementally improving it, once i find the time to do so.
> >
> > > Here's some code, in case anybody else wants to give this a try.
> > > >
> >
> > Looks awfully complex, possibly self-defeating.
>
> Possibly. I tried the simplest possible design, and added one little bit of complexity at a time until arriving at the final monster that you see here. Can't say for sure each bit of the complexity helped, since the testing is not deterministic, but it is true that the final version with all the bits does perform the best, on average.
>
> > 1. Try the specialFlags1 settings 1, 2 and 1+2 -- you don't have sound, so no need to setup sound decoding [although that might get skipped automatically, i don't know]. If your gpu + driver supports yuv textures, that could squeeze out another msec or two.
>
> Tried all three. +2 seems to be slower. +1 makes no difference.
>
> > 2. Only use 1 texture at a time, don't create multiple ahead of time, that will defeat some internal recycling of texture objects and lead to slower texture creation. The code is optimized for the common use pattern for live playback, which you are using in your code.
>
> Texture creation does seem to be one of the bottle necks. Too bad 1 texture at a time performs the worst.
>
> > 3. At high verbosity levels (6 or so) you'll get some debug output about movie decoding. From that output (something like "xxx buffers queued") you can get an estimate of how long it really takes to decode the full movie, ie., how to set your wait time. Obviously for a 2 second clip @ 120Hz you'd like to see 240 buffers queued the first time you 'getmovieimage' a texture -- then the decoding would be completed at start of playback. The timing stats there also give you a feeling for how long texture creation takes.
>
> Even having all 240 queued doesn't always ensure good performance, because the texture creation and uploading take a while.
>
> These numbers are typical while PTB buffering is occurring:
>
> PTB-DEBUG: Start of frame query to decode completion: 5.144509 msecs.
> PTB-DEBUG: Decode completion to texture created: 3.419429 msecs.
> PTB-DEBUG: Texture created to fetch completion: 0.020114 msecs.
>
> and then after PTB buffering is finished:
>
> PTB-DEBUG: Start of frame query to decode completion: 0.039949 msecs.
> PTB-DEBUG: Decode completion to texture created: 4.603937 msecs.
> PTB-DEBUG: Texture created to fetch completion: 0.282717 msecs.
>
> > 4. You still have various flags from previous posts you could try to squeeze out a msec somewhere.
>
> Sorry my post was unclear on this: I have tried all flags (though perhaps not all possible combinations of flags). None consistently rises above the run-to-run noise.
>

The timing is really tight, not much headroom left even in the best case. Other random things you could try:

1. Stick to Priority(1) -- it is the only sensible setting which would at least allow a well working scheduler to schedule the threads in a reasonable way. At levels 0 or 2 all bets are off.

2. Assuming this is not only a cpu scheduling problem and your machine apparently is powerful enough to manage perfect performance at least during some runs, some other factors may reduce the efficiency of memory access / data transfer / memory management or gpu. So here comes the classics, unrelated to movie playback per se:

a) Run matlab in -nojvm mode to get rid of any non-crucial threads and free up some memory. Maybe use octave instead of Matlab - it is less of a memory and resource hog.

b) Do not use a multi-display setup (in case you do) to free up some graphics driver / gpu resources. Possibly unplug monitors, so no ddc detection interferes, just in case...

c) Disable dynamic gpu power management and put it to permanent max performance. On AMD gpu's there is PsychGPUControl() for that. On NVidia GPU's there are some controls in their control panel.

d) Unplug and disable all non-critical devices, e.g., network, extra hard drives etc.

e) Make sure your computer is well cooled, to avoid any potential thermal throttling.

f) Of course disable all virus scanners, indexers etc. etc.

g) Check with dpclat.exe (freeware) if anything is impairing interrupt handling on your machine.

I think i'm out of ideas other than those. Although it would be interesting to have one of your movies downloadable from somewhere as a reference/test-case.

> > 5. "Linux is your friend", although i find it adorable how enthusiastic you try to make the elephant dance.
>
> At this point it would have been faster to install WUBI, no doubt. But we have a lot of lab infrastructure that's based on WinXP and dual booting is always such a pain - believe me, I've tried; my first Linux install was Slackware from floppies, using the umsdos filesystem. The problems change dramatically over the years, but the experience remains underwhelming.
>

Well, not my experience. Good luck with your current approach.
-mario