> 0. while whatever
> 1. getimage.
> 2. drawimage.
> 3. drawingfinished.
> 4. kbcheck and other non-visual logic.
> 5. flip.
> 6. endwhile
>
> 3. makes sure that the gpu starts processing, in parallel to 4., instead of potentially waiting with start of processing until 5, yielding more parallelism between cpu and gpu. Without 3., the driver will kick the gpu into action either at 5. or if enough rendering work has accumulated to warrant start of processing, or after some timeout has been reached. So 3. may or may not help, but usually it doesn't hurt and usually does help in corner-cases where you'd miss deadlines only by a small amount of time etc.
Sorry, should have been more clear: method 1-6 is the very first thing I tried after downloading the beta(^2?) version of screen (minus #4, just to eek out whatever additional ms I could get). It performs the most consistently, and also the worst. The version I posted before, which pre-loads frames instead of just a simple waitsecs sometimes performs perfectly, with no missed flips, whereas the simple approach ALWAYS misses a decent number of flips. I've also tried a version which pre-loads textures, and uses AsyncFlipBegin before loading the next texture in the draw-load loop, which performs the best of the bunch, often missing as few as 3% of flips, but sometimes as many as 15%.
> So at the moment, with Priority(2), the GStreamer threads run at the same priority as the main visual thread. At Priority(1), the GStreamer threads would still run at elevated realtime-priority levels, but lower than the main (visual) thread -- i think that's what you want.
Right. I've tried 0,1, and 2. No consistent winner.
> And to clarify: It is not a GStreamer thread, but potentially a dozen threads, depending on the movie format and various settings + the graphics driver also has internal threads for some gpu's + the ptb threads == There are many more threads than your 4 cores at any time competing for resources. Assigning proper priorities is especially important for realtime-apps, but hard to do on an os with rather limited scheduling control and a non-rt scheduler. I could write an essay about the ways the windows scheduler sucks for rt apps.
Interesting. I tried enabling hyperthreads on the computer, resulting in 8 somewhat wimpy cores instead of 4 real ones. Seems to help a little. Watching task manager, it looks like a total of 4-5 cores get pegged while the draw-decode loops is running, and all 8 some activity.
> For fine-grained rt control (with (m)any threads), a properly set up Linux system is the way to go. Even there, ptb currently only uses a fraction of the available tuning mechanisms, i expect to spend many more days/weeks of incrementally improving it, once i find the time to do so.
>
> > Here's some code, in case anybody else wants to give this a try.
> > >
>
> Looks awfully complex, possibly self-defeating.
Possibly. I tried the simplest possible design, and added one little bit of complexity at a time until arriving at the final monster that you see here. Can't say for sure each bit of the complexity helped, since the testing is not deterministic, but it is true that the final version with all the bits does perform the best, on average.
> 1. Try the specialFlags1 settings 1, 2 and 1+2 -- you don't have sound, so no need to setup sound decoding [although that might get skipped automatically, i don't know]. If your gpu + driver supports yuv textures, that could squeeze out another msec or two.
Tried all three. +2 seems to be slower. +1 makes no difference.
> 2. Only use 1 texture at a time, don't create multiple ahead of time, that will defeat some internal recycling of texture objects and lead to slower texture creation. The code is optimized for the common use pattern for live playback, which you are using in your code.
Texture creation does seem to be one of the bottle necks. Too bad 1 texture at a time performs the worst.
> 3. At high verbosity levels (6 or so) you'll get some debug output about movie decoding. From that output (something like "xxx buffers queued") you can get an estimate of how long it really takes to decode the full movie, ie., how to set your wait time. Obviously for a 2 second clip @ 120Hz you'd like to see 240 buffers queued the first time you 'getmovieimage' a texture -- then the decoding would be completed at start of playback. The timing stats there also give you a feeling for how long texture creation takes.
Even having all 240 queued doesn't always ensure good performance, because the texture creation and uploading take a while.
These numbers are typical while PTB buffering is occurring:
PTB-DEBUG: Start of frame query to decode completion: 5.144509 msecs.
PTB-DEBUG: Decode completion to texture created: 3.419429 msecs.
PTB-DEBUG: Texture created to fetch completion: 0.020114 msecs.
and then after PTB buffering is finished:
PTB-DEBUG: Start of frame query to decode completion: 0.039949 msecs.
PTB-DEBUG: Decode completion to texture created: 4.603937 msecs.
PTB-DEBUG: Texture created to fetch completion: 0.282717 msecs.
> 4. You still have various flags from previous posts you could try to squeeze out a msec somewhere.
Sorry my post was unclear on this: I have tried all flags (though perhaps not all possible combinations of flags). None consistently rises above the run-to-run noise.
> 5. "Linux is your friend", although i find it adorable how enthusiastic you try to make the elephant dance.
At this point it would have been faster to install WUBI, no doubt. But we have a lot of lab infrastructure that's based on WinXP and dual booting is always such a pain - believe me, I've tried; my first Linux install was Slackware from floppies, using the umsdos filesystem. The problems change dramatically over the years, but the experience remains underwhelming.