[Question] Speed of glReadPixels

I have a question about the speed of this command: glReadPixels.

In my codes, I have done the rendering, and just needed to pass it to a matrix for using. I use command like "myMatrix[][] = glReadPixels(...) ". It works well, but just much slower than I expected.

The data bus for my motherboard and GPU is PCI express x16 2.0 which can support maximum 8 GB/s bandwidth by theory. I think the action of glReadPixels is basically to move the data on the buffer (which is on somewhere on GPU) to the CPU (or I should call it host )memory. So physically this should be limited by the data bus.

In my code, this action took about 5 ms. The resolution I defined is 1024x768. That is to say, it is 1024x768 byte in 5 ms, which is equal to only 150 MB/s, way lower than its physical limitation, 8 GB/s. I think it is unlikely I can fully achieve the physical limitation, but at least I want to get the same order, it is just less than its 2%.

Therefore, my question is:
is there anything I should set up first to make "glReadPixel" more efficient?

I think glReadPixel in opengl shouldn't be so inefficient, so I am not sure if it is because I dont use it properly on Matlab with psychotoolbox.

Many thanks
--- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
>
> Hi Mario:
>
> Thank you very much for this clear explanation.
>
> I have tried the different format(GL.RGB etc)/buffer, and they all show similar speed on my computer, and actually I dont mind this tiny difference. My goal was try to get to about 2-4 GB/s (it should be reasonable to expect a 25%~50% performance of the physical limitation)
>

Desireable yes, reasonable depends a lot on the specific task, how optimal your application is written and also a quite bit of luck. You can get high performance if you use the latest state of the art techniques and optimizations like clever management of pixel buffer objects and parallelism between cpu and gpu, but not by some simple call to glReadPixels or tweaking a few parameters. The problem is not inherent to readpixels, but is given by the way the hardware is designed and the specific performance vs. flexibility tradeoffs chosen by hardware designers and graphics driver developers.

The quality of different gpu's and their drivers also varies by application and for what class of typical applications they are optimized. These are areas where, e.g., consumer GeForce cards can differ quite a bit in performance to Quadro's or Tesla compute boards, a bit by hardware design but even more by driver design. A graphics driver often has a dozen different ways of achieving the same task on a given gpu, with identical end results. Which of those is the fastest depends a lot on context and the specific application. And how much effort is put into optimizing the different execution paths depends on the target audience. The driver for a cheap GeForce gamer/desktop card will probably have much less optimizations and tuning for functions only relevant for high performance computing than a Tesla compute card which is specifically sold for such tasks.

The method combining pixel buffer objects with glReadPixels and multi-buffering which is also used in screen recorders like the one Tobias mentioned for Gnome is the currently optimal way of doing it. But your application needs to be designed around it to take advantage of the parallelism and asynchrony that can be provided by pbo's, otherwise it won't give much speedup. Also the gpu hardware and driver needs to be optimized for that use case.

> The way I measured the timing is: I type those command, including tic;glReadPixel;toc; on the command windows of matlab manually. That is to say, I (physically) just wait until the rendering is done (it should be finished in few ms). I did it to exclude the time of rendering, in order to get the timing as correct as possible.
>

That would be using what i proposed and doing a couple of dozen or hundred runs, not from the matlab console. E.g., rendering may only start when you call glReadPixels because the driver tries to be lazy and defer such ops to the last moment, because batching up large chunks of work is more efficient on average. Also given that your card not only has to process graphics commands from you, but also from all the running applications and the GUI (and desktop compositor on Windows Vista and later) of your computer, the pure act of typing anything in the command window or pressing the enter key - and maybe the matlab console scrolling will create extra graphics load which can execute inbetween your rendering and your glReadPixels and add a millisecond here and there.

> It seems to me that, the most time were simply spent on transforming the format to matlab matrix. I guess I can only use c#/opengl directly to avoid this problem.
>

No. You can remove the conversion code in our glReadPixels.m file as explained. Or use the moglcore('glreadpixels') call directly, even skipping parameter checking. If you don't feed the returned matrix into the image processing toolbox/imwrite/imshow, but only with your GPUmat toolbox, you could just write your processing code slightly different to operate on the native image format returned by OpenGL. You only need the conversion for typical use with Matlab's own image functions.

> So my application is simple, just take the image matrix out, and then do some other calculations, I will need to pass it to the GPU by GPUmat (a tool box allows matlabk to run some basic CUDA commands). It sounds trivial, because the data will be taken out from GPU to the host by the glReadPixel, and then pass back to GPU by GPUmat. But GPUmat and psychtoolbox are not integrated, so I simply need to control them from matlab (host) side. This is just an initial test for both sides (opengl by psychtoolbox, and CUDA by GPUmat). In the end, the whole code will be transferred to C# to avoid this problem.
>

GPUmat is actually quite interesting, i didn't know about it. Looks like what the commerical JackIt toolkit does, following the same basic implementation approach.

Does it work well for you? What kind of processing do you do? One of my favorite future todo's for ptb is to integrate native OpenCL support into our own image processing pipeline to allow optimal use of gpu computing within ptb for typical tasks. I hope to get around to do this hopefully sometime this year. But GPUmat looks as if it could be an interesting stop-gap measure to get some basic cuda based gpu computing going with ptb. I had a quick look at their source code and at Cuda's current OpenGL interop api. It should be possible to get some less than optimal VRAM<->VRAM data exchange going between ptb and GPUmat. Anything that avoids large data transfers and synchronisation between the host cpu and the gpu should give a large performance boost, even if it is implemented in a relatively hacky and less than optimal fashion.

> In this case, I think it is pretty much the limitation for the glReadPixel in this situation. Some of your suggested methods should be able to improve a bit, but as you said, they might not be significant. So I will just leave it majorly as a problem to be solved by using c# rather than matlab.
>

They can be significant depending on effort of implementation and graphics card/driver/app, it's a matter of trying and testing. It has nothing to do with c# vs. matlab. But nothing beats avoiding host-gpu data transfers whenever possible, regardless what programming or scripting environment you use.

-mario


> Many thanks
>
>
>
> --- In psychtoolbox@yahoogroups.com, "Mario" wrote:
> >
> >
> >
> > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > >
> > > Hi:
> > > I just tried it, and it took longer.
> >
> > 'GetImage' is not optimized for speed, only for convenience and flexibility and doing the right thing in all use cases, that's why it takes longer. Internally it uses glReadPixels. It could be made somewhat faster with not much effort, but part of the speed gap is because it converts from OpenGL format to Matlab matrix image format which is an expensive operation.
> >
> > > And to use GetImage, I need to do the Screen flip which is not necessary if I use glReadPixel.
> >
> > No, you just need to specify the proper buffername, ie., 'backBuffer' instead of the default 'frontBuffer'.
> >
> > > So in any case, seems using glReadPixel is quicker.
> > >
> > > it is 1.2 GB/s at on my computer at this momen, and its physical limitation should be 8 GB/s, therefore I am wondering if there is any method to read it quicker? Or simply I didnt set up some thing for glReadPixel previously?
> > >
> >
> > From your numbers i only calculate 600 MB/s, not 1.2 GB/s. However, benchmarking can easily go wrong, depending how you measure:
> >
> > draw something
> > tic
> > glReadPixels
> > toc
> >
> > -> Wrong. You are measuring the sum of execution time of drawing and readpixels, yielding a lower performance estimate than there is. glReadPixels is a cpu-gpu synchronization point. It will block until all pending rendering commands have completed and all relevant internal buffers are flushed. It can easily happen that the rendering commands only start executing when you call glReadPixels due to internal batch processing.
> >
> > draw something
> > glFinish
> > t1 = GetSecs;
> > glReadPixels
> > telapsed = GetSecs - t1;
> >
> > -> Less wrong. Now you are mostly measuring actual readback time, although the glFinish() itself will degrade real world performance somewhat, because cpu - gpu synchronization is a performance killer unless used wisely.
> >
> > PTB has api in Screen('GetWindowInfo'), demonstrated a little bit in DrawingSpeedTest to do the timing properly by using on-GPU timers which are designed for this kind of benchmarking.
> >
> > glReadPixels in PTB should be as fast as in C, the extra call overhead will be < 10 usecs per call. Oh, at least if you called InitializeMatlabOpenGL and specified the optional debuglevel flag to zero, so all the performance degrading but helpful error checks are skipped.
> >
> > Oh and you need to create your own version of glReadPixels.m which omits those lines after the moglcore('glReadPixels', ...); call:
> >
> > % Rearrange data in Matlab friendly format:
> > retpixels = zeros(size(pixels,2), size(pixels,3), size(pixels,1), pclass);
> > for i=1:numperpixel
> > retpixels(:,:,i) = pixels(i,:,:);
> > end;
> >
> > These lines are computationally expensive, but omitting them will leave you with some data that Matlab itself, e.g., the image processing toolbox or imshow() et al., can't process. You could store the raw binary data to disc or feed it into other custom written mex files for some processing though.
> >
> > However, the speed of glReadPixels is influenced by many parameters and the specific operating system / graphics driver / gpu hardware, and the specific usage pattern of an application and system. PCIe Bus speed is just a theoretical upper limit. And sometimes there's a large difference between use of pro cards (Quadro/Fire) and consumer cards (GeForce/Radeon) for readback speed, sometimes as a differentiating feature of pro vs. consumer gpu's.
> >
> > Example: A format of GL.RGBA or GL.BGRA may be more efficient than GL.RGB even if you only want the RGB channels. GL.RED may or may not be more efficient for getting luminance data than for getting everything. GL.BGRA could be faster or slower than GL.RGBA, depending on the 'type' parameter, where GL.UNSIGNED_INT_8_8_8_8 or GL.UNSIGNED_INT_8_8_8_8_REV can be faster or slower than GL.UNSIGNED_BYTE, depending if you use GL.RGBA or GL.BGRA.
> >
> > Differs by gpu vendor and gpu generation and driver which specific permutation gives you max speed. glPixelStorei(GL.PACK_ALIGNMENT, n); may give you different performance depending on the value of n and on the width of your image and if the width is a multiple of 1, 2, 4, 8, or 16 and which type and format you use. Only certain formats are supported by the hardware, others need conversion and that conversion may or may not be hardware accelerated and of different level of efficiency.
> >
> > Then it also depends on which buffer you're reading from and if that buffers memory layout is optimized for fast readback or fast rendering or for display on a monitor, because actual pixel data is usually not stored in a linear fashion in the gpu memory, as introductory articles about how graphics cards work would make you believe - that was true ten years ago, but no more.
> >
> > Some gpu's have dedicated DMA engines for async transfers while other stuff is going on in parallel, others don't, or only for certain data formats.
> >
> > And the strategy to get maximum readback performance is usually not simple textbook use of glReadPixels at all, but use of PBO's (pixelbuffer objects) in combination with glReadPixels and memory mapping and special synchronization primitives and double- or n-buffering of multiple readback buffers to take full advantage of hw capabilities. Some bits of this are used in moglmorpher.m and moglFDF.m.
> >
> > There are further limiting factors, e.g., the available bandwidth to your system memory, which is shared between all running applications, os and all hardware.
> >
> > Essentially your 600 MB/secs are not bad for naive use and if you are lucky and try different parameter combos and benchmarking and Google a bit around, you may be able to get a bit more without expending serious amounts (days / weeks or even months) of extra learning and effort. Or maybe not, depends on your specific hw/sw setup and needs.
> >
> > All these issues are part of the reason why high performance applications usually try to not transfer data from and to host memory but try to do as much processing on the gpu. Even tasks that are not faster on the gpu or even a bit slower can be effectively much faster if they cut down on host <-> gpu data transfers and the end results needs to go to the display.
> >
> > A different question would be what you are trying to achieve?
> > -mario
> >
> >
> >
> > > Many thanks
> > >
> > > --- In psychtoolbox@yahoogroups.com, "Diederick C. Niehorster" wrote:
> > > >
> > > > Consider using this instead of messing with opengl commands directly:
> > > > http://docs.psychtoolbox.org/GetImage
> > > >
> > > > On Wed, Jan 23, 2013 at 10:05 PM, elladawu wrote:
> > > >
> > > > > **
> > > > >
> > > > >
> > > > > um, your calculations are assuming that each pixel is only 1 bit of data...
> > > > >
> > > > >
> > > > > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > > > > >
> > > > > > I have a question about the speed of this command: glReadPixels.
> > > > > >
> > > > > > In my codes, I have done the rendering, and just needed to pass it to a
> > > > > matrix for using. I use command like "myMatrix[][] = glReadPixels(...) ".
> > > > > It works well, but just much slower than I expected.
> > > > > >
> > > > > > The data bus for my motherboard and GPU is PCI express x16 2.0 which can
> > > > > support maximum 8 GB/s bandwidth by theory. I think the action of
> > > > > glReadPixels is basically to move the data on the buffer (which is on
> > > > > somewhere on GPU) to the CPU (or I should call it host )memory. So
> > > > > physically this should be limited by the data bus.
> > > > > >
> > > > > > In my code, this action took about 5 ms. The resolution I defined is
> > > > > 1024x768. That is to say, it is 1024x768 byte in 5 ms, which is equal to
> > > > > only 150 MB/s, way lower than its physical limitation, 8 GB/s. I think it
> > > > > is unlikely I can fully achieve the physical limitation, but at least I
> > > > > want to get the same order, it is just less than its 2%.
> > > > > >
> > > > > > Therefore, my question is:
> > > > > > is there anything I should set up first to make "glReadPixel" more
> > > > > efficient?
> > > > > >
> > > > > > I think glReadPixel in opengl shouldn't be so inefficient, so I am not
> > > > > sure if it is because I dont use it properly on Matlab with psychotoolbox.
> > > > > >
> > > > > > Many thanks
> > > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > >
> >
>
Hi Mario:

thank you for the reply again.
There are many studies for me to do, and I will try to digest them.

Apart from that, the GPUmat works well for me.
However, seems it has its algorithm to allocate the memory during calculation, it might not be optimal for a specific application. And as you said, I believe it should be able to integrate the ptb and GPUmat, so we can reduce the part it goes through CPU<>GPU. I haven't found a way to manage it. I guess the source code needs to be modified , and it wont be easy for me.

Currently, I think openGL and CUDA are easier to be integrated on C# for a normal user.

Many thanks

--- In psychtoolbox@yahoogroups.com, "Mario" wrote:
>
>
>
>
>
> --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> >
> > Hi Mario:
> >
> > Thank you very much for this clear explanation.
> >
> > I have tried the different format(GL.RGB etc)/buffer, and they all show similar speed on my computer, and actually I dont mind this tiny difference. My goal was try to get to about 2-4 GB/s (it should be reasonable to expect a 25%~50% performance of the physical limitation)
> >
>
> Desireable yes, reasonable depends a lot on the specific task, how optimal your application is written and also a quite bit of luck. You can get high performance if you use the latest state of the art techniques and optimizations like clever management of pixel buffer objects and parallelism between cpu and gpu, but not by some simple call to glReadPixels or tweaking a few parameters. The problem is not inherent to readpixels, but is given by the way the hardware is designed and the specific performance vs. flexibility tradeoffs chosen by hardware designers and graphics driver developers.
>
> The quality of different gpu's and their drivers also varies by application and for what class of typical applications they are optimized. These are areas where, e.g., consumer GeForce cards can differ quite a bit in performance to Quadro's or Tesla compute boards, a bit by hardware design but even more by driver design. A graphics driver often has a dozen different ways of achieving the same task on a given gpu, with identical end results. Which of those is the fastest depends a lot on context and the specific application. And how much effort is put into optimizing the different execution paths depends on the target audience. The driver for a cheap GeForce gamer/desktop card will probably have much less optimizations and tuning for functions only relevant for high performance computing than a Tesla compute card which is specifically sold for such tasks.
>
> The method combining pixel buffer objects with glReadPixels and multi-buffering which is also used in screen recorders like the one Tobias mentioned for Gnome is the currently optimal way of doing it. But your application needs to be designed around it to take advantage of the parallelism and asynchrony that can be provided by pbo's, otherwise it won't give much speedup. Also the gpu hardware and driver needs to be optimized for that use case.
>
> > The way I measured the timing is: I type those command, including tic;glReadPixel;toc; on the command windows of matlab manually. That is to say, I (physically) just wait until the rendering is done (it should be finished in few ms). I did it to exclude the time of rendering, in order to get the timing as correct as possible.
> >
>
> That would be using what i proposed and doing a couple of dozen or hundred runs, not from the matlab console. E.g., rendering may only start when you call glReadPixels because the driver tries to be lazy and defer such ops to the last moment, because batching up large chunks of work is more efficient on average. Also given that your card not only has to process graphics commands from you, but also from all the running applications and the GUI (and desktop compositor on Windows Vista and later) of your computer, the pure act of typing anything in the command window or pressing the enter key - and maybe the matlab console scrolling will create extra graphics load which can execute inbetween your rendering and your glReadPixels and add a millisecond here and there.
>
> > It seems to me that, the most time were simply spent on transforming the format to matlab matrix. I guess I can only use c#/opengl directly to avoid this problem.
> >
>
> No. You can remove the conversion code in our glReadPixels.m file as explained. Or use the moglcore('glreadpixels') call directly, even skipping parameter checking. If you don't feed the returned matrix into the image processing toolbox/imwrite/imshow, but only with your GPUmat toolbox, you could just write your processing code slightly different to operate on the native image format returned by OpenGL. You only need the conversion for typical use with Matlab's own image functions.
>
> > So my application is simple, just take the image matrix out, and then do some other calculations, I will need to pass it to the GPU by GPUmat (a tool box allows matlabk to run some basic CUDA commands). It sounds trivial, because the data will be taken out from GPU to the host by the glReadPixel, and then pass back to GPU by GPUmat. But GPUmat and psychtoolbox are not integrated, so I simply need to control them from matlab (host) side. This is just an initial test for both sides (opengl by psychtoolbox, and CUDA by GPUmat). In the end, the whole code will be transferred to C# to avoid this problem.
> >
>
> GPUmat is actually quite interesting, i didn't know about it. Looks like what the commerical JackIt toolkit does, following the same basic implementation approach.
>
> Does it work well for you? What kind of processing do you do? One of my favorite future todo's for ptb is to integrate native OpenCL support into our own image processing pipeline to allow optimal use of gpu computing within ptb for typical tasks. I hope to get around to do this hopefully sometime this year. But GPUmat looks as if it could be an interesting stop-gap measure to get some basic cuda based gpu computing going with ptb. I had a quick look at their source code and at Cuda's current OpenGL interop api. It should be possible to get some less than optimal VRAM<->VRAM data exchange going between ptb and GPUmat. Anything that avoids large data transfers and synchronisation between the host cpu and the gpu should give a large performance boost, even if it is implemented in a relatively hacky and less than optimal fashion.
>
> > In this case, I think it is pretty much the limitation for the glReadPixel in this situation. Some of your suggested methods should be able to improve a bit, but as you said, they might not be significant. So I will just leave it majorly as a problem to be solved by using c# rather than matlab.
> >
>
> They can be significant depending on effort of implementation and graphics card/driver/app, it's a matter of trying and testing. It has nothing to do with c# vs. matlab. But nothing beats avoiding host-gpu data transfers whenever possible, regardless what programming or scripting environment you use.
>
> -mario
>
>
> > Many thanks
> >
> >
> >
> > --- In psychtoolbox@yahoogroups.com, "Mario" wrote:
> > >
> > >
> > >
> > > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > > >
> > > > Hi:
> > > > I just tried it, and it took longer.
> > >
> > > 'GetImage' is not optimized for speed, only for convenience and flexibility and doing the right thing in all use cases, that's why it takes longer. Internally it uses glReadPixels. It could be made somewhat faster with not much effort, but part of the speed gap is because it converts from OpenGL format to Matlab matrix image format which is an expensive operation.
> > >
> > > > And to use GetImage, I need to do the Screen flip which is not necessary if I use glReadPixel.
> > >
> > > No, you just need to specify the proper buffername, ie., 'backBuffer' instead of the default 'frontBuffer'.
> > >
> > > > So in any case, seems using glReadPixel is quicker.
> > > >
> > > > it is 1.2 GB/s at on my computer at this momen, and its physical limitation should be 8 GB/s, therefore I am wondering if there is any method to read it quicker? Or simply I didnt set up some thing for glReadPixel previously?
> > > >
> > >
> > > From your numbers i only calculate 600 MB/s, not 1.2 GB/s. However, benchmarking can easily go wrong, depending how you measure:
> > >
> > > draw something
> > > tic
> > > glReadPixels
> > > toc
> > >
> > > -> Wrong. You are measuring the sum of execution time of drawing and readpixels, yielding a lower performance estimate than there is. glReadPixels is a cpu-gpu synchronization point. It will block until all pending rendering commands have completed and all relevant internal buffers are flushed. It can easily happen that the rendering commands only start executing when you call glReadPixels due to internal batch processing.
> > >
> > > draw something
> > > glFinish
> > > t1 = GetSecs;
> > > glReadPixels
> > > telapsed = GetSecs - t1;
> > >
> > > -> Less wrong. Now you are mostly measuring actual readback time, although the glFinish() itself will degrade real world performance somewhat, because cpu - gpu synchronization is a performance killer unless used wisely.
> > >
> > > PTB has api in Screen('GetWindowInfo'), demonstrated a little bit in DrawingSpeedTest to do the timing properly by using on-GPU timers which are designed for this kind of benchmarking.
> > >
> > > glReadPixels in PTB should be as fast as in C, the extra call overhead will be < 10 usecs per call. Oh, at least if you called InitializeMatlabOpenGL and specified the optional debuglevel flag to zero, so all the performance degrading but helpful error checks are skipped.
> > >
> > > Oh and you need to create your own version of glReadPixels.m which omits those lines after the moglcore('glReadPixels', ...); call:
> > >
> > > % Rearrange data in Matlab friendly format:
> > > retpixels = zeros(size(pixels,2), size(pixels,3), size(pixels,1), pclass);
> > > for i=1:numperpixel
> > > retpixels(:,:,i) = pixels(i,:,:);
> > > end;
> > >
> > > These lines are computationally expensive, but omitting them will leave you with some data that Matlab itself, e.g., the image processing toolbox or imshow() et al., can't process. You could store the raw binary data to disc or feed it into other custom written mex files for some processing though.
> > >
> > > However, the speed of glReadPixels is influenced by many parameters and the specific operating system / graphics driver / gpu hardware, and the specific usage pattern of an application and system. PCIe Bus speed is just a theoretical upper limit. And sometimes there's a large difference between use of pro cards (Quadro/Fire) and consumer cards (GeForce/Radeon) for readback speed, sometimes as a differentiating feature of pro vs. consumer gpu's.
> > >
> > > Example: A format of GL.RGBA or GL.BGRA may be more efficient than GL.RGB even if you only want the RGB channels. GL.RED may or may not be more efficient for getting luminance data than for getting everything. GL.BGRA could be faster or slower than GL.RGBA, depending on the 'type' parameter, where GL.UNSIGNED_INT_8_8_8_8 or GL.UNSIGNED_INT_8_8_8_8_REV can be faster or slower than GL.UNSIGNED_BYTE, depending if you use GL.RGBA or GL.BGRA.
> > >
> > > Differs by gpu vendor and gpu generation and driver which specific permutation gives you max speed. glPixelStorei(GL.PACK_ALIGNMENT, n); may give you different performance depending on the value of n and on the width of your image and if the width is a multiple of 1, 2, 4, 8, or 16 and which type and format you use. Only certain formats are supported by the hardware, others need conversion and that conversion may or may not be hardware accelerated and of different level of efficiency.
> > >
> > > Then it also depends on which buffer you're reading from and if that buffers memory layout is optimized for fast readback or fast rendering or for display on a monitor, because actual pixel data is usually not stored in a linear fashion in the gpu memory, as introductory articles about how graphics cards work would make you believe - that was true ten years ago, but no more.
> > >
> > > Some gpu's have dedicated DMA engines for async transfers while other stuff is going on in parallel, others don't, or only for certain data formats.
> > >
> > > And the strategy to get maximum readback performance is usually not simple textbook use of glReadPixels at all, but use of PBO's (pixelbuffer objects) in combination with glReadPixels and memory mapping and special synchronization primitives and double- or n-buffering of multiple readback buffers to take full advantage of hw capabilities. Some bits of this are used in moglmorpher.m and moglFDF.m.
> > >
> > > There are further limiting factors, e.g., the available bandwidth to your system memory, which is shared between all running applications, os and all hardware.
> > >
> > > Essentially your 600 MB/secs are not bad for naive use and if you are lucky and try different parameter combos and benchmarking and Google a bit around, you may be able to get a bit more without expending serious amounts (days / weeks or even months) of extra learning and effort. Or maybe not, depends on your specific hw/sw setup and needs.
> > >
> > > All these issues are part of the reason why high performance applications usually try to not transfer data from and to host memory but try to do as much processing on the gpu. Even tasks that are not faster on the gpu or even a bit slower can be effectively much faster if they cut down on host <-> gpu data transfers and the end results needs to go to the display.
> > >
> > > A different question would be what you are trying to achieve?
> > > -mario
> > >
> > >
> > >
> > > > Many thanks
> > > >
> > > > --- In psychtoolbox@yahoogroups.com, "Diederick C. Niehorster" wrote:
> > > > >
> > > > > Consider using this instead of messing with opengl commands directly:
> > > > > http://docs.psychtoolbox.org/GetImage
> > > > >
> > > > > On Wed, Jan 23, 2013 at 10:05 PM, elladawu wrote:
> > > > >
> > > > > > **
> > > > > >
> > > > > >
> > > > > > um, your calculations are assuming that each pixel is only 1 bit of data...
> > > > > >
> > > > > >
> > > > > > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > > > > > >
> > > > > > > I have a question about the speed of this command: glReadPixels.
> > > > > > >
> > > > > > > In my codes, I have done the rendering, and just needed to pass it to a
> > > > > > matrix for using. I use command like "myMatrix[][] = glReadPixels(...) ".
> > > > > > It works well, but just much slower than I expected.
> > > > > > >
> > > > > > > The data bus for my motherboard and GPU is PCI express x16 2.0 which can
> > > > > > support maximum 8 GB/s bandwidth by theory. I think the action of
> > > > > > glReadPixels is basically to move the data on the buffer (which is on
> > > > > > somewhere on GPU) to the CPU (or I should call it host )memory. So
> > > > > > physically this should be limited by the data bus.
> > > > > > >
> > > > > > > In my code, this action took about 5 ms. The resolution I defined is
> > > > > > 1024x768. That is to say, it is 1024x768 byte in 5 ms, which is equal to
> > > > > > only 150 MB/s, way lower than its physical limitation, 8 GB/s. I think it
> > > > > > is unlikely I can fully achieve the physical limitation, but at least I
> > > > > > want to get the same order, it is just less than its 2%.
> > > > > > >
> > > > > > > Therefore, my question is:
> > > > > > > is there anything I should set up first to make "glReadPixel" more
> > > > > > efficient?
> > > > > > >
> > > > > > > I think glReadPixel in opengl shouldn't be so inefficient, so I am not
> > > > > > sure if it is because I dont use it properly on Matlab with psychotoolbox.
> > > > > > >
> > > > > > > Many thanks
> > > > > > >
> > > > > >
> > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
--- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
>
> Hi Mario:
>
> thank you for the reply again.
> There are many studies for me to do, and I will try to digest them.
>
> Apart from that, the GPUmat works well for me.
> However, seems it has its algorithm to allocate the memory during calculation, it might not be optimal for a specific application. And as you said, I believe it should be able to integrate the ptb and GPUmat, so we can reduce the part it goes through CPU<>GPU. I haven't found a way to manage it. I guess the source code needs to be modified , and it wont be easy for me.
>

Yes, depending on your specific task, performance will not me maximal. GPUmat uses the same approach as JackIt, and both are targeted, just as Matlab itself, to fast prototyping of code and ideas, or for people who need some speedup cheaply. It's an elegant solution, because you don't need to learn much (if anything) new beyond standard Matlab coding and allows fast iteration and testing due to the automated memory and gpu management. If you code directly in C or C++ with CUDA or OpenCL and you really know what you're doing - or can use libraries for the task written by someone who knows what they're doing, it should be faster. I wouldn't exactly use C# if performance is relevant, because that's the same logic - a high level language with automatic memory management, managed code and such - and the price you pay is lower performance.

But it depends on the task at hand. Some algorithms, carefully coded, may work just as fast in GPUmat or Jackit as in low-level code, e.g., if you don't need to reallocate memory a lot. These toolkits likely cache and recycle CUDA resources if possible, so their memory management may work quite efficiently with a little help of the Matlab code. Or if you use mostly a few computation steps in your algorithm with standard building blocks which are already available high-level in CUDA and bog standard, e.g., typical matrix algebra, matrix-vector or matrix-matrix multiplies, FFT etc.

> Currently, I think openGL and CUDA are easier to be integrated on C# for a normal user.
>

Your post made me curious, so i played a little bit with GPUmat over the weekend, actually ported it to MacOSX. I'm also almost done implementing a basic GPUmat<->PTB interop module, which could be extended to other toolkits. Most of it works well already, just some beautification and performance tuning is left, and some nice demo, so this module will be part of next PTB beta pretty soon.

-mario


> Many thanks
>
> --- In psychtoolbox@yahoogroups.com, "Mario" wrote:
> >
> >
> >
> >
> >
> > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > >
> > > Hi Mario:
> > >
> > > Thank you very much for this clear explanation.
> > >
> > > I have tried the different format(GL.RGB etc)/buffer, and they all show similar speed on my computer, and actually I dont mind this tiny difference. My goal was try to get to about 2-4 GB/s (it should be reasonable to expect a 25%~50% performance of the physical limitation)
> > >
> >
> > Desireable yes, reasonable depends a lot on the specific task, how optimal your application is written and also a quite bit of luck. You can get high performance if you use the latest state of the art techniques and optimizations like clever management of pixel buffer objects and parallelism between cpu and gpu, but not by some simple call to glReadPixels or tweaking a few parameters. The problem is not inherent to readpixels, but is given by the way the hardware is designed and the specific performance vs. flexibility tradeoffs chosen by hardware designers and graphics driver developers.
> >
> > The quality of different gpu's and their drivers also varies by application and for what class of typical applications they are optimized. These are areas where, e.g., consumer GeForce cards can differ quite a bit in performance to Quadro's or Tesla compute boards, a bit by hardware design but even more by driver design. A graphics driver often has a dozen different ways of achieving the same task on a given gpu, with identical end results. Which of those is the fastest depends a lot on context and the specific application. And how much effort is put into optimizing the different execution paths depends on the target audience. The driver for a cheap GeForce gamer/desktop card will probably have much less optimizations and tuning for functions only relevant for high performance computing than a Tesla compute card which is specifically sold for such tasks.
> >
> > The method combining pixel buffer objects with glReadPixels and multi-buffering which is also used in screen recorders like the one Tobias mentioned for Gnome is the currently optimal way of doing it. But your application needs to be designed around it to take advantage of the parallelism and asynchrony that can be provided by pbo's, otherwise it won't give much speedup. Also the gpu hardware and driver needs to be optimized for that use case.
> >
> > > The way I measured the timing is: I type those command, including tic;glReadPixel;toc; on the command windows of matlab manually. That is to say, I (physically) just wait until the rendering is done (it should be finished in few ms). I did it to exclude the time of rendering, in order to get the timing as correct as possible.
> > >
> >
> > That would be using what i proposed and doing a couple of dozen or hundred runs, not from the matlab console. E.g., rendering may only start when you call glReadPixels because the driver tries to be lazy and defer such ops to the last moment, because batching up large chunks of work is more efficient on average. Also given that your card not only has to process graphics commands from you, but also from all the running applications and the GUI (and desktop compositor on Windows Vista and later) of your computer, the pure act of typing anything in the command window or pressing the enter key - and maybe the matlab console scrolling will create extra graphics load which can execute inbetween your rendering and your glReadPixels and add a millisecond here and there.
> >
> > > It seems to me that, the most time were simply spent on transforming the format to matlab matrix. I guess I can only use c#/opengl directly to avoid this problem.
> > >
> >
> > No. You can remove the conversion code in our glReadPixels.m file as explained. Or use the moglcore('glreadpixels') call directly, even skipping parameter checking. If you don't feed the returned matrix into the image processing toolbox/imwrite/imshow, but only with your GPUmat toolbox, you could just write your processing code slightly different to operate on the native image format returned by OpenGL. You only need the conversion for typical use with Matlab's own image functions.
> >
> > > So my application is simple, just take the image matrix out, and then do some other calculations, I will need to pass it to the GPU by GPUmat (a tool box allows matlabk to run some basic CUDA commands). It sounds trivial, because the data will be taken out from GPU to the host by the glReadPixel, and then pass back to GPU by GPUmat. But GPUmat and psychtoolbox are not integrated, so I simply need to control them from matlab (host) side. This is just an initial test for both sides (opengl by psychtoolbox, and CUDA by GPUmat). In the end, the whole code will be transferred to C# to avoid this problem.
> > >
> >
> > GPUmat is actually quite interesting, i didn't know about it. Looks like what the commerical JackIt toolkit does, following the same basic implementation approach.
> >
> > Does it work well for you? What kind of processing do you do? One of my favorite future todo's for ptb is to integrate native OpenCL support into our own image processing pipeline to allow optimal use of gpu computing within ptb for typical tasks. I hope to get around to do this hopefully sometime this year. But GPUmat looks as if it could be an interesting stop-gap measure to get some basic cuda based gpu computing going with ptb. I had a quick look at their source code and at Cuda's current OpenGL interop api. It should be possible to get some less than optimal VRAM<->VRAM data exchange going between ptb and GPUmat. Anything that avoids large data transfers and synchronisation between the host cpu and the gpu should give a large performance boost, even if it is implemented in a relatively hacky and less than optimal fashion.
> >
> > > In this case, I think it is pretty much the limitation for the glReadPixel in this situation. Some of your suggested methods should be able to improve a bit, but as you said, they might not be significant. So I will just leave it majorly as a problem to be solved by using c# rather than matlab.
> > >
> >
> > They can be significant depending on effort of implementation and graphics card/driver/app, it's a matter of trying and testing. It has nothing to do with c# vs. matlab. But nothing beats avoiding host-gpu data transfers whenever possible, regardless what programming or scripting environment you use.
> >
> > -mario
> >
> >
> > > Many thanks
> > >
> > >
> > >
> > > --- In psychtoolbox@yahoogroups.com, "Mario" wrote:
> > > >
> > > >
> > > >
> > > > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > > > >
> > > > > Hi:
> > > > > I just tried it, and it took longer.
> > > >
> > > > 'GetImage' is not optimized for speed, only for convenience and flexibility and doing the right thing in all use cases, that's why it takes longer. Internally it uses glReadPixels. It could be made somewhat faster with not much effort, but part of the speed gap is because it converts from OpenGL format to Matlab matrix image format which is an expensive operation.
> > > >
> > > > > And to use GetImage, I need to do the Screen flip which is not necessary if I use glReadPixel.
> > > >
> > > > No, you just need to specify the proper buffername, ie., 'backBuffer' instead of the default 'frontBuffer'.
> > > >
> > > > > So in any case, seems using glReadPixel is quicker.
> > > > >
> > > > > it is 1.2 GB/s at on my computer at this momen, and its physical limitation should be 8 GB/s, therefore I am wondering if there is any method to read it quicker? Or simply I didnt set up some thing for glReadPixel previously?
> > > > >
> > > >
> > > > From your numbers i only calculate 600 MB/s, not 1.2 GB/s. However, benchmarking can easily go wrong, depending how you measure:
> > > >
> > > > draw something
> > > > tic
> > > > glReadPixels
> > > > toc
> > > >
> > > > -> Wrong. You are measuring the sum of execution time of drawing and readpixels, yielding a lower performance estimate than there is. glReadPixels is a cpu-gpu synchronization point. It will block until all pending rendering commands have completed and all relevant internal buffers are flushed. It can easily happen that the rendering commands only start executing when you call glReadPixels due to internal batch processing.
> > > >
> > > > draw something
> > > > glFinish
> > > > t1 = GetSecs;
> > > > glReadPixels
> > > > telapsed = GetSecs - t1;
> > > >
> > > > -> Less wrong. Now you are mostly measuring actual readback time, although the glFinish() itself will degrade real world performance somewhat, because cpu - gpu synchronization is a performance killer unless used wisely.
> > > >
> > > > PTB has api in Screen('GetWindowInfo'), demonstrated a little bit in DrawingSpeedTest to do the timing properly by using on-GPU timers which are designed for this kind of benchmarking.
> > > >
> > > > glReadPixels in PTB should be as fast as in C, the extra call overhead will be < 10 usecs per call. Oh, at least if you called InitializeMatlabOpenGL and specified the optional debuglevel flag to zero, so all the performance degrading but helpful error checks are skipped.
> > > >
> > > > Oh and you need to create your own version of glReadPixels.m which omits those lines after the moglcore('glReadPixels', ...); call:
> > > >
> > > > % Rearrange data in Matlab friendly format:
> > > > retpixels = zeros(size(pixels,2), size(pixels,3), size(pixels,1), pclass);
> > > > for i=1:numperpixel
> > > > retpixels(:,:,i) = pixels(i,:,:);
> > > > end;
> > > >
> > > > These lines are computationally expensive, but omitting them will leave you with some data that Matlab itself, e.g., the image processing toolbox or imshow() et al., can't process. You could store the raw binary data to disc or feed it into other custom written mex files for some processing though.
> > > >
> > > > However, the speed of glReadPixels is influenced by many parameters and the specific operating system / graphics driver / gpu hardware, and the specific usage pattern of an application and system. PCIe Bus speed is just a theoretical upper limit. And sometimes there's a large difference between use of pro cards (Quadro/Fire) and consumer cards (GeForce/Radeon) for readback speed, sometimes as a differentiating feature of pro vs. consumer gpu's.
> > > >
> > > > Example: A format of GL.RGBA or GL.BGRA may be more efficient than GL.RGB even if you only want the RGB channels. GL.RED may or may not be more efficient for getting luminance data than for getting everything. GL.BGRA could be faster or slower than GL.RGBA, depending on the 'type' parameter, where GL.UNSIGNED_INT_8_8_8_8 or GL.UNSIGNED_INT_8_8_8_8_REV can be faster or slower than GL.UNSIGNED_BYTE, depending if you use GL.RGBA or GL.BGRA.
> > > >
> > > > Differs by gpu vendor and gpu generation and driver which specific permutation gives you max speed. glPixelStorei(GL.PACK_ALIGNMENT, n); may give you different performance depending on the value of n and on the width of your image and if the width is a multiple of 1, 2, 4, 8, or 16 and which type and format you use. Only certain formats are supported by the hardware, others need conversion and that conversion may or may not be hardware accelerated and of different level of efficiency.
> > > >
> > > > Then it also depends on which buffer you're reading from and if that buffers memory layout is optimized for fast readback or fast rendering or for display on a monitor, because actual pixel data is usually not stored in a linear fashion in the gpu memory, as introductory articles about how graphics cards work would make you believe - that was true ten years ago, but no more.
> > > >
> > > > Some gpu's have dedicated DMA engines for async transfers while other stuff is going on in parallel, others don't, or only for certain data formats.
> > > >
> > > > And the strategy to get maximum readback performance is usually not simple textbook use of glReadPixels at all, but use of PBO's (pixelbuffer objects) in combination with glReadPixels and memory mapping and special synchronization primitives and double- or n-buffering of multiple readback buffers to take full advantage of hw capabilities. Some bits of this are used in moglmorpher.m and moglFDF.m.
> > > >
> > > > There are further limiting factors, e.g., the available bandwidth to your system memory, which is shared between all running applications, os and all hardware.
> > > >
> > > > Essentially your 600 MB/secs are not bad for naive use and if you are lucky and try different parameter combos and benchmarking and Google a bit around, you may be able to get a bit more without expending serious amounts (days / weeks or even months) of extra learning and effort. Or maybe not, depends on your specific hw/sw setup and needs.
> > > >
> > > > All these issues are part of the reason why high performance applications usually try to not transfer data from and to host memory but try to do as much processing on the gpu. Even tasks that are not faster on the gpu or even a bit slower can be effectively much faster if they cut down on host <-> gpu data transfers and the end results needs to go to the display.
> > > >
> > > > A different question would be what you are trying to achieve?
> > > > -mario
> > > >
> > > >
> > > >
> > > > > Many thanks
> > > > >
> > > > > --- In psychtoolbox@yahoogroups.com, "Diederick C. Niehorster" wrote:
> > > > > >
> > > > > > Consider using this instead of messing with opengl commands directly:
> > > > > > http://docs.psychtoolbox.org/GetImage
> > > > > >
> > > > > > On Wed, Jan 23, 2013 at 10:05 PM, elladawu wrote:
> > > > > >
> > > > > > > **
> > > > > > >
> > > > > > >
> > > > > > > um, your calculations are assuming that each pixel is only 1 bit of data...
> > > > > > >
> > > > > > >
> > > > > > > --- In psychtoolbox@yahoogroups.com, "jerremyjsc" wrote:
> > > > > > > >
> > > > > > > > I have a question about the speed of this command: glReadPixels.
> > > > > > > >
> > > > > > > > In my codes, I have done the rendering, and just needed to pass it to a
> > > > > > > matrix for using. I use command like "myMatrix[][] = glReadPixels(...) ".
> > > > > > > It works well, but just much slower than I expected.
> > > > > > > >
> > > > > > > > The data bus for my motherboard and GPU is PCI express x16 2.0 which can
> > > > > > > support maximum 8 GB/s bandwidth by theory. I think the action of
> > > > > > > glReadPixels is basically to move the data on the buffer (which is on
> > > > > > > somewhere on GPU) to the CPU (or I should call it host )memory. So
> > > > > > > physically this should be limited by the data bus.
> > > > > > > >
> > > > > > > > In my code, this action took about 5 ms. The resolution I defined is
> > > > > > > 1024x768. That is to say, it is 1024x768 byte in 5 ms, which is equal to
> > > > > > > only 150 MB/s, way lower than its physical limitation, 8 GB/s. I think it
> > > > > > > is unlikely I can fully achieve the physical limitation, but at least I
> > > > > > > want to get the same order, it is just less than its 2%.
> > > > > > > >
> > > > > > > > Therefore, my question is:
> > > > > > > > is there anything I should set up first to make "glReadPixel" more
> > > > > > > efficient?
> > > > > > > >
> > > > > > > > I think glReadPixel in opengl shouldn't be so inefficient, so I am not
> > > > > > > sure if it is because I dont use it properly on Matlab with psychotoolbox.
> > > > > > > >
> > > > > > > > Many thanks
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>