View Issue Details

IDProjectCategoryView StatusLast Update
0000294Cinelerra-GG[All Projects] Featurepublic2019-09-11 11:47
ReporterAndrew-RAssigned ToPhyllisSmith 
PrioritynormalSeverityminorReproducibilityalways
Status acknowledgedResolutionopen 
Product Version2019-06 
Target VersionFixed in Version 
Summary0000294: Support for RGBA-Float in playback3d.C
DescriptionHello!

I was testing yet another build of CinGG (Cinelerra Infinity - built: Sep 6 2019 00:54:56) and found that while RGBA-8 colormodel plays at 25 fps for both X11 and OpenGL outputs - RGBA-FLOAT drops down to 5-10 fps.

Looking at cinelerra-5.1/cinelerra/playback3d.C I see

void Playback3D::convert_cmodel(Canvas *canvas,
        VFrame *output,
        int dst_cmodel)
{
// Do nothing if colormodels are equivalent in OpenGL & the image is in hardware.
        int src_cmodel = output->get_color_model();
        if(
                (output->get_opengl_state() == VFrame::TEXTURE ||
                output->get_opengl_state() == VFrame::SCREEN) &&
// OpenGL has no floating point.
                ( (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGB_FLOAT) ||
                  (src_cmodel == BC_RGBA8888 && dst_cmodel == BC_RGBA_FLOAT) ||
                  (src_cmodel == BC_RGB_FLOAT && dst_cmodel == BC_RGB888) ||
                  (src_cmodel == BC_RGBA_FLOAT && dst_cmodel == BC_RGBA8888) ||
// OpenGL sets alpha to 1 on import
                  (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGBA8888) ||
                  (src_cmodel == BC_YUV888 && dst_cmodel == BC_YUVA8888) ||
                  (src_cmodel == BC_RGB_FLOAT && dst_cmodel == BC_RGBA_FLOAT) )
                ) return;


well a bit down same file there is table describing some conversions, but I'm not sure it covers all cases?

void Playback3D::convert_cmodel_sync(Playback3DCommand *command)

{skip}

static cmodel_shader_table_t cmodel_shader_table[] = {
                        { BC_RGB888, BC_YUV888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_RGB888, BC_YUVA8888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_RGBA8888, BC_RGB888, rgb_to_rgb, rgba_to_rgb_frag },
                        { BC_RGBA8888, BC_RGB_FLOAT, rgb_to_rgb, rgba_to_rgb_frag },
                        { BC_RGBA8888, BC_YUV888, rgb_to_yuv, rgba_to_yuv_frag },
                        { BC_RGBA8888, BC_YUVA8888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_RGB_FLOAT, BC_YUV888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_RGB_FLOAT, BC_YUVA8888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_RGBA_FLOAT,BC_RGB888, rgb_to_rgb, rgba_to_rgb_frag },
                        { BC_RGBA_FLOAT,BC_RGB_FLOAT, rgb_to_rgb, rgba_to_rgb_frag },
                        { BC_RGBA_FLOAT,BC_YUV888, rgb_to_yuv, rgba_to_yuv_frag },
                        { BC_RGBA_FLOAT,BC_YUVA8888, rgb_to_yuv, rgb_to_yuv_frag },
                        { BC_YUV888, BC_RGB888, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUV888, BC_RGBA8888, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUV888, BC_RGB_FLOAT, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUV888, BC_RGBA_FLOAT, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUVA8888, BC_RGB888, yuv_to_rgb, yuva_to_rgb_frag },
                        { BC_YUVA8888, BC_RGBA8888, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUVA8888, BC_RGB_FLOAT, yuv_to_rgb, yuva_to_rgb_frag },
                        { BC_YUVA8888, BC_RGBA_FLOAT, yuv_to_rgb, yuv_to_rgb_frag },
                        { BC_YUVA8888, BC_YUV888, yuv_to_yuv, yuva_to_yuv_frag },


Thing is, you apparently CAN use floating-point textures and renderbuffesr in openGL 3+!

https://learnopengl.com/Advanced-Lighting/HDR
-----quote-----
Floating point framebuffers

 To implement high dynamic range rendering we need some way to prevent color values getting clamped after each fragment shader run. When framebuffers use a normalized fixed-point color format (like GL_RGB) as their colorbuffer's internal format OpenGL automatically clamps the values between 0.0 and 1.0 before storing them in the framebuffer. This operation holds for most types of framebuffer formats, except for floating point formats that are used for their extended range of values.

 When the internal format of a framebuffer's colorbuffer is specified as GL_RGB16F, GL_RGBA16F, GL_RGB32F or GL_RGBA32F the framebuffer is known as a floating point framebuffer that can store floating point values outside the default range of 0.0 and 1.0. This is perfect for rendering in high dynamic range!

 To create a floating point framebuffer the only thing we need to change is its colorbuffer's internal format parameter:

glBindTexture(GL_TEXTURE_2D, colorBuffer);
glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA16F, SCR_WIDTH, SCR_HEIGHT, 0, GL_RGBA, GL_FLOAT, NULL);

 The default framebuffer of OpenGL (by default) only takes up 8 bits per color component. With a floating point framebuffer with 32 bits per color component (when using GL_RGB32F or GL_RGBA32F) we're using 4 times more memory for storing color values. As 32 bits isn't really necessary unless you need a high level of precision using GL_RGBA16F will suffice.
---end of quote-----

Because for RGBA-Float project colormodel you mostly need RGBA-F16 conversions and texture modes - it should be relatively simple ...ah, not if you count all plugins :/ But if whole pipeline run at R16G16B16A16 - then ffmpeg will covert video streams into this format at decoding stage, and then Cinelerra only need to deal with RGB(A) floating-point until final encoding stage, where again ffmpeg will pick up data, convert it for specific encoder and make it all work .... But plugins all use old 8-bit/channel OpenGL logic, so they probably will require additional attention.

glxinfo for me (OpenGL 3.3/DX10 era hw) gives:

glxinfo | grep float
    GLX_ARB_fbconfig_float, GLX_ARB_framebuffer_sRGB, GLX_ARB_multisample,
    GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
    GLX_ARB_create_context_robustness, GLX_ARB_fbconfig_float,
    GLX_EXT_create_context_es_profile, GLX_EXT_fbconfig_packed_float,
    GLX_ARB_fbconfig_float, GLX_ARB_framebuffer_sRGB,
    GLX_EXT_fbconfig_packed_float, GLX_EXT_framebuffer_sRGB,
    GL_ARB_color_buffer_float, GL_ARB_compressed_texture_pixel_storage,
    GL_ARB_debug_output, GL_ARB_depth_buffer_float, GL_ARB_depth_clamp,
    GL_ARB_get_texture_sub_image, GL_ARB_half_float_pixel,
    GL_ARB_half_float_vertex, GL_ARB_instanced_arrays,
    GL_ARB_texture_filter_anisotropic, GL_ARB_texture_float,
    GL_ATI_blend_equation_separate, GL_ATI_texture_float,
    GL_EXT_framebuffer_sRGB, GL_EXT_packed_depth_stencil, GL_EXT_packed_float,
    GL_ARB_clip_control, GL_ARB_color_buffer_float, GL_ARB_compatibility,
    GL_ARB_debug_output, GL_ARB_depth_buffer_float, GL_ARB_depth_clamp,
    GL_ARB_half_float_pixel, GL_ARB_half_float_vertex,
    GL_ARB_texture_float, GL_ARB_texture_mirror_clamp_to_edge,
    GL_ATI_texture_float, GL_ATI_texture_mirror_once, GL_EXT_abgr,
    GL_EXT_packed_float, GL_EXT_packed_pixels, GL_EXT_pixel_buffer_object,
    GL_EXT_clip_control, GL_EXT_clip_cull_distance, GL_EXT_color_buffer_float,
    GL_EXT_draw_elements_base_vertex, GL_EXT_float_blend, GL_EXT_frag_depth,
    GL_OES_texture_border_clamp, GL_OES_texture_float,
    GL_OES_texture_float_linear, GL_OES_texture_half_float,
    GL_OES_texture_half_float_linear, GL_OES_texture_npot,
    GL_OES_vertex_half_float

(for Core and Compat profiles, and GLX server/client parts)

Mesa has this enabled by default since June, 2018 (on hardware where it matters and all paths actually done in specific driver)

https://gitlab.freedesktop.org/lima/mesa/commit/66673bef941af344314fe9c91cad8cd330b245eb

Anyway, leaving this ticket hanging around, just in case GG or anyone else will have some time to play around with this idea.
Steps To ReproduceTry to set project format to RGBA-Float (from default RGBA-8) and Output driver to X11-OpenGL. Observe slowdown at playing even single track video.
Additional Informationgit version

commit 721a106de35567bcab14a0e92718767189acf176 (grafted, HEAD -> master, origin/master, origin/HEAD)
Author: Good Guy <[email protected]>
Date: Wed Sep 4 12:26:37 2019 -0600

    add crop plugin, add timeline bars, render setup err chks, minor tweaks
TagsNo tags attached.

Activities

Olaf

Olaf

2019-09-11 11:47

reporter   ~0002121

@Andrew-R,
Any graphics card with GeForce GTS 450 or higher and the OpenGL drivers from Nvidia should play 1080/25p without any problems (RGBA-float/X11-OpenGL) and without any hardware modifications. My 450 card reaches about 115 FPS with unprocessed fullhd FFvhuff material on the timeline, including pcm audio tracks. (glxgears: 109910 frames in 5.0 seconds = 21981.947 FPS)

If later the performance can be improved by 1-2 FPS by a code optimization, all the better. But as sorry as I am to say, in image processing the practical use of the proprietary powerful drivers contrasts with the ideological aspects.
Andrew-R

Andrew-R

2019-09-11 09:52

reporter   ~0002120

@Olaf:

Yes, I use mesa/nouveau, but my card can be partially reclocked, and even boot clocks are not very low (as they are on many other cards) - so, slower, but not untolerably slow. Also, my first test stream was av1 1080p video (no audio), and my second test stream was 720x400 h264 file scaled up to 1080p. Second case was faster, but still around 20 fps, not 25 ..... {speaking about RGBA-float project colorspace and X11-OpenGL output, specifically)
Olaf

Olaf

2019-09-11 08:00

reporter   ~0002119

Andrew-R, your test results (5-10 fps) refer to nouveau and mesa? Nvidia delivers with its drivers an OpenGL that is much faster than mesa. With these in connection with my age old graphics card I play 1080/25p in RGBA-FLOAT with 25 FPS. Only after image manipulation the FPS break in.
Andrew-R

Andrew-R

2019-09-11 02:17

reporter   ~0002118

And for some reason my attempt at reply disappeared :/ May be I just type it in too slowly!

Anyway, i was about to add this 'brilliant' idea about contacting X/mesa (and nouveau) mailing lists in hope they will find some interest in actual application hacking.

Links
https://lists.x.org/archives/xorg-devel/2018-February/055861.html
Depth 30 enablement for modesetting-ddx and fixups for glamor.
Mario Kleiner mario.kleiner.de at gmail.com

---------quote------
I used my photometer to make sure the bits come through while testing
on NVidia + nouveau, and then also quickly tested on old AMD gfx +
radeon-kms and on Intel + intel-kms to make sure that regular desktop
and OpenGL apps render correctly on that hw as well.
---end quote-----

https://lists.freedesktop.org/archives/mesa-dev/2019-February/214689.html
[Mesa-dev] 10-bit fbconfigs break most video players using VAAPI+GLX

https://github.com/skeggsb/nouveau/commit/ca5fe1a3e31e1f1e77274616e18296ddd0daba32
kms/nv50-: add fp16 scanout support

Phyllis - i'm sorry about you and your dog! OpenGL is big standard, but I hope this quest can be resolved with time and collective work...

PS; llvmpipe with mesa 19.3.0-git actually should be able to run some compute shaders - not very fast, but may be good for prototyping on machines where videocards too old for having OpenGL 4.3+ in hardware (like my machine).
PhyllisSmith

PhyllisSmith

2019-09-11 01:25

manager   ~0002116

@Andrew

GG did look at your patch. He is not sure how this will affect plugins and it is unclear if there might be risks involved. You probably already know that he is no expert when it comes to OpenGL (me and the dog have had to cover our ears in the last couple of months when he was working on OpenGL coding).

As always, it is good to have this logged as an issue for future consideration by others.
PhyllisSmith

PhyllisSmith

2019-09-11 00:49

manager   ~0002114

@Andrea
"I still have one thing to understand: doesn't implementing a CMS mean having internal LUTs that avoid making continuous conversions between color spaces and color models? Or, in any case, to make them faster and more precise because they always refer to the same absolute XYZ coordinates as the colours?"

I am not sure that the below quote from GG helps illuminate the question above, but here it is anyway:
Start quote: "There is a optional feature that can be used via .opts lines from the ffmpeg decoded files. This is via the video_filter=colormatrix=...ffmpeg plugin. There may be other good plugins (lut3d...) that can also accomplish a desired color transform. This .opts feature affects the file colorspace on a file by file basis, although in principle it should be possible to setup a histogram plugin or any of the F_lut* plugins to remap the colortable, either by table or interp.

For output, the yuv<->rgb transformations are via the YUV class, and its tables are initialized using YUV::yuv_set_colors. This sets up the transfer tables for just one version (one of bt601,709,2020) and the color range for mpeg or jpeg. This is limited, but since the product is usually a render, and this transform is designed to match display parameters, it is often correct to select the output colorspace once. If the render needs a colorspace mapping, then the session can be nested and the session output remapped by a plugin.

This is all not very glitzy or highly automated, but it does provide a wide color mapping capability." End Quote.

With our limited equipment and lack of color management knowledge, I think CMS can only be integrated into CinGG by someone with in depth knowledge of both video and programming skills.
Andrew-R

Andrew-R

2019-09-08 13:10

reporter   ~0002089

Something like this patch?

----------
diff --git a/cinelerra-5.1/cinelerra/vdevicex11.C b/cinelerra-5.1/cinelerra/vdevicex11.C
index 24d1be0..6d87692 100644
--- a/cinelerra-5.1/cinelerra/vdevicex11.C
+++ b/cinelerra-5.1/cinelerra/vdevicex11.C
@@ -333,6 +333,13 @@ void VDeviceX11::new_output_buffer(VFrame **result, int file_colormodel, EDL *ed
                                }
                                break;

+ case BC_RGBA_FLOAT:
+ case BC_RGB_FLOAT:
+ if( device->out_config->driver == PLAYBACK_X11_GL
+ && !output->use_scrollbars )
+ bitmap_type = BITMAP_PRIMARY;
+ break;
+
                        case BC_YUV420P:
                                if( device->out_config->driver == PLAYBACK_X11_XV &&
                                    window->accel_available(display_colormodel, 0) &&
diff --git a/cinelerra-5.1/guicast/bctexture.C b/cinelerra-5.1/guicast/bctexture.C
index 52787e1..f1fd166 100644
--- a/cinelerra-5.1/guicast/bctexture.C
+++ b/cinelerra-5.1/guicast/bctexture.C
@@ -124,9 +124,9 @@ void BC_Texture::create_texture(int w, int h, int colormodel)
                glGenTextures(1, (GLuint*)&texture_id);
                glBindTexture(GL_TEXTURE_2D, (GLuint)texture_id);
                glEnable(GL_TEXTURE_2D);
- int internal_format = texture_components == 4 ? GL_RGBA8 : GL_RGB8 ;
+ int internal_format = texture_components == 4 ? GL_RGBA32F : GL_RGB32F ;
                glTexImage2D(GL_TEXTURE_2D, 0, internal_format, texture_w, texture_h,
- 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
+ 0, GL_RGBA, GL_FLOAT, 0);
                window_id = BC_WindowBase::get_synchronous()->current_window->get_id();
                BC_WindowBase::get_synchronous()->put_texture(texture_id,
                        texture_w, texture_h, texture_components);
-----------------

Note, it only adds 32F floating-point textures, not rendering into floating-point frambuffer (glxinfo still not marks any of those FBconfigs as floating-point capable ..may be new kernel needed, or new card :})

I can see 30-bit integer formats (with 2bit alpha), but not sure how useful those can be ....

558 GLXFBConfigs:
    visual x bf lv rg d st colorbuffer sr ax dp st accumbuffer ms cav
  id dep cl sp sz l ci b ro r g b a F gb bf th cl r g b a ns b eat
----------------------------------------------------------------------------
0x075 0 tc 0 32 0 r . . 10 10 10 2 . . 0 0 0 0 0 0 0 0 0 None

opengl3_patch_very_small_speedup.patch (1,525 bytes)
diff --git a/cinelerra-5.1/cinelerra/vdevicex11.C b/cinelerra-5.1/cinelerra/vdevicex11.C
index 24d1be0..6d87692 100644
--- a/cinelerra-5.1/cinelerra/vdevicex11.C
+++ b/cinelerra-5.1/cinelerra/vdevicex11.C
@@ -333,6 +333,13 @@ void VDeviceX11::new_output_buffer(VFrame **result, int file_colormodel, EDL *ed
 				}
 				break;
 
+			case BC_RGBA_FLOAT:
+			case BC_RGB_FLOAT:
+			if( device->out_config->driver == PLAYBACK_X11_GL 
+				&& !output->use_scrollbars )
+					bitmap_type = BITMAP_PRIMARY;
+				break;
+
 			case BC_YUV420P:
 				if( device->out_config->driver == PLAYBACK_X11_XV &&
 				    window->accel_available(display_colormodel, 0) &&
diff --git a/cinelerra-5.1/guicast/bctexture.C b/cinelerra-5.1/guicast/bctexture.C
index 52787e1..f1fd166 100644
--- a/cinelerra-5.1/guicast/bctexture.C
+++ b/cinelerra-5.1/guicast/bctexture.C
@@ -124,9 +124,9 @@ void BC_Texture::create_texture(int w, int h, int colormodel)
 		glGenTextures(1, (GLuint*)&texture_id);
 		glBindTexture(GL_TEXTURE_2D, (GLuint)texture_id);
 		glEnable(GL_TEXTURE_2D);
-		int internal_format = texture_components == 4 ? GL_RGBA8 : GL_RGB8 ;
+		int internal_format = texture_components == 4 ? GL_RGBA32F : GL_RGB32F ;
 		glTexImage2D(GL_TEXTURE_2D, 0, internal_format, texture_w, texture_h,
-				0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
+				0, GL_RGBA, GL_FLOAT, 0);
 		window_id = BC_WindowBase::get_synchronous()->current_window->get_id();
 		BC_WindowBase::get_synchronous()->put_texture(texture_id,
 			texture_w, texture_h, texture_components);
Andrew-R

Andrew-R

2019-09-08 10:36

reporter   ~0002088

And while looking at cinelerra-5.1/cinelerra/vdevicex11.C - it seems it set up (in function void VDeviceX11::new_output_buffer(VFrame **result, int file_colormodel, EDL *edl) ) bitmap_type = BITMAP_TEMP; and then override it to bitmap_type = BITMAP_PRIMARY; IF display_colormodel () = case BC_BGR8888 and three other cases with additional constrains (BC_YUV420P, BC_YUV422P, BC_YUV422) but it doesn't include BC_RGB_FLOAT or BC_RGBA_FLOAT so, tmp conversion is used in this case (from project's RGBA-float (32-bit per component, yes?) to output's . .. in VDeviceX11::write_buffer .

So, if I want speed-up I must include rgba-32f type in this switch case, and allocate textures with this format, too .and update some check saying OpenGL display now can accelerate even RGB(A)-float output .....
Andrea_Paz

Andrea_Paz

2019-09-08 08:22

updater   ~0002087

I thank GG for the clear and thorough explanation.
I'm a bit sad because I seem to have understood that CinGG will never have a color management (CMS) unless you change EVERYTHING. This is impossible and not even desirable.
I still have one thing to understand: doesn't implementing a CMS mean having internal LUTs that avoid making continuous conversions between color spaces and color models? Or, in any case, to make them faster and more precise because they always refer to the same absolute XYZ coordinates as the colours?
Andrew-R

Andrew-R

2019-09-08 07:04

reporter   ~0002086

Wow, thanks a lot for both Phyllis and GG for such technical answer!

I see your point about rgb8 -> rgba_float conversion step done in Cinelerra itself as slow path.
I was looking at babl (http://gegl.org/babl/index.html#Usage) - I have it installed for GIMP anyway .....

May be some function in babl-0.1.72/extensions/sse4-int8.c can be useful for Cin (even if just for testing speedup/idea) ?

I see function like

#if defined(USE_SSE4_1)

/* SSE 4 */
#include <smmintrin.h>

#include <stdint.h>
#include <stdlib.h>

#include "babl.h"
#include "babl-cpuaccel.h"
#include "extensions/util.h"

static inline void
conv_y8_yF (const Babl *conversion,
            const uint8_t *src,
            float *dst,
            long samples)
{
  const float factor = 1.0f / 255.0f;
  const __v4sf factor_vec = {1.0f / 255.0f, 1.0f / 255.0f, 1.0f / 255.0f, 1.0f / 255.0f};
  const uint32_t *s_vec;
  __v4sf *d_vec;

  long n = samples;

  s_vec = (const uint32_t *)src;
  d_vec = (__v4sf *)dst;

  while (n >= 4)
    {
      __m128i in_val;
      __v4sf out_val;
      in_val = _mm_insert_epi32 ((__m128i)_mm_setzero_ps(), *s_vec++, 0);
      in_val = _mm_cvtepu8_epi32 (in_val);
      out_val = _mm_cvtepi32_ps (in_val) * factor_vec;
      _mm_storeu_ps ((float *)d_vec++, out_val);
      n -= 4;
    }

  src = (const uint8_t *)s_vec;
  dst = (float *)d_vec;

  while (n)
    {
      *dst++ = (float)(*src++) * factor;
      n -= 1;
    }
}

[...]

static void
conv_rgb8_rgbF (const Babl *conversion,
                const uint8_t *src,
                float *dst,
                long samples)
{
  conv_y8_yF (conversion, src, dst, samples * 3);
}

static void
conv_rgba8_rgbaF (const Babl *conversion,
                  const uint8_t *src,
                  float *dst,
                  long samples)
{
  conv_y8_yF (conversion, src, dst, samples * 4);
}

#endif
--------------

Also, because babl sort-of specializes in color-management too - may be it can be reused at least for some stages if/when color management will come to Cinelerra-GG ....

PS: my CPU has SSE4.1:

cat /proc/cpuinfo | grep sse4
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 popcnt aes xsave avx f16c lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs xop skinit wdt lwp fma4 tce nodeid_msr tbm topoext perfctr_core perfctr_nb cpb hw_pstate ssbd vmmcall bmi1 arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold
PhyllisSmith

PhyllisSmith

2019-09-08 01:35

manager   ~0002085

Short answer by Phyllis: we could duplicate your results on my previous laptop but the 2 computers we use daily have multiple cpus and the fps stayed just about 24 all of the time. It may be possible to re-write a bunch of stuff to specialize it for opengl data formats and texture design, but is a great distance from the current design that targets the internal data model you specify. And "The if (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGB_FLOAT) ... is testing to see if the conversion can be skipped (returns call). Whereas the table accesses a shader needed to do a color conversion." in response to concern about "covering all cases".

Long "real" answer by GoodGuy:
hi

This is a sort of fuzzy analysis of the data transfers for the
different rendering drivers, color models and formats that are
used when you do stuff. This data was collected using the
"prof2" program which is in the main cin5 src directory.
This profiler program uses an alarm signal to frequently
collect stack traces to see a birds eye view of the program
execution. Here is a piece of some of the data I collected
using this program:

this top part is time observed for each alarm interrupt.
In this case, it is cpu time slices, 100 per sec, that create
an intergal histgram of where it was when it was interrupted.

   1.540s 1.4% ff_hevc_put_hevc_qpel_hv8_8_sse4 /mnt0/build5/cinelerra-5.1/bin/cin
   1.600s 1.4% ff_hevc_put_hevc_bi_qpel_hv8_8_sse4 /mnt0/build5/cinelerra-5.1/bin/cin
   1.800s 1.6% shmdt /lib64/libc-2.28.so
   2.070s 1.8% ff_hevc_deblocking_boundary_strengths /mnt0/build5/cinelerra-5.1/bin/cin
   2.130s 1.9% copy_CTB_to_hv /mnt0/build5/cinelerra-5.1/bin/cin
  19.020s 17.0% yuv420_bgr32_mmx /mnt0/build5/cinelerra-5.1/bin/cin
  23.230s 20.8% BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int) /mnt0/build5/cinelerra-5.1/bin/cin
  37.430s 33.4% _fini /mnt0/build5/cinelerra-5.1/bin/cin
------------
this part tries to walk the stack, and show the cpu stack path histogram of the
time spent in execution. It shows how it got to the bad guys at the bottom
of the stack (above).
  10.180s 9.1% BC_Xfer::xfer_slices(int) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.760s 10.5% FFVideoStream::load(VFrame*, long) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.760s 10.5% FFVideoConvert::convert_cmodel(VFrame*, AVFrame*) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.760s 10.5% FFMPEG::decode(int, long, VFrame*) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.760s 10.5% FileFFMPEG::read_frame(VFrame*) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.910s 10.6% File::read_frame(VFrame*, int) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.920s 10.6% VEdit::read_frame(VFrame*, long, int, CICache*, int, int, int) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.930s 10.7% VRender::process_buffer(long, int) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  11.970s 10.7% VRender::run() 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  13.280s 11.9% non-virtual thunk to BC_Xfer::Slicer::run() 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  19.020s 17.0% yuv420_bgr32_mmx 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  23.230s 20.8% BC_Xfer::xfer_rgba8888_to_rgba_float(unsigned int, unsigned int) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  27.020s 24.1% Thread::entrypoint(void*) 1.0 /mnt0/build5/cinelerra-5.1/bin/cin
  27.020s 24.1% start_thread 1.0 /lib64/libpthread-2.28.so
  37.520s 33.5% _fini 1.0 /mnt0/build5/cinelerra-5.1/bin/cin

This is mostly just smoke and mirrors, but it does show that the program is
spending a great deal of time converting first from the media format (yuv420)
to rgb using ffmpeg sofware scale (sws) transfers to convert to rgb. This is
a good idea, since there are a bunch of yuv models (BT601,BT709,BT2020 and
MPEG/JPG color ranges) and cin5 only supports one of these at a time. There
may be more than one type in your session. That makes the ffmpeg yuv->rgb
conversion a needed feature. Next, since your internal buffers are rgb float,
the data is converted BC_Xfer::xfer_rgba8888_to_rgba_float. That requires
the use of the floating point unit to operate the conversion and memory
transfers. The float instructions are much slower than integer instructions,
and performance varies greatly depending on cpu models. The rgb8888 to
rgb float step is in there because "you said to" in the session format.

OpenGL can render float or 8bit, and probably at nearly the same speed, but
that is not where the time is spent. It is mostly in the decode, and media
data prep for the session render format that is soaking up all of the time.

It is true that textures support a wide variety of data/color models, but
the demand is for the render setup, not usually for graphics performance.
This puts a big constraint on what needs to be programmed, since the result
is targeting the software renders (always used as the reference for render)
since depending on opengl can produce results that are hard to control.
There are a very high number of rendering options for opengl, and every
implementation may or may not be exactly identical.

So, it may be possible to re-write a bunch of stuff to specialize it for
opengl data formats and texture design, but is a great distance from the
current design that targets the internal data model you specify.

convert_cmodel is only used to convert frames that are from nested edl
renders, not composer canvas render drawing. composer canvas draws are
normally Playback3D::write_buffer_sync->Playback3D::draw_output, which
may use an opengl fragment shader yuv_to_rgb_frag to convert if the
drawn frame is yuv. The shader tables you reference are opengl fragment
shaders that are used to convert data for the nested renders.
The if (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGB_FLOAT) ...
is testing to see if the conversion can be skipped (returns call).
The table accesses a shader needed to do a color conversion.

To affect drawing, the screen buffer format is normally defined by
the x11 visual chosen during glx probe, and is constrained to rgb8888
by BC_WindowBase::glx_window_fb_configs. With modern video and device
formats, this may need to be upgraded very soon, but I can't talk phyllis
into buying any new video graphics cards, high depth monitors, or any
new tvs to try out any of this, so for the time being, this is what I
can actually test, because it is what is here at my house.

The texture formats are also (sadly) always 8bit, BC_Texture::create_texture.
It sets the basic parameters for texture internal formats which are
almost always used as the data design for opengl operations. This also
needs to be upgraded, but it would use more graphics memory and may
introduce performance issues also. Mesa (software) opengl is used by
almost all distros, unless you specify that you want something else...
so internal format choice may widely affect mesa, and therefore the speed
of rendering in cin5.


and so in summary, it is true that cin5 may be able to use better
opengl configuations and parameters, but usually it is doing what it
does for pretty good reasons. The main purpose for opengl in cin5
seems to be to speed up editing, not produce the best render design.

gg
Andrea_Paz

Andrea_Paz

2019-09-06 07:40

updater   ~0002079

I'm sorry I can't help but I thank you for the work you do, which I think is really important for CinGG. It could be the beginning of a color management.
Andrew-R

Andrew-R

2019-09-06 02:08

reporter   ~0002077

Actually, I tried to hack a bit on Cinelerra, but while my hack seems to work as in showing image in Compositor window - it doesn't speed up things :/

--------------------
diff --git a/cinelerra-5.1/cinelerra/playback3d.C b/cinelerra-5.1/cinelerra/playback3d.C
index a7f185b..e45edc6 100644
--- a/cinelerra-5.1/cinelerra/playback3d.C
+++ b/cinelerra-5.1/cinelerra/playback3d.C
@@ -1491,11 +1491,14 @@ void Playback3D::convert_cmodel(Canvas *canvas,
        if(
                (output->get_opengl_state() == VFrame::TEXTURE ||
                output->get_opengl_state() == VFrame::SCREEN) &&
+(
 // OpenGL has no floating point.
+/*
                ( (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGB_FLOAT) ||
                  (src_cmodel == BC_RGBA8888 && dst_cmodel == BC_RGBA_FLOAT) ||
                  (src_cmodel == BC_RGB_FLOAT && dst_cmodel == BC_RGB888) ||
- (src_cmodel == BC_RGBA_FLOAT && dst_cmodel == BC_RGBA8888) ||
+ (src_cmodel == BC_RGBA_FLOAT && dst_cmodel == BC_RGBA8888) ||
+*/
 // OpenGL sets alpha to 1 on import
                  (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGBA8888) ||
                  (src_cmodel == BC_YUV888 && dst_cmodel == BC_YUVA8888) ||
diff --git a/cinelerra-5.1/guicast/bctexture.C b/cinelerra-5.1/guicast/bctexture.C
index 52787e1..cc50454 100644
--- a/cinelerra-5.1/guicast/bctexture.C
+++ b/cinelerra-5.1/guicast/bctexture.C
@@ -124,9 +124,9 @@ void BC_Texture::create_texture(int w, int h, int colormodel)
                glGenTextures(1, (GLuint*)&texture_id);
                glBindTexture(GL_TEXTURE_2D, (GLuint)texture_id);
                glEnable(GL_TEXTURE_2D);
- int internal_format = texture_components == 4 ? GL_RGBA8 : GL_RGB8 ;
+ int internal_format = texture_components == 4 ? GL_RGBA16F : GL_RGB16F ;
                glTexImage2D(GL_TEXTURE_2D, 0, internal_format, texture_w, texture_h,
- 0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
+ 0, GL_RGBA, GL_FLOAT, 0);
                window_id = BC_WindowBase::get_synchronous()->current_window->get_id();
                BC_WindowBase::get_synchronous()->put_texture(texture_id,
                        texture_w, texture_h, texture_components);
--------------

ogl3_no_speedup.diff (1,799 bytes)
diff --git a/cinelerra-5.1/cinelerra/playback3d.C b/cinelerra-5.1/cinelerra/playback3d.C
index a7f185b..e45edc6 100644
--- a/cinelerra-5.1/cinelerra/playback3d.C
+++ b/cinelerra-5.1/cinelerra/playback3d.C
@@ -1491,11 +1491,14 @@ void Playback3D::convert_cmodel(Canvas *canvas,
 	if(
 		(output->get_opengl_state() == VFrame::TEXTURE ||
 		output->get_opengl_state() == VFrame::SCREEN) &&
+(
 // OpenGL has no floating point.
+/*
 		( (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGB_FLOAT) ||
 		  (src_cmodel == BC_RGBA8888 && dst_cmodel == BC_RGBA_FLOAT) ||
 		  (src_cmodel == BC_RGB_FLOAT && dst_cmodel == BC_RGB888) ||
-		  (src_cmodel == BC_RGBA_FLOAT && dst_cmodel == BC_RGBA8888) ||
+		  (src_cmodel == BC_RGBA_FLOAT && dst_cmodel == BC_RGBA8888) || 
+*/
 // OpenGL sets alpha to 1 on import
 		  (src_cmodel == BC_RGB888 && dst_cmodel == BC_RGBA8888) ||
 		  (src_cmodel == BC_YUV888 && dst_cmodel == BC_YUVA8888) ||
diff --git a/cinelerra-5.1/guicast/bctexture.C b/cinelerra-5.1/guicast/bctexture.C
index 52787e1..cc50454 100644
--- a/cinelerra-5.1/guicast/bctexture.C
+++ b/cinelerra-5.1/guicast/bctexture.C
@@ -124,9 +124,9 @@ void BC_Texture::create_texture(int w, int h, int colormodel)
 		glGenTextures(1, (GLuint*)&texture_id);
 		glBindTexture(GL_TEXTURE_2D, (GLuint)texture_id);
 		glEnable(GL_TEXTURE_2D);
-		int internal_format = texture_components == 4 ? GL_RGBA8 : GL_RGB8 ;
+		int internal_format = texture_components == 4 ? GL_RGBA16F : GL_RGB16F ;
 		glTexImage2D(GL_TEXTURE_2D, 0, internal_format, texture_w, texture_h,
-				0, GL_RGBA, GL_UNSIGNED_BYTE, 0);
+				0, GL_RGBA, GL_FLOAT, 0);
 		window_id = BC_WindowBase::get_synchronous()->current_window->get_id();
 		BC_WindowBase::get_synchronous()->put_texture(texture_id,
 			texture_w, texture_h, texture_components);
ogl3_no_speedup.diff (1,799 bytes)

Issue History

Date Modified Username Field Change
2019-09-05 23:51 Andrew-R New Issue
2019-09-06 02:08 Andrew-R File Added: ogl3_no_speedup.diff
2019-09-06 02:08 Andrew-R Note Added: 0002077
2019-09-06 07:40 Andrea_Paz Note Added: 0002079
2019-09-08 01:35 PhyllisSmith Assigned To => PhyllisSmith
2019-09-08 01:35 PhyllisSmith Status new => acknowledged
2019-09-08 01:35 PhyllisSmith Note Added: 0002085
2019-09-08 07:04 Andrew-R Note Added: 0002086
2019-09-08 08:22 Andrea_Paz Note Added: 0002087
2019-09-08 10:36 Andrew-R Note Added: 0002088
2019-09-08 13:10 Andrew-R File Added: opengl3_patch_very_small_speedup.patch
2019-09-08 13:10 Andrew-R Note Added: 0002089
2019-09-11 00:49 PhyllisSmith Note Added: 0002114
2019-09-11 01:25 PhyllisSmith Note Added: 0002116
2019-09-11 02:17 Andrew-R Note Added: 0002118
2019-09-11 08:00 Olaf Note Added: 0002119
2019-09-11 09:52 Andrew-R Note Added: 0002120
2019-09-11 11:47 Olaf Note Added: 0002121