Cache strategy
This discussion is connected to the gegl-developer-list.gnome.org mailing list which is provided by the GIMP developers and not related to gimpusers.com.
This is a read-only list on gimpusers.com so this discussion thread is read-only, too.
Cache strategy | Patrik Östman | 14 Jun 21:06 |
Cache strategy | Patrik Östman | 15 Jun 21:12 |
Cache strategy | Øyvind Kolås | 15 Jun 23:46 |
Cache strategy | Martin Nordholts | 16 Jun 07:29 |
Cache strategy | johannes hanika | 16 Jun 09:58 |
Cache strategy | Øyvind Kolås | 16 Jun 10:48 |
Cache strategy | Øyvind Kolås | 16 Jun 10:54 |
Cache strategy | jcupitt@gmail.com | 16 Jun 12:11 |
Cache strategy | Øyvind Kolås | 16 Jun 12:32 |
Cache strategy | jcupitt@gmail.com | 16 Jun 12:38 |
Cache strategy | Øyvind Kolås | 16 Jun 14:37 |
Cache strategy | jcupitt@gmail.com | 16 Jun 15:29 |
Cache strategy | hendrik@topoi.pooq.com | 16 Jun 15:52 |
Cache strategy
Hi. I have some questions regarding the cache strategy in gegl. I have tested using both 0.0.20 and 0.0.22 and I find them different in the way nodes are cached. GEGL 0.0.20: Creating the graph below: 1 | 2 | 3 | 4 Rendering node 4 and after that rendering node 2. When rendering node 2 the output is taken from the cache because it is already calculated when rendering node 4. This is also that I had expected. Now linking node 2 to node 5 and creating a new branch: 1 | 2-- | | 3 5 | 4 Rendering node 4 again and the graph is rendered from node 2 to node 4. This should not be necessary because the output of node 2 is not changed when the output pad of node 2 is connected to another node. What is the reason for this? GEGL 0.0.22 In gegl 0.0.22 I are not able to use any cached node output other than the last rendered node output. E.g. creating the first graph above, rendering node 4 and after that rendering node 4 again getting the result from cache. But if rendering node 2 (after rendering node 4) the graph is rendered from node 1 to node 2. Are there any significant changes of cache handling done between 0.0.20 and 0.0.22 or are there a setting that must be turned on to get 'per node caches' functionality? Another question: What is the difference between GEGL_BLIT_CACHE and GEGL_BLIT_DEFAULT flags of the 'blit' function? Regards Patrik Östman
Cache strategy
Hi.
I repost my previous message because of bad formatting. Hope this will be better...
I have some questions regarding the cache strategy in gegl. I have tested using both 0.0.20 and 0.0.22 and I find them different in the way nodes are cached.
GEGL 0.0.20:
Creating the graph below:
1
|
2
|
3
|
4
Rendering node 4 and after that rendering node 2.
When rendering node 2 the output is taken from the cache
because it is already calculated when rendering node 4.
This is also that I had expected. Now linking node 2 to
node 5 and creating a new branch:
1
|
2--
| |
3 5
|
4
Rendering node 4 again and the graph is rendered from node 2 to node 4. This should not be necessary because the output of node 2 is not changed when the output pad of node 2 is connected to another node. What is the reason for this?
GEGL 0.0.22 In gegl 0.0.22 I are not able to use any cached node output other than the last rendered node output. E.g. creating the first graph above, rendering node 4 and after that rendering node 4 again getting the result from cache. But if rendering node 2 (after rendering node 4) the graph is rendered from node 1 to node 2. Are there any significant changes of cache handling done between 0.0.20 and 0.0.22 or are there a setting that must be turned on to get 'per node caches' functionality?
Another question: What is the difference between GEGL_BLIT_CACHE and GEGL_BLIT_DEFAULT flags of the 'blit' function?
Regards
Patrik Östman
Cache strategy
On Mon, Jun 15, 2009 at 8:12 PM, Patrik Östman wrote:
node 1 to node 2. Are there any significant changes of cache handling done between 0.0.20 and 0.0.22 or are there a setting that must be turned on to get 'per node caches' functionality?
The caching policies and mechanisms in GEGL are still basic and rough. At the moment the GeglOperationClass has a boolean no_cache member that is checked. Which operation subclasses automatically do cache have changed lately since the memory consumption of caching every single node probably outweighs the benefits of caches for every single node. In particular I think point operations now explicitly do not cache their results.
Ive entertained the idea to use special nodes to indicate caches, but that would to some extent muddle the graph. Perhaps a better solution is to allow turning the caches on dynamically on demand and not only define such hints in the class, the existing implementation should be possible to bend in both ways.
Another question: What is the difference between GEGL_BLIT_CACHE and GEGL_BLIT_DEFAULT flags of the 'blit' function?
GEGL_BLIT_DEFAULT will not create a cache for the node but render directly, using GEGL_BLIT_CACHE instead will create a cache that results are rendered into and provided from, making subsequent requests when properties and the graph has not changed be much faster.
Cache strategy
Øyvind Kolås wrote:
On Mon, Jun 15, 2009 at 8:12 PM, Patrik Östman wrote:
node 1 to node 2. Are there any significant changes of cache handling done between 0.0.20 and 0.0.22 or are there a setting that must be turned on to get 'per node caches' functionality?
The caching policies and mechanisms in GEGL are still basic and rough. At the moment the GeglOperationClass has a boolean no_cache member that is checked. Which operation subclasses automatically do cache have changed lately since the memory consumption of caching every single node probably outweighs the benefits of caches for every single node. In particular I think point operations now explicitly do not cache their results.
Ive entertained the idea to use special nodes to indicate caches, but that would to some extent muddle the graph. Perhaps a better solution is to allow turning the caches on dynamically on demand and not only define such hints in the class, the existing implementation should be possible to bend in both ways.
I've been thinking a while lately what the overall model would be for a redesign of the caching strategy used in GEGL. First of all, I don't think explicit cache nodes would solve that many problems. We would still need a way to propagate invalidation of nodes through the graph to invalidate the cache, and that logic still needs to be on the lowest node level. Furthermore, as you say, only allowing caching in explicit cache nodes will be very cumbersome if one needs to do clever on-demand caching in parts of the graph, and that will be needed in an interactive application with a large graph such as GIMP. For example, when you interactively apply geometric transforms on a layer, you will want to have caching right before the compositor op that composes the transformed layer onto the projection. You will not want to have caches on the compositor ops all the time though because it would cause a giant memory overhead.
One caching strategy that I think we will come pretty far with is to handle caching in a wrapper function to GeglOperation::process(). That is, the GeglEvalVisitor would not call GeglOperation::process() directly but instead something like GeglOperation::do_process(). This method would contain common logic for caching. If caching is turned off, it would simply call GeglOperation::process() and not do much more. If caching is turned on, it would not bother calling GeglOperation::process() and instead just return what it has cached. This strategy would also collect caching to a common place instead of spreading it across the GeglEvalVisitor and the GeglNeedVisitor (former GeglCrVisitor).
The wrapper function GeglOperation::do_process() would also be a natural place to have logic for turning off the "visibility" or "activeness" of a node. When a node is invisible/inactive, it acts as a gegl:nop and just passes its input directly to its output. That will allow us to easily implement toggling the visibility of traditional layers, and also toggling the "visibility" of layer effects such as a blur or emboss.
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
/ Martin
Cache strategy
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
hi,
i'm new to the list and was following your caching discussion with some interest, as i was just implementing something similar for an open-source interactive photo-development software and was thinking about using GEGL instead.
for this application, there are two main tasks which have to be fast: zooming/panning (especially quick 1:1 views) and history stack popping (to compare if your latest changes really improved the image).
as i wanted to develop images and not a powerful library, i used a very simple approach to caching:
- - there is a fixed-size cache with lines large enough to hold all relevant pixels for display (e.g. 3*float*512*512), with about 5 cache lines.
- - the first operation scales and crops the input buffer to fit this window.
- - each operation allocates its output buffer in this cache, tagged with a hash that depends on all parameters of all operations from original down to here.
- - the graph of operations is simply traversed as needed and the cache lines are replaced by a least recently used scheme.
obviously, this does not work terribly well with zooming and panning. to address this, there is a second pixel pipeline with a downscaled buffer, which is always evaluated as a whole, not only the visible parts (this should really be replaced by something as cool as the tile buffer in GeglBuffer).
Martin Nordholts wrote:
For example, when you
interactively apply geometric transforms on a layer, you will want to have caching right before the compositor op that composes the transformed layer onto the projection. You will not want to have caches on the compositor ops all the time though because it would cause a giant memory overhead.
this is handled well because changing the last operation in the graph will need the output of the previous one, thus incrementing the ``more recently used'' value of this one, preventing the important previous cache line from being swapped out.
if this swaps out other important cache lines on the way, a mechanism to detect such a situation can be implemented (only changing parameters of the same op very often and quickly), such a thing is required for the history stack anyways, as it would also overflow.
so i was thinking if maybe such a global (in the parent gegl node?), straight forward LRU cache with an appropriate hash could work for the tiles in GEGL as well (one tile = one cacheline, or one mip-map level of a tile = one cacheline..)? did you try it already? if so, how did it go?
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
this is definitely true, although i hope there will always be enough work on disjoint tiles or just parallel read accesses on the same tile (to have two threads writing the same tile doesn't really make sense anyways).
- -jo
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
iEYEARECAAYFAko3UI4ACgkQkuDaVvmNdriRtwCfRcIXEn7A6K8Y/seBnWmUwUJh
tYwAn1CUAqhxb0J8fBoHiA3LSeHVLjmh
=9C7z
-----END PGP SIGNATURE-----
Cache strategy
On Tue, Jun 16, 2009 at 8:58 AM, johannes hanika wrote:
this is handled well because changing the last operation in the graph will need the output of the previous one, thus incrementing the ``more recently used'' value of this one, preventing the important previous cache line from being swapped out.
if this swaps out other important cache lines on the way, a mechanism to detect such a situation can be implemented (only changing parameters of the same op very often and quickly), such a thing is required for the history stack anyways, as it would also overflow.
so i was thinking if maybe such a global (in the parent gegl node?), straight forward LRU cache with an appropriate hash could work for the tiles in GEGL as well (one tile = one cacheline, or one mip-map level of a tile = one cacheline..)? did you try it already? if so, how did it go?
Having caches for all nodes effectively implement what you suggest, since the caches are just GeglBuffers and GeglBuffers that are unused will end up having all their tiles in swap on disk.
Cache strategy
On Tue, Jun 16, 2009 at 6:31 AM, Martin Nordholts wrote:
One caching strategy that I think we will come pretty far with is to handle caching in a wrapper function to GeglOperation::process(). That is, the GeglEvalVisitor would not call GeglOperation::process() directly but instead something like GeglOperation::do_process(). This method would contain common logic for caching. If caching is turned off, it would simply call GeglOperation::process() and not do much more. If caching is turned on, it would not bother calling GeglOperation::process() and instead just return what it has cached. This strategy would also collect caching to a common place instead of spreading it across the GeglEvalVisitor and the GeglNeedVisitor (former GeglCrVisitor).
The wrapper function GeglOperation::do_process() would also be a natural place to have logic for turning off the "visibility" or "activeness" of a node. When a node is invisible/inactive, it acts as a gegl:nop and just passes its input directly to its output. That will allow us to easily implement toggling the visibility of traditional layers, and also toggling the "visibility" of layer effects such as a blur or emboss.
The current approach where the common logic in the node/processing code creates the destination GeglBuffer for the computation is already well separated. GeglOperation::do_process doesn't beloing inside GeglOperation but would be a new name for the current (and perhaps extended) functionality of figuring out how to create the GeglBuffer for computation.
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
In most instances, this would only incur synchronization of a few tiles where the chunks/work regions overlaps. Unless you are stupid and compute with chunk-size~=tile-size the impact of this should be mostly neglible.
Cache strategy
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
In most instances, this would only incur synchronization of a few tiles where the chunks/work regions overlaps. Unless you are stupid and compute with chunk-size~=tile-size the impact of this should be mostly neglible.
You would still need a lock on the cache wouldn't you? For example, if the cache is held as a GHashTable of tiles, even if individual tiles are disjoint and not shared, you'll still need to lock the hash table before you can search it. A couple of locks on every tile on every node will hurt SMP scaling badly.
For what it's worth, vips handles this by keeping caches thread-private. If threads are working in disjoint areas of the image, there's no benefit to a shared cache anyway. As you say, there will be a degree of recomputation for some operations (eg. convolution), but that's a small cost compared to lock/unlock.
vips has two types of cache: a very small (just 1 or 2 tiles) thread-private cache on every image, and a large and complex shared cache operation that can be explicitly added to the graph. The GUI adds a cache operator automatically just after every operation that can output to the display, but you can use it elsewhere if you wish.
John
Cache strategy
On Tue, Jun 16, 2009 at 11:11 AM, wrote:
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
In most instances, this would only incur synchronization of a few tiles where the chunks/work regions overlaps. Unless you are stupid and compute with chunk-size~=tile-size the impact of this should be mostly neglible.
You would still need a lock on the cache wouldn't you? For example, if the cache is held as a GHashTable of tiles, even if individual tiles are disjoint and not shared, you'll still need to lock the hash table before you can search it. A couple of locks on every tile on every node will hurt SMP scaling badly.
You do not need to make this be a global hashtable the way it is done with GeglBuffers it would end up being one hashtable per node that has a cache, not one cache for for all nodes. If I understood your concern correctly it was about having a global lock that is locked. Also note that GEGL does not use the tile-size for the computations, tile dimensions are a concept that only belongs on the GeglBuffer level that should be ignored when using the GeglBuffer API and at higher levels of abstraction.
For what it's worth, vips handles this by keeping caches thread-private. If threads are working in disjoint areas of the image, there's no benefit to a shared cache anyway. As you say, there will be a degree of recomputation for some operations (eg. convolution), but that's a small cost compared to lock/unlock.
vips has two types of cache: a very small (just 1 or 2 tiles) thread-private cache on every image, and a large and complex shared cache operation that can be explicitly added to the graph. The GUI adds a cache operator automatically just after every operation that can output to the display, but you can use it elsewhere if you wish.
This type of cache is the cache that is currently in use in GEGL, although it isnt an explicit member of the graph, but contained within the preceding node (it is the sparse tiled buffer written to by the operation contained in the node instead of allocating/recycling temporary one-off buffers). The caches are GeglBuffers. Each GeglBuffer (and thus cache) has a tile-cache as well that keeps frequently used data from being swapped out to disk.
Cache strategy
2009/6/16 Øyvind Kolås :
On Tue, Jun 16, 2009 at 11:11 AM, wrote:
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
In most instances, this would only incur synchronization of a few tiles where the chunks/work regions overlaps. Unless you are stupid and compute with chunk-size~=tile-size the impact of this should be mostly neglible.
You would still need a lock on the cache wouldn't you? For example, if the cache is held as a GHashTable of tiles, even if individual tiles are disjoint and not shared, you'll still need to lock the hash table before you can search it. A couple of locks on every tile on every node will hurt SMP scaling badly.
You do not need to make this be a global hashtable the way it is done with GeglBuffers it would end up being one hashtable per node that has a cache, not one cache for for all nodes. If I understood your concern correctly it was about having a global lock that is locked.
I'm probably misunderstanding. I was worried that there would be a hash of tiles on each node, and that this hash would be accessed by more than one thread.
If you have that, then even if tiles are disjoint, you'll need a lock/unlock pair around each hash table access, won't you? Otherwise, how can you add or remove tiles safely?
John
Cache strategy
On Tue, Jun 16, 2009 at 11:38 AM, wrote:
2009/6/16 Øyvind Kolås :
On Tue, Jun 16, 2009 at 11:11 AM, wrote:
Another thing worth mentioning is that caches on every node doesn't scale well to concurrent evaluation of the graph since the evaluators would need to all the time synchronize usage of the caches, preventing nice scaling of performance as you use more CPU cores/CPUs.
In most instances, this would only incur synchronization of a few tiles where the chunks/work regions overlaps. Unless you are stupid and compute with chunk-size~=tile-size the impact of this should be mostly neglible.
You would still need a lock on the cache wouldn't you? For example, if the cache is held as a GHashTable of tiles, even if individual tiles are disjoint and not shared, you'll still need to lock the hash table before you can search it. A couple of locks on every tile on every node will hurt SMP scaling badly.
You do not need to make this be a global hashtable the way it is done with GeglBuffers it would end up being one hashtable per node that has a cache, not one cache for for all nodes. If I understood your concern correctly it was about having a global lock that is locked.
I'm probably misunderstanding. I was worried that there would be a hash of tiles on each node, and that this hash would be accessed by more than one thread.
If you have that, then even if tiles are disjoint, you'll need a lock/unlock pair around each hash table access, won't you? Otherwise, how can you add or remove tiles safely?
You do do need a lock/unlock pair around tile accesses for GeglBuffers generally if buffers are to be shared between processes. (as earlier mentioned, caches are normal buffers allocated on demand for the nodes). One of the GSOC projects this year is investigating GPU based storage for GeglBuffers, this will involve further extension of the semantics of tile locking not only for writing but also for reading since migrations of tile data to/from GPU based texture storage will be needed.
Locking mutexes for such accesses, perhaps even over batches of tiles when reading/writing a rectangle of pixels I expect to be detrimental to performance only if the parallellization is implemented so that the threads will be blocking each others access. This should be avoidable.
GeglBuffer used to/still has code that allows access from multiple processes, in that case the processes have separate tile caches but still need atomic accesses to ensure that the buffer/tile revisions are up to date. Personally I find it more interesting to extend GEGL in the direction of multiple processes or even multiple hosts simultaneously processing than the programmer overhead of synchronizing threads.
Cache strategy
2009/6/16 Øyvind Kolås :
Locking mutexes for such accesses, perhaps even over batches of tiles when reading/writing a rectangle of pixels I expect to be detrimental to performance only if the parallellization is implemented so that the threads will be blocking each others access. This should be avoidable.
In my experience, lock/unlock pairs hurt performance even if threads are not interfering. A lock creates a memory barrier and, even if the processors are not contending, caches must be synchronised and flushed. I imagine the overhead varies a lot with the architecture.
Anyway! This is a theoretical concern at the moment I guess.
John
Cache strategy
Have a look at the Limiera mailing list, espcially their current threads on scheduling. LInks to these messages can be found at http://lists.lumiera.org/pipermail/lumiera/2009-June/thread.html
It seems to me that they're dealing with related issues, although with a different set of constraints.
Though I don't think it's time to merge the projects (it will probabyl never be), I think both would benefit from shared ideas and shared brainstorming..
-- hendrik