What is the fastest way to do Array Table Lookup with an Integer Index?What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array

Can planetary bodies have a second axis of rotation?

Are actors contractually obligated to certain things like going nude/ Sensual Scenes/ Gory Scenes?

US entry with tourist visa but past alcohol arrest

What is the fastest way to do Array Table Lookup with an Integer Index?

Do things made of adamantine rust?

How could artificial intelligence harm us?

Is It Possible to Have Different Sea Levels, Eventually Causing New Landforms to Appear?

Repeat elements in list, but the number of times each element is repeated is provided by a separate list

Writing a letter of recommendation for a mediocre student

Should the pagination be reset when changing the order?

Understanding an example in Golan's "Linear Algebra"

What do these pins mean? Where should I plug them in?

Debussy as term for bathroom?

What do solvers like Gurobi and CPLEX do when they run into hard instances of MIP

What was an "insurance cover"?

As a discovery writer, how do I complete an unfinished novel (which has highly diverged from the original plot ) after a time-gap?

When does removing Goblin Warchief affect its cost reduction ability?

Is there any actual security benefit to restricting foreign IPs?

How is underwater propagation of sound possible?

Can someone explain to me the parameters of a lognormal distribution?

What is the need of methods like GET and POST in the HTTP protocol?

What was the deeper meaning of Hermione wanting the cloak?

Spectrum of a Subspace of Matrices

How to fix folder structure in Windows 7 and 10



What is the fastest way to do Array Table Lookup with an Integer Index?


What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








13















I have a video processing application that moves a lot of data.



To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.



However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.



Currently I have the following:



public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];


I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.



int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];




Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.



Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.



I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).



Would an IntPtr be faster? How would I go about that?



Attached below is a screenshot of time distribution:



enter image description here










share|improve this question





















  • 4





    It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

    – 500 - Internal Server Error
    15 hours ago











  • The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

    – RobotRock
    15 hours ago











  • I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

    – RobotRock
    15 hours ago












  • @RobotRock: Off-topic , What tool are you using to display the time distribution?

    – Postlagerkarte
    15 hours ago











  • @Postlagerkarte It's VS2017 Diagnostics Tool

    – RobotRock
    15 hours ago

















13















I have a video processing application that moves a lot of data.



To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.



However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.



Currently I have the following:



public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];


I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.



int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];




Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.



Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.



I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).



Would an IntPtr be faster? How would I go about that?



Attached below is a screenshot of time distribution:



enter image description here










share|improve this question





















  • 4





    It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

    – 500 - Internal Server Error
    15 hours ago











  • The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

    – RobotRock
    15 hours ago











  • I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

    – RobotRock
    15 hours ago












  • @RobotRock: Off-topic , What tool are you using to display the time distribution?

    – Postlagerkarte
    15 hours ago











  • @Postlagerkarte It's VS2017 Diagnostics Tool

    – RobotRock
    15 hours ago













13












13








13


5






I have a video processing application that moves a lot of data.



To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.



However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.



Currently I have the following:



public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];


I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.



int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];




Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.



Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.



I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).



Would an IntPtr be faster? How would I go about that?



Attached below is a screenshot of time distribution:



enter image description here










share|improve this question
















I have a video processing application that moves a lot of data.



To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.



However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.



Currently I have the following:



public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];


I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.



int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];




Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.



Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.



I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).



Would an IntPtr be faster? How would I go about that?



Attached below is a screenshot of time distribution:



enter image description here







c# memory lookup






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 29 mins ago









JL2210

4,1944 gold badges12 silver badges41 bronze badges




4,1944 gold badges12 silver badges41 bronze badges










asked 15 hours ago









RobotRockRobotRock

2,2013 gold badges34 silver badges78 bronze badges




2,2013 gold badges34 silver badges78 bronze badges










  • 4





    It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

    – 500 - Internal Server Error
    15 hours ago











  • The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

    – RobotRock
    15 hours ago











  • I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

    – RobotRock
    15 hours ago












  • @RobotRock: Off-topic , What tool are you using to display the time distribution?

    – Postlagerkarte
    15 hours ago











  • @Postlagerkarte It's VS2017 Diagnostics Tool

    – RobotRock
    15 hours ago












  • 4





    It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

    – 500 - Internal Server Error
    15 hours ago











  • The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

    – RobotRock
    15 hours ago











  • I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

    – RobotRock
    15 hours ago












  • @RobotRock: Off-topic , What tool are you using to display the time distribution?

    – Postlagerkarte
    15 hours ago











  • @Postlagerkarte It's VS2017 Diagnostics Tool

    – RobotRock
    15 hours ago







4




4





It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago





It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago













The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago





The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago













I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago






I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago














@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago





@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago













@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago





@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago












2 Answers
2






active

oldest

votes


















27
















It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);



results first:



AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms


code:



using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return

int y = 0;
if (Avx2.IsSupported && avx)

var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

unsafe

fixed (int* sPtr = source)

// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)

rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);



// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;

// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)

results[y] = source[index[y]];

return result;


static void Main()

// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);

// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);

int[] results = new int[1024];

void GatherLocal(bool avx)

// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)

Console.Write(results[i].ToString("x2"));

Console.WriteLine();

const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();

GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);







share|improve this answer



























  • Awesome! I think that's what I need! I'm gonna give it a go.

    – RobotRock
    14 hours ago






  • 1





    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

    – Marc Gravell
    14 hours ago











  • Single loop is fine for me too. The double was more legacy than anything else.

    – RobotRock
    14 hours ago











  • @RobotRock note: still working on it, but it is possible that scale should be 4, not 1

    – Marc Gravell
    14 hours ago






  • 1





    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

    – Marc Gravell
    14 hours ago


















-2
















RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.



Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.



As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.



If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.






share|improve this answer



























  • while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

    – Marc Gravell
    14 hours ago













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57991432%2fwhat-is-the-fastest-way-to-do-array-table-lookup-with-an-integer-index%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























2 Answers
2






active

oldest

votes








2 Answers
2






active

oldest

votes









active

oldest

votes






active

oldest

votes









27
















It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);



results first:



AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms


code:



using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return

int y = 0;
if (Avx2.IsSupported && avx)

var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

unsafe

fixed (int* sPtr = source)

// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)

rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);



// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;

// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)

results[y] = source[index[y]];

return result;


static void Main()

// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);

// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);

int[] results = new int[1024];

void GatherLocal(bool avx)

// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)

Console.Write(results[i].ToString("x2"));

Console.WriteLine();

const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();

GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);







share|improve this answer



























  • Awesome! I think that's what I need! I'm gonna give it a go.

    – RobotRock
    14 hours ago






  • 1





    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

    – Marc Gravell
    14 hours ago











  • Single loop is fine for me too. The double was more legacy than anything else.

    – RobotRock
    14 hours ago











  • @RobotRock note: still working on it, but it is possible that scale should be 4, not 1

    – Marc Gravell
    14 hours ago






  • 1





    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

    – Marc Gravell
    14 hours ago















27
















It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);



results first:



AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms


code:



using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return

int y = 0;
if (Avx2.IsSupported && avx)

var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

unsafe

fixed (int* sPtr = source)

// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)

rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);



// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;

// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)

results[y] = source[index[y]];

return result;


static void Main()

// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);

// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);

int[] results = new int[1024];

void GatherLocal(bool avx)

// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)

Console.Write(results[i].ToString("x2"));

Console.WriteLine();

const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();

GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);







share|improve this answer



























  • Awesome! I think that's what I need! I'm gonna give it a go.

    – RobotRock
    14 hours ago






  • 1





    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

    – Marc Gravell
    14 hours ago











  • Single loop is fine for me too. The double was more legacy than anything else.

    – RobotRock
    14 hours ago











  • @RobotRock note: still working on it, but it is possible that scale should be 4, not 1

    – Marc Gravell
    14 hours ago






  • 1





    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

    – Marc Gravell
    14 hours ago













27














27










27









It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);



results first:



AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms


code:



using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return

int y = 0;
if (Avx2.IsSupported && avx)

var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

unsafe

fixed (int* sPtr = source)

// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)

rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);



// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;

// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)

results[y] = source[index[y]];

return result;


static void Main()

// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);

// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);

int[] results = new int[1024];

void GatherLocal(bool avx)

// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)

Console.Write(results[i].ToString("x2"));

Console.WriteLine();

const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();

GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);







share|improve this answer















It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);



results first:



AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms


code:



using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return

int y = 0;
if (Avx2.IsSupported && avx)

var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

unsafe

fixed (int* sPtr = source)

// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)

rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);



// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;

// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)

results[y] = source[index[y]];

return result;


static void Main()

// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);

// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);

int[] results = new int[1024];

void GatherLocal(bool avx)

// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)

Console.Write(results[i].ToString("x2"));

Console.WriteLine();

const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();

GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);








share|improve this answer














share|improve this answer



share|improve this answer








edited 13 hours ago

























answered 14 hours ago









Marc GravellMarc Gravell

822k212 gold badges2224 silver badges2623 bronze badges




822k212 gold badges2224 silver badges2623 bronze badges















  • Awesome! I think that's what I need! I'm gonna give it a go.

    – RobotRock
    14 hours ago






  • 1





    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

    – Marc Gravell
    14 hours ago











  • Single loop is fine for me too. The double was more legacy than anything else.

    – RobotRock
    14 hours ago











  • @RobotRock note: still working on it, but it is possible that scale should be 4, not 1

    – Marc Gravell
    14 hours ago






  • 1





    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

    – Marc Gravell
    14 hours ago

















  • Awesome! I think that's what I need! I'm gonna give it a go.

    – RobotRock
    14 hours ago






  • 1





    @RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

    – Marc Gravell
    14 hours ago











  • Single loop is fine for me too. The double was more legacy than anything else.

    – RobotRock
    14 hours ago











  • @RobotRock note: still working on it, but it is possible that scale should be 4, not 1

    – Marc Gravell
    14 hours ago






  • 1





    @RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

    – Marc Gravell
    14 hours ago
















Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago





Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago




1




1





@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell
14 hours ago





@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell
14 hours ago













Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago





Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago













@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell
14 hours ago





@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell
14 hours ago




1




1





@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell
14 hours ago





@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell
14 hours ago













-2
















RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.



Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.



As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.



If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.






share|improve this answer



























  • while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

    – Marc Gravell
    14 hours ago















-2
















RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.



Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.



As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.



If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.






share|improve this answer



























  • while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

    – Marc Gravell
    14 hours ago













-2














-2










-2









RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.



Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.



As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.



If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.






share|improve this answer















RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.



Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.



As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.



If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.







share|improve this answer














share|improve this answer



share|improve this answer








edited 15 hours ago

























answered 15 hours ago









ChristopherChristopher

4,1282 gold badges9 silver badges25 bronze badges




4,1282 gold badges9 silver badges25 bronze badges















  • while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

    – Marc Gravell
    14 hours ago

















  • while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

    – Marc Gravell
    14 hours ago
















while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell
14 hours ago





while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell
14 hours ago


















draft saved

draft discarded















































Thanks for contributing an answer to Stack Overflow!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57991432%2fwhat-is-the-fastest-way-to-do-array-table-lookup-with-an-integer-index%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

199年 目錄 大件事 到箇年出世嗰人 到箇年死嗰人 節慶、風俗習慣 導覽選單