What is the fastest way to do Array Table Lookup with an Integer Index?What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array
Can planetary bodies have a second axis of rotation?
Are actors contractually obligated to certain things like going nude/ Sensual Scenes/ Gory Scenes?
US entry with tourist visa but past alcohol arrest
What is the fastest way to do Array Table Lookup with an Integer Index?
Do things made of adamantine rust?
How could artificial intelligence harm us?
Is It Possible to Have Different Sea Levels, Eventually Causing New Landforms to Appear?
Repeat elements in list, but the number of times each element is repeated is provided by a separate list
Writing a letter of recommendation for a mediocre student
Should the pagination be reset when changing the order?
Understanding an example in Golan's "Linear Algebra"
What do these pins mean? Where should I plug them in?
Debussy as term for bathroom?
What do solvers like Gurobi and CPLEX do when they run into hard instances of MIP
What was an "insurance cover"?
As a discovery writer, how do I complete an unfinished novel (which has highly diverged from the original plot ) after a time-gap?
When does removing Goblin Warchief affect its cost reduction ability?
Is there any actual security benefit to restricting foreign IPs?
How is underwater propagation of sound possible?
Can someone explain to me the parameters of a lognormal distribution?
What is the need of methods like GET and POST in the HTTP protocol?
What was the deeper meaning of Hermione wanting the cloak?
Spectrum of a Subspace of Matrices
How to fix folder structure in Windows 7 and 10
What is the fastest way to do Array Table Lookup with an Integer Index?
What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p
(which is equivalent to width * y + x
) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
c# memory lookup
|
show 2 more comments
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p
(which is equivalent to width * y + x
) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
c# memory lookup
4
It'll be slightly faster to only have onefor
that goes toheight*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes inlookUp
are distributed?
– 500 - Internal Server Error
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago
|
show 2 more comments
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p
(which is equivalent to width * y + x
) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
c# memory lookup
I have a video processing application that moves a lot of data.
To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.
However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.
Currently I have the following:
public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];
I then perform a lookup with a pointer p
(which is equivalent to width * y + x
) to fetch the result.
int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++)
for (int x = 0; x < width; x++, p++)
newResults[p] = largeArray[lookUp[p]];
Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.
Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.
I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).
Would an IntPtr be faster? How would I go about that?
Attached below is a screenshot of time distribution:
c# memory lookup
c# memory lookup
edited 29 mins ago
JL2210
4,1944 gold badges12 silver badges41 bronze badges
4,1944 gold badges12 silver badges41 bronze badges
asked 15 hours ago
RobotRockRobotRock
2,2013 gold badges34 silver badges78 bronze badges
2,2013 gold badges34 silver badges78 bronze badges
4
It'll be slightly faster to only have onefor
that goes toheight*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes inlookUp
are distributed?
– 500 - Internal Server Error
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago
|
show 2 more comments
4
It'll be slightly faster to only have onefor
that goes toheight*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes inlookUp
are distributed?
– 500 - Internal Server Error
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago
4
4
It'll be slightly faster to only have one
for
that goes to height*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp
are distributed?– 500 - Internal Server Error
15 hours ago
It'll be slightly faster to only have one
for
that goes to height*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp
are distributed?– 500 - Internal Server Error
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago
|
show 2 more comments
2 Answers
2
active
oldest
votes
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER**
. This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
fixed (int* sPtr = source)
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
results[y] = source[index[y]];
return result;
static void Main()
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)
Console.Write(results[i].ToString("x2"));
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible thatscale
should be4
, not1
– Marc Gravell♦
14 hours ago
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
|
show 2 more comments
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57991432%2fwhat-is-the-fastest-way-to-do-array-table-lookup-with-an-integer-index%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER**
. This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
fixed (int* sPtr = source)
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
results[y] = source[index[y]];
return result;
static void Main()
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)
Console.Write(results[i].ToString("x2"));
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible thatscale
should be4
, not1
– Marc Gravell♦
14 hours ago
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
|
show 2 more comments
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER**
. This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
fixed (int* sPtr = source)
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
results[y] = source[index[y]];
return result;
static void Main()
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)
Console.Write(results[i].ToString("x2"));
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible thatscale
should be4
, not1
– Marc Gravell♦
14 hours ago
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
|
show 2 more comments
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER**
. This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
fixed (int* sPtr = source)
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
results[y] = source[index[y]];
return result;
static void Main()
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)
Console.Write(results[i].ToString("x2"));
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER**
. This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);
results first:
AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms
AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms
code:
using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;
static class P
static int Gather(int[] source, int[] index, int[] results, bool avx)
// normally you wouldn't have avx as a parameter; that is just so
// I can turn it off and on for the test; likewise the "int" return
// here is so I can monitor (in the test) how much we did in the "old"
// loop, vs AVX2; in real code this would be void return
int y = 0;
if (Avx2.IsSupported && avx)
var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);
unsafe
fixed (int* sPtr = source)
// note: here I'm assuming we are trying to fill "results" in
// a single outer loop; for a double-loop, you'll probably need
// to slice the spans
for (int i = 0; i < rv.Length; i++)
rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
// move past everything we've processed via SIMD
y += rv.Length * Vector256<int>.Count;
// now do anything left, which includes anything not aligned to 256 bits,
// plus the "no AVX2" scenario
int result = y;
int end = results.Length; // hoist, since this is not the JIT recognized pattern
for (; y < end; y++)
results[y] = source[index[y]];
return result;
static void Main()
// invent some random data
var rand = new Random(12345);
int size = 1024 * 512;
int[] data = new int[size];
for (int i = 0; i < data.Length; i++)
data[i] = rand.Next(255);
// build a fake index
int[] index = new int[1024];
for (int i = 0; i < index.Length; i++)
index[i] = rand.Next(size);
int[] results = new int[1024];
void GatherLocal(bool avx)
// prove that we're getting the same data
Array.Clear(results, 0, results.Length);
int from = Gather(data, index, results, avx);
Console.WriteLine($"AVX enabled: avx; slow loop from from");
for (int i = 0; i < 32; i++)
Console.Write(results[i].ToString("x2"));
Console.WriteLine();
const int TimeLoop = 1024 * 512;
var watch = Stopwatch.StartNew();
for (int i = 0; i < TimeLoop; i++)
Gather(data, index, results, avx);
watch.Stop();
Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
Console.WriteLine();
GatherLocal(false);
if (Avx2.IsSupported) GatherLocal(true);
edited 13 hours ago
answered 14 hours ago
Marc Gravell♦Marc Gravell
822k212 gold badges2224 silver badges2623 bronze badges
822k212 gold badges2224 silver badges2623 bronze badges
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible thatscale
should be4
, not1
– Marc Gravell♦
14 hours ago
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
|
show 2 more comments
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible thatscale
should be4
, not1
– Marc Gravell♦
14 hours ago
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
Awesome! I think that's what I need! I'm gonna give it a go.
– RobotRock
14 hours ago
1
1
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly
– Marc Gravell♦
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
Single loop is fine for me too. The double was more legacy than anything else.
– RobotRock
14 hours ago
@RobotRock note: still working on it, but it is possible that
scale
should be 4
, not 1
– Marc Gravell♦
14 hours ago
@RobotRock note: still working on it, but it is possible that
scale
should be 4
, not 1
– Marc Gravell♦
14 hours ago
1
1
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer
– Marc Gravell♦
14 hours ago
|
show 2 more comments
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
add a comment
|
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
add a comment
|
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.
Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.
As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.
If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.
edited 15 hours ago
answered 15 hours ago
ChristopherChristopher
4,1282 gold badges9 silver badges25 bronze badges
4,1282 gold badges9 silver badges25 bronze badges
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
add a comment
|
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)
– Marc Gravell♦
14 hours ago
add a comment
|
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57991432%2fwhat-is-the-fastest-way-to-do-array-table-lookup-with-an-integer-index%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
It'll be slightly faster to only have one
for
that goes toheight*width
, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes inlookUp
are distributed?– 500 - Internal Server Error
15 hours ago
The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations
– RobotRock
15 hours ago
I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.
– RobotRock
15 hours ago
@RobotRock: Off-topic , What tool are you using to display the time distribution?
– Postlagerkarte
15 hours ago
@Postlagerkarte It's VS2017 Diagnostics Tool
– RobotRock
15 hours ago