What is the fastest way to do Array Table Lookup with an Integer Index?What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array

Can planetary bodies have a second axis of rotation?

Are actors contractually obligated to certain things like going nude/ Sensual Scenes/ Gory Scenes?

US entry with tourist visa but past alcohol arrest

What is the fastest way to do Array Table Lookup with an Integer Index?

Do things made of adamantine rust?

How could artificial intelligence harm us?

Is It Possible to Have Different Sea Levels, Eventually Causing New Landforms to Appear?

Repeat elements in list, but the number of times each element is repeated is provided by a separate list

Writing a letter of recommendation for a mediocre student

Should the pagination be reset when changing the order?

Understanding an example in Golan's "Linear Algebra"

What do these pins mean? Where should I plug them in?

Debussy as term for bathroom?

What do solvers like Gurobi and CPLEX do when they run into hard instances of MIP

What was an "insurance cover"?

As a discovery writer, how do I complete an unfinished novel (which has highly diverged from the original plot ) after a time-gap?

When does removing Goblin Warchief affect its cost reduction ability?

Is there any actual security benefit to restricting foreign IPs?

How is underwater propagation of sound possible?

Can someone explain to me the parameters of a lognormal distribution?

What is the need of methods like GET and POST in the HTTP protocol?

What was the deeper meaning of Hermione wanting the cloak?

Spectrum of a Subspace of Matrices

How to fix folder structure in Windows 7 and 10

What is the fastest way to do Array Table Lookup with an Integer Index?

What is the correct way to create a single-instance WPF application?In Java, what is the best way to determine the size of an object?What is the best way to iterate over a dictionary?What are the differences between a multidimensional array and an array of arrays in C#?What is the point of Lookup<TKey, TElement>?Fastest Way of Inserting in Entity FrameworkPass an array of integers to ASP.NET Web API?Fastest way to convert Image to Byte array

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I have a video processing application that moves a lot of data.

To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.

However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.

Currently I have the following:

public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];

I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.

int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) 
 for (int x = 0; x < width; x++, p++) 
 newResults[p] = largeArray[lookUp[p]];

Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.

Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.

I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).

Would an IntPtr be faster? How would I go about that?

Attached below is a screenshot of time distribution:

enter image description here

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

4

It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago

The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago

I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago

@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago

@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago

|
show 2 more comments

I have a video processing application that moves a lot of data.

To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.

However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.

Currently I have the following:

public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];

I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.

int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) 
 for (int x = 0; x < width; x++, p++) 
 newResults[p] = largeArray[lookUp[p]];

Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.

Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.

I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).

Would an IntPtr be faster? How would I go about that?

Attached below is a screenshot of time distribution:

enter image description here

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

4

It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago

The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago

I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago

@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago

@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago

|
show 2 more comments

I have a video processing application that moves a lot of data.

To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.

However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.

Currently I have the following:

public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];

I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.

int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) 
 for (int x = 0; x < width; x++, p++) 
 newResults[p] = largeArray[lookUp[p]];

Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.

Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.

I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).

Would an IntPtr be faster? How would I go about that?

Attached below is a screenshot of time distribution:

enter image description here

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

I have a video processing application that moves a lot of data.

To speed things up, I have made a lookup table, as many calculations in essence only need to be calculated one time and can be reused.

However I'm at the point where all the lookups now takes 30% of the processing time. I'm wondering if it might be slow RAM.. However, I would still like to try to optimize it some more.

Currently I have the following:

public readonly int[] largeArray = new int[3000*2000];
public readonly int[] lookUp = new int[width*height];

I then perform a lookup with a pointer p (which is equivalent to width * y + x) to fetch the result.

int[] newResults = new int[width*height];
int p = 0;
for (int y = 0; y < height; y++) 
 for (int x = 0; x < width; x++, p++) 
 newResults[p] = largeArray[lookUp[p]];

Note that I cannot do an entire array copy to optimize. Also, the application is heavily multithreaded.

Some progress was in shortening the function stack, so no getters but a straight retrieval from a readonly array.

I've tried converting to ushort as well, but it seemed to be slower (as I understand it's due to word size).

Would an IntPtr be faster? How would I go about that?

Attached below is a screenshot of time distribution:

enter image description here

c# memory lookup

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

edited 29 mins ago

JL2210

4,1944 gold badges12 silver badges41 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

asked 15 hours ago

RobotRock

2,2013 gold badges34 silver badges78 bronze badges

4

It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago

The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago

I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago

@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago

@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago

|
show 2 more comments

4

It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago

The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago

I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago

@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago

@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago

It'll be slightly faster to only have one for that goes to height*width, but other than that it's hard to see any obvious optimization opportunities. Is there any pattern to how the indexes in lookUp are distributed?

– 500 - Internal Server Error
15 hours ago

The indexes are distributed from equirectangular to cubic mapping (and back). A lookup would seem to be the fastest approach, as the alternative was performing a lot of multiplications, cos, sin calculations

– RobotRock
15 hours ago

I've converted to a single loop, but it isn't the issue. I've added a screenshot with a more elaborate view of the code with timing from the profiler.

– RobotRock
15 hours ago

@RobotRock: Off-topic , What tool are you using to display the time distribution?

– Postlagerkarte
15 hours ago

@Postlagerkarte It's VS2017 Diagnostics Tool

– RobotRock
15 hours ago

|
show 2 more comments

2 Answers
2

active

oldest

votes

It looks like what you're doing here is effectively a "gather". Modern CPUs have dedicated instructions for this, in particular VPGATHER** . This is exposed in .NET Core 3, and should work something like below, which is the single loop scenario (you can probably work from here to get the double-loop version);

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

 static int Gather(int[] source, int[] index, int[] results, bool avx)
 // normally you wouldn't have avx as a parameter; that is just so
 // I can turn it off and on for the test; likewise the "int" return
 // here is so I can monitor (in the test) how much we did in the "old"
 // loop, vs AVX2; in real code this would be void return

 int y = 0;
 if (Avx2.IsSupported && avx)
 
 var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
 var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

 unsafe
 
 fixed (int* sPtr = source)
 
 // note: here I'm assuming we are trying to fill "results" in
 // a single outer loop; for a double-loop, you'll probably need
 // to slice the spans
 for (int i = 0; i < rv.Length; i++)
 
 rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
 
 
 
 // move past everything we've processed via SIMD
 y += rv.Length * Vector256<int>.Count;
 
 // now do anything left, which includes anything not aligned to 256 bits,
 // plus the "no AVX2" scenario
 int result = y;
 int end = results.Length; // hoist, since this is not the JIT recognized pattern
 for (; y < end; y++)
 
 results[y] = source[index[y]];
 
 return result;
 

 static void Main()
 
 // invent some random data
 var rand = new Random(12345);
 int size = 1024 * 512;
 int[] data = new int[size];
 for (int i = 0; i < data.Length; i++)
 data[i] = rand.Next(255);

 // build a fake index
 int[] index = new int[1024];
 for (int i = 0; i < index.Length; i++)
 index[i] = rand.Next(size);

 int[] results = new int[1024];

 void GatherLocal(bool avx)
 
 // prove that we're getting the same data
 Array.Clear(results, 0, results.Length);
 int from = Gather(data, index, results, avx);
 Console.WriteLine($"AVX enabled: avx; slow loop from from");
 for (int i = 0; i < 32; i++)
 
 Console.Write(results[i].ToString("x2"));
 
 Console.WriteLine();

 const int TimeLoop = 1024 * 512;
 var watch = Stopwatch.StartNew();
 for (int i = 0; i < TimeLoop; i++)
 Gather(data, index, results, avx);
 watch.Stop();
 Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
 Console.WriteLine();
 
 GatherLocal(false);
 if (Avx2.IsSupported) GatherLocal(true);

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

1

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

1

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

|
show 2 more comments

-2

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

Of course at the given sizes, this array is 6 Million entries in size. That will likely not fit in any cache. And will take forever to itterate over. It does not mater what the speed is, this is simply too much data.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

If you have to keep it on the GPU side, maybe caching or Lazy Initilisation would help? Chances are that you do not truly need every value. You only need to common values. Take a examples from dicerolling: If you roll 2 6-sided dice, every result from 2-12 is possible. But the result 7 happens 6 out of 36 casess. The 2 and 12 only 1 out of 36 cases each. So having the 7 stored is a lot more beneficial then the 2 and 12.

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "1"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f57991432%2fwhat-is-the-fastest-way-to-do-array-table-lookup-with-an-integer-index%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

 static int Gather(int[] source, int[] index, int[] results, bool avx)
 // normally you wouldn't have avx as a parameter; that is just so
 // I can turn it off and on for the test; likewise the "int" return
 // here is so I can monitor (in the test) how much we did in the "old"
 // loop, vs AVX2; in real code this would be void return

 int y = 0;
 if (Avx2.IsSupported && avx)
 
 var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
 var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

 unsafe
 
 fixed (int* sPtr = source)
 
 // note: here I'm assuming we are trying to fill "results" in
 // a single outer loop; for a double-loop, you'll probably need
 // to slice the spans
 for (int i = 0; i < rv.Length; i++)
 
 rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
 
 
 
 // move past everything we've processed via SIMD
 y += rv.Length * Vector256<int>.Count;
 
 // now do anything left, which includes anything not aligned to 256 bits,
 // plus the "no AVX2" scenario
 int result = y;
 int end = results.Length; // hoist, since this is not the JIT recognized pattern
 for (; y < end; y++)
 
 results[y] = source[index[y]];
 
 return result;
 

 static void Main()
 
 // invent some random data
 var rand = new Random(12345);
 int size = 1024 * 512;
 int[] data = new int[size];
 for (int i = 0; i < data.Length; i++)
 data[i] = rand.Next(255);

 // build a fake index
 int[] index = new int[1024];
 for (int i = 0; i < index.Length; i++)
 index[i] = rand.Next(size);

 int[] results = new int[1024];

 void GatherLocal(bool avx)
 
 // prove that we're getting the same data
 Array.Clear(results, 0, results.Length);
 int from = Gather(data, index, results, avx);
 Console.WriteLine($"AVX enabled: avx; slow loop from from");
 for (int i = 0; i < 32; i++)
 
 Console.Write(results[i].ToString("x2"));
 
 Console.WriteLine();

 const int TimeLoop = 1024 * 512;
 var watch = Stopwatch.StartNew();
 for (int i = 0; i < TimeLoop; i++)
 Gather(data, index, results, avx);
 watch.Stop();
 Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
 Console.WriteLine();
 
 GatherLocal(false);
 if (Avx2.IsSupported) GatherLocal(true);

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

1

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

1

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

|
show 2 more comments

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

 static int Gather(int[] source, int[] index, int[] results, bool avx)
 // normally you wouldn't have avx as a parameter; that is just so
 // I can turn it off and on for the test; likewise the "int" return
 // here is so I can monitor (in the test) how much we did in the "old"
 // loop, vs AVX2; in real code this would be void return

 int y = 0;
 if (Avx2.IsSupported && avx)
 
 var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
 var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

 unsafe
 
 fixed (int* sPtr = source)
 
 // note: here I'm assuming we are trying to fill "results" in
 // a single outer loop; for a double-loop, you'll probably need
 // to slice the spans
 for (int i = 0; i < rv.Length; i++)
 
 rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
 
 
 
 // move past everything we've processed via SIMD
 y += rv.Length * Vector256<int>.Count;
 
 // now do anything left, which includes anything not aligned to 256 bits,
 // plus the "no AVX2" scenario
 int result = y;
 int end = results.Length; // hoist, since this is not the JIT recognized pattern
 for (; y < end; y++)
 
 results[y] = source[index[y]];
 
 return result;
 

 static void Main()
 
 // invent some random data
 var rand = new Random(12345);
 int size = 1024 * 512;
 int[] data = new int[size];
 for (int i = 0; i < data.Length; i++)
 data[i] = rand.Next(255);

 // build a fake index
 int[] index = new int[1024];
 for (int i = 0; i < index.Length; i++)
 index[i] = rand.Next(size);

 int[] results = new int[1024];

 void GatherLocal(bool avx)
 
 // prove that we're getting the same data
 Array.Clear(results, 0, results.Length);
 int from = Gather(data, index, results, avx);
 Console.WriteLine($"AVX enabled: avx; slow loop from from");
 for (int i = 0; i < 32; i++)
 
 Console.Write(results[i].ToString("x2"));
 
 Console.WriteLine();

 const int TimeLoop = 1024 * 512;
 var watch = Stopwatch.StartNew();
 for (int i = 0; i < TimeLoop; i++)
 Gather(data, index, results, avx);
 watch.Stop();
 Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
 Console.WriteLine();
 
 GatherLocal(false);
 if (Avx2.IsSupported) GatherLocal(true);

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

1

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

1

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

|
show 2 more comments

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

 static int Gather(int[] source, int[] index, int[] results, bool avx)
 // normally you wouldn't have avx as a parameter; that is just so
 // I can turn it off and on for the test; likewise the "int" return
 // here is so I can monitor (in the test) how much we did in the "old"
 // loop, vs AVX2; in real code this would be void return

 int y = 0;
 if (Avx2.IsSupported && avx)
 
 var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
 var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

 unsafe
 
 fixed (int* sPtr = source)
 
 // note: here I'm assuming we are trying to fill "results" in
 // a single outer loop; for a double-loop, you'll probably need
 // to slice the spans
 for (int i = 0; i < rv.Length; i++)
 
 rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
 
 
 
 // move past everything we've processed via SIMD
 y += rv.Length * Vector256<int>.Count;
 
 // now do anything left, which includes anything not aligned to 256 bits,
 // plus the "no AVX2" scenario
 int result = y;
 int end = results.Length; // hoist, since this is not the JIT recognized pattern
 for (; y < end; y++)
 
 results[y] = source[index[y]];
 
 return result;
 

 static void Main()
 
 // invent some random data
 var rand = new Random(12345);
 int size = 1024 * 512;
 int[] data = new int[size];
 for (int i = 0; i < data.Length; i++)
 data[i] = rand.Next(255);

 // build a fake index
 int[] index = new int[1024];
 for (int i = 0; i < index.Length; i++)
 index[i] = rand.Next(size);

 int[] results = new int[1024];

 void GatherLocal(bool avx)
 
 // prove that we're getting the same data
 Array.Clear(results, 0, results.Length);
 int from = Gather(data, index, results, avx);
 Console.WriteLine($"AVX enabled: avx; slow loop from from");
 for (int i = 0; i < 32; i++)
 
 Console.Write(results[i].ToString("x2"));
 
 Console.WriteLine();

 const int TimeLoop = 1024 * 512;
 var watch = Stopwatch.StartNew();
 for (int i = 0; i < TimeLoop; i++)
 Gather(data, index, results, avx);
 watch.Stop();
 Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
 Console.WriteLine();
 
 GatherLocal(false);
 if (Avx2.IsSupported) GatherLocal(true);

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

results first:

AVX enabled: False; slow loop from 0
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 1524ms

AVX enabled: True; slow loop from 1024
e7ad04457529f201558c8a53f639fed30d3a880f75e613afe203e80a7317d0cb
for 524288 loops: 667ms

code:

using System;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Runtime.Intrinsics;
using System.Runtime.Intrinsics.X86;

static class P

 static int Gather(int[] source, int[] index, int[] results, bool avx)
 // normally you wouldn't have avx as a parameter; that is just so
 // I can turn it off and on for the test; likewise the "int" return
 // here is so I can monitor (in the test) how much we did in the "old"
 // loop, vs AVX2; in real code this would be void return

 int y = 0;
 if (Avx2.IsSupported && avx)
 
 var iv = MemoryMarshal.Cast<int, Vector256<int>>(index);
 var rv = MemoryMarshal.Cast<int, Vector256<int>>(results);

 unsafe
 
 fixed (int* sPtr = source)
 
 // note: here I'm assuming we are trying to fill "results" in
 // a single outer loop; for a double-loop, you'll probably need
 // to slice the spans
 for (int i = 0; i < rv.Length; i++)
 
 rv[i] = Avx2.GatherVector256(sPtr, iv[i], 4);
 
 
 
 // move past everything we've processed via SIMD
 y += rv.Length * Vector256<int>.Count;
 
 // now do anything left, which includes anything not aligned to 256 bits,
 // plus the "no AVX2" scenario
 int result = y;
 int end = results.Length; // hoist, since this is not the JIT recognized pattern
 for (; y < end; y++)
 
 results[y] = source[index[y]];
 
 return result;
 

 static void Main()
 
 // invent some random data
 var rand = new Random(12345);
 int size = 1024 * 512;
 int[] data = new int[size];
 for (int i = 0; i < data.Length; i++)
 data[i] = rand.Next(255);

 // build a fake index
 int[] index = new int[1024];
 for (int i = 0; i < index.Length; i++)
 index[i] = rand.Next(size);

 int[] results = new int[1024];

 void GatherLocal(bool avx)
 
 // prove that we're getting the same data
 Array.Clear(results, 0, results.Length);
 int from = Gather(data, index, results, avx);
 Console.WriteLine($"AVX enabled: avx; slow loop from from");
 for (int i = 0; i < 32; i++)
 
 Console.Write(results[i].ToString("x2"));
 
 Console.WriteLine();

 const int TimeLoop = 1024 * 512;
 var watch = Stopwatch.StartNew();
 for (int i = 0; i < TimeLoop; i++)
 Gather(data, index, results, avx);
 watch.Stop();
 Console.WriteLine($"for TimeLoop loops: watch.ElapsedMillisecondsms");
 Console.WriteLine();
 
 GatherLocal(false);
 if (Avx2.IsSupported) GatherLocal(true);

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

edited 13 hours ago

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

answered 14 hours ago

Marc Gravell♦

822k212 gold badges2224 silver badges2623 bronze badges

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

1

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

1

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

|
show 2 more comments

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

1

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

1

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

Awesome! I think that's what I need! I'm gonna give it a go.

– RobotRock
14 hours ago

@RobotRock great! just note that your double-loop is a complication; the above is focused on the single loop scenario, but the principles are the same; you just need to slice your spans correctly

– Marc Gravell♦
14 hours ago

Single loop is fine for me too. The double was more legacy than anything else.

– RobotRock
14 hours ago

@RobotRock note: still working on it, but it is possible that scale should be 4, not 1

– Marc Gravell♦
14 hours ago

@RobotRock awesome; goes from 1534ms (no AVX2) to 698ms (using AVX2); updating answer

– Marc Gravell♦
14 hours ago

|
show 2 more comments

-2

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

add a comment
|

-2

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

add a comment
|

-2

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

RAM is already one of the fastest things possible. The only memory faster is the CPU caches. So it will be Memory Bound, but that is still plenty fast.

As a general rule, video processing is done on the GPU nowadays. GPU's are literally desinged to operate on giant arrays. Because that is what the Image you are seeing right now is - a giant array.

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

edited 15 hours ago

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

answered 15 hours ago

Christopher

4,1282 gold badges9 silver badges25 bronze badges

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

add a comment
|

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

while a GPU might be nice here, modern CPUs also have operations specifically to support this scenario - this is a "gather" (see answer above)

– Marc Gravell♦
14 hours ago

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

搜尋此網誌

Mfcttrf

2 Answers
2

Your Answer

Post as a guest

2 Answers
2

2 Answers
2

Post as a guest

Popular posts from this blog

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Post as a guest

2 Answers 2

2 Answers 2

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

2 Answers
2

2 Answers
2

2 Answers
2