IEEE 754 square root with Newton-RaphsonSqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function

Smallest PRIME containing the first 11 primes as sub-strings

Why is STARTTLS used when it can be downgraded very easily?

Beyond Futuristic Technology for an Alien Warship?

Isn't the detector always measuring, and thus always collapsing the state?

Implementation of a Thread Pool in C++

A word that refers to saying something in an attempt to anger or embarrass someone into doing something that they don’t want to do?

Would allowing versatile weapons wielded in two hands to benefit from Dueling be unbalanced?

My machine, client installed VPN,

Why, even after his imprisonment, do people keep calling Hannibal Lecter "Doctor"?

What are one's options when facing religious discrimination at the airport?

Can you cure a Gorgon's Petrifying Breath before it finishes turning a target to stone?

Can the President of the US limit First Amendment rights?

How to add the real hostname in the beginning of Linux cli command

Delete n lines skip 1 line script

Phonetic distortion when words are borrowed among languages

Knights and Knaves: What does C say?

Is it mandatory to use contractions in tag questions and the like?

Can I pay some of the cost of an activated ability lots of times to get more out of the effect?

Can adverbs modify adjectives?

What would happen if I build a half bath without permits?

Garage door sticks on a bolt

Top off gas with old oil, is that bad?

Avoiding dust scattering when you drill

Is it possible to take a database offline when doing a backup using an SQL job?

IEEE 754 square root with Newton-Raphson

Sqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function

.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;

I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:

decompose.h:

#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C" 
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif

decompose_generic.h:

#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C" 
#endif

#define SIGN_SIZE 1

union decompose 
 TYPE f;
 struct decompose_s 
 mantissa m:MANTISSA_SIZE;
 exponent e:EXPONENT_SIZE;
 sign s:SIGN_SIZE;
 __attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

 union decompose result;
 result.f = f;
 return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif

sqrt.c:

#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
 (r.o.e >> 1);
 r.o.e += EXPONENT_BIAS;

 /* Newton-Raphson */
 while(result_old2 != r.f)
 
 if(c % 2 == 0) result_old2 = r.f;
 r.f -= (r.f*r.f-a)/(2*r.f);
 c++;
 

 return r.f;

What I'm looking for:

Are there any efficiency problems?

Are there any accuracy problems?

Are there any other miscellaneous things that I can improve?

As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.

One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.

I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

4

$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago

add a comment
|

I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:

decompose.h:

#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C" 
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif

decompose_generic.h:

#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C" 
#endif

#define SIGN_SIZE 1

union decompose 
 TYPE f;
 struct decompose_s 
 mantissa m:MANTISSA_SIZE;
 exponent e:EXPONENT_SIZE;
 sign s:SIGN_SIZE;
 __attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

 union decompose result;
 result.f = f;
 return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif

sqrt.c:

#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
 (r.o.e >> 1);
 r.o.e += EXPONENT_BIAS;

 /* Newton-Raphson */
 while(result_old2 != r.f)
 
 if(c % 2 == 0) result_old2 = r.f;
 r.f -= (r.f*r.f-a)/(2*r.f);
 c++;
 

 return r.f;

What I'm looking for:

Are there any efficiency problems?

Are there any accuracy problems?

Are there any other miscellaneous things that I can improve?

As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.

One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.

I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

4

$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago

add a comment
|

I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:

decompose.h:

#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C" 
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif

decompose_generic.h:

#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C" 
#endif

#define SIGN_SIZE 1

union decompose 
 TYPE f;
 struct decompose_s 
 mantissa m:MANTISSA_SIZE;
 exponent e:EXPONENT_SIZE;
 sign s:SIGN_SIZE;
 __attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

 union decompose result;
 result.f = f;
 return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif

sqrt.c:

#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
 (r.o.e >> 1);
 r.o.e += EXPONENT_BIAS;

 /* Newton-Raphson */
 while(result_old2 != r.f)
 
 if(c % 2 == 0) result_old2 = r.f;
 r.f -= (r.f*r.f-a)/(2*r.f);
 c++;
 

 return r.f;

What I'm looking for:

Are there any efficiency problems?

Are there any accuracy problems?

Are there any other miscellaneous things that I can improve?

As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.

One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.

I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:

decompose.h:

#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C" 
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif

decompose_generic.h:

#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C" 
#endif

#define SIGN_SIZE 1

union decompose 
 TYPE f;
 struct decompose_s 
 mantissa m:MANTISSA_SIZE;
 exponent e:EXPONENT_SIZE;
 sign s:SIGN_SIZE;
 __attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

 union decompose result;
 result.f = f;
 return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif

sqrt.c:

#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
 (r.o.e >> 1);
 r.o.e += EXPONENT_BIAS;

 /* Newton-Raphson */
 while(result_old2 != r.f)
 
 if(c % 2 == 0) result_old2 = r.f;
 r.f -= (r.f*r.f-a)/(2*r.f);
 c++;
 

 return r.f;

What I'm looking for:

Are there any efficiency problems?

Are there any accuracy problems?

Are there any other miscellaneous things that I can improve?

As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.

One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.

I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.

c reinventing-the-wheel numerical-methods floating-point

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

edited 7 hours ago

asked 9 hours ago

JL2210

38812 bronze badges

asked 9 hours ago

JL2210

38812 bronze badges

asked 9 hours ago

JL2210

38812 bronze badges

4

$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago

add a comment
|

4

$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago

It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.

– Toby Speight
7 hours ago

add a comment
|

3 Answers
3

active

oldest

votes

Initial approximation

Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:

uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias

If you change the order of operations, you can handle the biasing in a single constant:

uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias

The constant can then be simplified to:

uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above

Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):

original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
 0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
 1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
 2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
 3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )

With your code, the mapping is (also linear mapping within intervals):

original number is exponent orig interval
exponent in interval in sqrt maps to
 0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
 1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
 2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
 3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )

My mapping is far more accurate and yours isn't even monotonic.

Newton-Raphson

Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.

The entire code could look like this:

float sqrtf(float x)

 float y = x;
 // Approximation
 uint32_t* i = (uint32_t*)&x;
 *i = (*i >> 1) + (127 << 22);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;


double sqrt(double x)

 double y = x;
 // Approximation
 uint64_t* i = (uint64_t*)&x;
 *i = (*i >> 1) + (1023 << 51);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago

1

$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago

$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago

1

$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago

$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago

|
show 4 more comments

Modular odd check

The compiler is likely to do this anyway, but can't

if(c % 2 == 0)

if (!(c & 1))

Factoring

Isn't

r.f -= (r.f*r.f-a)/(2*r.f);

equivalent to:

r.f = (r.f + a/r.f)/2?

Optimization

Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

2

$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago

2

$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago

1

$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago

|
show 6 more comments

I don't see why you define this constant yourself:

#define MANTISSA_SIZE 52

Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).

I think there's danger of integer overflow here:

/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

We'd be better extracting into a big enough temporary, and applying the exponent bias to that:

int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;

It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago

$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago

add a comment
|

Your Answer

StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);

);

draft saved

draft discarded

StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f229573%2fieee-754-square-root-with-newton-raphson%23new-answer', 'question_page');

);

Post as a guest

Name

Required, but never shown

3 Answers
3

active

oldest

votes

3 Answers
3

active

oldest

votes

Initial approximation

Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:

uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias

If you change the order of operations, you can handle the biasing in a single constant:

uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias

The constant can then be simplified to:

uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above

original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
 0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
 1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
 2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
 3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )

With your code, the mapping is (also linear mapping within intervals):

original number is exponent orig interval
exponent in interval in sqrt maps to
 0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
 1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
 2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
 3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )

My mapping is far more accurate and yours isn't even monotonic.

Newton-Raphson

The entire code could look like this:

float sqrtf(float x)

 float y = x;
 // Approximation
 uint32_t* i = (uint32_t*)&x;
 *i = (*i >> 1) + (127 << 22);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;


double sqrt(double x)

 double y = x;
 // Approximation
 uint64_t* i = (uint64_t*)&x;
 *i = (*i >> 1) + (1023 << 51);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago

1

$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago

$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago

1

$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago

$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago

|
show 4 more comments

Initial approximation

Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:

uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias

If you change the order of operations, you can handle the biasing in a single constant:

uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias

The constant can then be simplified to:

uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above

original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
 0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
 1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
 2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
 3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )

With your code, the mapping is (also linear mapping within intervals):

original number is exponent orig interval
exponent in interval in sqrt maps to
 0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
 1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
 2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
 3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )

My mapping is far more accurate and yours isn't even monotonic.

Newton-Raphson

The entire code could look like this:

float sqrtf(float x)

 float y = x;
 // Approximation
 uint32_t* i = (uint32_t*)&x;
 *i = (*i >> 1) + (127 << 22);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;


double sqrt(double x)

 double y = x;
 // Approximation
 uint64_t* i = (uint64_t*)&x;
 *i = (*i >> 1) + (1023 << 51);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago

1

$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago

$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago

1

$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago

$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago

|
show 4 more comments

Initial approximation

Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:

uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias

If you change the order of operations, you can handle the biasing in a single constant:

uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias

The constant can then be simplified to:

uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above

original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
 0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
 1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
 2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
 3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )

With your code, the mapping is (also linear mapping within intervals):

original number is exponent orig interval
exponent in interval in sqrt maps to
 0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
 1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
 2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
 3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )

My mapping is far more accurate and yours isn't even monotonic.

Newton-Raphson

The entire code could look like this:

float sqrtf(float x)

 float y = x;
 // Approximation
 uint32_t* i = (uint32_t*)&x;
 *i = (*i >> 1) + (127 << 22);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;


double sqrt(double x)

 double y = x;
 // Approximation
 uint64_t* i = (uint64_t*)&x;
 *i = (*i >> 1) + (1023 << 51);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

Initial approximation

Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:

uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias

If you change the order of operations, you can handle the biasing in a single constant:

uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias

The constant can then be simplified to:

uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above

original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
 0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
 1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
 2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
 3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )

With your code, the mapping is (also linear mapping within intervals):

original number is exponent orig interval
exponent in interval in sqrt maps to
 0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
 1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
 2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
 3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
 4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )

My mapping is far more accurate and yours isn't even monotonic.

Newton-Raphson

The entire code could look like this:

float sqrtf(float x)

 float y = x;
 // Approximation
 uint32_t* i = (uint32_t*)&x;
 *i = (*i >> 1) + (127 << 22);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;


double sqrt(double x)

 double y = x;
 // Approximation
 uint64_t* i = (uint64_t*)&x;
 *i = (*i >> 1) + (1023 << 51);
 // Newton-Raphson
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 x = (x + y/x) / 2;
 return x;

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

edited 4 hours ago

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

answered 7 hours ago

Rainer P.

1,4265 silver badges14 bronze badges

$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago

1

$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago

$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago

1

$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago

$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago

|
show 4 more comments

$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago

1

$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago

$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago

1

$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago

$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago

This won't work for 80-bit long double, which can't be represented as an integer.

– JL2210
6 hours ago

Also note that this violates strict aliasing and is undefined behavior.

– JL2210
4 hours ago

How many bits of accuracy does the current implementation provide?

– JL2210
4 hours ago

Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210

– Rainer P.
4 hours ago

How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?

– JL2210
4 hours ago

|
show 4 more comments

Modular odd check

The compiler is likely to do this anyway, but can't

if(c % 2 == 0)

if (!(c & 1))

Factoring

Isn't

r.f -= (r.f*r.f-a)/(2*r.f);

equivalent to:

r.f = (r.f + a/r.f)/2?

Optimization

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

2

$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago

2

$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago

1

$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago

|
show 6 more comments

Modular odd check

The compiler is likely to do this anyway, but can't

if(c % 2 == 0)

if (!(c & 1))

Factoring

Isn't

r.f -= (r.f*r.f-a)/(2*r.f);

equivalent to:

r.f = (r.f + a/r.f)/2?

Optimization

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

2

$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago

2

$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago

1

$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago

|
show 6 more comments

Modular odd check

The compiler is likely to do this anyway, but can't

if(c % 2 == 0)

if (!(c & 1))

Factoring

Isn't

r.f -= (r.f*r.f-a)/(2*r.f);

equivalent to:

r.f = (r.f + a/r.f)/2?

Optimization

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

Modular odd check

The compiler is likely to do this anyway, but can't

if(c % 2 == 0)

if (!(c & 1))

Factoring

Isn't

r.f -= (r.f*r.f-a)/(2*r.f);

equivalent to:

r.f = (r.f + a/r.f)/2?

Optimization

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

edited 8 hours ago

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

answered 8 hours ago

Reinderien

9,89016 silver badges41 bronze badges

2

$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago

2

$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago

1

$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago

|
show 6 more comments

2

$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago

2

$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago

1

$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago

1

$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago

Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.

– Reinderien
8 hours ago

@Reinderien Thanks. I've added a clarification.

– JL2210
8 hours ago

Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.

– Reinderien
8 hours ago

The second has two divisions while the first has one.

– JL2210
8 hours ago

Produced assembly seems fine, GCC with optimization levels three and size.

– JL2210
8 hours ago

|
show 6 more comments

I don't see why you define this constant yourself:

#define MANTISSA_SIZE 52

Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).

I think there's danger of integer overflow here:

/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

We'd be better extracting into a big enough temporary, and applying the exponent bias to that:

int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;

It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago

$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago

add a comment
|

I don't see why you define this constant yourself:

#define MANTISSA_SIZE 52

Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).

I think there's danger of integer overflow here:

/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

We'd be better extracting into a big enough temporary, and applying the exponent bias to that:

int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;

It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago

$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago

add a comment
|

I don't see why you define this constant yourself:

#define MANTISSA_SIZE 52

Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).

I think there's danger of integer overflow here:

/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

We'd be better extracting into a big enough temporary, and applying the exponent bias to that:

int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;

It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

I don't see why you define this constant yourself:

#define MANTISSA_SIZE 52

Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).

I think there's danger of integer overflow here:

/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

We'd be better extracting into a big enough temporary, and applying the exponent bias to that:

int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;

It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

edited 7 hours ago

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

answered 7 hours ago

Toby Speight

32k7 gold badges46 silver badges137 bronze badges

$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago

$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago

add a comment
|

$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago

$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago

Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.

– JL2210
7 hours ago

I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.

– JL2210
6 hours ago

add a comment
|

draft saved

draft discarded

Thanks for contributing an answer to Code Review Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

UfBhmCAQCZRQw44w4dEYag71YRUq fEx8aR0b C9m

搜尋此網誌

Mfcttrf

3 Answers
3

Initial approximation

Newton-Raphson

Modular odd check

Factoring

Optimization

Your Answer

Post as a guest

3 Answers
3

3 Answers
3

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Post as a guest

Popular posts from this blog

199年目錄大件事到箇年出世嗰人到箇年死嗰人節慶、風俗習慣導覽選單

3 Answers 3

Initial approximation

Newton-Raphson

Modular odd check

Factoring

Optimization

Your Answer

Sign up or log in

Post as a guest

Post as a guest

3 Answers 3

3 Answers 3

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Initial approximation

Newton-Raphson

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Modular odd check

Factoring

Optimization

Sign up or log in

Post as a guest

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Sign up or log in

Post as a guest

Popular posts from this blog

199年 目錄 大件事 到箇年出世嗰人 到箇年死嗰人 節慶、風俗習慣 導覽選單

3 Answers
3

3 Answers
3

3 Answers
3

199年目錄大件事到箇年出世嗰人到箇年死嗰人節慶、風俗習慣導覽選單