IEEE 754 square root with Newton-RaphsonSqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function

Smallest PRIME containing the first 11 primes as sub-strings

Why is STARTTLS used when it can be downgraded very easily?

Beyond Futuristic Technology for an Alien Warship?

Isn't the detector always measuring, and thus always collapsing the state?

Implementation of a Thread Pool in C++

A word that refers to saying something in an attempt to anger or embarrass someone into doing something that they don’t want to do?

Would allowing versatile weapons wielded in two hands to benefit from Dueling be unbalanced?

My machine, client installed VPN,

Why, even after his imprisonment, do people keep calling Hannibal Lecter "Doctor"?

What are one's options when facing religious discrimination at the airport?

Can you cure a Gorgon's Petrifying Breath before it finishes turning a target to stone?

Can the President of the US limit First Amendment rights?

How to add the real hostname in the beginning of Linux cli command

Delete n lines skip 1 line script

Phonetic distortion when words are borrowed among languages

Knights and Knaves: What does C say?

Is it mandatory to use contractions in tag questions and the like?

Can I pay some of the cost of an activated ability lots of times to get more out of the effect?

Can adverbs modify adjectives?

What would happen if I build a half bath without permits?

Garage door sticks on a bolt

Top off gas with old oil, is that bad?

Avoiding dust scattering when you drill

Is it possible to take a database offline when doing a backup using an SQL job?



IEEE 754 square root with Newton-Raphson


Sqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function






.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;








7












$begingroup$


I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:



decompose.h:



#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C"
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif


decompose_generic.h:



#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C"
#endif

#define SIGN_SIZE 1

union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

union decompose result;
result.f = f;
return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif


sqrt.c:



#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

/* Newton-Raphson */
while(result_old2 != r.f)

if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;


return r.f;



What I'm looking for:



  • Are there any efficiency problems?


  • Are there any accuracy problems?


  • Are there any other miscellaneous things that I can improve?


As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.



One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.



I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.










share|improve this question











$endgroup$









  • 4




    $begingroup$
    It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
    $endgroup$
    – Toby Speight
    7 hours ago


















7












$begingroup$


I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:



decompose.h:



#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C"
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif


decompose_generic.h:



#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C"
#endif

#define SIGN_SIZE 1

union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

union decompose result;
result.f = f;
return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif


sqrt.c:



#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

/* Newton-Raphson */
while(result_old2 != r.f)

if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;


return r.f;



What I'm looking for:



  • Are there any efficiency problems?


  • Are there any accuracy problems?


  • Are there any other miscellaneous things that I can improve?


As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.



One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.



I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.










share|improve this question











$endgroup$









  • 4




    $begingroup$
    It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
    $endgroup$
    – Toby Speight
    7 hours ago














7












7








7





$begingroup$


I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:



decompose.h:



#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C"
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif


decompose_generic.h:



#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C"
#endif

#define SIGN_SIZE 1

union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

union decompose result;
result.f = f;
return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif


sqrt.c:



#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

/* Newton-Raphson */
while(result_old2 != r.f)

if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;


return r.f;



What I'm looking for:



  • Are there any efficiency problems?


  • Are there any accuracy problems?


  • Are there any other miscellaneous things that I can improve?


As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.



One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.



I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.










share|improve this question











$endgroup$




I have an implementation of the sqrt function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:



decompose.h:



#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1

#ifdef __cplusplus
extern "C"
#endif

#include <float.h>

#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name

typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;

#include "decompose_generic.h"

#ifdef __cplusplus

#endif

#endif


decompose_generic.h:



#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1

#ifdef __cplusplus
extern "C"
#endif

#define SIGN_SIZE 1

union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;

static inline union decompose decompose(TYPE f)

union decompose result;
result.f = f;
return result;


#define is_nan(x) NAME_SFX(isnan)(x)

#ifdef __cplusplus

#endif

#endif


sqrt.c:



#include <math.h>
#include <errno.h>

#include "decompose.h"

TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;

/* Newton-Raphson */
while(result_old2 != r.f)

if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;


return r.f;



What I'm looking for:



  • Are there any efficiency problems?


  • Are there any accuracy problems?


  • Are there any other miscellaneous things that I can improve?


As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.



One note: This is part of a modular system. There will be a decompose.h for each floating point type (float, double, long double), and they all include decompose_generic.h.



I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.







c reinventing-the-wheel numerical-methods floating-point






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 7 hours ago







JL2210

















asked 9 hours ago









JL2210JL2210

38812 bronze badges




38812 bronze badges










  • 4




    $begingroup$
    It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
    $endgroup$
    – Toby Speight
    7 hours ago













  • 4




    $begingroup$
    It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
    $endgroup$
    – Toby Speight
    7 hours ago








4




4




$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago





$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using std::numeric_limits<T>::is_iec559. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>.
$endgroup$
– Toby Speight
7 hours ago











3 Answers
3






active

oldest

votes


















6














$begingroup$

Initial approximation



Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:



uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias


If you change the order of operations, you can handle the biasing in a single constant:



uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias


The constant can then be simplified to:



uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above


Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):



original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )


With your code, the mapping is (also linear mapping within intervals):



original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )


My mapping is far more accurate and yours isn't even monotonic.



Newton-Raphson



Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.



The entire code could look like this:



float sqrtf(float x)

float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;


double sqrt(double x)

double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;






share|improve this answer











$endgroup$














  • $begingroup$
    This won't work for 80-bit long double, which can't be represented as an integer.
    $endgroup$
    – JL2210
    6 hours ago






  • 1




    $begingroup$
    Also note that this violates strict aliasing and is undefined behavior.
    $endgroup$
    – JL2210
    4 hours ago










  • $begingroup$
    How many bits of accuracy does the current implementation provide?
    $endgroup$
    – JL2210
    4 hours ago






  • 1




    $begingroup$
    Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
    $endgroup$
    – Rainer P.
    4 hours ago











  • $begingroup$
    How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
    $endgroup$
    – JL2210
    4 hours ago


















3














$begingroup$

Modular odd check



The compiler is likely to do this anyway, but can't



if(c % 2 == 0)


be



if (!(c & 1))


?



Factoring



Isn't



r.f -= (r.f*r.f-a)/(2*r.f);


equivalent to:



r.f = (r.f + a/r.f)/2?


Optimization



Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?






share|improve this answer











$endgroup$










  • 2




    $begingroup$
    Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
    $endgroup$
    – Reinderien
    8 hours ago






  • 2




    $begingroup$
    @Reinderien Thanks. I've added a clarification.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
    $endgroup$
    – Reinderien
    8 hours ago






  • 1




    $begingroup$
    The second has two divisions while the first has one.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Produced assembly seems fine, GCC with optimization levels three and size.
    $endgroup$
    – JL2210
    8 hours ago


















3














$begingroup$

I don't see why you define this constant yourself:




#define MANTISSA_SIZE 52



Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).




I think there's danger of integer overflow here:




/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;



We'd be better extracting into a big enough temporary, and applying the exponent bias to that:



int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;


It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.






share|improve this answer











$endgroup$














  • $begingroup$
    Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
    $endgroup$
    – JL2210
    7 hours ago










  • $begingroup$
    I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
    $endgroup$
    – JL2210
    6 hours ago













Your Answer






StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");

StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);

else
createEditor();

);

function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);



);














draft saved

draft discarded
















StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f229573%2fieee-754-square-root-with-newton-raphson%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown

























3 Answers
3






active

oldest

votes








3 Answers
3






active

oldest

votes









active

oldest

votes






active

oldest

votes









6














$begingroup$

Initial approximation



Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:



uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias


If you change the order of operations, you can handle the biasing in a single constant:



uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias


The constant can then be simplified to:



uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above


Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):



original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )


With your code, the mapping is (also linear mapping within intervals):



original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )


My mapping is far more accurate and yours isn't even monotonic.



Newton-Raphson



Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.



The entire code could look like this:



float sqrtf(float x)

float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;


double sqrt(double x)

double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;






share|improve this answer











$endgroup$














  • $begingroup$
    This won't work for 80-bit long double, which can't be represented as an integer.
    $endgroup$
    – JL2210
    6 hours ago






  • 1




    $begingroup$
    Also note that this violates strict aliasing and is undefined behavior.
    $endgroup$
    – JL2210
    4 hours ago










  • $begingroup$
    How many bits of accuracy does the current implementation provide?
    $endgroup$
    – JL2210
    4 hours ago






  • 1




    $begingroup$
    Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
    $endgroup$
    – Rainer P.
    4 hours ago











  • $begingroup$
    How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
    $endgroup$
    – JL2210
    4 hours ago















6














$begingroup$

Initial approximation



Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:



uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias


If you change the order of operations, you can handle the biasing in a single constant:



uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias


The constant can then be simplified to:



uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above


Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):



original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )


With your code, the mapping is (also linear mapping within intervals):



original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )


My mapping is far more accurate and yours isn't even monotonic.



Newton-Raphson



Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.



The entire code could look like this:



float sqrtf(float x)

float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;


double sqrt(double x)

double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;






share|improve this answer











$endgroup$














  • $begingroup$
    This won't work for 80-bit long double, which can't be represented as an integer.
    $endgroup$
    – JL2210
    6 hours ago






  • 1




    $begingroup$
    Also note that this violates strict aliasing and is undefined behavior.
    $endgroup$
    – JL2210
    4 hours ago










  • $begingroup$
    How many bits of accuracy does the current implementation provide?
    $endgroup$
    – JL2210
    4 hours ago






  • 1




    $begingroup$
    Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
    $endgroup$
    – Rainer P.
    4 hours ago











  • $begingroup$
    How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
    $endgroup$
    – JL2210
    4 hours ago













6














6










6







$begingroup$

Initial approximation



Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:



uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias


If you change the order of operations, you can handle the biasing in a single constant:



uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias


The constant can then be simplified to:



uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above


Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):



original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )


With your code, the mapping is (also linear mapping within intervals):



original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )


My mapping is far more accurate and yours isn't even monotonic.



Newton-Raphson



Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.



The entire code could look like this:



float sqrtf(float x)

float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;


double sqrt(double x)

double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;






share|improve this answer











$endgroup$



Initial approximation



Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:



uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias


If you change the order of operations, you can handle the biasing in a single constant:



uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias


The constant can then be simplified to:



uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above


Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):



original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )


With your code, the mapping is (also linear mapping within intervals):



original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )


My mapping is far more accurate and yours isn't even monotonic.



Newton-Raphson



Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.



The entire code could look like this:



float sqrtf(float x)

float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;


double sqrt(double x)

double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;







share|improve this answer














share|improve this answer



share|improve this answer








edited 4 hours ago

























answered 7 hours ago









Rainer P.Rainer P.

1,4265 silver badges14 bronze badges




1,4265 silver badges14 bronze badges














  • $begingroup$
    This won't work for 80-bit long double, which can't be represented as an integer.
    $endgroup$
    – JL2210
    6 hours ago






  • 1




    $begingroup$
    Also note that this violates strict aliasing and is undefined behavior.
    $endgroup$
    – JL2210
    4 hours ago










  • $begingroup$
    How many bits of accuracy does the current implementation provide?
    $endgroup$
    – JL2210
    4 hours ago






  • 1




    $begingroup$
    Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
    $endgroup$
    – Rainer P.
    4 hours ago











  • $begingroup$
    How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
    $endgroup$
    – JL2210
    4 hours ago
















  • $begingroup$
    This won't work for 80-bit long double, which can't be represented as an integer.
    $endgroup$
    – JL2210
    6 hours ago






  • 1




    $begingroup$
    Also note that this violates strict aliasing and is undefined behavior.
    $endgroup$
    – JL2210
    4 hours ago










  • $begingroup$
    How many bits of accuracy does the current implementation provide?
    $endgroup$
    – JL2210
    4 hours ago






  • 1




    $begingroup$
    Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
    $endgroup$
    – Rainer P.
    4 hours ago











  • $begingroup$
    How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
    $endgroup$
    – JL2210
    4 hours ago















$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago




$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago




1




1




$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago




$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago












$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago




$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago




1




1




$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago





$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago













$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago




$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for long double?
$endgroup$
– JL2210
4 hours ago













3














$begingroup$

Modular odd check



The compiler is likely to do this anyway, but can't



if(c % 2 == 0)


be



if (!(c & 1))


?



Factoring



Isn't



r.f -= (r.f*r.f-a)/(2*r.f);


equivalent to:



r.f = (r.f + a/r.f)/2?


Optimization



Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?






share|improve this answer











$endgroup$










  • 2




    $begingroup$
    Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
    $endgroup$
    – Reinderien
    8 hours ago






  • 2




    $begingroup$
    @Reinderien Thanks. I've added a clarification.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
    $endgroup$
    – Reinderien
    8 hours ago






  • 1




    $begingroup$
    The second has two divisions while the first has one.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Produced assembly seems fine, GCC with optimization levels three and size.
    $endgroup$
    – JL2210
    8 hours ago















3














$begingroup$

Modular odd check



The compiler is likely to do this anyway, but can't



if(c % 2 == 0)


be



if (!(c & 1))


?



Factoring



Isn't



r.f -= (r.f*r.f-a)/(2*r.f);


equivalent to:



r.f = (r.f + a/r.f)/2?


Optimization



Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?






share|improve this answer











$endgroup$










  • 2




    $begingroup$
    Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
    $endgroup$
    – Reinderien
    8 hours ago






  • 2




    $begingroup$
    @Reinderien Thanks. I've added a clarification.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
    $endgroup$
    – Reinderien
    8 hours ago






  • 1




    $begingroup$
    The second has two divisions while the first has one.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Produced assembly seems fine, GCC with optimization levels three and size.
    $endgroup$
    – JL2210
    8 hours ago













3














3










3







$begingroup$

Modular odd check



The compiler is likely to do this anyway, but can't



if(c % 2 == 0)


be



if (!(c & 1))


?



Factoring



Isn't



r.f -= (r.f*r.f-a)/(2*r.f);


equivalent to:



r.f = (r.f + a/r.f)/2?


Optimization



Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?






share|improve this answer











$endgroup$



Modular odd check



The compiler is likely to do this anyway, but can't



if(c % 2 == 0)


be



if (!(c & 1))


?



Factoring



Isn't



r.f -= (r.f*r.f-a)/(2*r.f);


equivalent to:



r.f = (r.f + a/r.f)/2?


Optimization



Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?







share|improve this answer














share|improve this answer



share|improve this answer








edited 8 hours ago

























answered 8 hours ago









ReinderienReinderien

9,89016 silver badges41 bronze badges




9,89016 silver badges41 bronze badges










  • 2




    $begingroup$
    Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
    $endgroup$
    – Reinderien
    8 hours ago






  • 2




    $begingroup$
    @Reinderien Thanks. I've added a clarification.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
    $endgroup$
    – Reinderien
    8 hours ago






  • 1




    $begingroup$
    The second has two divisions while the first has one.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Produced assembly seems fine, GCC with optimization levels three and size.
    $endgroup$
    – JL2210
    8 hours ago












  • 2




    $begingroup$
    Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
    $endgroup$
    – Reinderien
    8 hours ago






  • 2




    $begingroup$
    @Reinderien Thanks. I've added a clarification.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
    $endgroup$
    – Reinderien
    8 hours ago






  • 1




    $begingroup$
    The second has two divisions while the first has one.
    $endgroup$
    – JL2210
    8 hours ago






  • 1




    $begingroup$
    Produced assembly seems fine, GCC with optimization levels three and size.
    $endgroup$
    – JL2210
    8 hours ago







2




2




$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago




$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago




2




2




$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago




$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago




1




1




$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago




$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago




1




1




$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago




$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago




1




1




$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago




$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago











3














$begingroup$

I don't see why you define this constant yourself:




#define MANTISSA_SIZE 52



Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).




I think there's danger of integer overflow here:




/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;



We'd be better extracting into a big enough temporary, and applying the exponent bias to that:



int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;


It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.






share|improve this answer











$endgroup$














  • $begingroup$
    Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
    $endgroup$
    – JL2210
    7 hours ago










  • $begingroup$
    I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
    $endgroup$
    – JL2210
    6 hours ago















3














$begingroup$

I don't see why you define this constant yourself:




#define MANTISSA_SIZE 52



Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).




I think there's danger of integer overflow here:




/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;



We'd be better extracting into a big enough temporary, and applying the exponent bias to that:



int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;


It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.






share|improve this answer











$endgroup$














  • $begingroup$
    Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
    $endgroup$
    – JL2210
    7 hours ago










  • $begingroup$
    I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
    $endgroup$
    – JL2210
    6 hours ago













3














3










3







$begingroup$

I don't see why you define this constant yourself:




#define MANTISSA_SIZE 52



Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).




I think there's danger of integer overflow here:




/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;



We'd be better extracting into a big enough temporary, and applying the exponent bias to that:



int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;


It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.






share|improve this answer











$endgroup$



I don't see why you define this constant yourself:




#define MANTISSA_SIZE 52



Given we already assume that FLT_RADIX is 2, we can use the appropriate macro from <float.h> (DBL_MANT_DIG for double, etc.).




I think there's danger of integer overflow here:




/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;



We'd be better extracting into a big enough temporary, and applying the exponent bias to that:



int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;


It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.







share|improve this answer














share|improve this answer



share|improve this answer








edited 7 hours ago

























answered 7 hours ago









Toby SpeightToby Speight

32k7 gold badges46 silver badges137 bronze badges




32k7 gold badges46 silver badges137 bronze badges














  • $begingroup$
    Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
    $endgroup$
    – JL2210
    7 hours ago










  • $begingroup$
    I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
    $endgroup$
    – JL2210
    6 hours ago
















  • $begingroup$
    Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
    $endgroup$
    – JL2210
    7 hours ago










  • $begingroup$
    I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
    $endgroup$
    – JL2210
    6 hours ago















$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago




$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago












$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago




$begingroup$
I agree with the top point, but it would be DBL_MANT_DIG-1, not DBL_MANT_DIG.
$endgroup$
– JL2210
6 hours ago


















draft saved

draft discarded















































Thanks for contributing an answer to Code Review Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid


  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f229573%2fieee-754-square-root-with-newton-raphson%23new-answer', 'question_page');

);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

Canceling a color specificationRandomly assigning color to Graphics3D objects?Default color for Filling in Mathematica 9Coloring specific elements of sets with a prime modified order in an array plotHow to pick a color differing significantly from the colors already in a given color list?Detection of the text colorColor numbers based on their valueCan color schemes for use with ColorData include opacity specification?My dynamic color schemes

Invision Community Contents History See also References External links Navigation menuProprietaryinvisioncommunity.comIPS Community ForumsIPS Community Forumsthis blog entry"License Changes, IP.Board 3.4, and the Future""Interview -- Matt Mecham of Ibforums""CEO Invision Power Board, Matt Mecham Is a Liar, Thief!"IPB License Explanation 1.3, 1.3.1, 2.0, and 2.1ArchivedSecurity Fixes, Updates And Enhancements For IPB 1.3.1Archived"New Demo Accounts - Invision Power Services"the original"New Default Skin"the original"Invision Power Board 3.0.0 and Applications Released"the original"Archived copy"the original"Perpetual licenses being done away with""Release Notes - Invision Power Services""Introducing: IPS Community Suite 4!"Invision Community Release Notes

199年 目錄 大件事 到箇年出世嗰人 到箇年死嗰人 節慶、風俗習慣 導覽選單