IEEE 754 square root with Newton-RaphsonSqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function
Smallest PRIME containing the first 11 primes as sub-strings
Why is STARTTLS used when it can be downgraded very easily?
Beyond Futuristic Technology for an Alien Warship?
Isn't the detector always measuring, and thus always collapsing the state?
Implementation of a Thread Pool in C++
A word that refers to saying something in an attempt to anger or embarrass someone into doing something that they don’t want to do?
Would allowing versatile weapons wielded in two hands to benefit from Dueling be unbalanced?
My machine, client installed VPN,
Why, even after his imprisonment, do people keep calling Hannibal Lecter "Doctor"?
What are one's options when facing religious discrimination at the airport?
Can you cure a Gorgon's Petrifying Breath before it finishes turning a target to stone?
Can the President of the US limit First Amendment rights?
How to add the real hostname in the beginning of Linux cli command
Delete n lines skip 1 line script
Phonetic distortion when words are borrowed among languages
Knights and Knaves: What does C say?
Is it mandatory to use contractions in tag questions and the like?
Can I pay some of the cost of an activated ability lots of times to get more out of the effect?
Can adverbs modify adjectives?
What would happen if I build a half bath without permits?
Garage door sticks on a bolt
Top off gas with old oil, is that bad?
Avoiding dust scattering when you drill
Is it possible to take a database offline when doing a backup using an SQL job?
IEEE 754 square root with Newton-Raphson
Sqrt (square root) functionFloating point representation in IEEE 754 formatApproximating the square root using an iterative methodC++ implementation of Java's floatToIntBits() and intBitsToFloat()Newton's square rootSquare Root CalculatorFunctional abstraction to find nth root of a number - Newton raphsonUtility Method to find the Square root of a numberInteger square rootVisualizing Newton-Raphson method for finding zeroes of a function
.everyoneloves__top-leaderboard:empty,.everyoneloves__mid-leaderboard:empty,.everyoneloves__bot-mid-leaderboard:empty margin-bottom:0;
$begingroup$
I have an implementation of the sqrt
function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:
decompose.h
:
#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1
#ifdef __cplusplus
extern "C"
#endif
#include <float.h>
#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name
typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;
#include "decompose_generic.h"
#ifdef __cplusplus
#endif
#endif
decompose_generic.h
:
#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1
#ifdef __cplusplus
extern "C"
#endif
#define SIGN_SIZE 1
union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;
static inline union decompose decompose(TYPE f)
union decompose result;
result.f = f;
return result;
#define is_nan(x) NAME_SFX(isnan)(x)
#ifdef __cplusplus
#endif
#endif
sqrt.c
:
#include <math.h>
#include <errno.h>
#include "decompose.h"
TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
/* Newton-Raphson */
while(result_old2 != r.f)
if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;
return r.f;
What I'm looking for:
Are there any efficiency problems?
Are there any accuracy problems?
Are there any other miscellaneous things that I can improve?
As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.
One note: This is part of a modular system. There will be a decompose.h
for each floating point type (float
, double
, long double
), and they all include decompose_generic.h
.
I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.
c reinventing-the-wheel numerical-methods floating-point
$endgroup$
add a comment
|
$begingroup$
I have an implementation of the sqrt
function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:
decompose.h
:
#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1
#ifdef __cplusplus
extern "C"
#endif
#include <float.h>
#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name
typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;
#include "decompose_generic.h"
#ifdef __cplusplus
#endif
#endif
decompose_generic.h
:
#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1
#ifdef __cplusplus
extern "C"
#endif
#define SIGN_SIZE 1
union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;
static inline union decompose decompose(TYPE f)
union decompose result;
result.f = f;
return result;
#define is_nan(x) NAME_SFX(isnan)(x)
#ifdef __cplusplus
#endif
#endif
sqrt.c
:
#include <math.h>
#include <errno.h>
#include "decompose.h"
TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
/* Newton-Raphson */
while(result_old2 != r.f)
if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;
return r.f;
What I'm looking for:
Are there any efficiency problems?
Are there any accuracy problems?
Are there any other miscellaneous things that I can improve?
As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.
One note: This is part of a modular system. There will be a decompose.h
for each floating point type (float
, double
, long double
), and they all include decompose_generic.h
.
I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.
c reinventing-the-wheel numerical-methods floating-point
$endgroup$
4
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation usingstd::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in<float.h>
.
$endgroup$
– Toby Speight
7 hours ago
add a comment
|
$begingroup$
I have an implementation of the sqrt
function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:
decompose.h
:
#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1
#ifdef __cplusplus
extern "C"
#endif
#include <float.h>
#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name
typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;
#include "decompose_generic.h"
#ifdef __cplusplus
#endif
#endif
decompose_generic.h
:
#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1
#ifdef __cplusplus
extern "C"
#endif
#define SIGN_SIZE 1
union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;
static inline union decompose decompose(TYPE f)
union decompose result;
result.f = f;
return result;
#define is_nan(x) NAME_SFX(isnan)(x)
#ifdef __cplusplus
#endif
#endif
sqrt.c
:
#include <math.h>
#include <errno.h>
#include "decompose.h"
TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
/* Newton-Raphson */
while(result_old2 != r.f)
if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;
return r.f;
What I'm looking for:
Are there any efficiency problems?
Are there any accuracy problems?
Are there any other miscellaneous things that I can improve?
As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.
One note: This is part of a modular system. There will be a decompose.h
for each floating point type (float
, double
, long double
), and they all include decompose_generic.h
.
I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.
c reinventing-the-wheel numerical-methods floating-point
$endgroup$
I have an implementation of the sqrt
function that uses a combination of IEEE 754, packed bitfields, and the Newton-Raphson algorithm:
decompose.h
:
#ifndef DECOMPOSE_H
#define DECOMPOSE_H 1
#ifdef __cplusplus
extern "C"
#endif
#include <float.h>
#define MANTISSA_SIZE 52
#define EXPONENT_SIZE 11
#define EXPONENT_BIAS 0x3ff
#define TYPE double
#define TYPE_MIN DBL_MIN
#define TYPE_MAX DBL_MAX
#define NAME_SFX(name) name
typedef unsigned long long mantissa;
typedef unsigned short exponent;
typedef unsigned char sign;
#include "decompose_generic.h"
#ifdef __cplusplus
#endif
#endif
decompose_generic.h
:
#ifndef DECOMPOSE_GENERIC_H
#define DECOMPOSE_GENERIC_H 1
#ifdef __cplusplus
extern "C"
#endif
#define SIGN_SIZE 1
union decompose
TYPE f;
struct decompose_s
mantissa m:MANTISSA_SIZE;
exponent e:EXPONENT_SIZE;
sign s:SIGN_SIZE;
__attribute__((packed)) o;
;
static inline union decompose decompose(TYPE f)
union decompose result;
result.f = f;
return result;
#define is_nan(x) NAME_SFX(isnan)(x)
#ifdef __cplusplus
#endif
#endif
sqrt.c
:
#include <math.h>
#include <errno.h>
#include "decompose.h"
TYPE NAME_SFX(sqrt)(TYPE a)
(r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
/* Newton-Raphson */
while(result_old2 != r.f)
if(c % 2 == 0) result_old2 = r.f;
r.f -= (r.f*r.f-a)/(2*r.f);
c++;
return r.f;
What I'm looking for:
Are there any efficiency problems?
Are there any accuracy problems?
Are there any other miscellaneous things that I can improve?
As far as I've tested, this is as accurate as possible. It even matches the results from the GNU C library.
One note: This is part of a modular system. There will be a decompose.h
for each floating point type (float
, double
, long double
), and they all include decompose_generic.h
.
I'm slightly new to this area of C (math, IEEE 754), but not so new that I'd add the beginner tag.
c reinventing-the-wheel numerical-methods floating-point
c reinventing-the-wheel numerical-methods floating-point
edited 7 hours ago
JL2210
asked 9 hours ago
JL2210JL2210
38812 bronze badges
38812 bronze badges
4
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation usingstd::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in<float.h>
.
$endgroup$
– Toby Speight
7 hours ago
add a comment
|
4
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation usingstd::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in<float.h>
.
$endgroup$
– Toby Speight
7 hours ago
4
4
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using
std::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>
.$endgroup$
– Toby Speight
7 hours ago
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using
std::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in <float.h>
.$endgroup$
– Toby Speight
7 hours ago
add a comment
|
3 Answers
3
active
oldest
votes
$begingroup$
Initial approximation
Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:
uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias
If you change the order of operations, you can handle the biasing in a single constant:
uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias
The constant can then be simplified to:
uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above
Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):
original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )
With your code, the mapping is (also linear mapping within intervals):
original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )
My mapping is far more accurate and yours isn't even monotonic.
Newton-Raphson
Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.
The entire code could look like this:
float sqrtf(float x)
float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
double sqrt(double x)
double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
$endgroup$
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works forlong double
?
$endgroup$
– JL2210
4 hours ago
|
show 4 more comments
$begingroup$
Modular odd check
The compiler is likely to do this anyway, but can't
if(c % 2 == 0)
be
if (!(c & 1))
?
Factoring
Isn't
r.f -= (r.f*r.f-a)/(2*r.f);
equivalent to:
r.f = (r.f + a/r.f)/2?
Optimization
Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?
$endgroup$
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
|
show 6 more comments
$begingroup$
I don't see why you define this constant yourself:
#define MANTISSA_SIZE 52
Given we already assume that FLT_RADIX
is 2, we can use the appropriate macro from <float.h>
(DBL_MANT_DIG
for double
, etc.).
I think there's danger of integer overflow here:
/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
We'd be better extracting into a big enough temporary, and applying the exponent bias to that:
int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;
It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.
$endgroup$
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would beDBL_MANT_DIG-1
, notDBL_MANT_DIG
.
$endgroup$
– JL2210
6 hours ago
add a comment
|
Your Answer
StackExchange.ifUsing("editor", function ()
StackExchange.using("externalEditor", function ()
StackExchange.using("snippets", function ()
StackExchange.snippets.init();
);
);
, "code-snippets");
StackExchange.ready(function()
var channelOptions =
tags: "".split(" "),
id: "196"
;
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function()
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled)
StackExchange.using("snippets", function()
createEditor();
);
else
createEditor();
);
function createEditor()
StackExchange.prepareEditor(
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader:
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/4.0/"u003ecc by-sa 4.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
,
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
);
);
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f229573%2fieee-754-square-root-with-newton-raphson%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
3 Answers
3
active
oldest
votes
3 Answers
3
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
Initial approximation
Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:
uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias
If you change the order of operations, you can handle the biasing in a single constant:
uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias
The constant can then be simplified to:
uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above
Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):
original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )
With your code, the mapping is (also linear mapping within intervals):
original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )
My mapping is far more accurate and yours isn't even monotonic.
Newton-Raphson
Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.
The entire code could look like this:
float sqrtf(float x)
float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
double sqrt(double x)
double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
$endgroup$
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works forlong double
?
$endgroup$
– JL2210
4 hours ago
|
show 4 more comments
$begingroup$
Initial approximation
Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:
uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias
If you change the order of operations, you can handle the biasing in a single constant:
uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias
The constant can then be simplified to:
uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above
Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):
original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )
With your code, the mapping is (also linear mapping within intervals):
original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )
My mapping is far more accurate and yours isn't even monotonic.
Newton-Raphson
Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.
The entire code could look like this:
float sqrtf(float x)
float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
double sqrt(double x)
double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
$endgroup$
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works forlong double
?
$endgroup$
– JL2210
4 hours ago
|
show 4 more comments
$begingroup$
Initial approximation
Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:
uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias
If you change the order of operations, you can handle the biasing in a single constant:
uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias
The constant can then be simplified to:
uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above
Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):
original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )
With your code, the mapping is (also linear mapping within intervals):
original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )
My mapping is far more accurate and yours isn't even monotonic.
Newton-Raphson
Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.
The entire code could look like this:
float sqrtf(float x)
float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
double sqrt(double x)
double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
$endgroup$
Initial approximation
Packed bit fields are not only unnecessary here, they make the result worse. You do something along the lines of:
uint64_t i = *(uint64_t*)&f; // double to int
i = i - (1023 << 52); // remove IEEE754 bias
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52); // add IEEE754 bias
If you change the order of operations, you can handle the biasing in a single constant:
uint64_t i = *(uint64_t*)&f; // double to int
i = i >> 1; // divide exponent by 2
i = i + (1023 << 52 - 1023 << 51); // remove/add IEEE754 bias
The constant can then be simplified to:
uint64_t i = *(uint64_t*)&f; // double to int
i = (i >> 1) + (1023 << 51); // all of the above
Note that I didn't bother to mask when shifting, so the exponent's rightmost bit drops into the mantissa, which is shifted as well. This is a feature, not a bug. With my approximation, the mapping is as follows (linear mapping within intervals):
original number is exponent manitssa orig interval
exponent in interval in sqrt first bit maps to
0 [ 1.0, 2.0 ) 0 0 [ 1.0, 1.5 )
1 [ 2.0, 4.0 ) 0 1 [ 1.5, 2.0 )
2 [ 4.0, 8.0 ) 1 0 [ 2.0, 3.0 )
3 [ 8.0, 16.0 ) 1 1 [ 3.0, 4.0 )
4 [ 16.0, 32.0 ) 2 0 [ 4.0, 6.0 )
With your code, the mapping is (also linear mapping within intervals):
original number is exponent orig interval
exponent in interval in sqrt maps to
0 [ 1.0, 2.0 ) 0 [ 1.0, 2.0 )
1 [ 2.0, 4.0 ) 0 [ 1.0, 2.0 )
2 [ 4.0, 8.0 ) 1 [ 2.0, 4.0 )
3 [ 8.0, 16.0 ) 1 [ 2.0, 4.0 )
4 [ 16.0, 32.0 ) 2 [ 4.0, 8.0 )
My mapping is far more accurate and yours isn't even monotonic.
Newton-Raphson
Given a good approximation, Newton-Raphson doubles the number of significant digits on each iteration (quadratic convergence). The above approximation provides about 4 bits of accuracy (max error: 6% or ~1/16), so 3 Newton-Raphson iterations are required for single and 4 iterations for double precision.
The entire code could look like this:
float sqrtf(float x)
float y = x;
// Approximation
uint32_t* i = (uint32_t*)&x;
*i = (*i >> 1) + (127 << 22);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
double sqrt(double x)
double y = x;
// Approximation
uint64_t* i = (uint64_t*)&x;
*i = (*i >> 1) + (1023 << 51);
// Newton-Raphson
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
x = (x + y/x) / 2;
return x;
edited 4 hours ago
answered 7 hours ago
Rainer P.Rainer P.
1,4265 silver badges14 bronze badges
1,4265 silver badges14 bronze badges
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works forlong double
?
$endgroup$
– JL2210
4 hours ago
|
show 4 more comments
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works forlong double
?
$endgroup$
– JL2210
4 hours ago
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
$begingroup$
This won't work for 80-bit long double, which can't be represented as an integer.
$endgroup$
– JL2210
6 hours ago
1
1
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
Also note that this violates strict aliasing and is undefined behavior.
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
$begingroup$
How many bits of accuracy does the current implementation provide?
$endgroup$
– JL2210
4 hours ago
1
1
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
Your approximation is off by 41% at worst, that's about 1bit. (with my approximation it's ~6%) - @JL2210
$endgroup$
– Rainer P.
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for
long double
?$endgroup$
– JL2210
4 hours ago
$begingroup$
How can I have something similar to your implementation that 1) doesn't violate strict aliasing and 2) works for
long double
?$endgroup$
– JL2210
4 hours ago
|
show 4 more comments
$begingroup$
Modular odd check
The compiler is likely to do this anyway, but can't
if(c % 2 == 0)
be
if (!(c & 1))
?
Factoring
Isn't
r.f -= (r.f*r.f-a)/(2*r.f);
equivalent to:
r.f = (r.f + a/r.f)/2?
Optimization
Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?
$endgroup$
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
|
show 6 more comments
$begingroup$
Modular odd check
The compiler is likely to do this anyway, but can't
if(c % 2 == 0)
be
if (!(c & 1))
?
Factoring
Isn't
r.f -= (r.f*r.f-a)/(2*r.f);
equivalent to:
r.f = (r.f + a/r.f)/2?
Optimization
Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?
$endgroup$
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
|
show 6 more comments
$begingroup$
Modular odd check
The compiler is likely to do this anyway, but can't
if(c % 2 == 0)
be
if (!(c & 1))
?
Factoring
Isn't
r.f -= (r.f*r.f-a)/(2*r.f);
equivalent to:
r.f = (r.f + a/r.f)/2?
Optimization
Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?
$endgroup$
Modular odd check
The compiler is likely to do this anyway, but can't
if(c % 2 == 0)
be
if (!(c & 1))
?
Factoring
Isn't
r.f -= (r.f*r.f-a)/(2*r.f);
equivalent to:
r.f = (r.f + a/r.f)/2?
Optimization
Depending on which compiler you're using and what flags you pass it, some of your attempted math risks modification by the optimizer, and in an implementation so closely coupled to the hardware, this might matter. Have you checked your produced assembly? Can you show us your build flags?
edited 8 hours ago
answered 8 hours ago
ReinderienReinderien
9,89016 silver badges41 bronze badges
9,89016 silver badges41 bronze badges
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
|
show 6 more comments
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
2
2
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
$begingroup$
Soooo. The problem isn't with my answer (and I don't think it should receive your edit, yet); it's with the code you posted. A strict interpretation of the CR policies would suggest that since there's already an answer, you can't edit your code. However, I think instead that you should edit your question to show your actual (non-preprocessed) code, and I'll then amend this answer myself.
$endgroup$
– Reinderien
8 hours ago
2
2
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
$begingroup$
@Reinderien Thanks. I've added a clarification.
$endgroup$
– JL2210
8 hours ago
1
1
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
$begingroup$
Re. factoring: So my first stab at the math was wrong, and I modified it to reflect reality.
$endgroup$
– Reinderien
8 hours ago
1
1
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
$begingroup$
The second has two divisions while the first has one.
$endgroup$
– JL2210
8 hours ago
1
1
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
$begingroup$
Produced assembly seems fine, GCC with optimization levels three and size.
$endgroup$
– JL2210
8 hours ago
|
show 6 more comments
$begingroup$
I don't see why you define this constant yourself:
#define MANTISSA_SIZE 52
Given we already assume that FLT_RADIX
is 2, we can use the appropriate macro from <float.h>
(DBL_MANT_DIG
for double
, etc.).
I think there's danger of integer overflow here:
/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
We'd be better extracting into a big enough temporary, and applying the exponent bias to that:
int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;
It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.
$endgroup$
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would beDBL_MANT_DIG-1
, notDBL_MANT_DIG
.
$endgroup$
– JL2210
6 hours ago
add a comment
|
$begingroup$
I don't see why you define this constant yourself:
#define MANTISSA_SIZE 52
Given we already assume that FLT_RADIX
is 2, we can use the appropriate macro from <float.h>
(DBL_MANT_DIG
for double
, etc.).
I think there's danger of integer overflow here:
/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
We'd be better extracting into a big enough temporary, and applying the exponent bias to that:
int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;
It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.
$endgroup$
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would beDBL_MANT_DIG-1
, notDBL_MANT_DIG
.
$endgroup$
– JL2210
6 hours ago
add a comment
|
$begingroup$
I don't see why you define this constant yourself:
#define MANTISSA_SIZE 52
Given we already assume that FLT_RADIX
is 2, we can use the appropriate macro from <float.h>
(DBL_MANT_DIG
for double
, etc.).
I think there's danger of integer overflow here:
/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
We'd be better extracting into a big enough temporary, and applying the exponent bias to that:
int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;
It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.
$endgroup$
I don't see why you define this constant yourself:
#define MANTISSA_SIZE 52
Given we already assume that FLT_RADIX
is 2, we can use the appropriate macro from <float.h>
(DBL_MANT_DIG
for double
, etc.).
I think there's danger of integer overflow here:
/* Divide the exponent by 2 */
r.o.e -= EXPONENT_BIAS;
r.o.e = (r.o.e & 1) | (r.o.e >> 1);
r.o.e += EXPONENT_BIAS;
We'd be better extracting into a big enough temporary, and applying the exponent bias to that:
int e = r.o.e - EXPONENT_BIAS;
e = (e & 1) | (e >> 1);
r.o.e = e + EXPONENT_BIAS;
It might be possible to shift in place and then correct using half the bias; I haven't checked whether that's equivalent.
edited 7 hours ago
answered 7 hours ago
Toby SpeightToby Speight
32k7 gold badges46 silver badges137 bronze badges
32k7 gold badges46 silver badges137 bronze badges
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would beDBL_MANT_DIG-1
, notDBL_MANT_DIG
.
$endgroup$
– JL2210
6 hours ago
add a comment
|
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would beDBL_MANT_DIG-1
, notDBL_MANT_DIG
.
$endgroup$
– JL2210
6 hours ago
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
Signed bit-shifting isn't well-defined. I don't think the second half of your answer is valid.
$endgroup$
– JL2210
7 hours ago
$begingroup$
I agree with the top point, but it would be
DBL_MANT_DIG-1
, not DBL_MANT_DIG
.$endgroup$
– JL2210
6 hours ago
$begingroup$
I agree with the top point, but it would be
DBL_MANT_DIG-1
, not DBL_MANT_DIG
.$endgroup$
– JL2210
6 hours ago
add a comment
|
Thanks for contributing an answer to Code Review Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function ()
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fcodereview.stackexchange.com%2fquestions%2f229573%2fieee-754-square-root-with-newton-raphson%23new-answer', 'question_page');
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function ()
StackExchange.helpers.onClickDraftSave('#login-link');
);
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
4
$begingroup$
It's a shame this isn't C++, where you could check that the floating-point representation matches your expectation using
std::numeric_limits<T>::is_iec559
. I don't know of an equivalent test for C, other than separately checking all the macros in<float.h>
.$endgroup$
– Toby Speight
7 hours ago