Fast Division and Square Root Algorithms for AVR 8-Bit Microcontrollers in UAV Projects

24 12

#1 21621283 04 Dec 2010 22:38

Edward Henderson Edward Henderson

Anonymous

Post #1
21621283 04 Dec 2010 22:38

I have an digital board intended for a Flight control system on RC aircraft, with grand plans for a home built UAV. I have been doing a lot of study of the math and algorithms that go into such systems. Some of the discussions cover efficiency of the algorithms, in terms of how many multiply operations and addition operations are required for each update time. The AVR 8 bit devices have a built in hardware multiply operation, in 8 bit. 16 bit operations require a couple of cycles. What they do NOT have is a divide operation. A divide is the same a multiplication by the reciprocal.. but how do you get the reciprocal? The other operation that is not represented is the square root.

The avr-libc library includes divide and sqrt functions, but they are fairly costly according to the documentation. The __divsf3 function, which I assume is the divide (but documentation is limited here), takes 465 clock cycles. That is about 24us at 20Mhz, with 1 cycle per clock (I think that is correct).. That seems like an awfully long time to perform a division.. especially if a mul operation is 1 cycle! sqrt takes 492 cycles, which is also pretty high. For some reason, I would expect sqrt to take more.

I'm sure I can dig into the details of these routines, but if anyone has some insight, that would be great. Perhaps if I knew a bit more about how these particular routines work, I might be able to optimize a version specifically for my task. I'm sure that the avr-libc versions are somewhat generic, so, making one that is specific to a single purpose might save some of those cycles. Anyone have any experience in this area?
ADVERTISEMENT
#2 21621284 07 Dec 2010 19:32

Joe Wolin Joe Wolin

Anonymous

Post #2
21621284 07 Dec 2010 19:32

The folks on the forum at http://www.avrfreaks.net might have a good answer.

Let us know what you find out...
ADVERTISEMENT
#3 21621285 09 Dec 2010 12:09

Edward Henderson Edward Henderson

Anonymous

Post #3
21621285 09 Dec 2010 12:09

AVR Freaks might help, but I was thinking my question was more of a general microprocessor question. The question is more what types of shortcuts can one make with a small micro.

I was discussing transcendental functions (sin, cos, tran) with a friend recently. The processor I am using takes 1600+ clock cycles for a sin or cos calculation. The calculations are done in floating point.

For small angles, a taylor series expansion can be used instead. This expansion requires only multiplication and addition, which have direct hardware support. While this isn't a full sin/cos function, for some applications it might work quite well (small angles), and the performance will be significantly better.

Those are the types of optimization I am looking for.
#4 21621286 09 Dec 2010 12:47

Joe Wolin Joe Wolin

Anonymous

Post #4
21621286 09 Dec 2010 12:47

Thanks for the update. A Taylor Series does sound like the way to go.
#5 21621287 17 Dec 2010 02:34

Randy Dawson Randy Dawson

Anonymous

Post #5
21621287 17 Dec 2010 02:34

Ed, I'm going to suggest HACKMEM as a note you might enjoy.

Its a bit dated MIT paper (1972), but lots of mathematical gems in there.

Just plain interesting reading, too!
#6 21621288 17 Dec 2010 20:07

Joe Wolin Joe Wolin

Anonymous

Post #6
21621288 17 Dec 2010 20:07

Here's a link to the paper Randy is talking about. Thanks Randy.

http://www.inwap.com/pdp10/hbaker/hakmem/hakmem.html
ADVERTISEMENT
#7 21621289 19 Dec 2010 04:21

Bill Westfield Bill Westfield

Anonymous

Post #7
21621289 19 Dec 2010 04:21

The internal multiply instruction you talk about is for 8 bit integers, while the divsf3 and sqrt functions are for 32bit FLOATING POINT numbers, which is a lot different.

For high speed math as you might need in a flight control system, you probably want to look for a library of high-speed FIXED POINT FRACTIONAL math functions. This is nicely searchable. While I don't have any direct experience, there is this: https://sourceforge.net/projects/avrfix/
ADVERTISEMENT
#8 21621290 04 Jan 2011 12:34

Gary Crowell Gary Crowell

Anonymous

Post #8
21621290 04 Jan 2011 12:34

Hi Ed, I have a book titled "Math Toolkit for Real-Time Programming" by Jack Crenshaw. It may not be everything you're looking for, but it's interesting to read anyway. The author did the 'Programmer's Toolbox' column for Embedded Systems Programming magazine for many years, and his columns are the main reason I've kept most of the back issues.

I'm sure we'll be seeing each other this week, so I'll bring it along.
#9 21621291 17 Feb 2011 10:51

Bruce Land Bruce Land

Anonymous

Post #9
21621291 17 Feb 2011 10:51

I wrote some 16 bit fixed point routines for AVRGCC. The format is 8:8. Multiply speed is about 40 cycles, divide speed is about 360 cycles worst case. Code is at
http://people.ece.cornell.edu/land/courses/ece4760/Math/index.html
#10 21621292 11 Oct 2011 11:41

Bruce Land Bruce Land

Anonymous

Post #10
21621292 11 Oct 2011 11:41

I wrote a simple 16-bit reciprocal and sqrt for Mega644 in GCC.
http://people.ece.cornell.edu/land/courses/ece4760/Math/index.html
#11 21621293 16 Oct 2011 05:25

Per Zackrisson Per Zackrisson

Anonymous

Post #11
21621293 16 Oct 2011 05:25

As a rule in realtime applications:
Find out how precise you will have to be.
eight bit gives you a 1.4% error in sinus or cosinus.
Is that not enough try 16 bits and so on.
You can also do calculations beforehand and put in a table. Exchange memory for speed.
#12 21621294 18 Oct 2011 09:41

Ralph Pruitt Ralph Pruitt

Anonymous

Post #12
21621294 18 Oct 2011 09:41

Hi Edward,

Let me add one more item to this discussion. In Microcontrollers there always is a tradeoff of speed verses use of memory. At the most basic level you can write the routine and have the low speed microcontroller attempt to do the calculations as fast as an algorithm can be coded. The other solution is to code all or part of the routine using lookup tables where the calculations have been previously performed and placed into the table. The only negative will be your resolution. If this is a simple 8 bit result that is passed an 8 bit value this uses 256 bytes of EEPROM, else if it is a 12 bit result from a 12 bit parameter that can use 1536 bytes. The key is to attempt to only use the resolution that you need. Further, with tables they can be used along with the algorithm for key sections so the solution can be a hybrid.

Just some ideas to consider when using low precessing power micros.
#13 21621295 18 Oct 2011 10:14

Todd Hayden Todd Hayden

Anonymous

Post #13
21621295 18 Oct 2011 10:14

Good points.

Another thing I have done to avoid consuming memory space with a 1:1 table is to use piecewise linear approximations to the function. Select the table entries so that interpolation between two entries still gives the resolution needed. The table size can sometimes be reduced dramatically for a small increase in processing of the value.
Create an account, log in and become active in a forum and ads will not appear. You will receive points by participating in discussions.
Join this discussion.

Install the application

Didn't find an answer? Ask Artificial Intelligence

*I agree to send the question to OpenAI, Anthropic PBC, Perplexity AI, Inc., Kagi Inc., Google LLC - owners of language models in order to prepare the best response. The companies may monitor and log information entered into the form.

*I agree to publicly display my question and answer. The question and answer will be publicly available to everyone. The process may take a few minutes. Upon completion, you will be redirected to the page with the answer.

Wait...(2min)

Topic summary

The discussion addresses the challenge of implementing fast division and square root operations on AVR 8-bit microcontrollers for UAV flight control systems, where hardware multiply instructions exist but no native divide or square root instructions are available. The built-in avr-libc floating-point divide (__divsf3) and sqrt functions are computationally expensive, consuming hundreds of clock cycles, which is critical for real-time control. Suggested approaches include using fixed-point fractional math libraries to improve speed, such as the AVRFix project, and employing mathematical shortcuts like Taylor series expansions for transcendental functions to reduce computational load. Lookup tables and piecewise linear approximations are recommended to trade memory usage for speed, with careful consideration of required precision and resolution. References to classic resources like the MIT HAKMEM paper and the "Math Toolkit for Real-Time Programming" book provide additional algorithmic insights. Practical implementations of 16-bit fixed-point reciprocal and square root routines for AVR GCC are available, demonstrating achievable cycle counts for multiplication (~40 cycles) and division (~360 cycles). Overall, the discussion emphasizes balancing precision, speed, and memory constraints in embedded UAV flight control math computations on AVR microcontrollers.
Summary generated by the language model.

FAQ

TL;DR: On AVR, MUL is 2 cycles, while 16‑bit fixed‑point divide is ~360 cycles worst‑case [AVR Instruction Set Manual; Elektroda, Anonymous, #21621291]. "Find out how precise you will have to be." [Elektroda, Anonymous, post #21621293] This FAQ shows faster division, sqrt, and trig for UAV control loops on 8‑bit AVR using fixed‑point and approximations.

Why it matters: Faster math keeps your control loop stable at higher update rates without adding a costly FPU MCU.

AVR 8‑bit MCUs have hardware 8×8 MUL (2 cycles) and no hardware DIV/FPU [AVR Instruction Set Manual].
Reported float __divsf3 ≈ 465 cycles; sqrt ≈ 492 cycles (software) [Elektroda, Anonymous, post #21621283]
Measured 8.8 fixed‑point: mul ≈ 40 cycles; div ≈ 360 cycles worst‑case [Elektroda, Anonymous, post #21621291]
Float sin/cos measured at 1600+ cycles on the target MCU [Elektroda, Anonymous, post #21621285]
Lookup tables: 8‑bit→8‑bit costs 256 bytes; 12‑bit→12‑bit ≈ 1536 bytes [Elektroda, Anonymous, post #21621294]

Quick Facts

- AVR 8‑bit devices: hardware 8×8 MUL (2 cycles), no hardware DIV/FPU [AVR Instruction Set Manual].
- Reported float divide __divsf3 ≈ 465 cycles; float sqrt ≈ 492 cycles [Elektroda, Anonymous, post #21621283]
- Fixed‑point 16‑bit (Q8.8) implementation: mul ≈ 40 cycles, div ≈ 360 cycles worst‑case [Elektroda, Anonymous, post #21621291]
- Float sin/cos calls can take 1600+ cycles on the referenced platform [Elektroda, Anonymous, post #21621285]
- Memory tradeoff: 256‑entry LUT = 256 bytes; 12‑bit case ≈ 1536 bytes [Elektroda, Anonymous, post #21621294]

How can I do fast division on AVR 8‑bit without hardware divide?

Avoid float. Use fixed‑point and compute a reciprocal via Newton–Raphson: y ≈ 1/x, then multiply by y. This uses only adds, shifts, and MUL, which AVR accelerates [AVR Instruction Set Manual]. A Q8.8 library measured ~360 cycles worst‑case for 16‑bit divide [Elektroda, Anonymous, post #21621291] Expert tip: "Use fixed‑point and approximations" for speed [Elektroda, Anonymous, post #21621289] Libraries like avrfix and Cornell’s routines provide working code paths [Elektroda, Anonymous, #21621289; Elektroda, Anonymous, #21621291].

Should I avoid float in UAV control loops on AVR?

Yes. Float division and sqrt run in software and cost hundreds of cycles, reducing loop bandwidth [Elektroda, Anonymous, post #21621283] Fixed‑point keeps operations to integer math, with measured 16‑bit mul ≈ 40 cycles and div ≈ 360 cycles worst‑case [Elektroda, Anonymous, post #21621291] Many control systems fit within 16‑bit fractional ranges if you scale signals appropriately [Math Toolkit for Real-Time Programming].

Is __divsf3 really that slow on AVR?

It is software‑emulated and reported at ≈465 cycles in the referenced setup [Elektroda, Anonymous, post #21621283] Actual cost varies with compiler version and optimization level because there is no hardware divide or FPU [avr-libc User Manual; AVR Instruction Set Manual]. For time‑critical code, convert to fixed‑point division via reciprocal iterations [Elektroda, Anonymous, post #21621291]

Whats a fast way to compute sqrt(x) on AVR?

Use fixed‑point Newton–Raphson or a binary restoring integer sqrt. A forum implementation provides 16‑bit reciprocal and sqrt for ATmega644 in GCC [Elektroda, Anonymous, post #21621292] Float sqrt was reported around 492 cycles in the thread’s context [Elektroda, Anonymous, post #21621283] Fixed‑point variants avoid float overhead and let you control scaling and saturation [Math Toolkit for Real-Time Programming].

How do I implement a fast 1/x using NewtonRaphson in fixed‑point?

Get an initial guess y0 from a small LUT or bit trick. 2. Iterate y_{k+1} = y_k(2 − xy_k), rescaled in Q‑format. 3. Clamp/saturate to handle near‑zero inputs. This converges quadratically with a good initial guess [Newton's method]. AVR MUL helps, as integer multiplies are fast [AVR Instruction Set Manual].

Can I replace sin/cos with faster approximations on AVR?

Yes. For small angles, use a Taylor series truncated to a few terms, using only adds and multiplies [Elektroda, Anonymous, post #21621285] For broader ranges, use LUTs with piecewise linear interpolation to meet your error budget [Elektroda, Anonymous, post #21621295] Eight‑bit precision can be ~1.4% error in sine/cosine per the thread [Elektroda, Anonymous, post #21621293]

Is CORDIC worth it for trig on 8‑bit AVR?

CORDIC uses only shifts and adds, making it attractive when MUL is costly or absent. AVR has fast MUL, but CORDIC still avoids float and can be tuned for precision [CORDIC]. You must do range reduction and manage scaling factors, or accuracy suffers [CORDIC]. Consider LUT+interpolation if memory allows [Elektroda, Anonymous, post #21621295]

How big should lookup tables be for trig or sqrt, and whats the tradeoff?

Match table resolution to required precision. An 8‑bit in/out table costs 256 bytes; a 12‑bit mapping costs ~1536 bytes [Elektroda, Anonymous, post #21621294] You can reduce size using nonuniform breakpoints with linear interpolation between entries, saving memory with minimal extra math [Elektroda, Anonymous, post #21621295]

Which fixed‑point format should I use (Q8.8 vs Q1.15)?

Q8.8 is simple and integrates with 16‑bit math on AVR, with tested routines available [Elektroda, Anonymous, post #21621291] Q1.15 gives higher fractional precision for values in −1..1, useful for normalized vectors and filters [Math Toolkit for Real-Time Programming]. Choose the format that fits your signal ranges and avoids overflow [Math Toolkit for Real-Time Programming].

Any ready‑made fixed‑point math libraries for AVR?

Yes. See avrfix on SourceForge for fixed‑point fractional functions [Elektroda, Anonymous, post #21621289] Cornell’s ECE4760 page publishes Q8.8 multiply, divide, reciprocal, and sqrt code for AVR‑GCC [Elektroda, Anonymous, #21621291; Elektroda, Anonymous, #21621292]. These save you from writing assembly and offer known cycle counts [Elektroda, Anonymous, post #21621291]

How many cycles does 16‑bit fixed‑point multiply/divide take on AVR?

One published Q8.8 implementation measured multiply ≈ 40 cycles and divide ≈ 360 cycles worst‑case [Elektroda, Anonymous, post #21621291] These avoid software floating‑point overhead while using hardware MUL [AVR Instruction Set Manual]. Use them to size control‑loop budgets alongside sensor and I/O costs [Elektroda, Anonymous, post #21621291]

Common pitfalls with NewtonRaphson for reciprocal or sqrt on fixed‑point?

Poor initial guesses can slow convergence or diverge; clamp domains and precondition inputs [Newton's method]. Scale to avoid overflow during intermediate multiplies, especially for values near zero [Math Toolkit for Real-Time Programming]. Saturate outputs and check for division by zero before iteration [Math Toolkit for Real-Time Programming].

Do AVR 8‑bit MCUs have any hardware help for division?

No. Classic 8‑bit AVR provides an 8×8 hardware multiplier but no divide instruction and no FPU [AVR Instruction Set Manual]. Therefore, C float division and sqrt compile into software routines like divsf3 and mulsf3 via libgcc/avr‑libc [avr-libc User Manual].

Where can I learn more math tricks for small micros?

See HAKMEM for classic bit‑level and numeric hacks [HAKMEM]. Jack Crenshaw’s “Math Toolkit for Real‑Time Programming” dives into fixed‑point and numerics for embedded systems [Elektroda, Anonymous, post #21621290] The Cornell ECE4760 math page includes AVR‑GCC examples with cycle notes [Elektroda, Anonymous, #21621291; Elektroda, Anonymous, #21621292].

Fast Division and Square Root Algorithms for AVR 8-Bit Microcontrollers in UAV Projects

Didn't find an answer? Ask Artificial Intelligence

Topic summary