Faster copy (struct copy will be done using an optimized inline memcpy)
Cheaper arithmetic
Avoid divisions by creating a table of 1/j and multiply by 1/j instead
Unsigned division is cheaper than signed division
Modulo often be transformed into if-statements
Fewer operations
Calculations in loops
Avoid casting
Remove if statements in loops if they are only run for one particular iteration
Use a small number of function arguments to avoid using the stack
Inline functions
Parallelism
Using the restrict keyword
Double initialization
Useless initialization
Algorithms and Data Structures
Algorithms optimized for speed may consume more memory due to precomputed data structures or caching
Conversely, algorithms optimized for memory may have to compute values "on the fly" to avoid having them stored in a lookup table.
Not trivial as the best theoretical algorithm may be slower in practice due to how the program interacts with the memory!
EXAMPLE: Sorting
Cache utilization
Code optimized for speed "mus" exploit the CPU cache more effectively to get faster execution times. However, this is not trivial and platform dependent.
Memory hierarchy
Cache access
Data cache vs Text cache
Cache models and levels
Spatial locality vs temporal locality
Instructions and data are normally not accessed in a random order. The order is based on loops, subroutines, data structures, arrays etc.
Stack vs Heap performance
Virtual memory
Cost effective?
Concurrency
Keep the critical section small
Spin-lock vs semaphore
Lock-free
Distributed compute
Big data
MapReduce
SIMD
I/O
Misc
Loop unrolling
Pointer arithmetic
Compiler optimizations
Bitfields
Bitmaps
Register allocation
Floating point arithmetic
BigInt or 128 bits on 64 bit architecture
Inlining
Branchless programming (Pipelining)
Control hazards
Can lead to many stalls in the pipeline, normally the branch is predicted and stalled only if it turns out to be the wrong path. Much better if we can avoid the control statement altogether!
Alignment
RISC vs CISC
CISC => Requires fewer instructions, but these instructions require a complex implementation which will make all instructions slow, even the simple instructions.
Compilers are often good at making fast code with simple instructions.
The longest delay in the circuit determines the clock speed.
Handcoding assembly for performance?
Sometimes, but at the cost of efficiency and more lines of code increases the amounts of bugs