Memory hierarchy
The following illustration shows how the memory is structured. At the top we have the CPU registers. Traversing down we go further away from the CPU and the access times get slower.
Here is table showing the access times for each type of memory.
Access times (ns) | Type |
---|---|
1-2 | CPU registers |
3-10 | L1 cache |
25-50 | L2 cache |
30-90 | Main memory |
5-20 | Hard drive |
Parallel programming models
Multi-threaded programming (MT)
Message Passing Interface (MPI)
Map-reduce (MR)
- distributed hash table
Spark (SP)
Resilient Distributed Datasets (RDDs)
- lineage
Hadoop
Hadoop Distributed File System (HDFS)
RAID
https://en.wikipedia.org/wiki/RAID
Redundant array of inexpensive disks (RAID) is a collection of methods to combine multiple disks into one or more logical units for the purpose of performance or redundancy.
Bloom filters
Bloom filters are a probabilistic data structure for membership queries. There are two operations: inserting an item and querying an item . If is present in the bloom filter, the query will always be answered correctly. However, there is a probability of that a query might be positive even though an item is not present in the bloom filter. Thus, there exists no false negatives, but false positive do occur. Each item has different hash functions. They are used to fill in the bits in the bloom filter when inserting item , and used to query item .
0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 |
---|
If we assume uniform hashing and that items have been inserted, then the probability that positions are set for an item when inserting item is the follow
Setting to
will minimize the error probability. The following properties hold for optimal bloom filters:
Bloom filters can be multi-threaded with some care. Counting items where many items only have a frequency of 1 could be done in the following way with bloom filters
from pybloom import BloomFilter
from collections import defaultdict
f = BloomFilter(capacity=len(X),error_rate=p)
d = defaultdict(int)
for x in X:
if x in f:
d[x] += 1
else:
f.add(x)
for x in d.keys():
if (d[x] + 1) > tau:
print(x, d[x])
Cache behavior
Tries
Optimization ideas
If the current solution is too slow we might try out some general ideas to make it faster:
- Change the problem
-
Refactor
- Generally if we are using python, the shorter the program is the faster it is. Try to simplify and remove uneccesary code.
- Use NumPy whenever you can
-
Compile
- Cython
- Numba
-
Optimize
- Measure running time and identify hot spots better data structures. Constants may matter.
- Think about the memory hierarchy. Often the theoretical computational complexity is misleading in the computers today.
-
Parallelize
- based on structure of workload: data flow, computational effort, hardware, ...
- Measure the serial and parallel parts if that is possible. Use Amdahl's to calculate the theoretical speedup.
- Make use of SIMD instructions.
-
Choose suitable multiprogramming paradigm:
- Multi-threading
- Message passing
- Map-reduce
- Spark
- ...
- Use specialized frameworks that suits the problem in question.
- Avoid oversubscription
-
Scale the solution by
- Dedicated hardware: GPU, TPU
- Clusters
- Clouds
If the data is too big you need to parallelize the solution. Move the data to e.g.:
- RAID with SSD
- Cluster with local disks and HDFS