
This is the README file for my program "bandwidth".

Bandwidth is a benchmark that attempts to measure
memory bandwidth. 

Bandwidth is useful because memory bandwidth need to
measured to give you a clear idea of what your computer
is capable of. Merely relying on specs does not 
provide a full picture as specs can be misleading.

--------------------------------------------------

My program "bandwidth" performs sequential and random
reads and writes of varying sizes. This permits 
you to infer from the graph how each type of memory 
is performing. So for instance when bandwidth
writes a 256-byte chunk, you know that because
caches are normally write-back, this chunk
will reside entirely in the L1 cache. Whereas
a 512 kB chunk will mainly reside in L2.

You could run a non-artificial benchmark and 
observe that a general performance number is lower 
on one machine or higher on anotehr, but that might
conceal the cause. 

So the purpose of this program is to help you 
hone in on the cause of good or bad system 
performance.

It also tells you the best-case scenario e.g.
the maximum bandwidth achieved using sequential
memory accesses is typically ideal.

Release 1.10:
	- ARM 64 support, ARM 32 refinements. Apple M1 support.
Release 1.9:
	- More object-oriented improvements. Fixed Windows 64-bit support. Removed Linux framebuffer test.
Release 1.8:
	- More object-oriented improvements. Windows 64-bit supported.
Release 1.7:
	- Separated object-oriented C (OOC) from bandwidth app.
Release 1.6:
	- Converted the code to my conception of object-oriented C.
Release 1.5:
	- Fixed AVX bug. Added --nice mode and CPU temperature monitoring (OS/X only).
Release 1.4:
        - Added randomized 256-bit AVX reader & writer tests (Intel64 only).
Release 1.3:
        - Added CSV output. Updated ARM code for Raspberry π 3.
Release 1.2:
        - Put 32-bit ARM code back in.
Release 1.1:
	- Added larger font.
Release 1.0:
	- Moved graphing into BMPGraphing module.
	- Finally added LODS benchmarking, which
	  proves how badly lodsb/lodsw/lodsd/lodsq
	  perform.
	- Added switches --faster and --fastest.
Release 0.32:
	- Improved AVX support.
Release 0.31:
	- Adds cache detection for Intel 32-bit CPUs
	- Adds a little AVX support.
	- Fixes vector-to/from-main transfer bugs.
Release 0.30 adds cache detection for Intel 64-bit CPUs.
Release 0.29 improved graph granularity with more
	128-byte tests and removes ARM support.
Release 0.28 added a proper test of CPU features e.g. SSE 4.1.
Release 0.27 added finer-granularity 128-byte tests.
Release 0.26 fixed an issue with AMD processors.
Release 0.25 maked network bandwidth bidirectional.
Release 0.24 added network bandwidth testing.

Release 0.23 added:
	- Mac OS/X 64-bit support.
	- Vector-to-vector register transfer test.
	- Main register to/from vector register transfer test.
	- Main register byte/word/dword/qword to/from 
	  vector register test (pinsr*, pextr* instructions).
	- Memory copy test using SSE2.
	- Automatic checks under Linux for SSE2 & SSE4.

Release 0.22 added:
	- Register-to-register transfer test.
	- Register-to/from-stack transfer tests.

Release 0.21 added:
	- Standardized memory chunks to always be
	  a multiple of 256-byte mini-chunks.
	- Random memory accesses, in which each 
	  256-byte mini-chunk accessed is accessed 
	  in a random order, but also, inside each 
	  mini-chunk the 32/64/128 data are accessed
	  pseudo-randomly as well. 
	- Now 'bandwidth' includes chunk sizes that 
	  are not powers of 2, which increases 
	  data points around the key chunk sizes 
	  corresponding to common L1 and L2 cache 
	  sizes.
	- Command-line options:
		--fast for 0.25 seconds per test.
		--slow for 20 seconds per test.
		--title for adding a graph title.

Release 0.20 added graphing, with the graph
stored in a BMP image file. It also adds the
--slow option for more precise runs.

Release 0.19 added a second 128-bit SSE writer
routine that bypasses the caches, in addition
to the one that doesn't.

Release 0.18 was my Grand Unified bandwidth
benchmark that brought together support for
four operating systems:
	- Linux
	- Windows Mobile
	- 32-bit Windows
	- Mac OS/X 64-bit
and two processor architectures:
	- x86
	- Intel64
I've written custom assembly routines for
each architecture.

Total run time for the default speed, which
has 5 seconds per test, is about 30 minutes.

--------------------------------------------------
This program is provided without any warranty
and AS-IS. See the file COPYING for details.

Zack Smith
1@zsmith.co
June 2019

