Similar to LongAdder in Java, or ThreadCachedInt in folly, In scenarios of high concurrent writes but few reads, it can provide dozens of times the write performance than
per 100 calls.
goos: linux goarch: amd64 pkg: github.com/chen3feng/atomiccounter cpu: Intel(R) Xeon(R) Gold 6133 CPU @ 2.50GHz BenchmarkNonAtomicAdd-16 9508723 135.3 ns/op BenchmarkAtomicAdd-16 582798 2070 ns/op BenchmarkCounter-16 4748263 263.1 ns/op
Under MacOS with M1 Pro:
goos: darwin goarch: arm64 pkg: github.com/chen3feng/atomiccounter BenchmarkNonAtomicAdd-10 47337121 22.14 ns/op BenchmarkAtomicAdd-10 180942 6861 ns/op BenchmarkCounter-10 14871549 81.02 ns/op
From top to bottom are writing time-consuming of non-atomic (and thus unsafe), atomic, and
atomiccounter. It can be seen that in the case of high concurrent writes,
atomiccounter is only a few times slower than non-atomic writes but much faster than atomic writes.
But it is much slower reads:
goos: darwin goarch: arm64 pkg: github.com/chen3feng/atomiccounter BenchmarkNonAtomicRead-10 1000000000 0.3112 ns/op BenchmarkAtomicRead-10 1000000000 0.5336 ns/op BenchmarkCounterRead-10 54609476 21.20 ns/op
In addition, each
atomiccounter.Int64 object needs to consume 8K memory, so please only use it in a small number of scenarios with a large number of concurrent writes but few reads, such as counting the number of requests.
A data race is one of the biggest performance killers in multi-core programs. For counters with a large number of writes, the performance will be severely affected if ordinary atomic is used.
In scenarios with few reads, a standard solution is to spread the writes across different variables and accumulate them when they are read. Such as Java’s LongAdder and folly ThreadCachedInt, and per-CPU in the Linux kernel are all used this method. Although the implementation details are different, the idea is similar.
There is no well-known implementation for this kind of purpose in go, so I implemented this library.
To reduce memory footprint, multiple
Int64 objects may share the same memory chunk.
An int64 array of multiple sizes of CPU cache line size becomes a cell. A group of cells is called a chunk.
The cell size is an integer multiple of the cache line size of the CPU, and the first and last fields are padded with blanks of the size of the cache line size, thus avoiding false sharing.
chunk.lastIndex member is used to recording the last unused index for allocating the
Int64 object contains two fields: the chunk pointer and the index in the cell, so multiple
Int64 objects can share the same chunk but access elements with different indices in each cell.
The address of the last created chunk is recorded in the global variable
lastChunk. When an
Int64 object is created, its
lastIndex is increased. If it reaches the number of int64 in the cell, this chunk has been allocated and a new chunk needs to be created.
Please first understand Go’s GMP scheduling model.
The best performance is to get the current subscript of
M in Go and directly access the corresponding
cell, so that there will be no conflict between different
Ms, and even avoid using atomic operations.
But I haven’t found a way to get the
Therefore, this implementation uses the hash of the address of
M as the subscript to access the cell, and the measured effect is also quite good.
As long as the number of cells in each chunk is larger than the standard number of CPU cores, the impact of hash collisions can be reduced so that different M will have a high probability of accessing other cells.
When increasing the value of an
Int64 object, the hash of the current
M‘s’ address is used as the subscript to obtain the corresponding cell in the chunk. Then use the
Int64.index member as a subscript to access the int64 array in this cell.
When reading, traverse the value indexed by
Int64.index in all cell arrays and accumulate the value.