1
0
mirror of https://github.com/Oxalide/vsphere-influxdb-go.git synced 2023-10-10 11:36:51 +00:00

add vendoring with go dep

This commit is contained in:
Adrian Todorov
2017-10-25 20:52:40 +00:00
parent 704f4d20d1
commit a59409f16b
1627 changed files with 489673 additions and 0 deletions

View File

@@ -0,0 +1,9 @@
// Package engine can be imported to initialize and register all available TSDB engines.
//
// Alternatively, you can import any individual subpackage underneath engine.
package engine // import "github.com/influxdata/influxdb/tsdb/engine"
import (
// Initialize and register tsm1 engine
_ "github.com/influxdata/influxdb/tsdb/engine/tsm1"
)

View File

@@ -0,0 +1,451 @@
# File Structure
A TSM file is composed for four sections: header, blocks, index and the footer.
```
┌────────┬────────────────────────────────────┬─────────────┬──────────────┐
│ Header │ Blocks │ Index │ Footer │
│5 bytes │ N bytes │ N bytes │ 4 bytes │
└────────┴────────────────────────────────────┴─────────────┴──────────────┘
```
Header is composed of a magic number to identify the file type and a version number.
```
┌───────────────────┐
│ Header │
├─────────┬─────────┤
│ Magic │ Version │
│ 4 bytes │ 1 byte │
└─────────┴─────────┘
```
Blocks are sequences of block CRC32 and data. The block data is opaque to the file. The CRC32 is used for recovery to ensure blocks have not been corrupted due to bugs outside of our control. The length of the blocks is stored in the index.
```
┌───────────────────────────────────────────────────────────┐
│ Blocks │
├───────────────────┬───────────────────┬───────────────────┤
│ Block 1 │ Block 2 │ Block N │
├─────────┬─────────┼─────────┬─────────┼─────────┬─────────┤
│ CRC │ Data │ CRC │ Data │ CRC │ Data │
│ 4 bytes │ N bytes │ 4 bytes │ N bytes │ 4 bytes │ N bytes │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
```
Following the blocks is the index for the blocks in the file. The index is composed of a sequence of index entries ordered lexicographically by key and then by time. Each index entry starts with a key length and key followed by a count of the number of blocks in the file. Each block entry is composed of the min and max time for the block, the offset into the file where the block is located and the size of the block.
The index structure can provide efficient access to all blocks as well as the ability to determine the cost associated with accessing a given key. Given a key and timestamp, we know exactly which file contains the block for that timestamp as well as where that block resides and how much data to read to retrieve the block. If we know we need to read all or multiple blocks in a file, we can use the size to determine how much to read in a given IO.
_TBD: The block length stored in the block data could probably be dropped since we store it in the index._
```
┌────────────────────────────────────────────────────────────────────────────┐
│ Index │
├─────────┬─────────┬──────┬───────┬─────────┬─────────┬────────┬────────┬───┤
│ Key Len │ Key │ Type │ Count │Min Time │Max Time │ Offset │ Size │...│
│ 2 bytes │ N bytes │1 byte│2 bytes│ 8 bytes │ 8 bytes │8 bytes │4 bytes │ │
└─────────┴─────────┴──────┴───────┴─────────┴─────────┴────────┴────────┴───┘
```
The last section is the footer that stores the offset of the start of the index.
```
┌─────────┐
│ Footer │
├─────────┤
│Index Ofs│
│ 8 bytes │
└─────────┘
```
# File System Layout
The file system is organized a directory per shard where each shard is an integer number. Associated with each shard directory, there is a set of other directories and files:
* a wal directory - contains a set numerically increasing files WAL segment files named #####.wal. The wal directory is separate from the directory containing the TSM files so that different types can be used if necessary.
* .tsm files - a set of numerically increasing TSM files containing compressed series data.
* .tombstone files - files named after the corresponding TSM file as #####.tombstone. These contain measurement and series keys that have been deleted. These files are removed during compactions.
# Data Flow
Writes are appended to the current WAL segment and are also added to the Cache. Each WAL segment is size bounded and rolls-over to a new file after it fills up. The cache is also size bounded; snapshots are taken and WAL compactions are initiated when the cache becomes too full. If the inbound write rate exceeds the WAL compaction rate for a sustained period, the cache may become too full in which case new writes will fail until the compaction process catches up. The WAL and Cache are separate entities and do not interact with each other. The Engine coordinates the writes to both.
When WAL segments fill up and have been closed, the Compactor reads the WAL entries and combines them with one or more existing TSM files. This process runs continuously until all WAL files are compacted and there is a minimum number of TSM files. As each TSM file is completed, it is loaded and referenced by the FileStore.
Queries are executed by constructing Cursors for keys. The Cursors iterate over slices of Values. When the current Values are exhausted, a Cursor requests the next set of Values from the Engine. The Engine returns a slice of Values by querying the FileStore and Cache. The Values in the Cache are overlaid on top of the values returned from the FileStore. The FileStore reads and decodes blocks of Values according to the index for the file.
Updates (writing a newer value for a point that already exists) occur as normal writes. Since cached values overwrite existing values, newer writes take precedence.
Deletes occur by writing a delete entry for the measurement or series to the WAL and then updating the Cache and FileStore. The Cache evicts all relevant entries. The FileStore writes a tombstone file for each TSM file that contains relevant data. These tombstone files are used at startup time to ignore blocks as well as during compactions to remove deleted entries.
# Compactions
Compactions are a serial and continuously running process that iteratively optimizes the storage for queries. Specifically, it does the following:
* Converts closed WAL files into TSM files and removes the closed WAL files
* Combines smaller TSM files into larger ones to improve compression ratios
* Rewrites existing files that contain series data that has been deleted
* Rewrites existing files that contain writes with more recent data to ensure a point exists in only one TSM file.
The compaction algorithm is continuously running and always selects files to compact based on a priority.
1. If there are closed WAL files, the 5 oldest WAL segments are added to the set of compaction files.
2. If any TSM files contain points with older timestamps that also exist in the WAL files, those TSM files are added to the compaction set.
3. If any TSM files have a tombstone marker, those TSM files are added to the compaction set.
The compaction algorithm generates a set of SeriesIterators that return a sequence of `key`, `Values` where each `key` returned is lexicographically greater than the previous one. The iterators are ordered such that WAL iterators will override any values returned by the TSM file iterators. WAL iterators read and cache the WAL segment so that deletes later in the log can be processed correctly. TSM file iterators use the tombstone files to ensure that deleted series are not returned during iteration. As each key is processed, the Values slice is grown, sorted, and then written to a new block in the new TSM file. The blocks can be split based on number of points or size of the block. If the total size of the current TSM file would exceed the maximum file size, a new file is created.
Deletions can occur while a new file is being written. Since the new TSM file is not complete a tombstone would not be written for it. This could result in deleted values getting written into a new file. To prevent this, if a compaction is running and a delete occurs, the current compaction is aborted and new compaction is started.
When all WAL files in the current compaction have been processed and the new TSM files have been successfully written, the new TSM files are renamed to their final names, the WAL segments are truncated and the associated snapshots are released from the cache.
The compaction process then runs again until there are no more WAL files and the minimum number of TSM files exist that are also under the maximum file size.
# WAL
Currently, there is a WAL per shard. This means all the writes in a WAL segment are for the given shard. It also means that writes across a lot of shards append to many files which might result in more disk IO due to seeking to the end of multiple files.
Two options are being considered:
## WAL per Shard
This is the current behavior of the WAL. This option is conceptually easier to reason about. For example, compactions that read in multiple WAL segments are assured that all the WAL entries pertain to the current shard. If it completes a compaction, it is safe to remove the WAL segment. It is also easier to deal with shard deletions as all the WAL segments can be dropped along with the other shard files.
The drawback of this option is the potential for turning sequential write IO into random IO in the presence of multiple shards and writes to many different shards.
## Single WAL
Using a single WAL adds some complexity to compactions and deletions. Compactions will need to either sort all the WAL entries in a segment by shard first and then run compactions on each shard or the compactor needs to be able to compact multiple shards concurrently while ensuring points in existing TSM files in different shards remain separate.
Deletions would not be able to reclaim WAL segments immediately as in the case where there is a WAL per shard. Similarly, a compaction of a WAL segment that contains writes for a deleted shard would need to be dropped.
Currently, we are moving towards a Single WAL implementation.
# Cache
The purpose of the cache is so that data in the WAL is queryable. Every time a point is written to a WAL segment, it is also written to an in-memory cache. The cache is split into two parts: a "hot" part, representing the most recent writes and a "cold" part containing snapshots for which an active WAL compaction
process is underway.
Queries are satisfied with values read from the cache and finalized TSM files. Points in the cache always take precedence over points in TSM files with the same timestamp. Queries are never read directly from WAL segment files which are designed to optimize write rather than read performance.
The cache tracks its size on a "point-calculated" basis. "point-calculated" means that the RAM storage footprint for a point is the determined by calling its `Size()` method. While this does not correspond directly to the actual RAM footprint in the cache, the two values are sufficiently well correlated for the purpose of controlling RAM usage.
If the cache becomes too full, or the cache has been idle for too long, a snapshot of the cache is taken and a compaction process is initiated for the related WAL segments. When the compaction of these segments is complete, the related snapshots are released from the cache.
In cases where IO performance of the compaction process falls behind the incoming write rate, it is possible that writes might arrive at the cache while the cache is both too full and the compaction of the previous snapshot is still in progress. In this case, the cache will reject the write, causing the write to fail.
Well behaved clients should interpret write failures as back pressure and should either discard the write or back off and retry the write after a delay.
# TSM File Index
Each TSM file contains a full index of the blocks contained within the file. The existing index structure is designed to allow for a binary search across the index to find the starting block for a key. We would then seek to that start key and sequentially scan each block to find the location of a timestamp.
Some issues with the existing structure is that seeking to a given timestamp for a key has a unknown cost. This can cause variability in read performance that would very difficult to fix. Another issue is that startup times for loading a TSM file would grow in proportion to number and size of TSM files on disk since we would need to scan the entire file to find all keys contained in the file. This could be addressed by using a separate index like file or changing the index structure.
We've chosen to update the block index structure to ensure a TSM file is fully self-contained, supports consistent IO characteristics for sequential and random accesses as well as provides an efficient load time regardless of file size. The implications of these changes are that the index is slightly larger and we need to be able to search the index despite each entry being variably sized.
The following are some alternative design options to handle the cases where the index is too large to fit in memory. We are currently planning to use an indirect MMAP indexing approach for loaded TSM files.
### Indirect MMAP Indexing
One option is to MMAP the index into memory and record the pointers to the start of each index entry in a slice. When searching for a given key, the pointers are used to perform a binary search on the underlying mmap data. When the matching key is found, the block entries can be loaded and search or a subsequent binary search on the blocks can be performed.
A variation of this can also be done without MMAPs by seeking and reading in the file. The underlying file cache will still be utilized in this approach as well.
As an example, if we have an index structure in memory such as:
```
┌────────────────────────────────────────────────────────────────────┐
│ Index │
├─┬──────────────────────┬──┬───────────────────────┬───┬────────────┘
│0│ │62│ │145│
├─┴───────┬─────────┬────┼──┴──────┬─────────┬──────┼───┴─────┬──────┐
│Key 1 Len│ Key │... │Key 2 Len│ Key 2 │ ... │ Key 3 │ ... │
│ 2 bytes │ N bytes │ │ 2 bytes │ N bytes │ │ 2 bytes │ │
└─────────┴─────────┴────┴─────────┴─────────┴──────┴─────────┴──────┘
```
We would build an `offsets` slices where each element pointers to the byte location for the first key in then index slice.
```
┌────────────────────────────────────────────────────────────────────┐
│ Offsets │
├────┬────┬────┬─────────────────────────────────────────────────────┘
│ 0 │ 62 │145 │
└────┴────┴────┘
```
Using this offset slice we can find `Key 2` by doing a binary search over the offsets slice. Instead of comparing the value in the offsets (e.g. `62`), we use that as an index into the underlying index to retrieve the key at position `62` and perform our comparisons with that.
When we have identified the correct position in the index for a given key, we could perform another binary search or a linear scan. This should be fast as well since each index entry is 28 bytes and all contiguous in memory.
The size of the offsets slice would be proportional to the number of unique series. If we we limit file sizes to 4GB, we would use 4 bytes for each pointer.
### LRU/Lazy Load
A second option could be to have the index work as a memory bounded, lazy-load style cache. When a cache miss occurs, the index structure is scanned to find the key and the entries are load and added to the cache which causes the least-recently used entries to be evicted.
### Key Compression
Another option is compress keys using a key specific dictionary encoding. For example,
```
cpu,host=server1 value=1
cpu,host=server2 value=2
memory,host=server1 value=3
```
Could be compressed by expanding the key into its respective parts: measurement, tag keys, tag values and tag fields . For each part a unique number is assigned. e.g.
Measurements
```
cpu = 1
memory = 2
```
Tag Keys
```
host = 1
```
Tag Values
```
server1 = 1
server2 = 2
```
Fields
```
value = 1
```
Using this encoding dictionary, the string keys could be converted to a sequence of integers:
```
cpu,host=server1 value=1 --> 1,1,1,1
cpu,host=server2 value=2 --> 1,1,2,1
memory,host=server1 value=3 --> 3,1,2,1
```
These sequences of small integers list can then be compressed further using a bit packed format such as Simple9 or Simple8b. The resulting byte slices would be a multiple of 4 or 8 bytes (using Simple9/Simple8b respectively) which could used as the (string).
### Separate Index
Another option might be to have a separate index file (BoltDB) that serves as the storage for the `FileIndex` and is transient. This index would be recreated at startup and updated at compaction time.
# Components
These are some of the high-level components and their responsibilities. These are ideas preliminary.
## WAL
* Append-only log composed of fixed size segment files.
* Writes are appended to the current segment
* Roll-over to new segment after filling the current segment
* Closed segments are never modified and used for startup and recovery as well as compactions.
* There is a single WAL for the store as opposed to a WAL per shard.
## Compactor
* Continuously running, iterative file storage optimizer
* Takes closed WAL files, existing TSM files and combines into one or more new TSM files
## Cache
* Hold recently written series data
* Has max size and a flushing limit
* When the flushing limit is crossed, a snapshot is taken and a compaction process for the related WAL segments is commenced.
* If a write comes in, the cache is too full, and the previous snapshot is still being compacted, the write will fail.
# Engine
* Maintains references to Cache, FileStore, WAL, etc..
* Creates a cursor
* Receives writes, coordinates queries
* Hides underlying files and types from clients
## Cursor
* Iterates forward or reverse for given key
* Requests values from Engine for key and timestamp
* Has no knowledge of TSM files or WAL - delegates to Engine to request next set of Values
## FileStore
* Manages TSM files
* Maintains the file indexes and references to active files
* A TSM file that is opened entails reading in and adding the index section to the `FileIndex`. The block data is then MMAPed up to the index offset to avoid having the index in memory twice.
## FileIndex
* Provides location information to a file and block for a given key and timestamp.
## Interfaces
```
SeriesIterator returns the key and []Value such that a key is only returned
once and subsequent calls to Next() do not return the same key twice.
type SeriesIterator interface {
func Next() (key, []Value, error)
}
```
## Types
_NOTE: the actual func names are to illustrate the type of functionality the type is responsible._
```
TSMWriter writes a sets of key and Values to a TSM file.
type TSMWriter struct {}
func (t *TSMWriter) Write(key string, values []Value) error {}
func (t *TSMWriter) Close() error
```
```
// WALIterator returns the key and []Values for a set of WAL segment files.
type WALIterator struct{
Files *os.File
}
func (r *WALReader) Next() (key, []Value, error)
```
```
TSMIterator returns the key and values from a TSM file.
type TSMIterator struct {}
func (r *TSMIterator) Next() (key, []Value, error)
```
```
type Compactor struct {}
func (c *Compactor) Compact(iters ...SeriesIterators) error
```
```
type Engine struct {
wal *WAL
cache *Cache
fileStore *FileStore
compactor *Compactor
}
func (e *Engine) ValuesBefore(key string, timestamp time.Time) ([]Value, error)
func (e *Engine) ValuesAfter(key string, timestamp time.Time) ([]Value, error)
```
```
type Cursor struct{
engine *Engine
}
...
```
```
// FileStore maintains references
type FileStore struct {}
func (f *FileStore) ValuesBefore(key string, timestamp time.Time) ([]Value, error)
func (f *FileStore) ValuesAfter(key string, timestamp time.Time) ([]Value, error)
```
```
type FileIndex struct {}
// Returns a file and offset for a block located in the return file that contains the requested key and timestamp.
func (f *FileIndex) Location(key, timestamp) (*os.File, uint64, error)
```
```
type Cache struct {}
func (c *Cache) Write(key string, values []Value, checkpoint uint64) error
func (c *Cache) SetCheckpoint(checkpoint uint64) error
func (c *Cache) Cursor(key string) tsdb.Cursor
```
```
type WAL struct {}
func (w *WAL) Write(key string, values []Value)
func (w *WAL) ClosedSegments() ([]*os.File, error)
```
# Concerns
## Performance
There are three categories of performance this design is concerned with:
* Write Throughput/Latency
* Query Throughput/Latency
* Startup time
* Compaction Throughput/Latency
* Memory Usage
### Writes
Write throughput is bounded by the time to process the write on the CPU (parsing, sorting, etc..), adding and evicting to the Cache and appending the write to the WAL. The first two items are CPU bound and can be tuned and optimized if they become a bottleneck. The WAL write can be tuned such that in the worst case every write requires at least 2 IOPS (write + fsync) or batched so that multiple writes are queued and fsync'd in sizes matching one or more disk blocks. Performing more work with each IO will improve throughput
Write latency is minimal for the WAL write since there are no seeks. The latency is bounded by the time to complete any write and fsync calls.
### Queries
Query throughput is directly related to how many blocks can be read in a period of time. The index structure contains enough information to determine if one or multiple blocks can be read in a single IO.
Query latency is determine by how long it takes to find and read the relevant blocks. The in-memory index structure contains the offsets and sizes of all blocks for a key. This allows every block to be read in 2 IOPS (seek + read) regardless of position, structure or size of file.
### Startup
Startup time is proportional to the number of WAL files, TSM files and tombstone files. WAL files can be read and process in large batches using the WALIterators. TSM files require reading the index block into memory (5 IOPS/file). Tombstone files are expected to be small and infrequent and would require approximately 2 IOPS/file.
### Compactions
Compactions are IO intensive in that they may need to read multiple, large TSM files to rewrite them. The throughput of a compactions (MB/s) as well as the latency for each compaction is important to keep consistent even as data sizes grow.
To address these concerns, compactions prioritize old WAL files over optimizing storage/compression to avoid data being hidden during overload situations. This also accounts for the fact that shards will eventually become cold for writes so that existing data will be able to be optimized. To maintain consistent performance, the number of each type of file processed as well as the size of each file processed is bounded.
### Memory Footprint
The memory footprint should not grow unbounded due to additional files or series keys of large sizes or numbers. Some options for addressing this concern is covered in the [Design Options] section.
## Concurrency
The main concern with concurrency is that reads and writes should not block each other. Writes add entries to the Cache and append entries to the WAL. During queries, the contention points will be the Cache and existing TSM files. Since the Cache and TSM file data is only accessed through the engine by the cursors, several strategies can be used to improve concurrency.
1. cached series data is returned to cursors as a copy. Since cache snapshots are released following compaction, cursor iteration and writes to the same series could block each other. Iterating over copies of the values can relieve some of this contention.
2. TSM data values returned by the engine are new references to Values and not access to the actual TSM files. This means that the `Engine`, through the `FileStore` can limit contention.
3. Compactions are the only place where new TSM files are added and removed. Since this is a serial, continuously running process, file contention is minimized.
## Robustness
The two robustness concerns considered by this design are writes filling the cache and crash recovery.
### Cache Exhaustion
The cache is used to hold the contents of uncompacted WAL segments in memory until such time that the compaction process has had a chance to convert the write-optimised WAL segments into read-optimised TSM files.
The question arises about what to do in the case that the inbound write rate temporarily exceeds the compaction rate. There are four alternatives:
* block the write until the compaction process catches up
* cache the write and hope that the compaction process catches up before memory exhaustion occurs
* evict older cache entries to make room for new writes
* fail the write and propagate the error back to the database client as a form of back pressure
The current design chooses the last option - failing the writes. While this option reduces the apparent robustness of the database API from the perspective of the clients, it does provide a means by which the database can communicate, via back pressure, the need for clients to temporarily backoff. Well behaved clients should respond to write errors either by discarding the write or by retrying the write after a delay in the hope that the compaction process will eventually catch up. The problem with the first two options is that they may exhaust server resources. The problem with the third option is that queries (which don't touch WAL segments) might silently return incomplete results during compaction periods; with the selected option the possibility of incomplete queries is at least flagged by the presence of write errors during periods of degraded compaction performance.
### Crash Recovery
Crash recovery is facilitated with the following two properties: the append-only nature of WAL segments and the write-once nature of TSM files. If the server crashes incomplete compactions are discarded and the cache is rebuilt from the discovered WAL segments. Compactions will then resume in the normal way. Similarly, TSM files are immutable once they have been created and registered with the file store. A compaction may replace an existing TSM file, but the replaced file is not removed from the file system until replacement file has been created and synced to disk.
#Errata
This section is reserved for errata. In cases where the document is incorrect or inconsistent, such errata will be noted here with the contents of this section taking precedence over text elsewhere in the document in the case of discrepancies. Future full revisions of this document will fold the errata text back into the body of the document.
#Revisions
##14 February, 2016
* refined description of cache behaviour and robustness to reflect current design based on snapshots. Most references to checkpoints and evictions have been removed. See discussion here - https://goo.gl/L7AzVu
##11 November, 2015
* initial design published

View File

@@ -0,0 +1,5 @@
{
"files": [
"00000001.tsl"
]
}

View File

@@ -0,0 +1,133 @@
package tsm1
import "io"
// BitReader reads bits from an io.Reader.
type BitReader struct {
data []byte
buf struct {
v uint64 // bit buffer
n uint // available bits
}
}
// NewBitReader returns a new instance of BitReader that reads from data.
func NewBitReader(data []byte) *BitReader {
b := new(BitReader)
b.Reset(data)
return b
}
// Reset sets the underlying reader on b and reinitializes.
func (r *BitReader) Reset(data []byte) {
r.data = data
r.buf.v, r.buf.n = 0, 0
r.readBuf()
}
// CanReadBitFast returns true if calling ReadBitFast() is allowed.
// Fast bit reads are allowed when at least 2 values are in the buffer.
// This is because it is not required to refilled the buffer and the caller
// can inline the calls.
func (r *BitReader) CanReadBitFast() bool { return r.buf.n > 1 }
// ReadBitFast is an optimized bit read.
// IMPORTANT: Only allowed if CanReadFastBit() is true!
func (r *BitReader) ReadBitFast() bool {
v := (r.buf.v&(1<<63) != 0)
r.buf.v <<= 1
r.buf.n -= 1
return v
}
// ReadBit returns the next bit from the underlying data.
func (r *BitReader) ReadBit() (bool, error) {
v, err := r.ReadBits(1)
return v != 0, err
}
// ReadBits reads nbits from the underlying data into a uint64.
// nbits must be from 1 to 64, inclusive.
func (r *BitReader) ReadBits(nbits uint) (uint64, error) {
// Return EOF if there is no more data.
if r.buf.n == 0 {
return 0, io.EOF
}
// Return bits from buffer if less than available bits.
if nbits <= r.buf.n {
// Return all bits, if requested.
if nbits == 64 {
v := r.buf.v
r.buf.v, r.buf.n = 0, 0
r.readBuf()
return v, nil
}
// Otherwise mask returned bits.
v := (r.buf.v >> (64 - nbits))
r.buf.v <<= nbits
r.buf.n -= nbits
if r.buf.n == 0 {
r.readBuf()
}
return v, nil
}
// Otherwise read all available bits in current buffer.
v, n := r.buf.v, r.buf.n
// Read new buffer.
r.buf.v, r.buf.n = 0, 0
r.readBuf()
// Append new buffer to previous buffer and shift to remove unnecessary bits.
v |= (r.buf.v >> n)
v >>= 64 - nbits
// Remove used bits from new buffer.
bufN := nbits - n
if bufN > r.buf.n {
bufN = r.buf.n
}
r.buf.v <<= bufN
r.buf.n -= bufN
if r.buf.n == 0 {
r.readBuf()
}
return v, nil
}
func (r *BitReader) readBuf() {
// Determine number of bytes to read to fill buffer.
byteN := 8 - (r.buf.n / 8)
// Limit to the length of our data.
if n := uint(len(r.data)); byteN > n {
byteN = n
}
// Optimized 8-byte read.
if byteN == 8 {
r.buf.v = uint64(r.data[7]) | uint64(r.data[6])<<8 |
uint64(r.data[5])<<16 | uint64(r.data[4])<<24 |
uint64(r.data[3])<<32 | uint64(r.data[2])<<40 |
uint64(r.data[1])<<48 | uint64(r.data[0])<<56
r.buf.n = 64
r.data = r.data[8:]
return
}
// Otherwise append bytes to buffer.
for i := uint(0); i < byteN; i++ {
r.buf.n += 8
r.buf.v |= uint64(r.data[i]) << (64 - r.buf.n)
}
// Move data forward.
r.data = r.data[byteN:]
}

View File

@@ -0,0 +1,180 @@
package tsm1_test
import (
"bytes"
"io"
"math"
"math/rand"
"reflect"
"testing"
"testing/quick"
"github.com/dgryski/go-bitstream"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func TestBitStreamEOF(t *testing.T) {
br := tsm1.NewBitReader([]byte("0"))
b, err := br.ReadBits(8)
if err != nil {
t.Fatal(err)
}
if b != '0' {
t.Error("ReadBits(8) didn't return first byte")
}
if _, err := br.ReadBits(8); err != io.EOF {
t.Error("ReadBits(8) on empty string didn't return EOF")
}
// 0 = 0b00110000
br = tsm1.NewBitReader([]byte("0"))
buf := bytes.NewBuffer(nil)
bw := bitstream.NewWriter(buf)
for i := 0; i < 4; i++ {
bit, err := br.ReadBit()
if err == io.EOF {
break
}
if err != nil {
t.Error("GetBit returned error err=", err.Error())
return
}
bw.WriteBit(bitstream.Bit(bit))
}
bw.Flush(bitstream.One)
err = bw.WriteByte(0xAA)
if err != nil {
t.Error("unable to WriteByte")
}
c := buf.Bytes()
if len(c) != 2 || c[1] != 0xAA || c[0] != 0x3f {
t.Error("bad return from 4 read bytes")
}
_, err = tsm1.NewBitReader([]byte("")).ReadBit()
if err != io.EOF {
t.Error("ReadBit on empty string didn't return EOF")
}
}
func TestBitStream(t *testing.T) {
buf := bytes.NewBuffer(nil)
br := tsm1.NewBitReader([]byte("hello"))
bw := bitstream.NewWriter(buf)
for {
bit, err := br.ReadBit()
if err == io.EOF {
break
}
if err != nil {
t.Error("GetBit returned error err=", err.Error())
return
}
bw.WriteBit(bitstream.Bit(bit))
}
s := buf.String()
if s != "hello" {
t.Error("expected 'hello', got=", []byte(s))
}
}
func TestByteStream(t *testing.T) {
buf := bytes.NewBuffer(nil)
br := tsm1.NewBitReader([]byte("hello"))
bw := bitstream.NewWriter(buf)
for i := 0; i < 3; i++ {
bit, err := br.ReadBit()
if err == io.EOF {
break
}
if err != nil {
t.Error("GetBit returned error err=", err.Error())
return
}
bw.WriteBit(bitstream.Bit(bit))
}
for i := 0; i < 3; i++ {
byt, err := br.ReadBits(8)
if err == io.EOF {
break
}
if err != nil {
t.Error("ReadBits(8) returned error err=", err.Error())
return
}
bw.WriteByte(byte(byt))
}
u, err := br.ReadBits(13)
if err != nil {
t.Error("ReadBits returned error err=", err.Error())
return
}
bw.WriteBits(u, 13)
bw.WriteBits(('!'<<12)|('.'<<4)|0x02, 20)
// 0x2f == '/'
bw.Flush(bitstream.One)
s := buf.String()
if s != "hello!./" {
t.Errorf("expected 'hello!./', got=%x", []byte(s))
}
}
// Ensure bit reader can read random bits written to a stream.
func TestBitReader_Quick(t *testing.T) {
if err := quick.Check(func(values []uint64, nbits []uint) bool {
// Limit nbits to 64.
for i := 0; i < len(values) && i < len(nbits); i++ {
nbits[i] = (nbits[i] % 64) + 1
values[i] = values[i] & (math.MaxUint64 >> (64 - nbits[i]))
}
// Write bits to a buffer.
var buf bytes.Buffer
w := bitstream.NewWriter(&buf)
for i := 0; i < len(values) && i < len(nbits); i++ {
w.WriteBits(values[i], int(nbits[i]))
}
w.Flush(bitstream.Zero)
// Read bits from the buffer.
r := tsm1.NewBitReader(buf.Bytes())
for i := 0; i < len(values) && i < len(nbits); i++ {
v, err := r.ReadBits(nbits[i])
if err != nil {
t.Errorf("unexpected error(%d): %s", i, err)
return false
} else if v != values[i] {
t.Errorf("value mismatch(%d): got=%d, exp=%d (nbits=%d)", i, v, values[i], nbits[i])
return false
}
}
return true
}, &quick.Config{
Values: func(a []reflect.Value, rand *rand.Rand) {
a[0], _ = quick.Value(reflect.TypeOf([]uint64{}), rand)
a[1], _ = quick.Value(reflect.TypeOf([]uint{}), rand)
},
}); err != nil {
t.Fatal(err)
}
}

View File

@@ -0,0 +1,174 @@
package tsm1
// boolean encoding uses 1 bit per value. Each compressed byte slice contains a 1 byte header
// indicating the compression type, followed by a variable byte encoded length indicating
// how many booleans are packed in the slice. The remaining bytes contains 1 byte for every
// 8 boolean values encoded.
import (
"encoding/binary"
"fmt"
)
const (
// booleanUncompressed is an uncompressed boolean format.
// Not yet implemented.
booleanUncompressed = 0
// booleanCompressedBitPacked is an bit packed format using 1 bit per boolean
booleanCompressedBitPacked = 1
)
// BooleanEncoder encodes a series of booleans to an in-memory buffer.
type BooleanEncoder struct {
// The encoded bytes
bytes []byte
// The current byte being encoded
b byte
// The number of bools packed into b
i int
// The total number of bools written
n int
}
// NewBooleanEncoder returns a new instance of BooleanEncoder.
func NewBooleanEncoder(sz int) BooleanEncoder {
return BooleanEncoder{
bytes: make([]byte, 0, (sz+7)/8),
}
}
// Reset sets the encoder to its initial state.
func (e *BooleanEncoder) Reset() {
e.bytes = e.bytes[:0]
e.b = 0
e.i = 0
e.n = 0
}
// Write encodes b to the underlying buffer.
func (e *BooleanEncoder) Write(b bool) {
// If we have filled the current byte, flush it
if e.i >= 8 {
e.flush()
}
// Use 1 bit for each boolean value, shift the current byte
// by 1 and set the least signficant bit acordingly
e.b = e.b << 1
if b {
e.b |= 1
}
// Increment the current boolean count
e.i++
// Increment the total boolean count
e.n++
}
func (e *BooleanEncoder) flush() {
// Pad remaining byte w/ 0s
for e.i < 8 {
e.b = e.b << 1
e.i++
}
// If we have bits set, append them to the byte slice
if e.i > 0 {
e.bytes = append(e.bytes, e.b)
e.b = 0
e.i = 0
}
}
// Flush is no-op
func (e *BooleanEncoder) Flush() {}
// Bytes returns a new byte slice containing the encoded booleans from previous calls to Write.
func (e *BooleanEncoder) Bytes() ([]byte, error) {
// Ensure the current byte is flushed
e.flush()
b := make([]byte, 10+1)
// Store the encoding type in the 4 high bits of the first byte
b[0] = byte(booleanCompressedBitPacked) << 4
i := 1
// Encode the number of booleans written
i += binary.PutUvarint(b[i:], uint64(e.n))
// Append the packed booleans
return append(b[:i], e.bytes...), nil
}
// BooleanDecoder decodes a series of booleans from an in-memory buffer.
type BooleanDecoder struct {
b []byte
i int
n int
err error
}
// SetBytes initializes the decoder with a new set of bytes to read from.
// This must be called before calling any other methods.
func (e *BooleanDecoder) SetBytes(b []byte) {
if len(b) == 0 {
return
}
// First byte stores the encoding type, only have 1 bit-packet format
// currently ignore for now.
b = b[1:]
count, n := binary.Uvarint(b)
if n <= 0 {
e.err = fmt.Errorf("BooleanDecoder: invalid count")
return
}
e.b = b[n:]
e.i = -1
e.n = int(count)
if min := len(e.b) * 8; min < e.n {
// Shouldn't happen - TSM file was truncated/corrupted
e.n = min
}
}
// Next returns whether there are any bits remaining in the decoder.
// It returns false if there was an error decoding.
// The error is available on the Error method.
func (e *BooleanDecoder) Next() bool {
if e.err != nil {
return false
}
e.i++
return e.i < e.n
}
// Read returns the next bit from the decoder.
func (e *BooleanDecoder) Read() bool {
// Index into the byte slice
idx := e.i >> 3 // integer division by 8
// Bit position
pos := 7 - (e.i & 0x7)
// The mask to select the bit
mask := byte(1 << uint(pos))
// The packed byte
v := e.b[idx]
// Returns true if the bit is set
return v&mask == mask
}
// Error returns the error encountered during decoding, if one occurred.
func (e *BooleanDecoder) Error() error {
return e.err
}

View File

@@ -0,0 +1,161 @@
package tsm1_test
import (
"reflect"
"testing"
"testing/quick"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func Test_BooleanEncoder_NoValues(t *testing.T) {
enc := tsm1.NewBooleanEncoder(0)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec tsm1.BooleanDecoder
dec.SetBytes(b)
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_BooleanEncoder_Single(t *testing.T) {
enc := tsm1.NewBooleanEncoder(1)
v1 := true
enc.Write(v1)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec tsm1.BooleanDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got false, exp true")
}
if v1 != dec.Read() {
t.Fatalf("unexpected value: got %v, exp %v", dec.Read(), v1)
}
}
func Test_BooleanEncoder_Multi_Compressed(t *testing.T) {
enc := tsm1.NewBooleanEncoder(10)
values := make([]bool, 10)
for i := range values {
values[i] = i%2 == 0
enc.Write(values[i])
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if exp := 4; len(b) != exp {
t.Fatalf("unexpected length: got %v, exp %v", len(b), exp)
}
var dec tsm1.BooleanDecoder
dec.SetBytes(b)
for i, v := range values {
if !dec.Next() {
t.Fatalf("unexpected next value: got false, exp true")
}
if v != dec.Read() {
t.Fatalf("unexpected value at pos %d: got %v, exp %v", i, dec.Read(), v)
}
}
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_BooleanEncoder_Quick(t *testing.T) {
if err := quick.Check(func(values []bool) bool {
expected := values
if values == nil {
expected = []bool{}
}
// Write values to encoder.
enc := tsm1.NewBooleanEncoder(1024)
for _, v := range values {
enc.Write(v)
}
// Retrieve compressed bytes.
buf, err := enc.Bytes()
if err != nil {
t.Fatal(err)
}
// Read values out of decoder.
got := make([]bool, 0, len(values))
var dec tsm1.BooleanDecoder
dec.SetBytes(buf)
for dec.Next() {
got = append(got, dec.Read())
}
// Verify that input and output values match.
if !reflect.DeepEqual(expected, got) {
t.Fatalf("mismatch:\n\nexp=%#v\n\ngot=%#v\n\n", expected, got)
}
return true
}, nil); err != nil {
t.Fatal(err)
}
}
func Test_BooleanDecoder_Corrupt(t *testing.T) {
cases := []string{
"", // Empty
"\x10\x90", // Packed: invalid count
"\x10\x7f", // Packed: count greater than remaining bits, multiple bytes expected
"\x10\x01", // Packed: count greater than remaining bits, one byte expected
}
for _, c := range cases {
var dec tsm1.BooleanDecoder
dec.SetBytes([]byte(c))
if dec.Next() {
t.Fatalf("exp next == false, got true for case %q", c)
}
}
}
func BenchmarkBooleanDecoder_2048(b *testing.B) { benchmarkBooleanDecoder(b, 2048) }
func benchmarkBooleanDecoder(b *testing.B, size int) {
e := tsm1.NewBooleanEncoder(size)
for i := 0; i < size; i++ {
e.Write(i&1 == 1)
}
bytes, err := e.Bytes()
if err != nil {
b.Fatalf("unexpected error: %v", err)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
var d tsm1.BooleanDecoder
d.SetBytes(bytes)
var n int
for d.Next() {
_ = d.Read()
n++
}
if n != size {
b.Fatalf("expected to read %d booleans, but read %d", size, n)
}
}
}

View File

@@ -0,0 +1,766 @@
package tsm1
import (
"fmt"
"math"
"os"
"sync"
"sync/atomic"
"time"
"github.com/influxdata/influxdb/influxql"
"github.com/influxdata/influxdb/models"
"github.com/influxdata/influxdb/tsdb"
"github.com/uber-go/zap"
)
// ringShards specifies the number of partitions that the hash ring used to
// store the entry mappings contains. It must be a power of 2. From empirical
// testing, a value above the number of cores on the machine does not provide
// any additional benefit. For now we'll set it to the number of cores on the
// largest box we could imagine running influx.
const ringShards = 4096
var (
// ErrSnapshotInProgress is returned if a snapshot is attempted while one is already running.
ErrSnapshotInProgress = fmt.Errorf("snapshot in progress")
)
// ErrCacheMemorySizeLimitExceeded returns an error indicating an operation
// could not be completed due to exceeding the cache-max-memory-size setting.
func ErrCacheMemorySizeLimitExceeded(n, limit uint64) error {
return fmt.Errorf("cache-max-memory-size exceeded: (%d/%d)", n, limit)
}
// entry is a set of values and some metadata.
type entry struct {
mu sync.RWMutex
values Values // All stored values.
// The type of values stored. Read only so doesn't need to be protected by
// mu.
vtype int
}
// newEntryValues returns a new instance of entry with the given values. If the
// values are not valid, an error is returned.
//
// newEntryValues takes an optional hint to indicate the initial buffer size.
// The hint is only respected if it's positive.
func newEntryValues(values []Value, hint int) (*entry, error) {
// Ensure we start off with a reasonably sized values slice.
if hint < 32 {
hint = 32
}
e := &entry{}
if len(values) > hint {
e.values = make(Values, 0, len(values))
} else {
e.values = make(Values, 0, hint)
}
e.values = append(e.values, values...)
// No values, don't check types and ordering
if len(values) == 0 {
return e, nil
}
et := valueType(values[0])
for _, v := range values {
// Make sure all the values are the same type
if et != valueType(v) {
return nil, tsdb.ErrFieldTypeConflict
}
}
// Set the type of values stored.
e.vtype = et
return e, nil
}
// add adds the given values to the entry.
func (e *entry) add(values []Value) error {
if len(values) == 0 {
return nil // Nothing to do.
}
// Are any of the new values the wrong type?
for _, v := range values {
if e.vtype != valueType(v) {
return tsdb.ErrFieldTypeConflict
}
}
// entry currently has no values, so add the new ones and we're done.
e.mu.Lock()
if len(e.values) == 0 {
// Ensure we start off with a reasonably sized values slice.
if len(values) < 32 {
e.values = make(Values, 0, 32)
e.values = append(e.values, values...)
} else {
e.values = values
}
e.mu.Unlock()
return nil
}
// Append the new values to the existing ones...
e.values = append(e.values, values...)
e.mu.Unlock()
return nil
}
// deduplicate sorts and orders the entry's values. If values are already deduped and sorted,
// the function does no work and simply returns.
func (e *entry) deduplicate() {
e.mu.Lock()
defer e.mu.Unlock()
if len(e.values) == 0 {
return
}
e.values = e.values.Deduplicate()
}
// count returns the number of values in this entry.
func (e *entry) count() int {
e.mu.RLock()
n := len(e.values)
e.mu.RUnlock()
return n
}
// filter removes all values with timestamps between min and max inclusive.
func (e *entry) filter(min, max int64) {
e.mu.Lock()
e.values = e.values.Exclude(min, max)
e.mu.Unlock()
}
// size returns the size of this entry in bytes.
func (e *entry) size() int {
e.mu.RLock()
sz := e.values.Size()
e.mu.RUnlock()
return sz
}
// InfluxQLType returns for the entry the data type of its values.
func (e *entry) InfluxQLType() (influxql.DataType, error) {
e.mu.RLock()
defer e.mu.RUnlock()
return e.values.InfluxQLType()
}
// Statistics gathered by the Cache.
const (
// levels - point in time measures
statCacheMemoryBytes = "memBytes" // level: Size of in-memory cache in bytes
statCacheDiskBytes = "diskBytes" // level: Size of on-disk snapshots in bytes
statSnapshots = "snapshotCount" // level: Number of active snapshots.
statCacheAgeMs = "cacheAgeMs" // level: Number of milliseconds since cache was last snapshoted at sample time
// counters - accumulative measures
statCachedBytes = "cachedBytes" // counter: Total number of bytes written into snapshots.
statWALCompactionTimeMs = "WALCompactionTimeMs" // counter: Total number of milliseconds spent compacting snapshots
statCacheWriteOK = "writeOk"
statCacheWriteErr = "writeErr"
statCacheWriteDropped = "writeDropped"
)
// storer is the interface that descibes a cache's store.
type storer interface {
entry(key string) (*entry, bool) // Get an entry by its key.
write(key string, values Values) error // Write an entry to the store.
add(key string, entry *entry) // Add a new entry to the store.
remove(key string) // Remove an entry from the store.
keys(sorted bool) []string // Return an optionally sorted slice of entry keys.
apply(f func(string, *entry) error) error // Apply f to all entries in the store in parallel.
applySerial(f func(string, *entry) error) error // Apply f to all entries in serial.
reset() // Reset the store to an initial unused state.
}
// Cache maintains an in-memory store of Values for a set of keys.
type Cache struct {
// Due to a bug in atomic size needs to be the first word in the struct, as
// that's the only place where you're guaranteed to be 64-bit aligned on a
// 32 bit system. See: https://golang.org/pkg/sync/atomic/#pkg-note-BUG
size uint64
snapshotSize uint64
mu sync.RWMutex
store storer
maxSize uint64
// snapshots are the cache objects that are currently being written to tsm files
// they're kept in memory while flushing so they can be queried along with the cache.
// they are read only and should never be modified
snapshot *Cache
snapshotting bool
// This number is the number of pending or failed WriteSnaphot attempts since the last successful one.
snapshotAttempts int
stats *CacheStatistics
lastSnapshot time.Time
// A one time synchronization used to initial the cache with a store. Since the store can allocate a
// a large amount memory across shards, we lazily create it.
initialize atomic.Value
initializedCount uint32
}
// NewCache returns an instance of a cache which will use a maximum of maxSize bytes of memory.
// Only used for engine caches, never for snapshots.
func NewCache(maxSize uint64, path string) *Cache {
c := &Cache{
maxSize: maxSize,
store: emptyStore{},
stats: &CacheStatistics{},
lastSnapshot: time.Now(),
}
c.initialize.Store(&sync.Once{})
c.UpdateAge()
c.UpdateCompactTime(0)
c.updateCachedBytes(0)
c.updateMemSize(0)
c.updateSnapshots()
return c
}
// CacheStatistics hold statistics related to the cache.
type CacheStatistics struct {
MemSizeBytes int64
DiskSizeBytes int64
SnapshotCount int64
CacheAgeMs int64
CachedBytes int64
WALCompactionTimeMs int64
WriteOK int64
WriteErr int64
WriteDropped int64
}
// Statistics returns statistics for periodic monitoring.
func (c *Cache) Statistics(tags map[string]string) []models.Statistic {
return []models.Statistic{{
Name: "tsm1_cache",
Tags: tags,
Values: map[string]interface{}{
statCacheMemoryBytes: atomic.LoadInt64(&c.stats.MemSizeBytes),
statCacheDiskBytes: atomic.LoadInt64(&c.stats.DiskSizeBytes),
statSnapshots: atomic.LoadInt64(&c.stats.SnapshotCount),
statCacheAgeMs: atomic.LoadInt64(&c.stats.CacheAgeMs),
statCachedBytes: atomic.LoadInt64(&c.stats.CachedBytes),
statWALCompactionTimeMs: atomic.LoadInt64(&c.stats.WALCompactionTimeMs),
statCacheWriteOK: atomic.LoadInt64(&c.stats.WriteOK),
statCacheWriteErr: atomic.LoadInt64(&c.stats.WriteErr),
statCacheWriteDropped: atomic.LoadInt64(&c.stats.WriteDropped),
},
}}
}
// init initializes the cache and allocates the underlying store. Once initialized,
// the store re-used until Freed.
func (c *Cache) init() {
if !atomic.CompareAndSwapUint32(&c.initializedCount, 0, 1) {
return
}
c.mu.Lock()
c.store, _ = newring(ringShards)
c.mu.Unlock()
}
// Free releases the underlying store and memory held by the Cache.
func (c *Cache) Free() {
if !atomic.CompareAndSwapUint32(&c.initializedCount, 1, 0) {
return
}
c.mu.Lock()
c.store = emptyStore{}
c.mu.Unlock()
}
// Write writes the set of values for the key to the cache. This function is goroutine-safe.
// It returns an error if the cache will exceed its max size by adding the new values.
func (c *Cache) Write(key string, values []Value) error {
c.init()
addedSize := uint64(Values(values).Size())
// Enough room in the cache?
limit := c.maxSize
n := c.Size() + addedSize
if limit > 0 && n > limit {
atomic.AddInt64(&c.stats.WriteErr, 1)
return ErrCacheMemorySizeLimitExceeded(n, limit)
}
if err := c.store.write(key, values); err != nil {
atomic.AddInt64(&c.stats.WriteErr, 1)
return err
}
// Update the cache size and the memory size stat.
c.increaseSize(addedSize)
c.updateMemSize(int64(addedSize))
atomic.AddInt64(&c.stats.WriteOK, 1)
return nil
}
// WriteMulti writes the map of keys and associated values to the cache. This
// function is goroutine-safe. It returns an error if the cache will exceeded
// its max size by adding the new values. The write attempts to write as many
// values as possible. If one key fails, the others can still succeed and an
// error will be returned.
func (c *Cache) WriteMulti(values map[string][]Value) error {
c.init()
var addedSize uint64
for _, v := range values {
addedSize += uint64(Values(v).Size())
}
// Enough room in the cache?
limit := c.maxSize // maxSize is safe for reading without a lock.
n := c.Size() + addedSize
if limit > 0 && n > limit {
atomic.AddInt64(&c.stats.WriteErr, 1)
return ErrCacheMemorySizeLimitExceeded(n, limit)
}
var werr error
c.mu.RLock()
store := c.store
c.mu.RUnlock()
// We'll optimistially set size here, and then decrement it for write errors.
c.increaseSize(addedSize)
for k, v := range values {
if err := store.write(k, v); err != nil {
// The write failed, hold onto the error and adjust the size delta.
werr = err
addedSize -= uint64(Values(v).Size())
c.decreaseSize(uint64(Values(v).Size()))
}
}
// Some points in the batch were dropped. An error is returned so
// error stat is incremented as well.
if werr != nil {
atomic.AddInt64(&c.stats.WriteDropped, 1)
atomic.AddInt64(&c.stats.WriteErr, 1)
}
// Update the memory size stat
c.updateMemSize(int64(addedSize))
atomic.AddInt64(&c.stats.WriteOK, 1)
return werr
}
// Snapshot takes a snapshot of the current cache, adds it to the slice of caches that
// are being flushed, and resets the current cache with new values.
func (c *Cache) Snapshot() (*Cache, error) {
c.init()
c.mu.Lock()
defer c.mu.Unlock()
if c.snapshotting {
return nil, ErrSnapshotInProgress
}
c.snapshotting = true
c.snapshotAttempts++ // increment the number of times we tried to do this
// If no snapshot exists, create a new one, otherwise update the existing snapshot
if c.snapshot == nil {
store, err := newring(ringShards)
if err != nil {
return nil, err
}
c.snapshot = &Cache{
store: store,
}
}
// Did a prior snapshot exist that failed? If so, return the existing
// snapshot to retry.
if c.snapshot.Size() > 0 {
return c.snapshot, nil
}
c.snapshot.store, c.store = c.store, c.snapshot.store
snapshotSize := c.Size()
// Save the size of the snapshot on the snapshot cache
atomic.StoreUint64(&c.snapshot.size, snapshotSize)
// Save the size of the snapshot on the live cache
atomic.StoreUint64(&c.snapshotSize, snapshotSize)
// Reset the cache's store.
c.store.reset()
atomic.StoreUint64(&c.size, 0)
c.lastSnapshot = time.Now()
c.updateCachedBytes(snapshotSize) // increment the number of bytes added to the snapshot
c.updateSnapshots()
return c.snapshot, nil
}
// Deduplicate sorts the snapshot before returning it. The compactor and any queries
// coming in while it writes will need the values sorted.
func (c *Cache) Deduplicate() {
c.mu.RLock()
store := c.store
c.mu.RUnlock()
// Apply a function that simply calls deduplicate on each entry in the ring.
// apply cannot return an error in this invocation.
_ = store.apply(func(_ string, e *entry) error { e.deduplicate(); return nil })
}
// ClearSnapshot removes the snapshot cache from the list of flushing caches and
// adjusts the size.
func (c *Cache) ClearSnapshot(success bool) {
c.init()
c.mu.Lock()
defer c.mu.Unlock()
c.snapshotting = false
if success {
c.snapshotAttempts = 0
c.updateMemSize(-int64(atomic.LoadUint64(&c.snapshotSize))) // decrement the number of bytes in cache
// Reset the snapshot's store, and reset the snapshot to a fresh Cache.
c.snapshot.store.reset()
c.snapshot = &Cache{
store: c.snapshot.store,
}
atomic.StoreUint64(&c.snapshotSize, 0)
c.updateSnapshots()
}
}
// Size returns the number of point-calcuated bytes the cache currently uses.
func (c *Cache) Size() uint64 {
return atomic.LoadUint64(&c.size) + atomic.LoadUint64(&c.snapshotSize)
}
// increaseSize increases size by delta.
func (c *Cache) increaseSize(delta uint64) {
atomic.AddUint64(&c.size, delta)
}
// decreaseSize decreases size by delta.
func (c *Cache) decreaseSize(delta uint64) {
// Per sync/atomic docs, bit-flip delta minus one to perform subtraction within AddUint64.
atomic.AddUint64(&c.size, ^(delta - 1))
}
// MaxSize returns the maximum number of bytes the cache may consume.
func (c *Cache) MaxSize() uint64 {
return c.maxSize
}
// Keys returns a sorted slice of all keys under management by the cache.
func (c *Cache) Keys() []string {
c.mu.RLock()
store := c.store
c.mu.RUnlock()
return store.keys(true)
}
// unsortedKeys returns a slice of all keys under management by the cache. The
// keys are not sorted.
func (c *Cache) unsortedKeys() []string {
c.mu.RLock()
store := c.store
c.mu.RUnlock()
return store.keys(false)
}
// Values returns a copy of all values, deduped and sorted, for the given key.
func (c *Cache) Values(key string) Values {
var snapshotEntries *entry
c.mu.RLock()
e, ok := c.store.entry(key)
if c.snapshot != nil {
snapshotEntries, _ = c.snapshot.store.entry(key)
}
c.mu.RUnlock()
if !ok {
if snapshotEntries == nil {
// No values in hot cache or snapshots.
return nil
}
} else {
e.deduplicate()
}
// Build the sequence of entries that will be returned, in the correct order.
// Calculate the required size of the destination buffer.
var entries []*entry
sz := 0
if snapshotEntries != nil {
snapshotEntries.deduplicate() // guarantee we are deduplicated
entries = append(entries, snapshotEntries)
sz += snapshotEntries.count()
}
if e != nil {
entries = append(entries, e)
sz += e.count()
}
// Any entries? If not, return.
if sz == 0 {
return nil
}
// Create the buffer, and copy all hot values and snapshots. Individual
// entries are sorted at this point, so now the code has to check if the
// resultant buffer will be sorted from start to finish.
values := make(Values, sz)
n := 0
for _, e := range entries {
e.mu.RLock()
n += copy(values[n:], e.values)
e.mu.RUnlock()
}
values = values[:n]
values = values.Deduplicate()
return values
}
// Delete removes all values for the given keys from the cache.
func (c *Cache) Delete(keys []string) {
c.DeleteRange(keys, math.MinInt64, math.MaxInt64)
}
// DeleteRange removes the values for all keys containing points
// with timestamps between between min and max from the cache.
//
// TODO(edd): Lock usage could possibly be optimised if necessary.
func (c *Cache) DeleteRange(keys []string, min, max int64) {
c.init()
c.mu.Lock()
defer c.mu.Unlock()
for _, k := range keys {
// Make sure key exist in the cache, skip if it does not
e, ok := c.store.entry(k)
if !ok {
continue
}
origSize := uint64(e.size())
if min == math.MinInt64 && max == math.MaxInt64 {
c.decreaseSize(origSize)
c.store.remove(k)
continue
}
e.filter(min, max)
if e.count() == 0 {
c.store.remove(k)
c.decreaseSize(origSize)
continue
}
c.decreaseSize(origSize - uint64(e.size()))
}
atomic.StoreInt64(&c.stats.MemSizeBytes, int64(c.Size()))
}
// SetMaxSize updates the memory limit of the cache.
func (c *Cache) SetMaxSize(size uint64) {
c.mu.Lock()
c.maxSize = size
c.mu.Unlock()
}
// values returns the values for the key. It assumes the data is already sorted.
// It doesn't lock the cache but it does read-lock the entry if there is one for the key.
// values should only be used in compact.go in the CacheKeyIterator.
func (c *Cache) values(key string) Values {
e, _ := c.store.entry(key)
if e == nil {
return nil
}
e.mu.RLock()
v := e.values
e.mu.RUnlock()
return v
}
// ApplyEntryFn applies the function f to each entry in the Cache.
// ApplyEntryFn calls f on each entry in turn, within the same goroutine.
// It is safe for use by multiple goroutines.
func (c *Cache) ApplyEntryFn(f func(key string, entry *entry) error) error {
c.mu.RLock()
store := c.store
c.mu.RUnlock()
return store.applySerial(f)
}
// CacheLoader processes a set of WAL segment files, and loads a cache with the data
// contained within those files. Processing of the supplied files take place in the
// order they exist in the files slice.
type CacheLoader struct {
files []string
Logger zap.Logger
}
// NewCacheLoader returns a new instance of a CacheLoader.
func NewCacheLoader(files []string) *CacheLoader {
return &CacheLoader{
files: files,
Logger: zap.New(zap.NullEncoder()),
}
}
// Load returns a cache loaded with the data contained within the segment files.
// If, during reading of a segment file, corruption is encountered, that segment
// file is truncated up to and including the last valid byte, and processing
// continues with the next segment file.
func (cl *CacheLoader) Load(cache *Cache) error {
var r *WALSegmentReader
for _, fn := range cl.files {
if err := func() error {
f, err := os.OpenFile(fn, os.O_CREATE|os.O_RDWR, 0666)
if err != nil {
return err
}
defer f.Close()
// Log some information about the segments.
stat, err := os.Stat(f.Name())
if err != nil {
return err
}
cl.Logger.Info(fmt.Sprintf("reading file %s, size %d", f.Name(), stat.Size()))
// Nothing to read, skip it
if stat.Size() == 0 {
return nil
}
if r == nil {
r = NewWALSegmentReader(f)
defer r.Close()
} else {
r.Reset(f)
}
for r.Next() {
entry, err := r.Read()
if err != nil {
n := r.Count()
cl.Logger.Info(fmt.Sprintf("file %s corrupt at position %d, truncating", f.Name(), n))
if err := f.Truncate(n); err != nil {
return err
}
break
}
switch t := entry.(type) {
case *WriteWALEntry:
if err := cache.WriteMulti(t.Values); err != nil {
return err
}
case *DeleteRangeWALEntry:
cache.DeleteRange(t.Keys, t.Min, t.Max)
case *DeleteWALEntry:
cache.Delete(t.Keys)
}
}
return r.Close()
}(); err != nil {
return err
}
}
return nil
}
// WithLogger sets the logger on the CacheLoader.
func (cl *CacheLoader) WithLogger(log zap.Logger) {
cl.Logger = log.With(zap.String("service", "cacheloader"))
}
// UpdateAge updates the age statistic based on the current time.
func (c *Cache) UpdateAge() {
c.mu.RLock()
defer c.mu.RUnlock()
ageStat := int64(time.Since(c.lastSnapshot) / time.Millisecond)
atomic.StoreInt64(&c.stats.CacheAgeMs, ageStat)
}
// UpdateCompactTime updates WAL compaction time statistic based on d.
func (c *Cache) UpdateCompactTime(d time.Duration) {
atomic.AddInt64(&c.stats.WALCompactionTimeMs, int64(d/time.Millisecond))
}
// updateCachedBytes increases the cachedBytes counter by b.
func (c *Cache) updateCachedBytes(b uint64) {
atomic.AddInt64(&c.stats.CachedBytes, int64(b))
}
// updateMemSize updates the memSize level by b.
func (c *Cache) updateMemSize(b int64) {
atomic.AddInt64(&c.stats.MemSizeBytes, b)
}
func valueType(v Value) int {
switch v.(type) {
case FloatValue:
return 1
case IntegerValue:
return 2
case StringValue:
return 3
case BooleanValue:
return 4
default:
return 0
}
}
// updateSnapshots updates the snapshotsCount and the diskSize levels.
func (c *Cache) updateSnapshots() {
// Update disk stats
atomic.StoreInt64(&c.stats.DiskSizeBytes, int64(atomic.LoadUint64(&c.snapshotSize)))
atomic.StoreInt64(&c.stats.SnapshotCount, int64(c.snapshotAttempts))
}
type emptyStore struct{}
func (e emptyStore) entry(key string) (*entry, bool) { return nil, false }
func (e emptyStore) write(key string, values Values) error { return nil }
func (e emptyStore) add(key string, entry *entry) {}
func (e emptyStore) remove(key string) {}
func (e emptyStore) keys(sorted bool) []string { return nil }
func (e emptyStore) apply(f func(string, *entry) error) error { return nil }
func (e emptyStore) applySerial(f func(string, *entry) error) error { return nil }
func (e emptyStore) reset() {}

View File

@@ -0,0 +1,206 @@
package tsm1_test
import (
"fmt"
"math/rand"
"sync"
"testing"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func TestCacheCheckConcurrentReadsAreSafe(t *testing.T) {
values := make(tsm1.Values, 1000)
timestamps := make([]int64, len(values))
series := make([]string, 100)
for i := range timestamps {
timestamps[i] = int64(rand.Int63n(int64(len(values))))
}
for i := range values {
values[i] = tsm1.NewValue(timestamps[i*len(timestamps)/len(values)], float64(i))
}
for i := range series {
series[i] = fmt.Sprintf("series%d", i)
}
wg := sync.WaitGroup{}
c := tsm1.NewCache(1000000, "")
ch := make(chan struct{})
for _, s := range series {
for _, v := range values {
c.Write(s, tsm1.Values{v})
}
wg.Add(3)
go func(s string) {
defer wg.Done()
<-ch
c.Values(s)
}(s)
go func(s string) {
defer wg.Done()
<-ch
c.Values(s)
}(s)
go func(s string) {
defer wg.Done()
<-ch
c.Values(s)
}(s)
}
close(ch)
wg.Wait()
}
func TestCacheRace(t *testing.T) {
values := make(tsm1.Values, 1000)
timestamps := make([]int64, len(values))
series := make([]string, 100)
for i := range timestamps {
timestamps[i] = int64(rand.Int63n(int64(len(values))))
}
for i := range values {
values[i] = tsm1.NewValue(timestamps[i*len(timestamps)/len(values)], float64(i))
}
for i := range series {
series[i] = fmt.Sprintf("series%d", i)
}
wg := sync.WaitGroup{}
c := tsm1.NewCache(1000000, "")
ch := make(chan struct{})
for _, s := range series {
for _, v := range values {
c.Write(s, tsm1.Values{v})
}
wg.Add(1)
go func(s string) {
defer wg.Done()
<-ch
c.Values(s)
}(s)
}
errC := make(chan error)
wg.Add(1)
go func() {
defer wg.Done()
<-ch
s, err := c.Snapshot()
if err == tsm1.ErrSnapshotInProgress {
return
}
if err != nil {
errC <- fmt.Errorf("failed to snapshot cache: %v", err)
return
}
s.Deduplicate()
c.ClearSnapshot(true)
}()
close(ch)
go func() {
wg.Wait()
close(errC)
}()
for err := range errC {
if err != nil {
t.Error(err)
}
}
}
func TestCacheRace2Compacters(t *testing.T) {
values := make(tsm1.Values, 1000)
timestamps := make([]int64, len(values))
series := make([]string, 100)
for i := range timestamps {
timestamps[i] = int64(rand.Int63n(int64(len(values))))
}
for i := range values {
values[i] = tsm1.NewValue(timestamps[i*len(timestamps)/len(values)], float64(i))
}
for i := range series {
series[i] = fmt.Sprintf("series%d", i)
}
wg := sync.WaitGroup{}
c := tsm1.NewCache(1000000, "")
ch := make(chan struct{})
for _, s := range series {
for _, v := range values {
c.Write(s, tsm1.Values{v})
}
wg.Add(1)
go func(s string) {
defer wg.Done()
<-ch
c.Values(s)
}(s)
}
fileCounter := 0
mapFiles := map[int]bool{}
mu := sync.Mutex{}
errC := make(chan error)
for i := 0; i < 2; i++ {
wg.Add(1)
go func() {
defer wg.Done()
<-ch
s, err := c.Snapshot()
if err == tsm1.ErrSnapshotInProgress {
return
}
if err != nil {
errC <- fmt.Errorf("failed to snapshot cache: %v", err)
return
}
mu.Lock()
mapFiles[fileCounter] = true
fileCounter++
myFiles := map[int]bool{}
for k, e := range mapFiles {
myFiles[k] = e
}
mu.Unlock()
s.Deduplicate()
c.ClearSnapshot(true)
mu.Lock()
defer mu.Unlock()
for k, _ := range myFiles {
if _, ok := mapFiles[k]; !ok {
errC <- fmt.Errorf("something else deleted one of my files")
return
} else {
delete(mapFiles, k)
}
}
}()
}
close(ch)
go func() {
wg.Wait()
close(errC)
}()
for err := range errC {
if err != nil {
t.Error(err)
}
}
}

View File

@@ -0,0 +1,883 @@
package tsm1
import (
"errors"
"fmt"
"io/ioutil"
"math"
"math/rand"
"os"
"reflect"
"runtime"
"strings"
"sync"
"sync/atomic"
"testing"
"github.com/golang/snappy"
)
func TestCache_NewCache(t *testing.T) {
c := NewCache(100, "")
if c == nil {
t.Fatalf("failed to create new cache")
}
if c.MaxSize() != 100 {
t.Fatalf("new cache max size not correct")
}
if c.Size() != 0 {
t.Fatalf("new cache size not correct")
}
if len(c.Keys()) != 0 {
t.Fatalf("new cache keys not correct: %v", c.Keys())
}
}
func TestCache_CacheWrite(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(3*valuesSize, "")
if err := c.Write("foo", values); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if err := c.Write("bar", values); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if n := c.Size(); n != 2*valuesSize {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", 2*valuesSize, n)
}
if exp, keys := []string{"bar", "foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
}
func TestCache_CacheWrite_TypeConflict(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, int(64))
values := Values{v0, v1}
valuesSize := v0.Size() + v1.Size()
c := NewCache(uint64(2*valuesSize), "")
if err := c.Write("foo", values[:1]); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if err := c.Write("foo", values[1:]); err == nil {
t.Fatalf("expected field type conflict")
}
if exp, got := uint64(v0.Size()), c.Size(); exp != got {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", exp, got)
}
}
func TestCache_CacheWriteMulti(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(30*valuesSize, "")
if err := c.WriteMulti(map[string][]Value{"foo": values, "bar": values}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if n := c.Size(); n != 2*valuesSize {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", 2*valuesSize, n)
}
if exp, keys := []string{"bar", "foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
}
// Tests that the cache stats and size are correctly maintained during writes.
func TestCache_WriteMulti_Stats(t *testing.T) {
limit := uint64(1)
c := NewCache(limit, "")
ms := NewTestStore()
c.store = ms
// Not enough room in the cache.
v := NewValue(1, 1.0)
values := map[string][]Value{"foo": []Value{v, v}}
if got, exp := c.WriteMulti(values), ErrCacheMemorySizeLimitExceeded(uint64(v.Size()*2), limit); !reflect.DeepEqual(got, exp) {
t.Fatalf("got %q, expected %q", got, exp)
}
// Fail one of the values in the write.
c = NewCache(50, "")
c.init()
c.store = ms
ms.writef = func(key string, v Values) error {
if key == "foo" {
return errors.New("write failed")
}
return nil
}
values = map[string][]Value{"foo": []Value{v, v}, "bar": []Value{v}}
if got, exp := c.WriteMulti(values), errors.New("write failed"); !reflect.DeepEqual(got, exp) {
t.Fatalf("got %v, expected %v", got, exp)
}
// Cache size decreased correctly.
if got, exp := c.Size(), uint64(16); got != exp {
t.Fatalf("got %v, expected %v", got, exp)
}
// Write stats updated
if got, exp := c.stats.WriteDropped, int64(1); got != exp {
t.Fatalf("got %v, expected %v", got, exp)
} else if got, exp := c.stats.WriteErr, int64(1); got != exp {
t.Fatalf("got %v, expected %v", got, exp)
}
}
func TestCache_CacheWriteMulti_TypeConflict(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, int64(3))
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(3*valuesSize, "")
if err := c.WriteMulti(map[string][]Value{"foo": values[:1], "bar": values[1:]}); err == nil {
t.Fatalf(" expected field type conflict")
}
if exp, got := uint64(v0.Size()), c.Size(); exp != got {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", exp, got)
}
if exp, keys := []string{"foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
}
func TestCache_Cache_DeleteRange(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(30*valuesSize, "")
if err := c.WriteMulti(map[string][]Value{"foo": values, "bar": values}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if n := c.Size(); n != 2*valuesSize {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", 2*valuesSize, n)
}
if exp, keys := []string{"bar", "foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
c.DeleteRange([]string{"bar"}, 2, math.MaxInt64)
if exp, keys := []string{"bar", "foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
if got, exp := c.Size(), valuesSize+uint64(v0.Size()); exp != got {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", exp, got)
}
if got, exp := len(c.Values("bar")), 1; got != exp {
t.Fatalf("cache values mismatch: got %v, exp %v", got, exp)
}
if got, exp := len(c.Values("foo")), 3; got != exp {
t.Fatalf("cache values mismatch: got %v, exp %v", got, exp)
}
}
func TestCache_DeleteRange_NoValues(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(3*valuesSize, "")
if err := c.WriteMulti(map[string][]Value{"foo": values}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if n := c.Size(); n != valuesSize {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", 2*valuesSize, n)
}
if exp, keys := []string{"foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
c.DeleteRange([]string{"foo"}, math.MinInt64, math.MaxInt64)
if exp, keys := 0, len(c.Keys()); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
if got, exp := c.Size(), uint64(0); exp != got {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", exp, got)
}
if got, exp := len(c.Values("foo")), 0; got != exp {
t.Fatalf("cache values mismatch: got %v, exp %v", got, exp)
}
}
func TestCache_Cache_Delete(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
values := Values{v0, v1, v2}
valuesSize := uint64(v0.Size() + v1.Size() + v2.Size())
c := NewCache(30*valuesSize, "")
if err := c.WriteMulti(map[string][]Value{"foo": values, "bar": values}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if n := c.Size(); n != 2*valuesSize {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", 2*valuesSize, n)
}
if exp, keys := []string{"bar", "foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
c.Delete([]string{"bar"})
if exp, keys := []string{"foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
if got, exp := c.Size(), valuesSize; exp != got {
t.Fatalf("cache size incorrect after 2 writes, exp %d, got %d", exp, got)
}
if got, exp := len(c.Values("bar")), 0; got != exp {
t.Fatalf("cache values mismatch: got %v, exp %v", got, exp)
}
if got, exp := len(c.Values("foo")), 3; got != exp {
t.Fatalf("cache values mismatch: got %v, exp %v", got, exp)
}
}
func TestCache_Cache_Delete_NonExistent(t *testing.T) {
c := NewCache(1024, "")
c.Delete([]string{"bar"})
if got, exp := c.Size(), uint64(0); exp != got {
t.Fatalf("cache size incorrect exp %d, got %d", exp, got)
}
}
// This tests writing two batches to the same series. The first batch
// is sorted. The second batch is also sorted but contains duplicates.
func TestCache_CacheWriteMulti_Duplicates(t *testing.T) {
v0 := NewValue(2, 1.0)
v1 := NewValue(3, 1.0)
values0 := Values{v0, v1}
v3 := NewValue(4, 2.0)
v4 := NewValue(5, 3.0)
v5 := NewValue(5, 3.0)
values1 := Values{v3, v4, v5}
c := NewCache(0, "")
if err := c.WriteMulti(map[string][]Value{"foo": values0}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if err := c.WriteMulti(map[string][]Value{"foo": values1}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if exp, keys := []string{"foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after 2 writes, exp %v, got %v", exp, keys)
}
expAscValues := Values{v0, v1, v3, v5}
if exp, got := len(expAscValues), len(c.Values("foo")); exp != got {
t.Fatalf("value count mismatch: exp: %v, got %v", exp, got)
}
if deduped := c.Values("foo"); !reflect.DeepEqual(expAscValues, deduped) {
t.Fatalf("deduped ascending values for foo incorrect, exp: %v, got %v", expAscValues, deduped)
}
}
func TestCache_CacheValues(t *testing.T) {
v0 := NewValue(1, 0.0)
v1 := NewValue(2, 2.0)
v2 := NewValue(3, 3.0)
v3 := NewValue(1, 1.0)
v4 := NewValue(4, 4.0)
c := NewCache(512, "")
if deduped := c.Values("no such key"); deduped != nil {
t.Fatalf("Values returned for no such key")
}
if err := c.Write("foo", Values{v0, v1, v2, v3}); err != nil {
t.Fatalf("failed to write 3 values, key foo to cache: %s", err.Error())
}
if err := c.Write("foo", Values{v4}); err != nil {
t.Fatalf("failed to write 1 value, key foo to cache: %s", err.Error())
}
expAscValues := Values{v3, v1, v2, v4}
if deduped := c.Values("foo"); !reflect.DeepEqual(expAscValues, deduped) {
t.Fatalf("deduped ascending values for foo incorrect, exp: %v, got %v", expAscValues, deduped)
}
}
func TestCache_CacheSnapshot(t *testing.T) {
v0 := NewValue(2, 0.0)
v1 := NewValue(3, 2.0)
v2 := NewValue(4, 3.0)
v3 := NewValue(5, 4.0)
v4 := NewValue(6, 5.0)
v5 := NewValue(1, 5.0)
v6 := NewValue(7, 5.0)
v7 := NewValue(2, 5.0)
c := NewCache(512, "")
if err := c.Write("foo", Values{v0, v1, v2, v3}); err != nil {
t.Fatalf("failed to write 3 values, key foo to cache: %s", err.Error())
}
// Grab snapshot, and ensure it's as expected.
snapshot, err := c.Snapshot()
if err != nil {
t.Fatalf("failed to snapshot cache: %v", err)
}
expValues := Values{v0, v1, v2, v3}
if deduped := snapshot.values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("snapshotted values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
// Ensure cache is still as expected.
if deduped := c.Values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("post-snapshot values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
// Write a new value to the cache.
if err := c.Write("foo", Values{v4}); err != nil {
t.Fatalf("failed to write post-snap value, key foo to cache: %s", err.Error())
}
expValues = Values{v0, v1, v2, v3, v4}
if deduped := c.Values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("post-snapshot write values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
// Write a new, out-of-order, value to the cache.
if err := c.Write("foo", Values{v5}); err != nil {
t.Fatalf("failed to write post-snap value, key foo to cache: %s", err.Error())
}
expValues = Values{v5, v0, v1, v2, v3, v4}
if deduped := c.Values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("post-snapshot out-of-order write values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
// Clear snapshot, ensuring non-snapshot data untouched.
c.ClearSnapshot(true)
expValues = Values{v5, v4}
if deduped := c.Values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("post-clear values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
// Create another snapshot
snapshot, err = c.Snapshot()
if err != nil {
t.Fatalf("failed to snapshot cache: %v", err)
}
if err := c.Write("foo", Values{v4, v5}); err != nil {
t.Fatalf("failed to write post-snap value, key foo to cache: %s", err.Error())
}
c.ClearSnapshot(true)
snapshot, err = c.Snapshot()
if err != nil {
t.Fatalf("failed to snapshot cache: %v", err)
}
if err := c.Write("foo", Values{v6, v7}); err != nil {
t.Fatalf("failed to write post-snap value, key foo to cache: %s", err.Error())
}
expValues = Values{v5, v7, v4, v6}
if deduped := c.Values("foo"); !reflect.DeepEqual(expValues, deduped) {
t.Fatalf("post-snapshot out-of-order write values for foo incorrect, exp: %v, got %v", expValues, deduped)
}
}
// Tests that Snapshot updates statistics correctly.
func TestCache_Snapshot_Stats(t *testing.T) {
limit := uint64(16)
c := NewCache(limit, "")
values := map[string][]Value{"foo": []Value{NewValue(1, 1.0)}}
if err := c.WriteMulti(values); err != nil {
t.Fatal(err)
}
_, err := c.Snapshot()
if err != nil {
t.Fatal(err)
}
// Store size should have been reset.
if got, exp := c.Size(), uint64(16); got != exp {
t.Fatalf("got %v, expected %v", got, exp)
}
// Cached bytes should have been increased.
if got, exp := c.stats.CachedBytes, int64(16); got != exp {
t.Fatalf("got %v, expected %v", got, exp)
}
}
func TestCache_CacheEmptySnapshot(t *testing.T) {
c := NewCache(512, "")
// Grab snapshot, and ensure it's as expected.
snapshot, err := c.Snapshot()
if err != nil {
t.Fatalf("failed to snapshot cache: %v", err)
}
if deduped := snapshot.values("foo"); !reflect.DeepEqual(Values(nil), deduped) {
t.Fatalf("snapshotted values for foo incorrect, exp: %v, got %v", nil, deduped)
}
// Ensure cache is still as expected.
if deduped := c.Values("foo"); !reflect.DeepEqual(Values(nil), deduped) {
t.Fatalf("post-snapshotted values for foo incorrect, exp: %v, got %v", Values(nil), deduped)
}
// Clear snapshot.
c.ClearSnapshot(true)
if deduped := c.Values("foo"); !reflect.DeepEqual(Values(nil), deduped) {
t.Fatalf("post-snapshot-clear values for foo incorrect, exp: %v, got %v", Values(nil), deduped)
}
}
func TestCache_CacheWriteMemoryExceeded(t *testing.T) {
v0 := NewValue(1, 1.0)
v1 := NewValue(2, 2.0)
c := NewCache(uint64(v1.Size()), "")
if err := c.Write("foo", Values{v0}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
if exp, keys := []string{"foo"}, c.Keys(); !reflect.DeepEqual(keys, exp) {
t.Fatalf("cache keys incorrect after writes, exp %v, got %v", exp, keys)
}
if err := c.Write("bar", Values{v1}); err == nil || !strings.Contains(err.Error(), "cache-max-memory-size") {
t.Fatalf("wrong error writing key bar to cache: %v", err)
}
// Grab snapshot, write should still fail since we're still using the memory.
_, err := c.Snapshot()
if err != nil {
t.Fatalf("failed to snapshot cache: %v", err)
}
if err := c.Write("bar", Values{v1}); err == nil || !strings.Contains(err.Error(), "cache-max-memory-size") {
t.Fatalf("wrong error writing key bar to cache: %v", err)
}
// Clear the snapshot and the write should now succeed.
c.ClearSnapshot(true)
if err := c.Write("bar", Values{v1}); err != nil {
t.Fatalf("failed to write key foo to cache: %s", err.Error())
}
expAscValues := Values{v1}
if deduped := c.Values("bar"); !reflect.DeepEqual(expAscValues, deduped) {
t.Fatalf("deduped ascending values for bar incorrect, exp: %v, got %v", expAscValues, deduped)
}
}
func TestCache_Deduplicate_Concurrent(t *testing.T) {
if testing.Short() || os.Getenv("GORACE") != "" || os.Getenv("APPVEYOR") != "" {
t.Skip("Skipping test in short, race, appveyor mode.")
}
values := make(map[string][]Value)
for i := 0; i < 1000; i++ {
for j := 0; j < 100; j++ {
values[fmt.Sprintf("cpu%d", i)] = []Value{NewValue(int64(i+j)+int64(rand.Intn(10)), float64(i))}
}
}
wg := sync.WaitGroup{}
c := NewCache(1000000, "")
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 1000; i++ {
c.WriteMulti(values)
}
}()
wg.Add(1)
go func() {
defer wg.Done()
for i := 0; i < 1000; i++ {
c.Deduplicate()
}
}()
wg.Wait()
}
// Ensure the CacheLoader can correctly load from a single segment, even if it's corrupted.
func TestCacheLoader_LoadSingle(t *testing.T) {
// Create a WAL segment.
dir := mustTempDir()
defer os.RemoveAll(dir)
f := mustTempFile(dir)
w := NewWALSegmentWriter(f)
p1 := NewValue(1, 1.1)
p2 := NewValue(1, int64(1))
p3 := NewValue(1, true)
values := map[string][]Value{
"foo": []Value{p1},
"bar": []Value{p2},
"baz": []Value{p3},
}
entry := &WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
t.Fatal("write points", err)
}
if err := w.Flush(); err != nil {
t.Fatalf("flush error: %v", err)
}
// Load the cache using the segment.
cache := NewCache(1024, "")
loader := NewCacheLoader([]string{f.Name()})
if err := loader.Load(cache); err != nil {
t.Fatalf("failed to load cache: %s", err.Error())
}
// Check the cache.
if values := cache.Values("foo"); !reflect.DeepEqual(values, Values{p1}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p1})
}
if values := cache.Values("bar"); !reflect.DeepEqual(values, Values{p2}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p2})
}
if values := cache.Values("baz"); !reflect.DeepEqual(values, Values{p3}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p3})
}
// Corrupt the WAL segment.
if _, err := f.Write([]byte{1, 4, 0, 0, 0}); err != nil {
t.Fatalf("corrupt WAL segment: %s", err.Error())
}
// Reload the cache using the segment.
cache = NewCache(1024, "")
loader = NewCacheLoader([]string{f.Name()})
if err := loader.Load(cache); err != nil {
t.Fatalf("failed to load cache: %s", err.Error())
}
// Check the cache.
if values := cache.Values("foo"); !reflect.DeepEqual(values, Values{p1}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p1})
}
if values := cache.Values("bar"); !reflect.DeepEqual(values, Values{p2}) {
t.Fatalf("cache key bar not as expected, got %v, exp %v", values, Values{p2})
}
if values := cache.Values("baz"); !reflect.DeepEqual(values, Values{p3}) {
t.Fatalf("cache key baz not as expected, got %v, exp %v", values, Values{p3})
}
}
// Ensure the CacheLoader can correctly load from two segments, even if one is corrupted.
func TestCacheLoader_LoadDouble(t *testing.T) {
// Create a WAL segment.
dir := mustTempDir()
defer os.RemoveAll(dir)
f1, f2 := mustTempFile(dir), mustTempFile(dir)
w1, w2 := NewWALSegmentWriter(f1), NewWALSegmentWriter(f2)
p1 := NewValue(1, 1.1)
p2 := NewValue(1, int64(1))
p3 := NewValue(1, true)
p4 := NewValue(1, "string")
// Write first and second segment.
segmentWrite := func(w *WALSegmentWriter, values map[string][]Value) {
entry := &WriteWALEntry{
Values: values,
}
if err := w1.Write(mustMarshalEntry(entry)); err != nil {
t.Fatal("write points", err)
}
if err := w1.Flush(); err != nil {
t.Fatalf("flush error: %v", err)
}
}
values := map[string][]Value{
"foo": []Value{p1},
"bar": []Value{p2},
}
segmentWrite(w1, values)
values = map[string][]Value{
"baz": []Value{p3},
"qux": []Value{p4},
}
segmentWrite(w2, values)
// Corrupt the first WAL segment.
if _, err := f1.Write([]byte{1, 4, 0, 0, 0}); err != nil {
t.Fatalf("corrupt WAL segment: %s", err.Error())
}
// Load the cache using the segments.
cache := NewCache(1024, "")
loader := NewCacheLoader([]string{f1.Name(), f2.Name()})
if err := loader.Load(cache); err != nil {
t.Fatalf("failed to load cache: %s", err.Error())
}
// Check the cache.
if values := cache.Values("foo"); !reflect.DeepEqual(values, Values{p1}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p1})
}
if values := cache.Values("bar"); !reflect.DeepEqual(values, Values{p2}) {
t.Fatalf("cache key bar not as expected, got %v, exp %v", values, Values{p2})
}
if values := cache.Values("baz"); !reflect.DeepEqual(values, Values{p3}) {
t.Fatalf("cache key baz not as expected, got %v, exp %v", values, Values{p3})
}
if values := cache.Values("qux"); !reflect.DeepEqual(values, Values{p4}) {
t.Fatalf("cache key qux not as expected, got %v, exp %v", values, Values{p4})
}
}
// Ensure the CacheLoader can load deleted series
func TestCacheLoader_LoadDeleted(t *testing.T) {
// Create a WAL segment.
dir := mustTempDir()
defer os.RemoveAll(dir)
f := mustTempFile(dir)
w := NewWALSegmentWriter(f)
p1 := NewValue(1, 1.0)
p2 := NewValue(2, 2.0)
p3 := NewValue(3, 3.0)
values := map[string][]Value{
"foo": []Value{p1, p2, p3},
}
entry := &WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
t.Fatal("write points", err)
}
if err := w.Flush(); err != nil {
t.Fatalf("flush error: %v", err)
}
dentry := &DeleteRangeWALEntry{
Keys: []string{"foo"},
Min: 2,
Max: 3,
}
if err := w.Write(mustMarshalEntry(dentry)); err != nil {
t.Fatal("write points", err)
}
if err := w.Flush(); err != nil {
t.Fatalf("flush error: %v", err)
}
// Load the cache using the segment.
cache := NewCache(1024, "")
loader := NewCacheLoader([]string{f.Name()})
if err := loader.Load(cache); err != nil {
t.Fatalf("failed to load cache: %s", err.Error())
}
// Check the cache.
if values := cache.Values("foo"); !reflect.DeepEqual(values, Values{p1}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p1})
}
// Reload the cache using the segment.
cache = NewCache(1024, "")
loader = NewCacheLoader([]string{f.Name()})
if err := loader.Load(cache); err != nil {
t.Fatalf("failed to load cache: %s", err.Error())
}
// Check the cache.
if values := cache.Values("foo"); !reflect.DeepEqual(values, Values{p1}) {
t.Fatalf("cache key foo not as expected, got %v, exp %v", values, Values{p1})
}
}
func mustTempDir() string {
dir, err := ioutil.TempDir("", "tsm1-test")
if err != nil {
panic(fmt.Sprintf("failed to create temp dir: %v", err))
}
return dir
}
func mustTempFile(dir string) *os.File {
f, err := ioutil.TempFile(dir, "tsm1test")
if err != nil {
panic(fmt.Sprintf("failed to create temp file: %v", err))
}
return f
}
func mustMarshalEntry(entry WALEntry) (WalEntryType, []byte) {
bytes := make([]byte, 1024<<2)
b, err := entry.Encode(bytes)
if err != nil {
panic(fmt.Sprintf("error encoding: %v", err))
}
return entry.Type(), snappy.Encode(b, b)
}
// TestStore implements the storer interface and can be used to mock out a
// Cache's storer implememation.
type TestStore struct {
entryf func(key string) (*entry, bool)
writef func(key string, values Values) error
addf func(key string, entry *entry)
removef func(key string)
keysf func(sorted bool) []string
applyf func(f func(string, *entry) error) error
applySerialf func(f func(string, *entry) error) error
resetf func()
}
func NewTestStore() *TestStore { return &TestStore{} }
func (s *TestStore) entry(key string) (*entry, bool) { return s.entryf(key) }
func (s *TestStore) write(key string, values Values) error { return s.writef(key, values) }
func (s *TestStore) add(key string, entry *entry) { s.addf(key, entry) }
func (s *TestStore) remove(key string) { s.removef(key) }
func (s *TestStore) keys(sorted bool) []string { return s.keysf(sorted) }
func (s *TestStore) apply(f func(string, *entry) error) error { return s.applyf(f) }
func (s *TestStore) applySerial(f func(string, *entry) error) error { return s.applySerialf(f) }
func (s *TestStore) reset() { s.resetf() }
var fvSize = uint64(NewValue(1, float64(1)).Size())
func BenchmarkCacheFloatEntries(b *testing.B) {
cache := NewCache(uint64(b.N)*fvSize, "")
vals := make([][]Value, b.N)
for i := 0; i < b.N; i++ {
vals[i] = []Value{NewValue(1, float64(i))}
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
if err := cache.Write("test", vals[i]); err != nil {
b.Fatal("err:", err, "i:", i, "N:", b.N)
}
}
}
type points struct {
key string
vals []Value
}
func BenchmarkCacheParallelFloatEntries(b *testing.B) {
c := b.N * runtime.GOMAXPROCS(0)
cache := NewCache(uint64(c)*fvSize*10, "")
vals := make([]points, c)
for i := 0; i < c; i++ {
v := make([]Value, 10)
for j := 0; j < 10; j++ {
v[j] = NewValue(1, float64(i+j))
}
vals[i] = points{key: fmt.Sprintf("cpu%v", rand.Intn(20)), vals: v}
}
i := int32(-1)
b.ResetTimer()
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
j := atomic.AddInt32(&i, 1)
v := vals[j]
if err := cache.Write(v.key, v.vals); err != nil {
b.Fatal("err:", err, "j:", j, "N:", b.N)
}
}
})
}
func BenchmarkEntry_add(b *testing.B) {
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
b.StopTimer()
values := make([]Value, 10)
for i := 0; i < 10; i++ {
values[i] = NewValue(int64(i+1), float64(i))
}
otherValues := make([]Value, 10)
for i := 0; i < 10; i++ {
otherValues[i] = NewValue(1, float64(i))
}
entry, err := newEntryValues(values, 0) // Will use default allocation size.
if err != nil {
b.Fatal(err)
}
b.StartTimer()
if err := entry.add(otherValues); err != nil {
b.Fatal(err)
}
}
})
}

View File

@@ -0,0 +1,867 @@
// Generated by tmpl
// https://github.com/benbjohnson/tmpl
//
// DO NOT EDIT!
// Source: compact.gen.go.tmpl
package tsm1
import (
"runtime"
)
// merge combines the next set of blocks into merged blocks.
func (k *tsmKeyIterator) mergeFloat() {
// No blocks left, or pending merged values, we're done
if len(k.blocks) == 0 && len(k.merged) == 0 && len(k.mergedFloatValues) == 0 {
return
}
dedup := len(k.mergedFloatValues) != 0
if len(k.blocks) > 0 && !dedup {
// If we have more than one block or any partially tombstoned blocks, we many need to dedup
dedup = len(k.blocks[0].tombstones) > 0 || k.blocks[0].partiallyRead()
// Quickly scan each block to see if any overlap with the prior block, if they overlap then
// we need to dedup as there may be duplicate points now
for i := 1; !dedup && i < len(k.blocks); i++ {
if k.blocks[i].partiallyRead() {
dedup = true
break
}
if k.blocks[i].minTime <= k.blocks[i-1].maxTime || len(k.blocks[i].tombstones) > 0 {
dedup = true
break
}
}
}
k.merged = k.combineFloat(dedup)
}
// combine returns a new set of blocks using the current blocks in the buffers. If dedup
// is true, all the blocks will be decoded, dedup and sorted in in order. If dedup is false,
// only blocks that are smaller than the chunk size will be decoded and combined.
func (k *tsmKeyIterator) combineFloat(dedup bool) blocks {
if dedup {
for len(k.mergedFloatValues) < k.size && len(k.blocks) > 0 {
for len(k.blocks) > 0 && k.blocks[0].read() {
k.blocks = k.blocks[1:]
}
if len(k.blocks) == 0 {
break
}
first := k.blocks[0]
minTime := first.minTime
maxTime := first.maxTime
// Adjust the min time to the start of any overlapping blocks.
for i := 0; i < len(k.blocks); i++ {
if k.blocks[i].overlapsTimeRange(minTime, maxTime) && !k.blocks[i].read() {
if k.blocks[i].minTime < minTime {
minTime = k.blocks[i].minTime
}
if k.blocks[i].maxTime > minTime && k.blocks[i].maxTime < maxTime {
maxTime = k.blocks[i].maxTime
}
}
}
// We have some overlapping blocks so decode all, append in order and then dedup
for i := 0; i < len(k.blocks); i++ {
if !k.blocks[i].overlapsTimeRange(minTime, maxTime) || k.blocks[i].read() {
continue
}
v, err := DecodeFloatBlock(k.blocks[i].b, &[]FloatValue{})
if err != nil {
k.err = err
return nil
}
// Remove values we already read
v = FloatValues(v).Exclude(k.blocks[i].readMin, k.blocks[i].readMax)
// Filter out only the values for overlapping block
v = FloatValues(v).Include(minTime, maxTime)
if len(v) > 0 {
// Record that we read a subset of the block
k.blocks[i].markRead(v[0].UnixNano(), v[len(v)-1].UnixNano())
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = FloatValues(v).Exclude(ts.Min, ts.Max)
}
k.mergedFloatValues = k.mergedFloatValues.Merge(v)
// Allow other goroutines to run
runtime.Gosched()
}
}
// Since we combined multiple blocks, we could have more values than we should put into
// a single block. We need to chunk them up into groups and re-encode them.
return k.chunkFloat(nil)
} else {
var chunked blocks
var i int
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
// If we this block is already full, just add it as is
if BlockCount(k.blocks[i].b) >= k.size {
chunked = append(chunked, k.blocks[i])
} else {
break
}
i++
// Allow other goroutines to run
runtime.Gosched()
}
if k.fast {
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
chunked = append(chunked, k.blocks[i])
i++
// Allow other goroutines to run
runtime.Gosched()
}
}
// If we only have 1 blocks left, just append it as is and avoid decoding/recoding
if i == len(k.blocks)-1 {
if !k.blocks[i].read() {
chunked = append(chunked, k.blocks[i])
}
i++
}
// The remaining blocks can be combined and we know that they do not overlap and
// so we can just append each, sort and re-encode.
for i < len(k.blocks) && len(k.mergedFloatValues) < k.size {
if k.blocks[i].read() {
i++
continue
}
v, err := DecodeFloatBlock(k.blocks[i].b, &[]FloatValue{})
if err != nil {
k.err = err
return nil
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = FloatValues(v).Exclude(ts.Min, ts.Max)
}
k.blocks[i].markRead(k.blocks[i].minTime, k.blocks[i].maxTime)
k.mergedFloatValues = k.mergedFloatValues.Merge(v)
i++
// Allow other goroutines to run
runtime.Gosched()
}
k.blocks = k.blocks[i:]
return k.chunkFloat(chunked)
}
}
func (k *tsmKeyIterator) chunkFloat(dst blocks) blocks {
if len(k.mergedFloatValues) > k.size {
values := k.mergedFloatValues[:k.size]
cb, err := FloatValues(values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: values[0].UnixNano(),
maxTime: values[len(values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedFloatValues = k.mergedFloatValues[k.size:]
return dst
}
// Re-encode the remaining values into the last block
if len(k.mergedFloatValues) > 0 {
cb, err := FloatValues(k.mergedFloatValues).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: k.mergedFloatValues[0].UnixNano(),
maxTime: k.mergedFloatValues[len(k.mergedFloatValues)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedFloatValues = k.mergedFloatValues[:0]
}
return dst
}
// merge combines the next set of blocks into merged blocks.
func (k *tsmKeyIterator) mergeInteger() {
// No blocks left, or pending merged values, we're done
if len(k.blocks) == 0 && len(k.merged) == 0 && len(k.mergedIntegerValues) == 0 {
return
}
dedup := len(k.mergedIntegerValues) != 0
if len(k.blocks) > 0 && !dedup {
// If we have more than one block or any partially tombstoned blocks, we many need to dedup
dedup = len(k.blocks[0].tombstones) > 0 || k.blocks[0].partiallyRead()
// Quickly scan each block to see if any overlap with the prior block, if they overlap then
// we need to dedup as there may be duplicate points now
for i := 1; !dedup && i < len(k.blocks); i++ {
if k.blocks[i].partiallyRead() {
dedup = true
break
}
if k.blocks[i].minTime <= k.blocks[i-1].maxTime || len(k.blocks[i].tombstones) > 0 {
dedup = true
break
}
}
}
k.merged = k.combineInteger(dedup)
}
// combine returns a new set of blocks using the current blocks in the buffers. If dedup
// is true, all the blocks will be decoded, dedup and sorted in in order. If dedup is false,
// only blocks that are smaller than the chunk size will be decoded and combined.
func (k *tsmKeyIterator) combineInteger(dedup bool) blocks {
if dedup {
for len(k.mergedIntegerValues) < k.size && len(k.blocks) > 0 {
for len(k.blocks) > 0 && k.blocks[0].read() {
k.blocks = k.blocks[1:]
}
if len(k.blocks) == 0 {
break
}
first := k.blocks[0]
minTime := first.minTime
maxTime := first.maxTime
// Adjust the min time to the start of any overlapping blocks.
for i := 0; i < len(k.blocks); i++ {
if k.blocks[i].overlapsTimeRange(minTime, maxTime) && !k.blocks[i].read() {
if k.blocks[i].minTime < minTime {
minTime = k.blocks[i].minTime
}
if k.blocks[i].maxTime > minTime && k.blocks[i].maxTime < maxTime {
maxTime = k.blocks[i].maxTime
}
}
}
// We have some overlapping blocks so decode all, append in order and then dedup
for i := 0; i < len(k.blocks); i++ {
if !k.blocks[i].overlapsTimeRange(minTime, maxTime) || k.blocks[i].read() {
continue
}
v, err := DecodeIntegerBlock(k.blocks[i].b, &[]IntegerValue{})
if err != nil {
k.err = err
return nil
}
// Remove values we already read
v = IntegerValues(v).Exclude(k.blocks[i].readMin, k.blocks[i].readMax)
// Filter out only the values for overlapping block
v = IntegerValues(v).Include(minTime, maxTime)
if len(v) > 0 {
// Record that we read a subset of the block
k.blocks[i].markRead(v[0].UnixNano(), v[len(v)-1].UnixNano())
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = IntegerValues(v).Exclude(ts.Min, ts.Max)
}
k.mergedIntegerValues = k.mergedIntegerValues.Merge(v)
// Allow other goroutines to run
runtime.Gosched()
}
}
// Since we combined multiple blocks, we could have more values than we should put into
// a single block. We need to chunk them up into groups and re-encode them.
return k.chunkInteger(nil)
} else {
var chunked blocks
var i int
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
// If we this block is already full, just add it as is
if BlockCount(k.blocks[i].b) >= k.size {
chunked = append(chunked, k.blocks[i])
} else {
break
}
i++
// Allow other goroutines to run
runtime.Gosched()
}
if k.fast {
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
chunked = append(chunked, k.blocks[i])
i++
// Allow other goroutines to run
runtime.Gosched()
}
}
// If we only have 1 blocks left, just append it as is and avoid decoding/recoding
if i == len(k.blocks)-1 {
if !k.blocks[i].read() {
chunked = append(chunked, k.blocks[i])
}
i++
}
// The remaining blocks can be combined and we know that they do not overlap and
// so we can just append each, sort and re-encode.
for i < len(k.blocks) && len(k.mergedIntegerValues) < k.size {
if k.blocks[i].read() {
i++
continue
}
v, err := DecodeIntegerBlock(k.blocks[i].b, &[]IntegerValue{})
if err != nil {
k.err = err
return nil
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = IntegerValues(v).Exclude(ts.Min, ts.Max)
}
k.blocks[i].markRead(k.blocks[i].minTime, k.blocks[i].maxTime)
k.mergedIntegerValues = k.mergedIntegerValues.Merge(v)
i++
// Allow other goroutines to run
runtime.Gosched()
}
k.blocks = k.blocks[i:]
return k.chunkInteger(chunked)
}
}
func (k *tsmKeyIterator) chunkInteger(dst blocks) blocks {
if len(k.mergedIntegerValues) > k.size {
values := k.mergedIntegerValues[:k.size]
cb, err := IntegerValues(values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: values[0].UnixNano(),
maxTime: values[len(values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedIntegerValues = k.mergedIntegerValues[k.size:]
return dst
}
// Re-encode the remaining values into the last block
if len(k.mergedIntegerValues) > 0 {
cb, err := IntegerValues(k.mergedIntegerValues).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: k.mergedIntegerValues[0].UnixNano(),
maxTime: k.mergedIntegerValues[len(k.mergedIntegerValues)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedIntegerValues = k.mergedIntegerValues[:0]
}
return dst
}
// merge combines the next set of blocks into merged blocks.
func (k *tsmKeyIterator) mergeString() {
// No blocks left, or pending merged values, we're done
if len(k.blocks) == 0 && len(k.merged) == 0 && len(k.mergedStringValues) == 0 {
return
}
dedup := len(k.mergedStringValues) != 0
if len(k.blocks) > 0 && !dedup {
// If we have more than one block or any partially tombstoned blocks, we many need to dedup
dedup = len(k.blocks[0].tombstones) > 0 || k.blocks[0].partiallyRead()
// Quickly scan each block to see if any overlap with the prior block, if they overlap then
// we need to dedup as there may be duplicate points now
for i := 1; !dedup && i < len(k.blocks); i++ {
if k.blocks[i].partiallyRead() {
dedup = true
break
}
if k.blocks[i].minTime <= k.blocks[i-1].maxTime || len(k.blocks[i].tombstones) > 0 {
dedup = true
break
}
}
}
k.merged = k.combineString(dedup)
}
// combine returns a new set of blocks using the current blocks in the buffers. If dedup
// is true, all the blocks will be decoded, dedup and sorted in in order. If dedup is false,
// only blocks that are smaller than the chunk size will be decoded and combined.
func (k *tsmKeyIterator) combineString(dedup bool) blocks {
if dedup {
for len(k.mergedStringValues) < k.size && len(k.blocks) > 0 {
for len(k.blocks) > 0 && k.blocks[0].read() {
k.blocks = k.blocks[1:]
}
if len(k.blocks) == 0 {
break
}
first := k.blocks[0]
minTime := first.minTime
maxTime := first.maxTime
// Adjust the min time to the start of any overlapping blocks.
for i := 0; i < len(k.blocks); i++ {
if k.blocks[i].overlapsTimeRange(minTime, maxTime) && !k.blocks[i].read() {
if k.blocks[i].minTime < minTime {
minTime = k.blocks[i].minTime
}
if k.blocks[i].maxTime > minTime && k.blocks[i].maxTime < maxTime {
maxTime = k.blocks[i].maxTime
}
}
}
// We have some overlapping blocks so decode all, append in order and then dedup
for i := 0; i < len(k.blocks); i++ {
if !k.blocks[i].overlapsTimeRange(minTime, maxTime) || k.blocks[i].read() {
continue
}
v, err := DecodeStringBlock(k.blocks[i].b, &[]StringValue{})
if err != nil {
k.err = err
return nil
}
// Remove values we already read
v = StringValues(v).Exclude(k.blocks[i].readMin, k.blocks[i].readMax)
// Filter out only the values for overlapping block
v = StringValues(v).Include(minTime, maxTime)
if len(v) > 0 {
// Record that we read a subset of the block
k.blocks[i].markRead(v[0].UnixNano(), v[len(v)-1].UnixNano())
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = StringValues(v).Exclude(ts.Min, ts.Max)
}
k.mergedStringValues = k.mergedStringValues.Merge(v)
// Allow other goroutines to run
runtime.Gosched()
}
}
// Since we combined multiple blocks, we could have more values than we should put into
// a single block. We need to chunk them up into groups and re-encode them.
return k.chunkString(nil)
} else {
var chunked blocks
var i int
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
// If we this block is already full, just add it as is
if BlockCount(k.blocks[i].b) >= k.size {
chunked = append(chunked, k.blocks[i])
} else {
break
}
i++
// Allow other goroutines to run
runtime.Gosched()
}
if k.fast {
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
chunked = append(chunked, k.blocks[i])
i++
// Allow other goroutines to run
runtime.Gosched()
}
}
// If we only have 1 blocks left, just append it as is and avoid decoding/recoding
if i == len(k.blocks)-1 {
if !k.blocks[i].read() {
chunked = append(chunked, k.blocks[i])
}
i++
}
// The remaining blocks can be combined and we know that they do not overlap and
// so we can just append each, sort and re-encode.
for i < len(k.blocks) && len(k.mergedStringValues) < k.size {
if k.blocks[i].read() {
i++
continue
}
v, err := DecodeStringBlock(k.blocks[i].b, &[]StringValue{})
if err != nil {
k.err = err
return nil
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = StringValues(v).Exclude(ts.Min, ts.Max)
}
k.blocks[i].markRead(k.blocks[i].minTime, k.blocks[i].maxTime)
k.mergedStringValues = k.mergedStringValues.Merge(v)
i++
// Allow other goroutines to run
runtime.Gosched()
}
k.blocks = k.blocks[i:]
return k.chunkString(chunked)
}
}
func (k *tsmKeyIterator) chunkString(dst blocks) blocks {
if len(k.mergedStringValues) > k.size {
values := k.mergedStringValues[:k.size]
cb, err := StringValues(values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: values[0].UnixNano(),
maxTime: values[len(values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedStringValues = k.mergedStringValues[k.size:]
return dst
}
// Re-encode the remaining values into the last block
if len(k.mergedStringValues) > 0 {
cb, err := StringValues(k.mergedStringValues).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: k.mergedStringValues[0].UnixNano(),
maxTime: k.mergedStringValues[len(k.mergedStringValues)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedStringValues = k.mergedStringValues[:0]
}
return dst
}
// merge combines the next set of blocks into merged blocks.
func (k *tsmKeyIterator) mergeBoolean() {
// No blocks left, or pending merged values, we're done
if len(k.blocks) == 0 && len(k.merged) == 0 && len(k.mergedBooleanValues) == 0 {
return
}
dedup := len(k.mergedBooleanValues) != 0
if len(k.blocks) > 0 && !dedup {
// If we have more than one block or any partially tombstoned blocks, we many need to dedup
dedup = len(k.blocks[0].tombstones) > 0 || k.blocks[0].partiallyRead()
// Quickly scan each block to see if any overlap with the prior block, if they overlap then
// we need to dedup as there may be duplicate points now
for i := 1; !dedup && i < len(k.blocks); i++ {
if k.blocks[i].partiallyRead() {
dedup = true
break
}
if k.blocks[i].minTime <= k.blocks[i-1].maxTime || len(k.blocks[i].tombstones) > 0 {
dedup = true
break
}
}
}
k.merged = k.combineBoolean(dedup)
}
// combine returns a new set of blocks using the current blocks in the buffers. If dedup
// is true, all the blocks will be decoded, dedup and sorted in in order. If dedup is false,
// only blocks that are smaller than the chunk size will be decoded and combined.
func (k *tsmKeyIterator) combineBoolean(dedup bool) blocks {
if dedup {
for len(k.mergedBooleanValues) < k.size && len(k.blocks) > 0 {
for len(k.blocks) > 0 && k.blocks[0].read() {
k.blocks = k.blocks[1:]
}
if len(k.blocks) == 0 {
break
}
first := k.blocks[0]
minTime := first.minTime
maxTime := first.maxTime
// Adjust the min time to the start of any overlapping blocks.
for i := 0; i < len(k.blocks); i++ {
if k.blocks[i].overlapsTimeRange(minTime, maxTime) && !k.blocks[i].read() {
if k.blocks[i].minTime < minTime {
minTime = k.blocks[i].minTime
}
if k.blocks[i].maxTime > minTime && k.blocks[i].maxTime < maxTime {
maxTime = k.blocks[i].maxTime
}
}
}
// We have some overlapping blocks so decode all, append in order and then dedup
for i := 0; i < len(k.blocks); i++ {
if !k.blocks[i].overlapsTimeRange(minTime, maxTime) || k.blocks[i].read() {
continue
}
v, err := DecodeBooleanBlock(k.blocks[i].b, &[]BooleanValue{})
if err != nil {
k.err = err
return nil
}
// Remove values we already read
v = BooleanValues(v).Exclude(k.blocks[i].readMin, k.blocks[i].readMax)
// Filter out only the values for overlapping block
v = BooleanValues(v).Include(minTime, maxTime)
if len(v) > 0 {
// Record that we read a subset of the block
k.blocks[i].markRead(v[0].UnixNano(), v[len(v)-1].UnixNano())
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = BooleanValues(v).Exclude(ts.Min, ts.Max)
}
k.mergedBooleanValues = k.mergedBooleanValues.Merge(v)
// Allow other goroutines to run
runtime.Gosched()
}
}
// Since we combined multiple blocks, we could have more values than we should put into
// a single block. We need to chunk them up into groups and re-encode them.
return k.chunkBoolean(nil)
} else {
var chunked blocks
var i int
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
// If we this block is already full, just add it as is
if BlockCount(k.blocks[i].b) >= k.size {
chunked = append(chunked, k.blocks[i])
} else {
break
}
i++
// Allow other goroutines to run
runtime.Gosched()
}
if k.fast {
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
chunked = append(chunked, k.blocks[i])
i++
// Allow other goroutines to run
runtime.Gosched()
}
}
// If we only have 1 blocks left, just append it as is and avoid decoding/recoding
if i == len(k.blocks)-1 {
if !k.blocks[i].read() {
chunked = append(chunked, k.blocks[i])
}
i++
}
// The remaining blocks can be combined and we know that they do not overlap and
// so we can just append each, sort and re-encode.
for i < len(k.blocks) && len(k.mergedBooleanValues) < k.size {
if k.blocks[i].read() {
i++
continue
}
v, err := DecodeBooleanBlock(k.blocks[i].b, &[]BooleanValue{})
if err != nil {
k.err = err
return nil
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = BooleanValues(v).Exclude(ts.Min, ts.Max)
}
k.blocks[i].markRead(k.blocks[i].minTime, k.blocks[i].maxTime)
k.mergedBooleanValues = k.mergedBooleanValues.Merge(v)
i++
// Allow other goroutines to run
runtime.Gosched()
}
k.blocks = k.blocks[i:]
return k.chunkBoolean(chunked)
}
}
func (k *tsmKeyIterator) chunkBoolean(dst blocks) blocks {
if len(k.mergedBooleanValues) > k.size {
values := k.mergedBooleanValues[:k.size]
cb, err := BooleanValues(values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: values[0].UnixNano(),
maxTime: values[len(values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedBooleanValues = k.mergedBooleanValues[k.size:]
return dst
}
// Re-encode the remaining values into the last block
if len(k.mergedBooleanValues) > 0 {
cb, err := BooleanValues(k.mergedBooleanValues).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: k.mergedBooleanValues[0].UnixNano(),
maxTime: k.mergedBooleanValues[len(k.mergedBooleanValues)-1].UnixNano(),
key: k.key,
b: cb,
})
k.mergedBooleanValues = k.mergedBooleanValues[:0]
}
return dst
}

View File

@@ -0,0 +1,223 @@
package tsm1
import (
"runtime"
)
{{range .}}
// merge combines the next set of blocks into merged blocks.
func (k *tsmKeyIterator) merge{{.Name}}() {
// No blocks left, or pending merged values, we're done
if len(k.blocks) == 0 && len(k.merged) == 0 && len(k.merged{{.Name}}Values) == 0 {
return
}
dedup := len(k.merged{{.Name}}Values) != 0
if len(k.blocks) > 0 && !dedup {
// If we have more than one block or any partially tombstoned blocks, we many need to dedup
dedup = len(k.blocks[0].tombstones) > 0 || k.blocks[0].partiallyRead()
// Quickly scan each block to see if any overlap with the prior block, if they overlap then
// we need to dedup as there may be duplicate points now
for i := 1; !dedup && i < len(k.blocks); i++ {
if k.blocks[i].partiallyRead() {
dedup = true
break
}
if k.blocks[i].minTime <= k.blocks[i-1].maxTime || len(k.blocks[i].tombstones) > 0 {
dedup = true
break
}
}
}
k.merged = k.combine{{.Name}}(dedup)
}
// combine returns a new set of blocks using the current blocks in the buffers. If dedup
// is true, all the blocks will be decoded, dedup and sorted in in order. If dedup is false,
// only blocks that are smaller than the chunk size will be decoded and combined.
func (k *tsmKeyIterator) combine{{.Name}}(dedup bool) blocks {
if dedup {
for len(k.merged{{.Name}}Values) < k.size && len(k.blocks) > 0 {
for len(k.blocks) > 0 && k.blocks[0].read() {
k.blocks = k.blocks[1:]
}
if len(k.blocks) == 0 {
break
}
first := k.blocks[0]
minTime := first.minTime
maxTime := first.maxTime
// Adjust the min time to the start of any overlapping blocks.
for i := 0; i < len(k.blocks); i++ {
if k.blocks[i].overlapsTimeRange(minTime, maxTime) && !k.blocks[i].read() {
if k.blocks[i].minTime < minTime {
minTime = k.blocks[i].minTime
}
if k.blocks[i].maxTime > minTime && k.blocks[i].maxTime < maxTime {
maxTime = k.blocks[i].maxTime
}
}
}
// We have some overlapping blocks so decode all, append in order and then dedup
for i := 0; i < len(k.blocks); i++ {
if !k.blocks[i].overlapsTimeRange(minTime, maxTime) || k.blocks[i].read() {
continue
}
v, err := Decode{{.Name}}Block(k.blocks[i].b, &[]{{.Name}}Value{})
if err != nil {
k.err = err
return nil
}
// Remove values we already read
v = {{.Name}}Values(v).Exclude(k.blocks[i].readMin, k.blocks[i].readMax)
// Filter out only the values for overlapping block
v = {{.Name}}Values(v).Include(minTime, maxTime)
if len(v) > 0 {
// Record that we read a subset of the block
k.blocks[i].markRead(v[0].UnixNano(), v[len(v)-1].UnixNano())
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = {{.Name}}Values(v).Exclude(ts.Min, ts.Max)
}
k.merged{{.Name}}Values = k.merged{{.Name}}Values.Merge(v)
// Allow other goroutines to run
runtime.Gosched()
}
}
// Since we combined multiple blocks, we could have more values than we should put into
// a single block. We need to chunk them up into groups and re-encode them.
return k.chunk{{.Name}}(nil)
} else {
var chunked blocks
var i int
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
// If we this block is already full, just add it as is
if BlockCount(k.blocks[i].b) >= k.size {
chunked = append(chunked, k.blocks[i])
} else {
break
}
i++
// Allow other goroutines to run
runtime.Gosched()
}
if k.fast {
for i < len(k.blocks) {
// skip this block if it's values were already read
if k.blocks[i].read() {
i++
continue
}
chunked = append(chunked, k.blocks[i])
i++
// Allow other goroutines to run
runtime.Gosched()
}
}
// If we only have 1 blocks left, just append it as is and avoid decoding/recoding
if i == len(k.blocks)-1 {
if !k.blocks[i].read() {
chunked = append(chunked, k.blocks[i])
}
i++
}
// The remaining blocks can be combined and we know that they do not overlap and
// so we can just append each, sort and re-encode.
for i < len(k.blocks) && len(k.merged{{.Name}}Values) < k.size {
if k.blocks[i].read() {
i++
continue
}
v, err := Decode{{.Name}}Block(k.blocks[i].b, &[]{{.Name}}Value{})
if err != nil {
k.err = err
return nil
}
// Apply each tombstone to the block
for _, ts := range k.blocks[i].tombstones {
v = {{.Name}}Values(v).Exclude(ts.Min, ts.Max)
}
k.blocks[i].markRead(k.blocks[i].minTime, k.blocks[i].maxTime)
k.merged{{.Name}}Values = k.merged{{.Name}}Values.Merge(v)
i++
// Allow other goroutines to run
runtime.Gosched()
}
k.blocks = k.blocks[i:]
return k.chunk{{.Name}}(chunked)
}
}
func (k *tsmKeyIterator) chunk{{.Name}}(dst blocks) blocks {
if len(k.merged{{.Name}}Values) > k.size {
values := k.merged{{.Name}}Values[:k.size]
cb, err := {{.Name}}Values(values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: values[0].UnixNano(),
maxTime: values[len(values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.merged{{.Name}}Values = k.merged{{.Name}}Values[k.size:]
return dst
}
// Re-encode the remaining values into the last block
if len(k.merged{{.Name}}Values) > 0 {
cb, err := {{.Name}}Values(k.merged{{.Name}}Values).Encode(nil)
if err != nil {
k.err = err
return nil
}
dst = append(dst, &block{
minTime: k.merged{{.Name}}Values[0].UnixNano(),
maxTime: k.merged{{.Name}}Values[len(k.merged{{.Name}}Values)-1].UnixNano(),
key: k.key,
b: cb,
})
k.merged{{.Name}}Values = k.merged{{.Name}}Values[:0]
}
return dst
}
{{ end }}

View File

@@ -0,0 +1,18 @@
[
{
"Name":"Float",
"name":"float"
},
{
"Name":"Integer",
"name":"integer"
},
{
"Name":"String",
"name":"string"
},
{
"Name":"Boolean",
"name":"boolean"
}
]

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,77 @@
package tsm1
import (
"math"
"github.com/influxdata/influxdb/tsdb"
)
// multieFieldCursor wraps cursors for multiple fields on the same series
// key. Instead of returning a plain interface value in the call for Next(),
// it returns a map[string]interface{} for the field values
type multiFieldCursor struct {
fields []string
cursors []tsdb.Cursor
ascending bool
keyBuffer []int64
valueBuffer []interface{}
}
// NewMultiFieldCursor returns an instance of Cursor that joins the results of cursors.
func NewMultiFieldCursor(fields []string, cursors []tsdb.Cursor, ascending bool) tsdb.Cursor {
return &multiFieldCursor{
fields: fields,
cursors: cursors,
ascending: ascending,
keyBuffer: make([]int64, len(cursors)),
valueBuffer: make([]interface{}, len(cursors)),
}
}
func (m *multiFieldCursor) SeekTo(seek int64) (key int64, value interface{}) {
for i, c := range m.cursors {
m.keyBuffer[i], m.valueBuffer[i] = c.SeekTo(seek)
}
return m.read()
}
func (m *multiFieldCursor) Next() (int64, interface{}) {
return m.read()
}
func (m *multiFieldCursor) Ascending() bool {
return m.ascending
}
func (m *multiFieldCursor) read() (int64, interface{}) {
t := int64(math.MaxInt64)
if !m.ascending {
t = int64(math.MinInt64)
}
// find the time we need to combine all fields
for _, k := range m.keyBuffer {
if k == tsdb.EOF {
continue
}
if m.ascending && t > k {
t = k
} else if !m.ascending && t < k {
t = k
}
}
// get the value and advance each of the cursors that have the matching time
if t == math.MinInt64 || t == math.MaxInt64 {
return tsdb.EOF, nil
}
mm := make(map[string]interface{})
for i, k := range m.keyBuffer {
if k == t {
mm[m.fields[i]] = m.valueBuffer[i]
m.keyBuffer[i], m.valueBuffer[i] = m.cursors[i].Next()
}
}
return t, mm
}

View File

@@ -0,0 +1,938 @@
// Generated by tmpl
// https://github.com/benbjohnson/tmpl
//
// DO NOT EDIT!
// Source: encoding.gen.go.tmpl
package tsm1
import (
"fmt"
"sort"
)
// Values represents a slice of values.
type Values []Value
func (a Values) MinTime() int64 {
return a[0].UnixNano()
}
func (a Values) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a Values) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a Values) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a Values) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a Values) Deduplicate() Values {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a Values) Exclude(min, max int64) Values {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a Values) Include(min, max int64) Values {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a Values) Merge(b Values) Values {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make(Values, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
// Sort methods
func (a Values) Len() int { return len(a) }
func (a Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a Values) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }
// FloatValues represents a slice of Float values.
type FloatValues []FloatValue
func (a FloatValues) MinTime() int64 {
return a[0].UnixNano()
}
func (a FloatValues) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a FloatValues) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a FloatValues) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a FloatValues) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a FloatValues) Deduplicate() FloatValues {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a FloatValues) Exclude(min, max int64) FloatValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a FloatValues) Include(min, max int64) FloatValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a FloatValues) Merge(b FloatValues) FloatValues {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make(FloatValues, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
func (a FloatValues) Encode(buf []byte) ([]byte, error) {
return encodeFloatValuesBlock(buf, a)
}
func encodeFloatValuesBlock(buf []byte, values []FloatValue) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
venc := getFloatEncoder(len(values))
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
tsenc.Write(v.unixnano)
venc.Write(v.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockFloat64, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putFloatEncoder(venc)
return b, err
}
// Sort methods
func (a FloatValues) Len() int { return len(a) }
func (a FloatValues) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a FloatValues) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }
// IntegerValues represents a slice of Integer values.
type IntegerValues []IntegerValue
func (a IntegerValues) MinTime() int64 {
return a[0].UnixNano()
}
func (a IntegerValues) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a IntegerValues) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a IntegerValues) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a IntegerValues) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a IntegerValues) Deduplicate() IntegerValues {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a IntegerValues) Exclude(min, max int64) IntegerValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a IntegerValues) Include(min, max int64) IntegerValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a IntegerValues) Merge(b IntegerValues) IntegerValues {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make(IntegerValues, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
func (a IntegerValues) Encode(buf []byte) ([]byte, error) {
return encodeIntegerValuesBlock(buf, a)
}
func encodeIntegerValuesBlock(buf []byte, values []IntegerValue) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
venc := getIntegerEncoder(len(values))
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
tsenc.Write(v.unixnano)
venc.Write(v.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockInteger, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putIntegerEncoder(venc)
return b, err
}
// Sort methods
func (a IntegerValues) Len() int { return len(a) }
func (a IntegerValues) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a IntegerValues) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }
// StringValues represents a slice of String values.
type StringValues []StringValue
func (a StringValues) MinTime() int64 {
return a[0].UnixNano()
}
func (a StringValues) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a StringValues) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a StringValues) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a StringValues) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a StringValues) Deduplicate() StringValues {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a StringValues) Exclude(min, max int64) StringValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a StringValues) Include(min, max int64) StringValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a StringValues) Merge(b StringValues) StringValues {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make(StringValues, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
func (a StringValues) Encode(buf []byte) ([]byte, error) {
return encodeStringValuesBlock(buf, a)
}
func encodeStringValuesBlock(buf []byte, values []StringValue) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
venc := getStringEncoder(len(values))
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
tsenc.Write(v.unixnano)
venc.Write(v.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockString, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putStringEncoder(venc)
return b, err
}
// Sort methods
func (a StringValues) Len() int { return len(a) }
func (a StringValues) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a StringValues) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }
// BooleanValues represents a slice of Boolean values.
type BooleanValues []BooleanValue
func (a BooleanValues) MinTime() int64 {
return a[0].UnixNano()
}
func (a BooleanValues) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a BooleanValues) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a BooleanValues) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a BooleanValues) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a BooleanValues) Deduplicate() BooleanValues {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a BooleanValues) Exclude(min, max int64) BooleanValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a BooleanValues) Include(min, max int64) BooleanValues {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a BooleanValues) Merge(b BooleanValues) BooleanValues {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make(BooleanValues, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
func (a BooleanValues) Encode(buf []byte) ([]byte, error) {
return encodeBooleanValuesBlock(buf, a)
}
func encodeBooleanValuesBlock(buf []byte, values []BooleanValue) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
venc := getBooleanEncoder(len(values))
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
tsenc.Write(v.unixnano)
venc.Write(v.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockBoolean, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putBooleanEncoder(venc)
return b, err
}
// Sort methods
func (a BooleanValues) Len() int { return len(a) }
func (a BooleanValues) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a BooleanValues) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }

View File

@@ -0,0 +1,209 @@
package tsm1
import (
"fmt"
"sort"
)
{{range .}}
// {{.Name}}Values represents a slice of {{.Name}} values.
type {{.Name}}Values []{{.Name}}Value
func (a {{.Name}}Values) MinTime() int64 {
return a[0].UnixNano()
}
func (a {{.Name}}Values) MaxTime() int64 {
return a[len(a)-1].UnixNano()
}
func (a {{.Name}}Values) Size() int {
sz := 0
for _, v := range a {
sz += v.Size()
}
return sz
}
func (a {{.Name}}Values) ordered() bool {
if len(a) <= 1 {
return true
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
return false
}
}
return true
}
func (a {{.Name}}Values) assertOrdered() {
if len(a) <= 1 {
return
}
for i := 1; i < len(a); i++ {
if av, ab := a[i-1].UnixNano(), a[i].UnixNano(); av >= ab {
panic(fmt.Sprintf("not ordered: %d %d >= %d", i, av, ab))
}
}
}
// Deduplicate returns a new slice with any values that have the same timestamp removed.
// The Value that appears last in the slice is the one that is kept.
func (a {{.Name}}Values) Deduplicate() {{.Name}}Values {
if len(a) == 0 {
return a
}
// See if we're already sorted and deduped
var needSort bool
for i := 1; i < len(a); i++ {
if a[i-1].UnixNano() >= a[i].UnixNano() {
needSort = true
break
}
}
if !needSort {
return a
}
sort.Stable(a)
var i int
for j := 1; j < len(a); j++ {
v := a[j]
if v.UnixNano() != a[i].UnixNano() {
i++
}
a[i] = v
}
return a[:i+1]
}
// Exclude returns the subset of values not in [min, max]
func (a {{.Name}}Values) Exclude(min, max int64) {{.Name}}Values {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() >= min && a[j].UnixNano() <= max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Include returns the subset values between min and max inclusive.
func (a {{.Name}}Values) Include(min, max int64) {{.Name}}Values {
var i int
for j := 0; j < len(a); j++ {
if a[j].UnixNano() < min || a[j].UnixNano() > max {
continue
}
a[i] = a[j]
i++
}
return a[:i]
}
// Merge overlays b to top of a. If two values conflict with
// the same timestamp, b is used. Both a and b must be sorted
// in ascending order.
func (a {{.Name}}Values) Merge(b {{.Name}}Values) {{.Name}}Values {
if len(a) == 0 {
return b
}
if len(b) == 0 {
return a
}
// Normally, both a and b should not contain duplicates. Due to a bug in older versions, it's
// possible stored blocks might contain duplicate values. Remove them if they exists before
// merging.
a = a.Deduplicate()
b = b.Deduplicate()
if a[len(a)-1].UnixNano() < b[0].UnixNano() {
return append(a, b...)
}
if b[len(b)-1].UnixNano() < a[0].UnixNano() {
return append(b, a...)
}
out := make({{.Name}}Values, 0, len(a)+len(b))
for len(a) > 0 && len(b) > 0 {
if a[0].UnixNano() < b[0].UnixNano() {
out, a = append(out, a[0]), a[1:]
} else if len(b) > 0 && a[0].UnixNano() == b[0].UnixNano() {
a = a[1:]
} else {
out, b = append(out, b[0]), b[1:]
}
}
if len(a) > 0 {
return append(out, a...)
}
return append(out, b...)
}
{{ if ne .Name "" }}
func (a {{.Name}}Values) Encode(buf []byte) ([]byte, error) {
return encode{{.Name}}ValuesBlock(buf, a)
}
func encode{{ .Name }}ValuesBlock(buf []byte, values []{{.Name}}Value) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
venc := get{{ .Name }}Encoder(len(values))
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
tsenc.Write(v.unixnano)
venc.Write(v.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, {{ .Type }}, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
put{{.Name}}Encoder(venc)
return b, err
}
{{ end }}
// Sort methods
func (a {{.Name}}Values) Len() int { return len(a) }
func (a {{.Name}}Values) Swap(i, j int) { a[i], a[j] = a[j], a[i] }
func (a {{.Name}}Values) Less(i, j int) bool { return a[i].UnixNano() < a[j].UnixNano() }
{{ end }}

View File

@@ -0,0 +1,27 @@
[
{
"Name":"",
"name":"",
"Type":""
},
{
"Name":"Float",
"name":"float",
"Type":"BlockFloat64"
},
{
"Name":"Integer",
"name":"integer",
"Type":"BlockInteger"
},
{
"Name":"String",
"name":"string",
"Type":"BlockString"
},
{
"Name":"Boolean",
"name":"boolean",
"Type":"BlockBoolean"
}
]

View File

@@ -0,0 +1,880 @@
package tsm1
import (
"encoding/binary"
"fmt"
"runtime"
"time"
"github.com/influxdata/influxdb/influxql"
"github.com/influxdata/influxdb/pkg/pool"
"github.com/influxdata/influxdb/tsdb"
)
const (
// BlockFloat64 designates a block encodes float64 values.
BlockFloat64 = byte(0)
// BlockInteger designates a block encodes int64 values.
BlockInteger = byte(1)
// BlockBoolean designates a block encodes boolean values.
BlockBoolean = byte(2)
// BlockString designates a block encodes string values.
BlockString = byte(3)
// encodedBlockHeaderSize is the size of the header for an encoded block. There is one
// byte encoding the type of the block.
encodedBlockHeaderSize = 1
)
func init() {
// Prime the pools with one encoder/decoder for each available CPU.
vals := make([]interface{}, 0, runtime.NumCPU())
for _, p := range []*pool.Generic{
timeEncoderPool, timeDecoderPool,
integerEncoderPool, integerDecoderPool,
floatDecoderPool, floatDecoderPool,
stringEncoderPool, stringEncoderPool,
booleanEncoderPool, booleanDecoderPool,
} {
vals = vals[:0]
// Check one out to force the allocation now and hold onto it
for i := 0; i < runtime.NumCPU(); i++ {
v := p.Get(tsdb.DefaultMaxPointsPerBlock)
vals = append(vals, v)
}
// Add them all back
for _, v := range vals {
p.Put(v)
}
}
}
var (
// encoder pools
timeEncoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return NewTimeEncoder(sz)
})
integerEncoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return NewIntegerEncoder(sz)
})
floatEncoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return NewFloatEncoder()
})
stringEncoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return NewStringEncoder(sz)
})
booleanEncoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return NewBooleanEncoder(sz)
})
// decoder pools
timeDecoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return &TimeDecoder{}
})
integerDecoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return &IntegerDecoder{}
})
floatDecoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return &FloatDecoder{}
})
stringDecoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return &StringDecoder{}
})
booleanDecoderPool = pool.NewGeneric(runtime.NumCPU(), func(sz int) interface{} {
return &BooleanDecoder{}
})
)
// Value represents a TSM-encoded value.
type Value interface {
// UnixNano returns the timestamp of the value in nanoseconds since unix epoch.
UnixNano() int64
// Value returns the underlying value.
Value() interface{}
// Size returns the number of bytes necessary to represent the value and its timestamp.
Size() int
// String returns the string representation of the value and its timestamp.
String() string
// internalOnly is unexported to ensure implementations of Value
// can only originate in this package.
internalOnly()
}
// NewValue returns a new Value with the underlying type dependent on value.
func NewValue(t int64, value interface{}) Value {
switch v := value.(type) {
case int64:
return IntegerValue{unixnano: t, value: v}
case float64:
return FloatValue{unixnano: t, value: v}
case bool:
return BooleanValue{unixnano: t, value: v}
case string:
return StringValue{unixnano: t, value: v}
}
return EmptyValue{}
}
// NewIntegerValue returns a new integer value.
func NewIntegerValue(t int64, v int64) Value {
return IntegerValue{unixnano: t, value: v}
}
// NewFloatValue returns a new float value.
func NewFloatValue(t int64, v float64) Value {
return FloatValue{unixnano: t, value: v}
}
// NewBooleanValue returns a new boolean value.
func NewBooleanValue(t int64, v bool) Value {
return BooleanValue{unixnano: t, value: v}
}
// NewStringValue returns a new string value.
func NewStringValue(t int64, v string) Value {
return StringValue{unixnano: t, value: v}
}
// EmptyValue is used when there is no appropriate other value.
type EmptyValue struct{}
// UnixNano returns tsdb.EOF.
func (e EmptyValue) UnixNano() int64 { return tsdb.EOF }
// Value returns nil.
func (e EmptyValue) Value() interface{} { return nil }
// Size returns 0.
func (e EmptyValue) Size() int { return 0 }
// String returns the empty string.
func (e EmptyValue) String() string { return "" }
func (_ EmptyValue) internalOnly() {}
func (_ StringValue) internalOnly() {}
func (_ IntegerValue) internalOnly() {}
func (_ BooleanValue) internalOnly() {}
func (_ FloatValue) internalOnly() {}
// Encode converts the values to a byte slice. If there are no values,
// this function panics.
func (a Values) Encode(buf []byte) ([]byte, error) {
if len(a) == 0 {
panic("unable to encode block type")
}
switch a[0].(type) {
case FloatValue:
return encodeFloatBlock(buf, a)
case IntegerValue:
return encodeIntegerBlock(buf, a)
case BooleanValue:
return encodeBooleanBlock(buf, a)
case StringValue:
return encodeStringBlock(buf, a)
}
return nil, fmt.Errorf("unsupported value type %T", a[0])
}
// InfluxQLType returns the influxql.DataType the values map to.
func (a Values) InfluxQLType() (influxql.DataType, error) {
if len(a) == 0 {
return influxql.Unknown, fmt.Errorf("no values to infer type")
}
switch a[0].(type) {
case FloatValue:
return influxql.Float, nil
case IntegerValue:
return influxql.Integer, nil
case BooleanValue:
return influxql.Boolean, nil
case StringValue:
return influxql.String, nil
}
return influxql.Unknown, fmt.Errorf("unsupported value type %T", a[0])
}
// BlockType returns the type of value encoded in a block or an error
// if the block type is unknown.
func BlockType(block []byte) (byte, error) {
blockType := block[0]
switch blockType {
case BlockFloat64, BlockInteger, BlockBoolean, BlockString:
return blockType, nil
default:
return 0, fmt.Errorf("unknown block type: %d", blockType)
}
}
// BlockCount returns the number of timestamps encoded in block.
func BlockCount(block []byte) int {
if len(block) <= encodedBlockHeaderSize {
panic(fmt.Sprintf("count of short block: got %v, exp %v", len(block), encodedBlockHeaderSize))
}
// first byte is the block type
tb, _, err := unpackBlock(block[1:])
if err != nil {
panic(fmt.Sprintf("BlockCount: error unpacking block: %s", err.Error()))
}
return CountTimestamps(tb)
}
// DecodeBlock takes a byte slice and decodes it into values of the appropriate type
// based on the block.
func DecodeBlock(block []byte, vals []Value) ([]Value, error) {
if len(block) <= encodedBlockHeaderSize {
panic(fmt.Sprintf("decode of short block: got %v, exp %v", len(block), encodedBlockHeaderSize))
}
blockType, err := BlockType(block)
if err != nil {
return nil, err
}
switch blockType {
case BlockFloat64:
var buf []FloatValue
decoded, err := DecodeFloatBlock(block, &buf)
if len(vals) < len(decoded) {
vals = make([]Value, len(decoded))
}
for i := range decoded {
vals[i] = decoded[i]
}
return vals[:len(decoded)], err
case BlockInteger:
var buf []IntegerValue
decoded, err := DecodeIntegerBlock(block, &buf)
if len(vals) < len(decoded) {
vals = make([]Value, len(decoded))
}
for i := range decoded {
vals[i] = decoded[i]
}
return vals[:len(decoded)], err
case BlockBoolean:
var buf []BooleanValue
decoded, err := DecodeBooleanBlock(block, &buf)
if len(vals) < len(decoded) {
vals = make([]Value, len(decoded))
}
for i := range decoded {
vals[i] = decoded[i]
}
return vals[:len(decoded)], err
case BlockString:
var buf []StringValue
decoded, err := DecodeStringBlock(block, &buf)
if len(vals) < len(decoded) {
vals = make([]Value, len(decoded))
}
for i := range decoded {
vals[i] = decoded[i]
}
return vals[:len(decoded)], err
default:
panic(fmt.Sprintf("unknown block type: %d", blockType))
}
}
// FloatValue represents a float64 value.
type FloatValue struct {
unixnano int64
value float64
}
// UnixNano returns the timestamp of the value.
func (v FloatValue) UnixNano() int64 {
return v.unixnano
}
// Value returns the underlying float64 value.
func (v FloatValue) Value() interface{} {
return v.value
}
// Size returns the number of bytes necessary to represent the value and its timestamp.
func (v FloatValue) Size() int {
return 16
}
// String returns the string representation of the value and its timestamp.
func (v FloatValue) String() string {
return fmt.Sprintf("%v %v", time.Unix(0, v.unixnano), v.value)
}
func encodeFloatBlock(buf []byte, values []Value) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
// A float block is encoded using different compression strategies
// for timestamps and values.
// Encode values using Gorilla float compression
venc := getFloatEncoder(len(values))
// Encode timestamps using an adaptive encoder that uses delta-encoding,
// frame-or-reference and run length encoding.
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
vv := v.(FloatValue)
tsenc.Write(vv.unixnano)
venc.Write(vv.value)
}
venc.Flush()
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded float values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockFloat64, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putFloatEncoder(venc)
return b, err
}
// DecodeFloatBlock decodes the float block from the byte slice
// and appends the float values to a.
func DecodeFloatBlock(block []byte, a *[]FloatValue) ([]FloatValue, error) {
// Block type is the next block, make sure we actually have a float block
blockType := block[0]
if blockType != BlockFloat64 {
return nil, fmt.Errorf("invalid block type: exp %d, got %d", BlockFloat64, blockType)
}
block = block[1:]
tb, vb, err := unpackBlock(block)
if err != nil {
return nil, err
}
tdec := timeDecoderPool.Get(0).(*TimeDecoder)
vdec := floatDecoderPool.Get(0).(*FloatDecoder)
var i int
err = func() error {
// Setup our timestamp and value decoders
tdec.Init(tb)
err = vdec.SetBytes(vb)
if err != nil {
return err
}
// Decode both a timestamp and value
for tdec.Next() && vdec.Next() {
ts := tdec.Read()
v := vdec.Values()
if i < len(*a) {
elem := &(*a)[i]
elem.unixnano = ts
elem.value = v
} else {
*a = append(*a, FloatValue{ts, v})
}
i++
}
// Did timestamp decoding have an error?
err = tdec.Error()
if err != nil {
return err
}
// Did float decoding have an error?
err = vdec.Error()
if err != nil {
return err
}
return nil
}()
timeDecoderPool.Put(tdec)
floatDecoderPool.Put(vdec)
return (*a)[:i], err
}
// BooleanValue represents a boolean value.
type BooleanValue struct {
unixnano int64
value bool
}
// Size returns the number of bytes necessary to represent the value and its timestamp.
func (v BooleanValue) Size() int {
return 9
}
// UnixNano returns the timestamp of the value in nanoseconds since unix epoch.
func (v BooleanValue) UnixNano() int64 {
return v.unixnano
}
// Value returns the underlying boolean value.
func (v BooleanValue) Value() interface{} {
return v.value
}
// String returns the string representation of the value and its timestamp.
func (v BooleanValue) String() string {
return fmt.Sprintf("%v %v", time.Unix(0, v.unixnano), v.Value())
}
func encodeBooleanBlock(buf []byte, values []Value) ([]byte, error) {
if len(values) == 0 {
return nil, nil
}
// A boolean block is encoded using different compression strategies
// for timestamps and values.
venc := getBooleanEncoder(len(values))
// Encode timestamps using an adaptive encoder
tsenc := getTimeEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
vv := v.(BooleanValue)
tsenc.Write(vv.unixnano)
venc.Write(vv.value)
}
// Encoded timestamp values
tb, err := tsenc.Bytes()
if err != nil {
return err
}
// Encoded float values
vb, err := venc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes and the block
// in the next byte, followed by the block
b = packBlock(buf, BlockBoolean, tb, vb)
return nil
}()
putTimeEncoder(tsenc)
putBooleanEncoder(venc)
return b, err
}
// DecodeBooleanBlock decodes the boolean block from the byte slice
// and appends the boolean values to a.
func DecodeBooleanBlock(block []byte, a *[]BooleanValue) ([]BooleanValue, error) {
// Block type is the next block, make sure we actually have a float block
blockType := block[0]
if blockType != BlockBoolean {
return nil, fmt.Errorf("invalid block type: exp %d, got %d", BlockBoolean, blockType)
}
block = block[1:]
tb, vb, err := unpackBlock(block)
if err != nil {
return nil, err
}
tdec := timeDecoderPool.Get(0).(*TimeDecoder)
vdec := booleanDecoderPool.Get(0).(*BooleanDecoder)
var i int
err = func() error {
// Setup our timestamp and value decoders
tdec.Init(tb)
vdec.SetBytes(vb)
// Decode both a timestamp and value
for tdec.Next() && vdec.Next() {
ts := tdec.Read()
v := vdec.Read()
if i < len(*a) {
elem := &(*a)[i]
elem.unixnano = ts
elem.value = v
} else {
*a = append(*a, BooleanValue{ts, v})
}
i++
}
// Did timestamp decoding have an error?
err = tdec.Error()
if err != nil {
return err
}
// Did boolean decoding have an error?
err = vdec.Error()
if err != nil {
return err
}
return nil
}()
timeDecoderPool.Put(tdec)
booleanDecoderPool.Put(vdec)
return (*a)[:i], err
}
// FloatValue represents an int64 value.
type IntegerValue struct {
unixnano int64
value int64
}
// Value returns the underlying int64 value.
func (v IntegerValue) Value() interface{} {
return v.value
}
// UnixNano returns the timestamp of the value.
func (v IntegerValue) UnixNano() int64 {
return v.unixnano
}
// Size returns the number of bytes necessary to represent the value and its timestamp.
func (v IntegerValue) Size() int {
return 16
}
// String returns the string representation of the value and its timestamp.
func (v IntegerValue) String() string {
return fmt.Sprintf("%v %v", time.Unix(0, v.unixnano), v.Value())
}
func encodeIntegerBlock(buf []byte, values []Value) ([]byte, error) {
tsEnc := getTimeEncoder(len(values))
vEnc := getIntegerEncoder(len(values))
var b []byte
err := func() error {
for _, v := range values {
vv := v.(IntegerValue)
tsEnc.Write(vv.unixnano)
vEnc.Write(vv.value)
}
// Encoded timestamp values
tb, err := tsEnc.Bytes()
if err != nil {
return err
}
// Encoded int64 values
vb, err := vEnc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes
b = packBlock(buf, BlockInteger, tb, vb)
return nil
}()
putTimeEncoder(tsEnc)
putIntegerEncoder(vEnc)
return b, err
}
// DecodeIntegerBlock decodes the integer block from the byte slice
// and appends the integer values to a.
func DecodeIntegerBlock(block []byte, a *[]IntegerValue) ([]IntegerValue, error) {
blockType := block[0]
if blockType != BlockInteger {
return nil, fmt.Errorf("invalid block type: exp %d, got %d", BlockInteger, blockType)
}
block = block[1:]
// The first 8 bytes is the minimum timestamp of the block
tb, vb, err := unpackBlock(block)
if err != nil {
return nil, err
}
tdec := timeDecoderPool.Get(0).(*TimeDecoder)
vdec := integerDecoderPool.Get(0).(*IntegerDecoder)
var i int
err = func() error {
// Setup our timestamp and value decoders
tdec.Init(tb)
vdec.SetBytes(vb)
// Decode both a timestamp and value
for tdec.Next() && vdec.Next() {
ts := tdec.Read()
v := vdec.Read()
if i < len(*a) {
elem := &(*a)[i]
elem.unixnano = ts
elem.value = v
} else {
*a = append(*a, IntegerValue{ts, v})
}
i++
}
// Did timestamp decoding have an error?
err = tdec.Error()
if err != nil {
return err
}
// Did int64 decoding have an error?
err = vdec.Error()
if err != nil {
return err
}
return nil
}()
timeDecoderPool.Put(tdec)
integerDecoderPool.Put(vdec)
return (*a)[:i], err
}
// StringValue represents a string value.
type StringValue struct {
unixnano int64
value string
}
// Value returns the underlying string value.
func (v StringValue) Value() interface{} {
return v.value
}
// UnixNano returns the timestamp of the value.
func (v StringValue) UnixNano() int64 {
return v.unixnano
}
// Size returns the number of bytes necessary to represent the value and its timestamp.
func (v StringValue) Size() int {
return 8 + len(v.value)
}
// String returns the string representation of the value and its timestamp.
func (v StringValue) String() string {
return fmt.Sprintf("%v %v", time.Unix(0, v.unixnano), v.Value())
}
func encodeStringBlock(buf []byte, values []Value) ([]byte, error) {
tsEnc := getTimeEncoder(len(values))
vEnc := getStringEncoder(len(values) * len(values[0].(StringValue).value))
var b []byte
err := func() error {
for _, v := range values {
vv := v.(StringValue)
tsEnc.Write(vv.unixnano)
vEnc.Write(vv.value)
}
// Encoded timestamp values
tb, err := tsEnc.Bytes()
if err != nil {
return err
}
// Encoded string values
vb, err := vEnc.Bytes()
if err != nil {
return err
}
// Prepend the first timestamp of the block in the first 8 bytes
b = packBlock(buf, BlockString, tb, vb)
return nil
}()
putTimeEncoder(tsEnc)
putStringEncoder(vEnc)
return b, err
}
// DecodeStringBlock decodes the string block from the byte slice
// and appends the string values to a.
func DecodeStringBlock(block []byte, a *[]StringValue) ([]StringValue, error) {
blockType := block[0]
if blockType != BlockString {
return nil, fmt.Errorf("invalid block type: exp %d, got %d", BlockString, blockType)
}
block = block[1:]
// The first 8 bytes is the minimum timestamp of the block
tb, vb, err := unpackBlock(block)
if err != nil {
return nil, err
}
tdec := timeDecoderPool.Get(0).(*TimeDecoder)
vdec := stringDecoderPool.Get(0).(*StringDecoder)
var i int
err = func() error {
// Setup our timestamp and value decoders
tdec.Init(tb)
err = vdec.SetBytes(vb)
if err != nil {
return err
}
// Decode both a timestamp and value
for tdec.Next() && vdec.Next() {
ts := tdec.Read()
v := vdec.Read()
if i < len(*a) {
elem := &(*a)[i]
elem.unixnano = ts
elem.value = v
} else {
*a = append(*a, StringValue{ts, v})
}
i++
}
// Did timestamp decoding have an error?
err = tdec.Error()
if err != nil {
return err
}
// Did string decoding have an error?
err = vdec.Error()
if err != nil {
return err
}
return nil
}()
timeDecoderPool.Put(tdec)
stringDecoderPool.Put(vdec)
return (*a)[:i], err
}
func packBlock(buf []byte, typ byte, ts []byte, values []byte) []byte {
// We encode the length of the timestamp block using a variable byte encoding.
// This allows small byte slices to take up 1 byte while larger ones use 2 or more.
sz := 1 + binary.MaxVarintLen64 + len(ts) + len(values)
if cap(buf) < sz {
buf = make([]byte, sz)
}
b := buf[:sz]
b[0] = typ
i := binary.PutUvarint(b[1:1+binary.MaxVarintLen64], uint64(len(ts)))
i += 1
// block is <len timestamp bytes>, <ts bytes>, <value bytes>
copy(b[i:], ts)
// We don't encode the value length because we know it's the rest of the block after
// the timestamp block.
copy(b[i+len(ts):], values)
return b[:i+len(ts)+len(values)]
}
func unpackBlock(buf []byte) (ts, values []byte, err error) {
// Unpack the timestamp block length
tsLen, i := binary.Uvarint(buf)
if i <= 0 {
err = fmt.Errorf("unpackBlock: unable to read timestamp block length")
return
}
// Unpack the timestamp bytes
tsIdx := int(i) + int(tsLen)
if tsIdx > len(buf) {
err = fmt.Errorf("unpackBlock: not enough data for timestamp")
return
}
ts = buf[int(i):tsIdx]
// Unpack the value bytes
values = buf[tsIdx:]
return
}
// ZigZagEncode converts a int64 to a uint64 by zig zagging negative and positive values
// across even and odd numbers. Eg. [0,-1,1,-2] becomes [0, 1, 2, 3].
func ZigZagEncode(x int64) uint64 {
return uint64(uint64(x<<1) ^ uint64((int64(x) >> 63)))
}
// ZigZagDecode converts a previously zigzag encoded uint64 back to a int64.
func ZigZagDecode(v uint64) int64 {
return int64((v >> 1) ^ uint64((int64(v&1)<<63)>>63))
}
func getTimeEncoder(sz int) TimeEncoder {
x := timeEncoderPool.Get(sz).(TimeEncoder)
x.Reset()
return x
}
func putTimeEncoder(enc TimeEncoder) { timeEncoderPool.Put(enc) }
func getIntegerEncoder(sz int) IntegerEncoder {
x := integerEncoderPool.Get(sz).(IntegerEncoder)
x.Reset()
return x
}
func putIntegerEncoder(enc IntegerEncoder) { integerEncoderPool.Put(enc) }
func getFloatEncoder(sz int) *FloatEncoder {
x := floatEncoderPool.Get(sz).(*FloatEncoder)
x.Reset()
return x
}
func putFloatEncoder(enc *FloatEncoder) { floatEncoderPool.Put(enc) }
func getStringEncoder(sz int) StringEncoder {
x := stringEncoderPool.Get(sz).(StringEncoder)
x.Reset()
return x
}
func putStringEncoder(enc StringEncoder) { stringEncoderPool.Put(enc) }
func getBooleanEncoder(sz int) BooleanEncoder {
x := booleanEncoderPool.Get(sz).(BooleanEncoder)
x.Reset()
return x
}
func putBooleanEncoder(enc BooleanEncoder) { booleanEncoderPool.Put(enc) }

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,659 @@
// Generated by tmpl
// https://github.com/benbjohnson/tmpl
//
// DO NOT EDIT!
// Source: file_store.gen.go.tmpl
package tsm1
// ReadFloatBlock reads the next block as a set of float values.
func (c *KeyCursor) ReadFloatBlock(buf *[]FloatValue) ([]FloatValue, error) {
// No matching blocks to decode
if len(c.current) == 0 {
return nil, nil
}
// First block is the oldest block containing the points we're searching for.
first := c.current[0]
*buf = (*buf)[:0]
values, err := first.r.ReadFloatBlockAt(&first.entry, buf)
if err != nil {
return nil, err
}
// Remove values we already read
values = FloatValues(values).Exclude(first.readMin, first.readMax)
// Remove any tombstones
tombstones := first.r.TombstoneRange(c.key)
values = c.filterFloatValues(tombstones, values)
// Check we have remaining values.
if len(values) == 0 {
return nil, nil
}
// Only one block with this key and time range so return it
if len(c.current) == 1 {
if len(values) > 0 {
first.markRead(values[0].UnixNano(), values[len(values)-1].UnixNano())
}
return values, nil
}
// Use the current block time range as our overlapping window
minT, maxT := first.readMin, first.readMax
if len(values) > 0 {
minT, maxT = values[0].UnixNano(), values[len(values)-1].UnixNano()
}
if c.ascending {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the min time range to ensure values are returned in ascending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MinTime < minT && !cur.read() {
minT = cur.entry.MinTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MaxTime > maxT {
maxT = cur.entry.MaxTime
}
values = FloatValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []FloatValue
v, err := cur.r.ReadFloatBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterFloatValues(tombstones, v)
// Remove values we already read
v = FloatValues(v).Exclude(cur.readMin, cur.readMax)
if len(v) > 0 {
// Only use values in the overlapping window
v = FloatValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = FloatValues(values).Merge(v)
}
cur.markRead(minT, maxT)
}
} else {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the max time range to ensure values are returned in descending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MaxTime > maxT && !cur.read() {
maxT = cur.entry.MaxTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MinTime < minT {
minT = cur.entry.MinTime
}
values = FloatValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []FloatValue
v, err := cur.r.ReadFloatBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterFloatValues(tombstones, v)
// Remove values we already read
v = FloatValues(v).Exclude(cur.readMin, cur.readMax)
// If the block we decoded should have all of it's values included, mark it as read so we
// don't use it again.
if len(v) > 0 {
v = FloatValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = FloatValues(v).Merge(values)
}
cur.markRead(minT, maxT)
}
}
first.markRead(minT, maxT)
return values, err
}
// ReadIntegerBlock reads the next block as a set of integer values.
func (c *KeyCursor) ReadIntegerBlock(buf *[]IntegerValue) ([]IntegerValue, error) {
// No matching blocks to decode
if len(c.current) == 0 {
return nil, nil
}
// First block is the oldest block containing the points we're searching for.
first := c.current[0]
*buf = (*buf)[:0]
values, err := first.r.ReadIntegerBlockAt(&first.entry, buf)
if err != nil {
return nil, err
}
// Remove values we already read
values = IntegerValues(values).Exclude(first.readMin, first.readMax)
// Remove any tombstones
tombstones := first.r.TombstoneRange(c.key)
values = c.filterIntegerValues(tombstones, values)
// Check we have remaining values.
if len(values) == 0 {
return nil, nil
}
// Only one block with this key and time range so return it
if len(c.current) == 1 {
if len(values) > 0 {
first.markRead(values[0].UnixNano(), values[len(values)-1].UnixNano())
}
return values, nil
}
// Use the current block time range as our overlapping window
minT, maxT := first.readMin, first.readMax
if len(values) > 0 {
minT, maxT = values[0].UnixNano(), values[len(values)-1].UnixNano()
}
if c.ascending {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the min time range to ensure values are returned in ascending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MinTime < minT && !cur.read() {
minT = cur.entry.MinTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MaxTime > maxT {
maxT = cur.entry.MaxTime
}
values = IntegerValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []IntegerValue
v, err := cur.r.ReadIntegerBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterIntegerValues(tombstones, v)
// Remove values we already read
v = IntegerValues(v).Exclude(cur.readMin, cur.readMax)
if len(v) > 0 {
// Only use values in the overlapping window
v = IntegerValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = IntegerValues(values).Merge(v)
}
cur.markRead(minT, maxT)
}
} else {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the max time range to ensure values are returned in descending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MaxTime > maxT && !cur.read() {
maxT = cur.entry.MaxTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MinTime < minT {
minT = cur.entry.MinTime
}
values = IntegerValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []IntegerValue
v, err := cur.r.ReadIntegerBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterIntegerValues(tombstones, v)
// Remove values we already read
v = IntegerValues(v).Exclude(cur.readMin, cur.readMax)
// If the block we decoded should have all of it's values included, mark it as read so we
// don't use it again.
if len(v) > 0 {
v = IntegerValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = IntegerValues(v).Merge(values)
}
cur.markRead(minT, maxT)
}
}
first.markRead(minT, maxT)
return values, err
}
// ReadStringBlock reads the next block as a set of string values.
func (c *KeyCursor) ReadStringBlock(buf *[]StringValue) ([]StringValue, error) {
// No matching blocks to decode
if len(c.current) == 0 {
return nil, nil
}
// First block is the oldest block containing the points we're searching for.
first := c.current[0]
*buf = (*buf)[:0]
values, err := first.r.ReadStringBlockAt(&first.entry, buf)
if err != nil {
return nil, err
}
// Remove values we already read
values = StringValues(values).Exclude(first.readMin, first.readMax)
// Remove any tombstones
tombstones := first.r.TombstoneRange(c.key)
values = c.filterStringValues(tombstones, values)
// Check we have remaining values.
if len(values) == 0 {
return nil, nil
}
// Only one block with this key and time range so return it
if len(c.current) == 1 {
if len(values) > 0 {
first.markRead(values[0].UnixNano(), values[len(values)-1].UnixNano())
}
return values, nil
}
// Use the current block time range as our overlapping window
minT, maxT := first.readMin, first.readMax
if len(values) > 0 {
minT, maxT = values[0].UnixNano(), values[len(values)-1].UnixNano()
}
if c.ascending {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the min time range to ensure values are returned in ascending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MinTime < minT && !cur.read() {
minT = cur.entry.MinTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MaxTime > maxT {
maxT = cur.entry.MaxTime
}
values = StringValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []StringValue
v, err := cur.r.ReadStringBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterStringValues(tombstones, v)
// Remove values we already read
v = StringValues(v).Exclude(cur.readMin, cur.readMax)
if len(v) > 0 {
// Only use values in the overlapping window
v = StringValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = StringValues(values).Merge(v)
}
cur.markRead(minT, maxT)
}
} else {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the max time range to ensure values are returned in descending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MaxTime > maxT && !cur.read() {
maxT = cur.entry.MaxTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MinTime < minT {
minT = cur.entry.MinTime
}
values = StringValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []StringValue
v, err := cur.r.ReadStringBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterStringValues(tombstones, v)
// Remove values we already read
v = StringValues(v).Exclude(cur.readMin, cur.readMax)
// If the block we decoded should have all of it's values included, mark it as read so we
// don't use it again.
if len(v) > 0 {
v = StringValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = StringValues(v).Merge(values)
}
cur.markRead(minT, maxT)
}
}
first.markRead(minT, maxT)
return values, err
}
// ReadBooleanBlock reads the next block as a set of boolean values.
func (c *KeyCursor) ReadBooleanBlock(buf *[]BooleanValue) ([]BooleanValue, error) {
// No matching blocks to decode
if len(c.current) == 0 {
return nil, nil
}
// First block is the oldest block containing the points we're searching for.
first := c.current[0]
*buf = (*buf)[:0]
values, err := first.r.ReadBooleanBlockAt(&first.entry, buf)
if err != nil {
return nil, err
}
// Remove values we already read
values = BooleanValues(values).Exclude(first.readMin, first.readMax)
// Remove any tombstones
tombstones := first.r.TombstoneRange(c.key)
values = c.filterBooleanValues(tombstones, values)
// Check we have remaining values.
if len(values) == 0 {
return nil, nil
}
// Only one block with this key and time range so return it
if len(c.current) == 1 {
if len(values) > 0 {
first.markRead(values[0].UnixNano(), values[len(values)-1].UnixNano())
}
return values, nil
}
// Use the current block time range as our overlapping window
minT, maxT := first.readMin, first.readMax
if len(values) > 0 {
minT, maxT = values[0].UnixNano(), values[len(values)-1].UnixNano()
}
if c.ascending {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the min time range to ensure values are returned in ascending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MinTime < minT && !cur.read() {
minT = cur.entry.MinTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MaxTime > maxT {
maxT = cur.entry.MaxTime
}
values = BooleanValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []BooleanValue
v, err := cur.r.ReadBooleanBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterBooleanValues(tombstones, v)
// Remove values we already read
v = BooleanValues(v).Exclude(cur.readMin, cur.readMax)
if len(v) > 0 {
// Only use values in the overlapping window
v = BooleanValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = BooleanValues(values).Merge(v)
}
cur.markRead(minT, maxT)
}
} else {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the max time range to ensure values are returned in descending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MaxTime > maxT && !cur.read() {
maxT = cur.entry.MaxTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MinTime < minT {
minT = cur.entry.MinTime
}
values = BooleanValues(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []BooleanValue
v, err := cur.r.ReadBooleanBlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filterBooleanValues(tombstones, v)
// Remove values we already read
v = BooleanValues(v).Exclude(cur.readMin, cur.readMax)
// If the block we decoded should have all of it's values included, mark it as read so we
// don't use it again.
if len(v) > 0 {
v = BooleanValues(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = BooleanValues(v).Merge(values)
}
cur.markRead(minT, maxT)
}
}
first.markRead(minT, maxT)
return values, err
}

View File

@@ -0,0 +1,168 @@
package tsm1
{{range .}}
// Read{{.Name}}Block reads the next block as a set of {{.name}} values.
func (c *KeyCursor) Read{{.Name}}Block(buf *[]{{.Name}}Value) ([]{{.Name}}Value, error) {
// No matching blocks to decode
if len(c.current) == 0 {
return nil, nil
}
// First block is the oldest block containing the points we're searching for.
first := c.current[0]
*buf = (*buf)[:0]
values, err := first.r.Read{{.Name}}BlockAt(&first.entry, buf)
if err != nil {
return nil, err
}
// Remove values we already read
values = {{.Name}}Values(values).Exclude(first.readMin, first.readMax)
// Remove any tombstones
tombstones := first.r.TombstoneRange(c.key)
values = c.filter{{.Name}}Values(tombstones, values)
// Check we have remaining values.
if len(values) == 0 {
return nil, nil
}
// Only one block with this key and time range so return it
if len(c.current) == 1 {
if len(values) > 0 {
first.markRead(values[0].UnixNano(), values[len(values)-1].UnixNano())
}
return values, nil
}
// Use the current block time range as our overlapping window
minT, maxT := first.readMin, first.readMax
if len(values) > 0 {
minT, maxT = values[0].UnixNano(), values[len(values)-1].UnixNano()
}
if c.ascending {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the min time range to ensure values are returned in ascending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MinTime < minT && !cur.read() {
minT = cur.entry.MinTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MaxTime > maxT {
maxT = cur.entry.MaxTime
}
values = {{.Name}}Values(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []{{.Name}}Value
v, err := cur.r.Read{{.Name}}BlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filter{{.Name}}Values(tombstones, v)
// Remove values we already read
v = {{.Name}}Values(v).Exclude(cur.readMin, cur.readMax)
if len(v) > 0 {
// Only use values in the overlapping window
v = {{.Name}}Values(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = {{.Name}}Values(values).Merge(v)
}
cur.markRead(minT, maxT)
}
} else {
// Blocks are ordered by generation, we may have values in the past in later blocks, if so,
// expand the window to include the max time range to ensure values are returned in descending
// order
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.MaxTime > maxT && !cur.read() {
maxT = cur.entry.MaxTime
}
}
// Find first block that overlaps our window
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
if cur.entry.OverlapsTimeRange(minT, maxT) && !cur.read() {
// Shrink our window so it's the intersection of the first overlapping block and the
// first block. We do this to minimize the region that overlaps and needs to
// be merged.
if cur.entry.MinTime < minT {
minT = cur.entry.MinTime
}
values = {{.Name}}Values(values).Include(minT, maxT)
break
}
}
// Search the remaining blocks that overlap our window and append their values so we can
// merge them.
for i := 1; i < len(c.current); i++ {
cur := c.current[i]
// Skip this block if it doesn't contain points we looking for or they have already been read
if !cur.entry.OverlapsTimeRange(minT, maxT) || cur.read() {
cur.markRead(minT, maxT)
continue
}
tombstones := cur.r.TombstoneRange(c.key)
var a []{{.Name}}Value
v, err := cur.r.Read{{.Name}}BlockAt(&cur.entry, &a)
if err != nil {
return nil, err
}
// Remove any tombstoned values
v = c.filter{{.Name}}Values(tombstones, v)
// Remove values we already read
v = {{.Name}}Values(v).Exclude(cur.readMin, cur.readMax)
// If the block we decoded should have all of it's values included, mark it as read so we
// don't use it again.
if len(v) > 0 {
v = {{.Name}}Values(v).Include(minT, maxT)
// Merge the remaing values with the existing
values = {{.Name}}Values(v).Merge(values)
}
cur.markRead(minT, maxT)
}
}
first.markRead(minT, maxT)
return values, err
}
{{ end }}

View File

@@ -0,0 +1,18 @@
[
{
"Name":"Float",
"name":"float"
},
{
"Name":"Integer",
"name":"integer"
},
{
"Name":"String",
"name":"string"
},
{
"Name":"Boolean",
"name":"boolean"
}
]

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,249 @@
package tsm1
import (
"bytes"
"fmt"
"testing"
)
func TestMergeSeriesKey_Single(t *testing.T) {
a := make(chan seriesKey, 5)
for i := 0; i < cap(a); i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
}
merged := merge(a)
close(a)
exp := []string{"0", "1", "2", "3", "4"}
for v := range merged {
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_Nil(t *testing.T) {
merged := merge(nil)
for v := range merged {
t.Fatalf("value mismatch: got %v, exp nil", v)
}
merged = merge(nil, nil)
for v := range merged {
t.Fatalf("value mismatch: got %v, exp nil", v)
}
}
func TestMergeSeriesKey_Duplicates(t *testing.T) {
a := make(chan seriesKey, 5)
b := make(chan seriesKey, 5)
for i := 0; i < cap(a); i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
b <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
}
merged := merge(a, b)
close(a)
close(b)
exp := []string{"0", "1", "2", "3", "4"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_Alternating(t *testing.T) {
a := make(chan seriesKey, 2)
b := make(chan seriesKey, 2)
for i := 0; i < cap(a); i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2))}
b <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2+1))}
}
merged := merge(a, b)
close(a)
close(b)
exp := []string{"0", "1", "2", "3"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", string(got.key), exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_AlternatingDuplicates(t *testing.T) {
a := make(chan seriesKey, 2)
b := make(chan seriesKey, 2)
c := make(chan seriesKey, 2)
for i := 0; i < cap(a); i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2))}
b <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2+1))}
c <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2))}
}
merged := merge(a, b, c)
close(a)
close(b)
close(c)
exp := []string{"0", "1", "2", "3"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", string(got.key), exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_Unbuffered(t *testing.T) {
a := make(chan seriesKey)
b := make(chan seriesKey)
go func() {
for i := 0; i < 2; i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2))}
}
close(a)
}()
go func() {
for i := 0; i < 2; i++ {
b <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2+1))}
}
close(b)
}()
merged := merge(a, b)
exp := []string{"0", "1", "2", "3"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", string(got.key), exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_OneEmpty(t *testing.T) {
a := make(chan seriesKey)
b := make(chan seriesKey)
go func() {
for i := 0; i < 2; i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i*2))}
}
close(a)
}()
close(b)
merged := merge(a, b)
exp := []string{"0", "2"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}
func TestMergeSeriesKey_Overlapping(t *testing.T) {
a := make(chan seriesKey)
b := make(chan seriesKey)
c := make(chan seriesKey)
go func() {
for i := 0; i < 3; i++ {
a <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
}
close(a)
}()
go func() {
for i := 4; i < 7; i++ {
b <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
}
close(b)
}()
go func() {
for i := 0; i < 9; i++ {
c <- seriesKey{key: []byte(fmt.Sprintf("%d", i))}
}
close(c)
}()
merged := merge(a, b, c)
exp := []string{"0", "1", "2", "3", "4", "5", "6", "7", "8"}
for v := range merged {
if len(exp) == 0 {
t.Fatalf("more values than expected: got %v", v)
}
if got, exp := v, exp[0]; !bytes.Equal(got.key, []byte(exp)) {
t.Fatalf("value mismatch: got %v, exp %v", string(got.key), exp)
}
exp = exp[1:]
}
if len(exp) > 0 {
t.Fatalf("missed values: %v", exp)
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,20 @@
// +build !windows
package tsm1
import "os"
func syncDir(dirName string) error {
// fsync the dir to flush the rename
dir, err := os.OpenFile(dirName, os.O_RDONLY, os.ModeDir)
if err != nil {
return err
}
defer dir.Close()
return dir.Sync()
}
// renameFile will rename the source to target using os function.
func renameFile(oldpath, newpath string) error {
return os.Rename(oldpath, newpath)
}

View File

@@ -0,0 +1,18 @@
package tsm1
import "os"
func syncDir(dirName string) error {
return nil
}
// renameFile will rename the source to target using os function. If target exists it will be removed before renaming.
func renameFile(oldpath, newpath string) error {
if _, err := os.Stat(newpath); err == nil {
if err = os.Remove(newpath); nil != err {
return err
}
}
return os.Rename(oldpath, newpath)
}

View File

@@ -0,0 +1,285 @@
package tsm1
/*
This code is originally from: https://github.com/dgryski/go-tsz and has been modified to remove
the timestamp compression fuctionality.
It implements the float compression as presented in: http://www.vldb.org/pvldb/vol8/p1816-teller.pdf.
This implementation uses a sentinel value of NaN which means that float64 NaN cannot be stored using
this version.
*/
import (
"bytes"
"fmt"
"math"
"github.com/dgryski/go-bits"
"github.com/dgryski/go-bitstream"
)
const (
// floatUncompressed is an uncompressed format using 8 bytes per value.
// Not yet implemented.
floatUncompressed = 0
// floatCompressedGorilla is a compressed format using the gorilla paper encoding
floatCompressedGorilla = 1
)
// uvnan is the constant returned from math.NaN().
const uvnan = 0x7FF8000000000001
// FloatEncoder encodes multiple float64s into a byte slice.
type FloatEncoder struct {
val float64
err error
leading uint64
trailing uint64
buf bytes.Buffer
bw *bitstream.BitWriter
first bool
finished bool
}
// NewFloatEncoder returns a new FloatEncoder.
func NewFloatEncoder() *FloatEncoder {
s := FloatEncoder{
first: true,
leading: ^uint64(0),
}
s.bw = bitstream.NewWriter(&s.buf)
s.buf.WriteByte(floatCompressedGorilla << 4)
return &s
}
// Reset sets the encoder back to its initial state.
func (s *FloatEncoder) Reset() {
s.val = 0
s.err = nil
s.leading = ^uint64(0)
s.trailing = 0
s.buf.Reset()
s.buf.WriteByte(floatCompressedGorilla << 4)
s.bw.Resume(0x0, 8)
s.finished = false
s.first = true
}
// Bytes returns a copy of the underlying byte buffer used in the encoder.
func (s *FloatEncoder) Bytes() ([]byte, error) {
return s.buf.Bytes(), s.err
}
// Flush indicates there are no more values to encode.
func (s *FloatEncoder) Flush() {
if !s.finished {
// write an end-of-stream record
s.finished = true
s.Write(math.NaN())
s.bw.Flush(bitstream.Zero)
}
}
// Write encodes v to the underlying buffer.
func (s *FloatEncoder) Write(v float64) {
// Only allow NaN as a sentinel value
if math.IsNaN(v) && !s.finished {
s.err = fmt.Errorf("unsupported value: NaN")
return
}
if s.first {
// first point
s.val = v
s.first = false
s.bw.WriteBits(math.Float64bits(v), 64)
return
}
vDelta := math.Float64bits(v) ^ math.Float64bits(s.val)
if vDelta == 0 {
s.bw.WriteBit(bitstream.Zero)
} else {
s.bw.WriteBit(bitstream.One)
leading := bits.Clz(vDelta)
trailing := bits.Ctz(vDelta)
// Clamp number of leading zeros to avoid overflow when encoding
leading &= 0x1F
if leading >= 32 {
leading = 31
}
// TODO(dgryski): check if it's 'cheaper' to reset the leading/trailing bits instead
if s.leading != ^uint64(0) && leading >= s.leading && trailing >= s.trailing {
s.bw.WriteBit(bitstream.Zero)
s.bw.WriteBits(vDelta>>s.trailing, 64-int(s.leading)-int(s.trailing))
} else {
s.leading, s.trailing = leading, trailing
s.bw.WriteBit(bitstream.One)
s.bw.WriteBits(leading, 5)
// Note that if leading == trailing == 0, then sigbits == 64. But that
// value doesn't actually fit into the 6 bits we have.
// Luckily, we never need to encode 0 significant bits, since that would
// put us in the other case (vdelta == 0). So instead we write out a 0 and
// adjust it back to 64 on unpacking.
sigbits := 64 - leading - trailing
s.bw.WriteBits(sigbits, 6)
s.bw.WriteBits(vDelta>>trailing, int(sigbits))
}
}
s.val = v
}
// FloatDecoder decodes a byte slice into multiple float64 values.
type FloatDecoder struct {
val uint64
leading uint64
trailing uint64
br BitReader
b []byte
first bool
finished bool
err error
}
// SetBytes initializes the decoder with b. Must call before calling Next().
func (it *FloatDecoder) SetBytes(b []byte) error {
var v uint64
if len(b) == 0 {
v = uvnan
} else {
// first byte is the compression type.
// we currently just have gorilla compression.
it.br.Reset(b[1:])
var err error
v, err = it.br.ReadBits(64)
if err != nil {
return err
}
}
// Reset all fields.
it.val = v
it.leading = 0
it.trailing = 0
it.b = b
it.first = true
it.finished = false
it.err = nil
return nil
}
// Next returns true if there are remaining values to read.
func (it *FloatDecoder) Next() bool {
if it.err != nil || it.finished {
return false
}
if it.first {
it.first = false
// mark as finished if there were no values.
if it.val == uvnan { // IsNaN
it.finished = true
return false
}
return true
}
// read compressed value
var bit bool
if it.br.CanReadBitFast() {
bit = it.br.ReadBitFast()
} else if v, err := it.br.ReadBit(); err != nil {
it.err = err
return false
} else {
bit = v
}
if !bit {
// it.val = it.val
} else {
var bit bool
if it.br.CanReadBitFast() {
bit = it.br.ReadBitFast()
} else if v, err := it.br.ReadBit(); err != nil {
it.err = err
return false
} else {
bit = v
}
if !bit {
// reuse leading/trailing zero bits
// it.leading, it.trailing = it.leading, it.trailing
} else {
bits, err := it.br.ReadBits(5)
if err != nil {
it.err = err
return false
}
it.leading = bits
bits, err = it.br.ReadBits(6)
if err != nil {
it.err = err
return false
}
mbits := bits
// 0 significant bits here means we overflowed and we actually need 64; see comment in encoder
if mbits == 0 {
mbits = 64
}
it.trailing = 64 - it.leading - mbits
}
mbits := uint(64 - it.leading - it.trailing)
bits, err := it.br.ReadBits(mbits)
if err != nil {
it.err = err
return false
}
vbits := it.val
vbits ^= (bits << it.trailing)
if vbits == uvnan { // IsNaN
it.finished = true
return false
}
it.val = vbits
}
return true
}
// Values returns the current float64 value.
func (it *FloatDecoder) Values() float64 {
return math.Float64frombits(it.val)
}
// Error returns the current decoding error.
func (it *FloatDecoder) Error() error {
return it.err
}

View File

@@ -0,0 +1,286 @@
package tsm1_test
import (
"math"
"reflect"
"testing"
"testing/quick"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func TestFloatEncoder_Simple(t *testing.T) {
// Example from the paper
s := tsm1.NewFloatEncoder()
s.Write(12)
s.Write(12)
s.Write(24)
// extra tests
// floating point masking/shifting bug
s.Write(13)
s.Write(24)
// delta-of-delta sizes
s.Write(24)
s.Write(24)
s.Write(24)
s.Flush()
b, err := s.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var it tsm1.FloatDecoder
if err := it.SetBytes(b); err != nil {
t.Fatalf("unexpected error creating float decoder: %v", err)
}
want := []float64{
12,
12,
24,
13,
24,
24,
24,
24,
}
for _, w := range want {
if !it.Next() {
t.Fatalf("Next()=false, want true")
}
vv := it.Values()
if w != vv {
t.Errorf("Values()=(%v), want (%v)\n", vv, w)
}
}
if it.Next() {
t.Fatalf("Next()=true, want false")
}
if err := it.Error(); err != nil {
t.Errorf("it.Error()=%v, want nil", err)
}
}
func TestFloatEncoder_SimilarFloats(t *testing.T) {
s := tsm1.NewFloatEncoder()
want := []float64{
6.00065e+06,
6.000656e+06,
6.000657e+06,
6.000659e+06,
6.000661e+06,
}
for _, v := range want {
s.Write(v)
}
s.Flush()
b, err := s.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var it tsm1.FloatDecoder
if err := it.SetBytes(b); err != nil {
t.Fatalf("unexpected error creating float decoder: %v", err)
}
for _, w := range want {
if !it.Next() {
t.Fatalf("Next()=false, want true")
}
vv := it.Values()
if w != vv {
t.Errorf("Values()=(%v), want (%v)\n", vv, w)
}
}
if it.Next() {
t.Fatalf("Next()=true, want false")
}
if err := it.Error(); err != nil {
t.Errorf("it.Error()=%v, want nil", err)
}
}
var TwoHoursData = []struct {
v float64
}{
// 2h of data
{761}, {727}, {763}, {706}, {700},
{679}, {757}, {708}, {739}, {707},
{699}, {740}, {729}, {766}, {730},
{715}, {705}, {693}, {765}, {724},
{799}, {761}, {737}, {766}, {756},
{719}, {722}, {801}, {747}, {731},
{742}, {744}, {791}, {750}, {759},
{809}, {751}, {705}, {770}, {792},
{727}, {762}, {772}, {721}, {748},
{753}, {744}, {716}, {776}, {659},
{789}, {766}, {758}, {690}, {795},
{770}, {758}, {723}, {767}, {765},
{693}, {706}, {681}, {727}, {724},
{780}, {678}, {696}, {758}, {740},
{735}, {700}, {742}, {747}, {752},
{734}, {743}, {732}, {746}, {770},
{780}, {710}, {731}, {712}, {712},
{741}, {770}, {770}, {754}, {718},
{670}, {775}, {749}, {795}, {756},
{741}, {787}, {721}, {745}, {782},
{765}, {780}, {811}, {790}, {836},
{743}, {858}, {739}, {762}, {770},
{752}, {763}, {795}, {792}, {746},
{786}, {785}, {774}, {786}, {718},
}
func TestFloatEncoder_Roundtrip(t *testing.T) {
s := tsm1.NewFloatEncoder()
for _, p := range TwoHoursData {
s.Write(p.v)
}
s.Flush()
b, err := s.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var it tsm1.FloatDecoder
if err := it.SetBytes(b); err != nil {
t.Fatalf("unexpected error creating float decoder: %v", err)
}
for _, w := range TwoHoursData {
if !it.Next() {
t.Fatalf("Next()=false, want true")
}
vv := it.Values()
// t.Logf("it.Values()=(%+v, %+v)\n", time.Unix(int64(tt), 0), vv)
if w.v != vv {
t.Errorf("Values()=(%v), want (%v)\n", vv, w.v)
}
}
if it.Next() {
t.Fatalf("Next()=true, want false")
}
if err := it.Error(); err != nil {
t.Errorf("it.Error()=%v, want nil", err)
}
}
func TestFloatEncoder_Roundtrip_NaN(t *testing.T) {
s := tsm1.NewFloatEncoder()
s.Write(1.0)
s.Write(math.NaN())
s.Write(2.0)
s.Flush()
_, err := s.Bytes()
if err == nil {
t.Fatalf("expected error. got nil")
}
}
func Test_FloatEncoder_Quick(t *testing.T) {
quick.Check(func(values []float64) bool {
expected := values
if values == nil {
expected = []float64{}
}
// Write values to encoder.
enc := tsm1.NewFloatEncoder()
for _, v := range values {
enc.Write(v)
}
enc.Flush()
// Read values out of decoder.
got := make([]float64, 0, len(values))
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec tsm1.FloatDecoder
if err := dec.SetBytes(b); err != nil {
t.Fatal(err)
}
for dec.Next() {
got = append(got, dec.Values())
}
// Verify that input and output values match.
if !reflect.DeepEqual(expected, got) {
t.Fatalf("mismatch:\n\nexp=%#v\n\ngot=%#v\n\n", expected, got)
}
return true
}, nil)
}
func TestFloatDecoder_Empty(t *testing.T) {
var dec tsm1.FloatDecoder
if err := dec.SetBytes([]byte{}); err != nil {
t.Fatalf("unexpected error: %v", err)
}
if dec.Next() {
t.Fatalf("exp next == false, got true")
}
}
func BenchmarkFloatEncoder(b *testing.B) {
for i := 0; i < b.N; i++ {
s := tsm1.NewFloatEncoder()
for _, tt := range TwoHoursData {
s.Write(tt.v)
}
s.Flush()
}
}
func BenchmarkFloatDecoder(b *testing.B) {
s := tsm1.NewFloatEncoder()
for _, tt := range TwoHoursData {
s.Write(tt.v)
}
s.Flush()
bytes, err := s.Bytes()
if err != nil {
b.Fatalf("unexpected error: %v", err)
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
var it tsm1.FloatDecoder
if err := it.SetBytes(bytes); err != nil {
b.Fatalf("unexpected error creating float decoder: %v", err)
}
for j := 0; j < len(TwoHoursData); it.Next() {
j++
}
}
}

View File

@@ -0,0 +1,324 @@
package tsm1
// Integer encoding uses two different strategies depending on the range of values in
// the uncompressed data. Encoded values are first encoding used zig zag encoding.
// This interleaves positive and negative integers across a range of positive integers.
//
// For example, [-2,-1,0,1] becomes [3,1,0,2]. See
// https://developers.google.com/protocol-buffers/docs/encoding?hl=en#signed-integers
// for more information.
//
// If all the zig zag encoded values are less than 1 << 60 - 1, they are compressed using
// simple8b encoding. If any value is larger than 1 << 60 - 1, the values are stored uncompressed.
//
// Each encoded byte slice contains a 1 byte header followed by multiple 8 byte packed integers
// or 8 byte uncompressed integers. The 4 high bits of the first byte indicate the encoding type
// for the remaining bytes.
//
// There are currently two encoding types that can be used with room for 16 total. These additional
// encoding slots are reserved for future use. One improvement to be made is to use a patched
// encoding such as PFOR if only a small number of values exceed the max compressed value range. This
// should improve compression ratios with very large integers near the ends of the int64 range.
import (
"encoding/binary"
"fmt"
"github.com/jwilder/encoding/simple8b"
)
const (
// intUncompressed is an uncompressed format using 8 bytes per point
intUncompressed = 0
// intCompressedSimple is a bit-packed format using simple8b encoding
intCompressedSimple = 1
// intCompressedRLE is a run-length encoding format
intCompressedRLE = 2
)
// IntegerEncoder encodes int64s into byte slices.
type IntegerEncoder struct {
prev int64
rle bool
values []uint64
}
// NewIntegerEncoder returns a new integer encoder with an initial buffer of values sized at sz.
func NewIntegerEncoder(sz int) IntegerEncoder {
return IntegerEncoder{
rle: true,
values: make([]uint64, 0, sz),
}
}
// Flush is no-op
func (e *IntegerEncoder) Flush() {}
// Reset sets the encoder back to its initial state.
func (e *IntegerEncoder) Reset() {
e.prev = 0
e.rle = true
e.values = e.values[:0]
}
// Write encodes v to the underlying buffers.
func (e *IntegerEncoder) Write(v int64) {
// Delta-encode each value as it's written. This happens before
// ZigZagEncoding because the deltas could be negative.
delta := v - e.prev
e.prev = v
enc := ZigZagEncode(delta)
if len(e.values) > 1 {
e.rle = e.rle && e.values[len(e.values)-1] == enc
}
e.values = append(e.values, enc)
}
// Bytes returns a copy of the underlying buffer.
func (e *IntegerEncoder) Bytes() ([]byte, error) {
// Only run-length encode if it could reduce storage size.
if e.rle && len(e.values) > 2 {
return e.encodeRLE()
}
for _, v := range e.values {
// Value is too large to encode using packed format
if v > simple8b.MaxValue {
return e.encodeUncompressed()
}
}
return e.encodePacked()
}
func (e *IntegerEncoder) encodeRLE() ([]byte, error) {
// Large varints can take up to 10 bytes. We're storing 3 + 1
// type byte.
var b [31]byte
// 4 high bits used for the encoding type
b[0] = byte(intCompressedRLE) << 4
i := 1
// The first value
binary.BigEndian.PutUint64(b[i:], e.values[0])
i += 8
// The first delta
i += binary.PutUvarint(b[i:], e.values[1])
// The number of times the delta is repeated
i += binary.PutUvarint(b[i:], uint64(len(e.values)-1))
return b[:i], nil
}
func (e *IntegerEncoder) encodePacked() ([]byte, error) {
if len(e.values) == 0 {
return nil, nil
}
// Encode all but the first value. Fist value is written unencoded
// using 8 bytes.
encoded, err := simple8b.EncodeAll(e.values[1:])
if err != nil {
return nil, err
}
b := make([]byte, 1+(len(encoded)+1)*8)
// 4 high bits of first byte store the encoding type for the block
b[0] = byte(intCompressedSimple) << 4
// Write the first value since it's not part of the encoded values
binary.BigEndian.PutUint64(b[1:9], e.values[0])
// Write the encoded values
for i, v := range encoded {
binary.BigEndian.PutUint64(b[9+i*8:9+i*8+8], v)
}
return b, nil
}
func (e *IntegerEncoder) encodeUncompressed() ([]byte, error) {
if len(e.values) == 0 {
return nil, nil
}
b := make([]byte, 1+len(e.values)*8)
// 4 high bits of first byte store the encoding type for the block
b[0] = byte(intUncompressed) << 4
for i, v := range e.values {
binary.BigEndian.PutUint64(b[1+i*8:1+i*8+8], v)
}
return b, nil
}
// IntegerDecoder decodes a byte slice into int64s.
type IntegerDecoder struct {
// 240 is the maximum number of values that can be encoded into a single uint64 using simple8b
values [240]uint64
bytes []byte
i int
n int
prev int64
first bool
// The first value for a run-length encoded byte slice
rleFirst uint64
// The delta value for a run-length encoded byte slice
rleDelta uint64
encoding byte
err error
}
// SetBytes sets the underlying byte slice of the decoder.
func (d *IntegerDecoder) SetBytes(b []byte) {
if len(b) > 0 {
d.encoding = b[0] >> 4
d.bytes = b[1:]
} else {
d.encoding = 0
d.bytes = nil
}
d.i = 0
d.n = 0
d.prev = 0
d.first = true
d.rleFirst = 0
d.rleDelta = 0
d.err = nil
}
// Next returns true if there are any values remaining to be decoded.
func (d *IntegerDecoder) Next() bool {
if d.i >= d.n && len(d.bytes) == 0 {
return false
}
d.i++
if d.i >= d.n {
switch d.encoding {
case intUncompressed:
d.decodeUncompressed()
case intCompressedSimple:
d.decodePacked()
case intCompressedRLE:
d.decodeRLE()
default:
d.err = fmt.Errorf("unknown encoding %v", d.encoding)
}
}
return d.err == nil && d.i < d.n
}
// Error returns the last error encountered by the decoder.
func (d *IntegerDecoder) Error() error {
return d.err
}
// Read returns the next value from the decoder.
func (d *IntegerDecoder) Read() int64 {
switch d.encoding {
case intCompressedRLE:
return ZigZagDecode(d.rleFirst) + int64(d.i)*ZigZagDecode(d.rleDelta)
default:
v := ZigZagDecode(d.values[d.i])
// v is the delta encoded value, we need to add the prior value to get the original
v = v + d.prev
d.prev = v
return v
}
}
func (d *IntegerDecoder) decodeRLE() {
if len(d.bytes) == 0 {
return
}
if len(d.bytes) < 8 {
d.err = fmt.Errorf("IntegerDecoder: not enough data to decode RLE starting value")
return
}
var i, n int
// Next 8 bytes is the starting value
first := binary.BigEndian.Uint64(d.bytes[i : i+8])
i += 8
// Next 1-10 bytes is the delta value
value, n := binary.Uvarint(d.bytes[i:])
if n <= 0 {
d.err = fmt.Errorf("IntegerDecoder: invalid RLE delta value")
return
}
i += n
// Last 1-10 bytes is how many times the value repeats
count, n := binary.Uvarint(d.bytes[i:])
if n <= 0 {
d.err = fmt.Errorf("IntegerDecoder: invalid RLE repeat value")
return
}
// Store the first value and delta value so we do not need to allocate
// a large values slice. We can compute the value at position d.i on
// demand.
d.rleFirst = first
d.rleDelta = value
d.n = int(count) + 1
d.i = 0
// We've process all the bytes
d.bytes = nil
}
func (d *IntegerDecoder) decodePacked() {
if len(d.bytes) == 0 {
return
}
if len(d.bytes) < 8 {
d.err = fmt.Errorf("IntegerDecoder: not enough data to decode packed value")
return
}
v := binary.BigEndian.Uint64(d.bytes[0:8])
// The first value is always unencoded
if d.first {
d.first = false
d.n = 1
d.values[0] = v
} else {
n, err := simple8b.Decode(&d.values, v)
if err != nil {
// Should never happen, only error that could be returned is if the the value to be decoded was not
// actually encoded by simple8b encoder.
d.err = fmt.Errorf("failed to decode value %v: %v", v, err)
}
d.n = n
}
d.i = 0
d.bytes = d.bytes[8:]
}
func (d *IntegerDecoder) decodeUncompressed() {
if len(d.bytes) == 0 {
return
}
if len(d.bytes) < 8 {
d.err = fmt.Errorf("IntegerDecoder: not enough data to decode uncompressed value")
return
}
d.values[0] = binary.BigEndian.Uint64(d.bytes[0:8])
d.i = 0
d.n = 1
d.bytes = d.bytes[8:]
}

View File

@@ -0,0 +1,646 @@
package tsm1
import (
"math"
"math/rand"
"reflect"
"testing"
"testing/quick"
)
func Test_IntegerEncoder_NoValues(t *testing.T) {
enc := NewIntegerEncoder(0)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if len(b) > 0 {
t.Fatalf("unexpected lenght: exp 0, got %v", len(b))
}
var dec IntegerDecoder
dec.SetBytes(b)
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_IntegerEncoder_One(t *testing.T) {
enc := NewIntegerEncoder(1)
v1 := int64(1)
enc.Write(1)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; intCompressedSimple != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v1)
}
}
func Test_IntegerEncoder_Two(t *testing.T) {
enc := NewIntegerEncoder(2)
var v1, v2 int64 = 1, 2
enc.Write(v1)
enc.Write(v2)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; intCompressedSimple != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v2)
}
}
func Test_IntegerEncoder_Negative(t *testing.T) {
enc := NewIntegerEncoder(3)
var v1, v2, v3 int64 = -2, 0, 1
enc.Write(v1)
enc.Write(v2)
enc.Write(v3)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; intCompressedSimple != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v2)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v3 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v3)
}
}
func Test_IntegerEncoder_Large_Range(t *testing.T) {
enc := NewIntegerEncoder(2)
var v1, v2 int64 = math.MinInt64, math.MaxInt64
enc.Write(v1)
enc.Write(v2)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; intUncompressed != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v2)
}
}
func Test_IntegerEncoder_Uncompressed(t *testing.T) {
enc := NewIntegerEncoder(3)
var v1, v2, v3 int64 = 0, 1, 1 << 60
enc.Write(v1)
enc.Write(v2)
enc.Write(v3)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("expected error: %v", err)
}
// 1 byte header + 3 * 8 byte values
if exp := 25; len(b) != exp {
t.Fatalf("length mismatch: got %v, exp %v", len(b), exp)
}
if got := b[0] >> 4; intUncompressed != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v2)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if v3 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), v3)
}
}
func Test_IntegerEncoder_NegativeUncompressed(t *testing.T) {
values := []int64{
-2352281900722994752, 1438442655375607923, -4110452567888190110,
-1221292455668011702, -1941700286034261841, -2836753127140407751,
1432686216250034552, 3663244026151507025, -3068113732684750258,
-1949953187327444488, 3713374280993588804, 3226153669854871355,
-2093273755080502606, 1006087192578600616, -2272122301622271655,
2533238229511593671, -4450454445568858273, 2647789901083530435,
2761419461769776844, -1324397441074946198, -680758138988210958,
94468846694902125, -2394093124890745254, -2682139311758778198,
}
enc := NewIntegerEncoder(256)
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("expected error: %v", err)
}
if got := b[0] >> 4; intUncompressed != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_AllNegative(t *testing.T) {
enc := NewIntegerEncoder(3)
values := []int64{
-10, -5, -1,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; intCompressedSimple != got {
t.Fatalf("encoding type mismatch: exp uncompressed, got %v", got)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_CounterPacked(t *testing.T) {
enc := NewIntegerEncoder(16)
values := []int64{
1e15, 1e15 + 1, 1e15 + 2, 1e15 + 3, 1e15 + 4, 1e15 + 6,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != intCompressedSimple {
t.Fatalf("unexpected encoding format: expected simple, got %v", b[0]>>4)
}
// Should use 1 header byte + 2, 8 byte words if delta-encoding is used based on
// values sizes. Without delta-encoding, we'd get 49 bytes.
if exp := 17; len(b) != exp {
t.Fatalf("encoded length mismatch: got %v, exp %v", len(b), exp)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_CounterRLE(t *testing.T) {
enc := NewIntegerEncoder(16)
values := []int64{
1e15, 1e15 + 1, 1e15 + 2, 1e15 + 3, 1e15 + 4, 1e15 + 5,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != intCompressedRLE {
t.Fatalf("unexpected encoding format: expected RLE, got %v", b[0]>>4)
}
// Should use 1 header byte, 8 byte first value, 1 var-byte for delta and 1 var-byte for
// count of deltas in this particular RLE.
if exp := 11; len(b) != exp {
t.Fatalf("encoded length mismatch: got %v, exp %v", len(b), exp)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_Descending(t *testing.T) {
enc := NewIntegerEncoder(16)
values := []int64{
7094, 4472, 1850,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != intCompressedRLE {
t.Fatalf("unexpected encoding format: expected simple, got %v", b[0]>>4)
}
// Should use 1 header byte, 8 byte first value, 1 var-byte for delta and 1 var-byte for
// count of deltas in this particular RLE.
if exp := 12; len(b) != exp {
t.Fatalf("encoded length mismatch: got %v, exp %v", len(b), exp)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_Flat(t *testing.T) {
enc := NewIntegerEncoder(16)
values := []int64{
1, 1, 1, 1,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != intCompressedRLE {
t.Fatalf("unexpected encoding format: expected simple, got %v", b[0]>>4)
}
// Should use 1 header byte, 8 byte first value, 1 var-byte for delta and 1 var-byte for
// count of deltas in this particular RLE.
if exp := 11; len(b) != exp {
t.Fatalf("encoded length mismatch: got %v, exp %v", len(b), exp)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_MinMax(t *testing.T) {
enc := NewIntegerEncoder(2)
values := []int64{
math.MinInt64, math.MaxInt64,
}
for _, v := range values {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != intUncompressed {
t.Fatalf("unexpected encoding format: expected simple, got %v", b[0]>>4)
}
if exp := 17; len(b) != exp {
t.Fatalf("encoded length mismatch: got %v, exp %v", len(b), exp)
}
var dec IntegerDecoder
dec.SetBytes(b)
i := 0
for dec.Next() {
if i > len(values) {
t.Fatalf("read too many values: got %v, exp %v", i, len(values))
}
if values[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), values[i])
}
i += 1
}
if i != len(values) {
t.Fatalf("failed to read enough values: got %v, exp %v", i, len(values))
}
}
func Test_IntegerEncoder_Quick(t *testing.T) {
quick.Check(func(values []int64) bool {
expected := values
if values == nil {
expected = []int64{} // is this really expected?
}
// Write values to encoder.
enc := NewIntegerEncoder(1024)
for _, v := range values {
enc.Write(v)
}
// Retrieve encoded bytes from encoder.
buf, err := enc.Bytes()
if err != nil {
t.Fatal(err)
}
// Read values out of decoder.
got := make([]int64, 0, len(values))
var dec IntegerDecoder
dec.SetBytes(buf)
for dec.Next() {
if err := dec.Error(); err != nil {
t.Fatal(err)
}
got = append(got, dec.Read())
}
// Verify that input and output values match.
if !reflect.DeepEqual(expected, got) {
t.Fatalf("mismatch:\n\nexp=%#v\n\ngot=%#v\n\n", expected, got)
}
return true
}, nil)
}
func Test_IntegerDecoder_Corrupt(t *testing.T) {
cases := []string{
"", // Empty
"\x00abc", // Uncompressed: less than 8 bytes
"\x10abc", // Packed: less than 8 bytes
"\x20abc", // RLE: less than 8 bytes
"\x2012345678\x90", // RLE: valid starting value but invalid delta value
"\x2012345678\x01\x90", // RLE: valid starting, valid delta value, invalid repeat value
}
for _, c := range cases {
var dec IntegerDecoder
dec.SetBytes([]byte(c))
if dec.Next() {
t.Fatalf("exp next == false, got true")
}
}
}
func BenchmarkIntegerEncoderRLE(b *testing.B) {
enc := NewIntegerEncoder(1024)
x := make([]int64, 1024)
for i := 0; i < len(x); i++ {
x[i] = int64(i)
enc.Write(x[i])
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
enc.Bytes()
}
}
func BenchmarkIntegerEncoderPackedSimple(b *testing.B) {
enc := NewIntegerEncoder(1024)
x := make([]int64, 1024)
for i := 0; i < len(x); i++ {
// Small amount of randomness prevents RLE from being used
x[i] = int64(i) + int64(rand.Intn(10))
enc.Write(x[i])
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
enc.Bytes()
enc.Reset()
for i := 0; i < len(x); i++ {
enc.Write(x[i])
}
}
}
func BenchmarkIntegerDecoderPackedSimple(b *testing.B) {
x := make([]int64, 1024)
enc := NewIntegerEncoder(1024)
for i := 0; i < len(x); i++ {
// Small amount of randomness prevents RLE from being used
x[i] = int64(i) + int64(rand.Intn(10))
enc.Write(x[i])
}
bytes, _ := enc.Bytes()
b.ResetTimer()
var dec IntegerDecoder
for i := 0; i < b.N; i++ {
dec.SetBytes(bytes)
for dec.Next() {
}
}
}
func BenchmarkIntegerDecoderRLE(b *testing.B) {
x := make([]int64, 1024)
enc := NewIntegerEncoder(1024)
for i := 0; i < len(x); i++ {
x[i] = int64(i)
enc.Write(x[i])
}
bytes, _ := enc.Bytes()
b.ResetTimer()
var dec IntegerDecoder
dec.SetBytes(bytes)
for i := 0; i < b.N; i++ {
dec.SetBytes(bytes)
for dec.Next() {
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,578 @@
package tsm1
import (
"sort"
"fmt"
"runtime"
"sync"
"github.com/influxdata/influxdb/influxql"
"github.com/influxdata/influxdb/tsdb"
"github.com/uber-go/zap"
)
type cursor interface {
close() error
next() (t int64, v interface{})
}
// cursorAt provides a bufferred cursor interface.
// This required for literal value cursors which don't have a time value.
type cursorAt interface {
close() error
peek() (k int64, v interface{})
nextAt(seek int64) interface{}
}
type nilCursor struct {}
func (nilCursor) next() (int64, interface{}) { return tsdb.EOF, nil }
// bufCursor implements a bufferred cursor.
type bufCursor struct {
cur cursor
buf struct {
key int64
value interface{}
filled bool
}
ascending bool
}
// newBufCursor returns a bufferred wrapper for cur.
func newBufCursor(cur cursor, ascending bool) *bufCursor {
return &bufCursor{cur: cur, ascending: ascending}
}
func (c *bufCursor) close() error {
err := c.cur.close()
c.cur = nil
return err
}
// next returns the buffer, if filled. Otherwise returns the next key/value from the cursor.
func (c *bufCursor) next() (int64, interface{}) {
if c.buf.filled {
k, v := c.buf.key, c.buf.value
c.buf.filled = false
return k, v
}
return c.cur.next()
}
// unread pushes k and v onto the buffer.
func (c *bufCursor) unread(k int64, v interface{}) {
c.buf.key, c.buf.value = k, v
c.buf.filled = true
}
// peek reads next next key/value without removing them from the cursor.
func (c *bufCursor) peek() (k int64, v interface{}) {
k, v = c.next()
c.unread(k, v)
return
}
// nextAt returns the next value where key is equal to seek.
// Skips over any keys that are less than seek.
// If the key doesn't exist then a nil value is returned instead.
func (c *bufCursor) nextAt(seek int64) interface{} {
for {
k, v := c.next()
if k != tsdb.EOF {
if k == seek {
return v
} else if c.ascending && k < seek {
continue
} else if !c.ascending && k > seek {
continue
}
c.unread(k, v)
}
// Return "nil" value for type.
switch c.cur.(type) {
case floatCursor:
return (*float64)(nil)
case integerCursor:
return (*int64)(nil)
case stringCursor:
return (*string)(nil)
case booleanCursor:
return (*bool)(nil)
default:
panic("unreachable")
}
}
}
// statsBufferCopyIntervalN is the number of points that are read before
// copying the stats buffer to the iterator's stats field. This is used to
// amortize the cost of using a mutex when updating stats.
const statsBufferCopyIntervalN = 100
{{range .}}
type {{.name}}FinalizerIterator struct {
influxql.{{.Name}}Iterator
logger zap.Logger
}
func new{{.Name}}FinalizerIterator(inner influxql.{{.Name}}Iterator, logger zap.Logger) *{{.name}}FinalizerIterator {
itr := &{{.name}}FinalizerIterator{ {{.Name}}Iterator: inner, logger: logger}
runtime.SetFinalizer(itr, (*{{.name}}FinalizerIterator).closeGC)
return itr
}
func (itr *{{.name}}FinalizerIterator) closeGC() {
runtime.SetFinalizer(itr, nil)
itr.logger.Error("{{.Name}}Iterator finalized by GC")
itr.Close()
}
func (itr *{{.name}}FinalizerIterator) Close() error {
runtime.SetFinalizer(itr, nil)
return itr.{{.Name}}Iterator.Close()
}
type {{.name}}Iterator struct {
cur {{.name}}Cursor
aux []cursorAt
conds struct {
names []string
curs []cursorAt
}
opt influxql.IteratorOptions
m map[string]interface{} // map used for condition evaluation
point influxql.{{.Name}}Point // reusable buffer
statsLock sync.Mutex
stats influxql.IteratorStats
statsBuf influxql.IteratorStats
}
func new{{.Name}}Iterator(name string, tags influxql.Tags, opt influxql.IteratorOptions, cur {{.name}}Cursor, aux []cursorAt, conds []cursorAt, condNames []string) *{{.name}}Iterator {
itr := &{{.name}}Iterator{
cur: cur,
aux: aux,
opt: opt,
point: influxql.{{.Name}}Point{
Name: name,
Tags: tags,
},
statsBuf: influxql.IteratorStats{
SeriesN: 1,
},
}
itr.stats = itr.statsBuf
if len(aux) > 0 {
itr.point.Aux = make([]interface{}, len(aux))
}
if opt.Condition != nil {
itr.m = make(map[string]interface{}, len(aux)+len(conds))
}
itr.conds.names = condNames
itr.conds.curs = conds
return itr
}
// Next returns the next point from the iterator.
func (itr *{{.name}}Iterator) Next() (*influxql.{{.Name}}Point, error) {
for {
seek := tsdb.EOF
if itr.cur != nil {
// Read from the main cursor if we have one.
itr.point.Time, itr.point.Value = itr.cur.next{{.Name}}()
seek = itr.point.Time
} else {
// Otherwise find lowest aux timestamp.
for i := range itr.aux {
if k, _ := itr.aux[i].peek(); k != tsdb.EOF {
if seek == tsdb.EOF || (itr.opt.Ascending && k < seek) || (!itr.opt.Ascending && k > seek) {
seek = k
}
}
}
itr.point.Time = seek
}
// Exit if we have no more points or we are outside our time range.
if itr.point.Time == tsdb.EOF {
itr.copyStats()
return nil, nil
} else if itr.opt.Ascending && itr.point.Time > itr.opt.EndTime {
itr.copyStats()
return nil, nil
} else if !itr.opt.Ascending && itr.point.Time < itr.opt.StartTime {
itr.copyStats()
return nil, nil
}
// Read from each auxiliary cursor.
for i := range itr.opt.Aux {
itr.point.Aux[i] = itr.aux[i].nextAt(seek)
}
// Read from condition field cursors.
for i := range itr.conds.curs {
itr.m[itr.conds.names[i]] = itr.conds.curs[i].nextAt(seek)
}
// Evaluate condition, if one exists. Retry if it fails.
if itr.opt.Condition != nil && !influxql.EvalBool(itr.opt.Condition, itr.m) {
continue
}
// Track points returned.
itr.statsBuf.PointN++
// Copy buffer to stats periodically.
if itr.statsBuf.PointN % statsBufferCopyIntervalN == 0 {
itr.copyStats()
}
return &itr.point, nil
}
}
// copyStats copies from the itr stats buffer to the stats under lock.
func (itr *{{.name}}Iterator) copyStats() {
itr.statsLock.Lock()
itr.stats = itr.statsBuf
itr.statsLock.Unlock()
}
// Stats returns stats on the points processed.
func (itr *{{.name}}Iterator) Stats() influxql.IteratorStats {
itr.statsLock.Lock()
stats := itr.stats
itr.statsLock.Unlock()
return stats
}
// Close closes the iterator.
func (itr *{{.name}}Iterator) Close() error {
cursorsAt(itr.aux).close()
itr.aux = nil
cursorsAt(itr.conds.curs).close()
itr.conds.curs = nil
if itr.cur != nil {
err := itr.cur.close()
itr.cur = nil
return err
}
return nil
}
// {{.name}}LimitIterator
type {{.name}}LimitIterator struct {
input influxql.{{.Name}}Iterator
opt influxql.IteratorOptions
n int
}
func new{{.Name}}LimitIterator(input influxql.{{.Name}}Iterator, opt influxql.IteratorOptions) *{{.name}}LimitIterator {
return &{{.name}}LimitIterator{
input: input,
opt: opt,
}
}
func (itr *{{.name}}LimitIterator) Stats() influxql.IteratorStats { return itr.input.Stats() }
func (itr *{{.name}}LimitIterator) Close() error { return itr.input.Close() }
func (itr *{{.name}}LimitIterator) Next() (*influxql.{{.Name}}Point, error) {
// Check if we are beyond the limit.
if (itr.n-itr.opt.Offset) > itr.opt.Limit {
return nil, nil
}
// Read the next point.
p, err := itr.input.Next()
if p == nil || err != nil {
return nil, err
}
// Increment counter.
itr.n++
// Offsets are handled by a higher level iterator so return all points.
return p, nil
}
// {{.name}}Cursor represents an object for iterating over a single {{.name}} field.
type {{.name}}Cursor interface {
cursor
next{{.Name}}() (t int64, v {{.Type}})
}
func new{{.Name}}Cursor(seek int64, ascending bool, cacheValues Values, tsmKeyCursor *KeyCursor) {{.name}}Cursor {
if ascending {
return new{{.Name}}AscendingCursor(seek, cacheValues, tsmKeyCursor)
}
return new{{.Name}}DescendingCursor(seek, cacheValues, tsmKeyCursor)
}
type {{.name}}AscendingCursor struct {
cache struct {
values Values
pos int
}
tsm struct {
buf []{{.Name}}Value
values []{{.Name}}Value
pos int
keyCursor *KeyCursor
}
}
func new{{.Name}}AscendingCursor(seek int64, cacheValues Values, tsmKeyCursor *KeyCursor) *{{.name}}AscendingCursor {
c := &{{.name}}AscendingCursor{}
c.cache.values = cacheValues
c.cache.pos = sort.Search(len(c.cache.values), func(i int) bool {
return c.cache.values[i].UnixNano() >= seek
})
c.tsm.keyCursor = tsmKeyCursor
c.tsm.buf = make([]{{.Name}}Value, 10)
c.tsm.values, _ = c.tsm.keyCursor.Read{{.Name}}Block(&c.tsm.buf)
c.tsm.pos = sort.Search(len(c.tsm.values), func(i int) bool {
return c.tsm.values[i].UnixNano() >= seek
})
return c
}
// peekCache returns the current time/value from the cache.
func (c *{{.name}}AscendingCursor) peekCache() (t int64, v {{.Type}}) {
if c.cache.pos >= len(c.cache.values) {
return tsdb.EOF, {{.Nil}}
}
item := c.cache.values[c.cache.pos]
return item.UnixNano(), item.({{.ValueType}}).value
}
// peekTSM returns the current time/value from tsm.
func (c *{{.name}}AscendingCursor) peekTSM() (t int64, v {{.Type}}) {
if c.tsm.pos < 0 || c.tsm.pos >= len(c.tsm.values) {
return tsdb.EOF, {{.Nil}}
}
item := c.tsm.values[c.tsm.pos]
return item.UnixNano(), item.value
}
// close closes the cursor and any dependent cursors.
func (c *{{.name}}AscendingCursor) close() (error) {
c.tsm.keyCursor.Close()
c.tsm.keyCursor = nil
c.tsm.buf = nil
c.cache.values = nil
c.tsm.values = nil
return nil
}
// next returns the next key/value for the cursor.
func (c *{{.name}}AscendingCursor) next() (int64, interface{}) { return c.next{{.Name}}() }
// next{{.Name}} returns the next key/value for the cursor.
func (c *{{.name}}AscendingCursor) next{{.Name}}() (int64, {{.Type}}) {
ckey, cvalue := c.peekCache()
tkey, tvalue := c.peekTSM()
// No more data in cache or in TSM files.
if ckey == tsdb.EOF && tkey == tsdb.EOF {
return tsdb.EOF, {{.Nil}}
}
// Both cache and tsm files have the same key, cache takes precedence.
if ckey == tkey {
c.nextCache()
c.nextTSM()
return ckey, cvalue
}
// Buffered cache key precedes that in TSM file.
if ckey != tsdb.EOF && (ckey < tkey || tkey == tsdb.EOF) {
c.nextCache()
return ckey, cvalue
}
// Buffered TSM key precedes that in cache.
c.nextTSM()
return tkey, tvalue
}
// nextCache returns the next value from the cache.
func (c *{{.name}}AscendingCursor) nextCache() {
if c.cache.pos >= len(c.cache.values) {
return
}
c.cache.pos++
}
// nextTSM returns the next value from the TSM files.
func (c *{{.name}}AscendingCursor) nextTSM() {
c.tsm.pos++
if c.tsm.pos >= len(c.tsm.values) {
c.tsm.keyCursor.Next()
c.tsm.values, _ = c.tsm.keyCursor.Read{{.Name}}Block(&c.tsm.buf)
if len(c.tsm.values) == 0 {
return
}
c.tsm.pos = 0
}
}
type {{.name}}DescendingCursor struct {
cache struct {
values Values
pos int
}
tsm struct {
buf []{{.Name}}Value
values []{{.Name}}Value
pos int
keyCursor *KeyCursor
}
}
func new{{.Name}}DescendingCursor(seek int64, cacheValues Values, tsmKeyCursor *KeyCursor) *{{.name}}DescendingCursor {
c := &{{.name}}DescendingCursor{}
c.cache.values = cacheValues
c.cache.pos = sort.Search(len(c.cache.values), func(i int) bool {
return c.cache.values[i].UnixNano() >= seek
})
if t, _ := c.peekCache(); t != seek {
c.cache.pos--
}
c.tsm.keyCursor = tsmKeyCursor
c.tsm.buf = make([]{{.Name}}Value, 10)
c.tsm.values, _ = c.tsm.keyCursor.Read{{.Name}}Block(&c.tsm.buf)
c.tsm.pos = sort.Search(len(c.tsm.values), func(i int) bool {
return c.tsm.values[i].UnixNano() >= seek
})
if t, _ := c.peekTSM(); t != seek {
c.tsm.pos--
}
return c
}
// peekCache returns the current time/value from the cache.
func (c *{{.name}}DescendingCursor) peekCache() (t int64, v {{.Type}}) {
if c.cache.pos < 0 || c.cache.pos >= len(c.cache.values) {
return tsdb.EOF, {{.Nil}}
}
item := c.cache.values[c.cache.pos]
return item.UnixNano(), item.({{.ValueType}}).value
}
// peekTSM returns the current time/value from tsm.
func (c *{{.name}}DescendingCursor) peekTSM() (t int64, v {{.Type}}) {
if c.tsm.pos < 0 || c.tsm.pos >= len(c.tsm.values) {
return tsdb.EOF, {{.Nil}}
}
item := c.tsm.values[c.tsm.pos]
return item.UnixNano(), item.value
}
// close closes the cursor and any dependent cursors.
func (c *{{.name}}DescendingCursor) close() (error) {
c.tsm.keyCursor.Close()
c.tsm.keyCursor = nil
c.tsm.buf = nil
c.cache.values = nil
c.tsm.values = nil
return nil
}
// next returns the next key/value for the cursor.
func (c *{{.name}}DescendingCursor) next() (int64, interface{}) { return c.next{{.Name}}() }
// next{{.Name}} returns the next key/value for the cursor.
func (c *{{.name}}DescendingCursor) next{{.Name}}() (int64, {{.Type}}) {
ckey, cvalue := c.peekCache()
tkey, tvalue := c.peekTSM()
// No more data in cache or in TSM files.
if ckey == tsdb.EOF && tkey == tsdb.EOF {
return tsdb.EOF, {{.Nil}}
}
// Both cache and tsm files have the same key, cache takes precedence.
if ckey == tkey {
c.nextCache()
c.nextTSM()
return ckey, cvalue
}
// Buffered cache key precedes that in TSM file.
if ckey != tsdb.EOF && (ckey > tkey || tkey == tsdb.EOF) {
c.nextCache()
return ckey, cvalue
}
// Buffered TSM key precedes that in cache.
c.nextTSM()
return tkey, tvalue
}
// nextCache returns the next value from the cache.
func (c *{{.name}}DescendingCursor) nextCache() {
if c.cache.pos < 0 {
return
}
c.cache.pos--
}
// nextTSM returns the next value from the TSM files.
func (c *{{.name}}DescendingCursor) nextTSM() {
c.tsm.pos--
if c.tsm.pos < 0 {
c.tsm.keyCursor.Next()
c.tsm.values, _ = c.tsm.keyCursor.Read{{.Name}}Block(&c.tsm.buf)
if len(c.tsm.values) == 0 {
return
}
c.tsm.pos = len(c.tsm.values) - 1
}
}
// {{.name}}LiteralCursor represents a cursor that always returns a single value.
// It doesn't not have a time value so it can only be used with nextAt().
type {{.name}}LiteralCursor struct {
value {{.Type}}
}
func (c *{{.name}}LiteralCursor) close() error { return nil }
func (c *{{.name}}LiteralCursor) peek() (t int64, v interface{}) { return tsdb.EOF, c.value }
func (c *{{.name}}LiteralCursor) next() (t int64, v interface{}) { return tsdb.EOF, c.value }
func (c *{{.name}}LiteralCursor) nextAt(seek int64) interface{} { return c.value }
// {{.name}}NilLiteralCursor represents a cursor that always returns a typed nil value.
// It doesn't not have a time value so it can only be used with nextAt().
type {{.name}}NilLiteralCursor struct {}
func (c *{{.name}}NilLiteralCursor) close() error { return nil }
func (c *{{.name}}NilLiteralCursor) peek() (t int64, v interface{}) { return tsdb.EOF, (*{{.Type}})(nil) }
func (c *{{.name}}NilLiteralCursor) next() (t int64, v interface{}) { return tsdb.EOF, (*{{.Type}})(nil) }
func (c *{{.name}}NilLiteralCursor) nextAt(seek int64) interface{} { return (*{{.Type}})(nil) }
{{end}}
var _ = fmt.Print

View File

@@ -0,0 +1,30 @@
[
{
"Name":"Float",
"name":"float",
"Type":"float64",
"ValueType":"FloatValue",
"Nil":"0"
},
{
"Name":"Integer",
"name":"integer",
"Type":"int64",
"ValueType":"IntegerValue",
"Nil":"0"
},
{
"Name":"String",
"name":"string",
"Type":"string",
"ValueType":"StringValue",
"Nil":"\"\""
},
{
"Name":"Boolean",
"name":"boolean",
"Type":"bool",
"ValueType":"BooleanValue",
"Nil":"false"
}
]

View File

@@ -0,0 +1,92 @@
package tsm1
import (
"fmt"
"github.com/influxdata/influxdb/influxql"
"github.com/uber-go/zap"
)
func newLimitIterator(input influxql.Iterator, opt influxql.IteratorOptions) influxql.Iterator {
switch input := input.(type) {
case influxql.FloatIterator:
return newFloatLimitIterator(input, opt)
case influxql.IntegerIterator:
return newIntegerLimitIterator(input, opt)
case influxql.StringIterator:
return newStringLimitIterator(input, opt)
case influxql.BooleanIterator:
return newBooleanLimitIterator(input, opt)
default:
panic(fmt.Sprintf("unsupported limit iterator type: %T", input))
}
}
type floatCastIntegerCursor struct {
cursor integerCursor
}
func (c *floatCastIntegerCursor) close() error { return c.cursor.close() }
func (c *floatCastIntegerCursor) next() (t int64, v interface{}) { return c.nextFloat() }
func (c *floatCastIntegerCursor) nextFloat() (int64, float64) {
t, v := c.cursor.nextInteger()
return t, float64(v)
}
type integerCastFloatCursor struct {
cursor floatCursor
}
func (c *integerCastFloatCursor) close() error { return c.cursor.close() }
func (c *integerCastFloatCursor) next() (t int64, v interface{}) { return c.nextInteger() }
func (c *integerCastFloatCursor) nextInteger() (int64, int64) {
t, v := c.cursor.nextFloat()
return t, int64(v)
}
type cursorsAt []cursorAt
func (c cursorsAt) close() {
for _, cur := range c {
cur.close()
}
}
// newMergeFinalizerIterator creates a new Merge iterator from the inputs. If the call to Merge succeeds,
// the resulting Iterator will be wrapped in a finalizer iterator.
// If Merge returns an error, the inputs will be closed.
func newMergeFinalizerIterator(inputs []influxql.Iterator, opt influxql.IteratorOptions, log zap.Logger) (influxql.Iterator, error) {
itr, err := influxql.Iterators(inputs).Merge(opt)
if err != nil {
influxql.Iterators(inputs).Close()
return nil, err
}
return newFinalizerIterator(itr, log), nil
}
// newFinalizerIterator creates a new iterator that installs a runtime finalizer
// to ensure close is eventually called if the iterator is garbage collected.
// This additional guard attempts to protect against clients of CreateIterator not
// correctly closing them and leaking cursors.
func newFinalizerIterator(itr influxql.Iterator, log zap.Logger) influxql.Iterator {
if itr == nil {
return nil
}
switch inner := itr.(type) {
case influxql.FloatIterator:
return newFloatFinalizerIterator(inner, log)
case influxql.IntegerIterator:
return newIntegerFinalizerIterator(inner, log)
case influxql.StringIterator:
return newStringFinalizerIterator(inner, log)
case influxql.BooleanIterator:
return newBooleanFinalizerIterator(inner, log)
default:
panic(fmt.Sprintf("unsupported finalizer iterator type: %T", itr))
}
}

View File

@@ -0,0 +1,32 @@
// +build solaris
package tsm1
import (
"os"
"syscall"
"golang.org/x/sys/unix"
)
func mmap(f *os.File, offset int64, length int) ([]byte, error) {
mmap, err := unix.Mmap(int(f.Fd()), 0, length, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return nil, err
}
if err := unix.Madvise(mmap, syscall.MADV_RANDOM); err != nil {
return nil, err
}
return mmap, nil
}
func munmap(b []byte) (err error) {
return unix.Munmap(b)
}
// From: github.com/boltdb/bolt/bolt_unix.go
func madvise(b []byte, advice int) (err error) {
return unix.Madvise(b, advice)
}

View File

@@ -0,0 +1,21 @@
// +build !windows,!plan9,!solaris
package tsm1
import (
"os"
"syscall"
)
func mmap(f *os.File, offset int64, length int) ([]byte, error) {
mmap, err := syscall.Mmap(int(f.Fd()), 0, length, syscall.PROT_READ, syscall.MAP_SHARED)
if err != nil {
return nil, err
}
return mmap, nil
}
func munmap(b []byte) (err error) {
return syscall.Munmap(b)
}

View File

@@ -0,0 +1,117 @@
package tsm1
import (
"errors"
"os"
"reflect"
"sync"
"syscall"
"unsafe"
)
// mmap implementation for Windows
// Based on: https://github.com/edsrzf/mmap-go
// Based on: https://github.com/boltdb/bolt/bolt_windows.go
// Ref: https://groups.google.com/forum/#!topic/golang-nuts/g0nLwQI9www
// We keep this map so that we can get back the original handle from the memory address.
var handleLock sync.Mutex
var handleMap = map[uintptr]syscall.Handle{}
var fileMap = map[uintptr]*os.File{}
func openSharedFile(f *os.File) (file *os.File, err error) {
var access, createmode, sharemode uint32
var sa *syscall.SecurityAttributes
access = syscall.GENERIC_READ
sharemode = uint32(syscall.FILE_SHARE_READ | syscall.FILE_SHARE_WRITE | syscall.FILE_SHARE_DELETE)
createmode = syscall.OPEN_EXISTING
fileName := f.Name()
pathp, err := syscall.UTF16PtrFromString(fileName)
if err != nil {
return nil, err
}
h, e := syscall.CreateFile(pathp, access, sharemode, sa, createmode, syscall.FILE_ATTRIBUTE_NORMAL, 0)
if e != nil {
return nil, e
}
//NewFile does not add finalizer, need to close this manually
return os.NewFile(uintptr(h), fileName), nil
}
func mmap(f *os.File, offset int64, length int) (out []byte, err error) {
// Open a file mapping handle.
sizelo := uint32(length >> 32)
sizehi := uint32(length) & 0xffffffff
sharedHandle, errno := openSharedFile(f)
if errno != nil {
return nil, os.NewSyscallError("CreateFile", errno)
}
h, errno := syscall.CreateFileMapping(syscall.Handle(sharedHandle.Fd()), nil, syscall.PAGE_READONLY, sizelo, sizehi, nil)
if h == 0 {
return nil, os.NewSyscallError("CreateFileMapping", errno)
}
// Create the memory map.
addr, errno := syscall.MapViewOfFile(h, syscall.FILE_MAP_READ, 0, 0, uintptr(length))
if addr == 0 {
return nil, os.NewSyscallError("MapViewOfFile", errno)
}
handleLock.Lock()
handleMap[addr] = h
fileMap[addr] = sharedHandle
handleLock.Unlock()
// Convert to a byte array.
hdr := (*reflect.SliceHeader)(unsafe.Pointer(&out))
hdr.Data = uintptr(unsafe.Pointer(addr))
hdr.Len = length
hdr.Cap = length
return
}
// munmap Windows implementation
// Based on: https://github.com/edsrzf/mmap-go
// Based on: https://github.com/boltdb/bolt/bolt_windows.go
func munmap(b []byte) (err error) {
handleLock.Lock()
defer handleLock.Unlock()
addr := (uintptr)(unsafe.Pointer(&b[0]))
if err := syscall.UnmapViewOfFile(addr); err != nil {
return os.NewSyscallError("UnmapViewOfFile", err)
}
handle, ok := handleMap[addr]
if !ok {
// should be impossible; we would've seen the error above
return errors.New("unknown base address")
}
delete(handleMap, addr)
e := syscall.CloseHandle(syscall.Handle(handle))
if e != nil {
return os.NewSyscallError("CloseHandle", e)
}
file, ok := fileMap[addr]
if !ok {
// should be impossible; we would've seen the error above
return errors.New("unknown base address")
}
delete(fileMap, addr)
e = file.Close()
if e != nil {
return errors.New("close file" + e.Error())
}
return nil
}

View File

@@ -0,0 +1,26 @@
package tsm1
import "sync"
var bufPool sync.Pool
// getBuf returns a buffer with length size from the buffer pool.
func getBuf(size int) *[]byte {
x := bufPool.Get()
if x == nil {
b := make([]byte, size)
return &b
}
buf := x.(*[]byte)
if cap(*buf) < size {
b := make([]byte, size)
return &b
}
*buf = (*buf)[:size]
return buf
}
// putBuf returns a buffer to the pool.
func putBuf(buf *[]byte) {
bufPool.Put(buf)
}

File diff suppressed because it is too large Load Diff

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,299 @@
package tsm1
import (
"fmt"
"sort"
"sync"
"sync/atomic"
"github.com/cespare/xxhash"
)
// partitions is the number of partitions we used in the ring's continuum. It
// basically defines the maximum number of partitions you can have in the ring.
// If a smaller number of partitions are chosen when creating a ring, then
// they're evenly spread across this many partitions in the ring.
const partitions = 4096
// ring is a structure that maps series keys to entries.
//
// ring is implemented as a crude hash ring, in so much that you can have
// variable numbers of members in the ring, and the appropriate member for a
// given series key can always consistently be found. Unlike a true hash ring
// though, this ring is not resizeable—there must be at most 256 members in the
// ring, and the number of members must always be a power of 2.
//
// ring works as follows: Each member of the ring contains a single store, which
// contains a map of series keys to entries. A ring always has 256 partitions,
// and a member takes up one or more of these partitions (depending on how many
// members are specified to be in the ring)
//
// To determine the partition that a series key should be added to, the series
// key is hashed and the first 8 bits are used as an index to the ring.
//
type ring struct {
// The unique set of partitions in the ring.
// len(partitions) <= len(continuum)
partitions []*partition
// A mapping of partition to location on the ring continuum. This is used
// to lookup a partition.
continuum []*partition
// Number of keys within the ring. This is used to provide a hint for
// allocating the return values in keys(). It will not be perfectly accurate
// since it doesn't consider adding duplicate keys, or trying to remove non-
// existent keys.
keysHint int64
}
// newring returns a new ring initialised with n partitions. n must always be a
// power of 2, and for performance reasons should be larger than the number of
// cores on the host. The supported set of values for n is:
//
// {1, 2, 4, 8, 16, 32, 64, 128, 256}.
//
func newring(n int) (*ring, error) {
if n <= 0 || n > partitions {
return nil, fmt.Errorf("invalid number of paritions: %d", n)
}
r := ring{
continuum: make([]*partition, partitions), // maximum number of partitions.
}
// The trick here is to map N partitions to all points on the continuum,
// such that the first eight bits of a given hash will map directly to one
// of the N partitions.
for i := 0; i < len(r.continuum); i++ {
if (i == 0 || i%(partitions/n) == 0) && len(r.partitions) < n {
r.partitions = append(r.partitions, &partition{
store: make(map[string]*entry),
entrySizeHints: make(map[uint64]int),
})
}
r.continuum[i] = r.partitions[len(r.partitions)-1]
}
return &r, nil
}
// reset resets the ring so it can be reused. Before removing references to entries
// within each partition it gathers sizing information to provide hints when
// reallocating entries in partition maps.
//
// reset is not safe for use by multiple goroutines.
func (r *ring) reset() {
for _, partition := range r.partitions {
partition.reset()
}
r.keysHint = 0
}
// getPartition retrieves the hash ring partition associated with the provided
// key.
func (r *ring) getPartition(key string) *partition {
return r.continuum[int(xxhash.Sum64([]byte(key))%partitions)]
}
// entry returns the entry for the given key.
// entry is safe for use by multiple goroutines.
func (r *ring) entry(key string) (*entry, bool) {
return r.getPartition(key).entry(key)
}
// write writes values to the entry in the ring's partition associated with key.
// If no entry exists for the key then one will be created.
// write is safe for use by multiple goroutines.
func (r *ring) write(key string, values Values) error {
return r.getPartition(key).write(key, values)
}
// add adds an entry to the ring.
func (r *ring) add(key string, entry *entry) {
r.getPartition(key).add(key, entry)
atomic.AddInt64(&r.keysHint, 1)
}
// remove deletes the entry for the given key.
// remove is safe for use by multiple goroutines.
func (r *ring) remove(key string) {
r.getPartition(key).remove(key)
if r.keysHint > 0 {
atomic.AddInt64(&r.keysHint, -1)
}
}
// keys returns all the keys from all partitions in the hash ring. The returned
// keys will be in order if sorted is true.
func (r *ring) keys(sorted bool) []string {
keys := make([]string, 0, atomic.LoadInt64(&r.keysHint))
for _, p := range r.partitions {
keys = append(keys, p.keys()...)
}
if sorted {
sort.Strings(keys)
}
return keys
}
// apply applies the provided function to every entry in the ring under a read
// lock using a separate goroutine for each partition. The provided function
// will be called with each key and the corresponding entry. The first error
// encountered will be returned, if any. apply is safe for use by multiple
// goroutines.
func (r *ring) apply(f func(string, *entry) error) error {
var (
wg sync.WaitGroup
res = make(chan error, len(r.partitions))
)
for _, p := range r.partitions {
wg.Add(1)
go func(p *partition) {
defer wg.Done()
p.mu.RLock()
for k, e := range p.store {
if err := f(k, e); err != nil {
res <- err
p.mu.RUnlock()
return
}
}
p.mu.RUnlock()
}(p)
}
go func() {
wg.Wait()
close(res)
}()
// Collect results.
for err := range res {
if err != nil {
return err
}
}
return nil
}
// applySerial is similar to apply, but invokes f on each partition in the same
// goroutine.
// apply is safe for use by multiple goroutines.
func (r *ring) applySerial(f func(string, *entry) error) error {
for _, p := range r.partitions {
p.mu.RLock()
for k, e := range p.store {
if err := f(k, e); err != nil {
p.mu.RUnlock()
return err
}
}
p.mu.RUnlock()
}
return nil
}
// partition provides safe access to a map of series keys to entries.
type partition struct {
mu sync.RWMutex
store map[string]*entry
// entrySizeHints stores hints for appropriate sizes to pre-allocate the
// []Values in an entry. entrySizeHints will only contain hints for entries
// that were present prior to the most recent snapshot, preventing unbounded
// growth over time.
entrySizeHints map[uint64]int
}
// entry returns the partition's entry for the provided key.
// It's safe for use by multiple goroutines.
func (p *partition) entry(key string) (*entry, bool) {
p.mu.RLock()
e, ok := p.store[key]
p.mu.RUnlock()
return e, ok
}
// write writes the values to the entry in the partition, creating the entry
// if it does not exist.
// write is safe for use by multiple goroutines.
func (p *partition) write(key string, values Values) error {
p.mu.RLock()
e, ok := p.store[key]
p.mu.RUnlock()
if ok {
// Hot path.
return e.add(values)
}
p.mu.Lock()
defer p.mu.Unlock()
// Check again.
if e, ok = p.store[key]; ok {
return e.add(values)
}
// Create a new entry using a preallocated size if we have a hint available.
hint, _ := p.entrySizeHints[xxhash.Sum64([]byte(key))]
e, err := newEntryValues(values, hint)
if err != nil {
return err
}
p.store[key] = e
return nil
}
// add adds a new entry for key to the partition.
func (p *partition) add(key string, entry *entry) {
p.mu.Lock()
p.store[key] = entry
p.mu.Unlock()
}
// remove deletes the entry associated with the provided key.
// remove is safe for use by multiple goroutines.
func (p *partition) remove(key string) {
p.mu.Lock()
delete(p.store, key)
p.mu.Unlock()
}
// keys returns an unsorted slice of the keys in the partition.
func (p *partition) keys() []string {
p.mu.RLock()
keys := make([]string, 0, len(p.store))
for k := range p.store {
keys = append(keys, k)
}
p.mu.RUnlock()
return keys
}
// reset resets the partition by reinitialising the store. reset returns hints
// about sizes that the entries within the store could be reallocated with.
func (p *partition) reset() {
p.mu.Lock()
defer p.mu.Unlock()
// Collect the allocated sizes of values for each entry in the store.
p.entrySizeHints = make(map[uint64]int)
for k, entry := range p.store {
// If the capacity is large then there are many values in the entry.
// Store a hint to pre-allocate the next time we see the same entry.
entry.mu.RLock()
if cap(entry.values) > 128 { // 4 x the default entry capacity size.
p.entrySizeHints[xxhash.Sum64([]byte(k))] = cap(entry.values)
}
entry.mu.RUnlock()
}
// Reset the store.
p.store = make(map[string]*entry, len(p.store))
}

View File

@@ -0,0 +1,122 @@
package tsm1
import (
"fmt"
"runtime"
"sync"
"testing"
)
func TestRing_newRing(t *testing.T) {
examples := []struct {
n int
returnErr bool
}{
{n: 1}, {n: 2}, {n: 4}, {n: 8}, {n: 16}, {n: 32}, {n: 64}, {n: 128}, {n: 256},
{n: 0, returnErr: true}, {n: 3, returnErr: true}, {n: 512, returnErr: true},
}
for i, example := range examples {
r, err := newring(example.n)
if err != nil {
if example.returnErr {
continue // expecting an error.
}
t.Fatal(err)
}
if got, exp := len(r.partitions), example.n; got != exp {
t.Fatalf("[Example %d] got %v, expected %v", i, got, exp)
}
// Check partitions distributed correctly
partitions := make([]*partition, 0)
for i, partition := range r.continuum {
if i == 0 || partition != partitions[len(partitions)-1] {
partitions = append(partitions, partition)
}
}
if got, exp := len(partitions), example.n; got != exp {
t.Fatalf("[Example %d] got %v, expected %v", i, got, exp)
}
}
}
var strSliceRes []string
func benchmarkRingkeys(b *testing.B, r *ring, keys int) {
// Add some keys
for i := 0; i < keys; i++ {
r.add(fmt.Sprintf("cpu,host=server-%d value=1", i), nil)
}
b.ReportAllocs()
b.ResetTimer()
for i := 0; i < b.N; i++ {
strSliceRes = r.keys(false)
}
}
func BenchmarkRing_keys_100(b *testing.B) { benchmarkRingkeys(b, MustNewRing(256), 100) }
func BenchmarkRing_keys_1000(b *testing.B) { benchmarkRingkeys(b, MustNewRing(256), 1000) }
func BenchmarkRing_keys_10000(b *testing.B) { benchmarkRingkeys(b, MustNewRing(256), 10000) }
func BenchmarkRing_keys_100000(b *testing.B) { benchmarkRingkeys(b, MustNewRing(256), 100000) }
func benchmarkRingWrite(b *testing.B, r *ring, n int) {
for i := 0; i < b.N; i++ {
var wg sync.WaitGroup
for i := 0; i < runtime.GOMAXPROCS(0); i++ {
errC := make(chan error)
wg.Add(1)
go func() {
defer wg.Done()
for j := 0; j < n; j++ {
if err := r.write(fmt.Sprintf("cpu,host=server-%d value=1", j), Values{}); err != nil {
errC <- err
}
}
}()
go func() {
wg.Wait()
close(errC)
}()
for err := range errC {
if err != nil {
b.Error(err)
}
}
}
}
}
func BenchmarkRing_write_1_100(b *testing.B) { benchmarkRingWrite(b, MustNewRing(1), 100) }
func BenchmarkRing_write_1_1000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(1), 1000) }
func BenchmarkRing_write_1_10000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(1), 10000) }
func BenchmarkRing_write_1_100000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(1), 100000) }
func BenchmarkRing_write_4_100(b *testing.B) { benchmarkRingWrite(b, MustNewRing(4), 100) }
func BenchmarkRing_write_4_1000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(4), 1000) }
func BenchmarkRing_write_4_10000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(4), 10000) }
func BenchmarkRing_write_4_100000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(4), 100000) }
func BenchmarkRing_write_32_100(b *testing.B) { benchmarkRingWrite(b, MustNewRing(32), 100) }
func BenchmarkRing_write_32_1000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(32), 1000) }
func BenchmarkRing_write_32_10000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(32), 10000) }
func BenchmarkRing_write_32_100000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(32), 100000) }
func BenchmarkRing_write_128_100(b *testing.B) { benchmarkRingWrite(b, MustNewRing(128), 100) }
func BenchmarkRing_write_128_1000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(128), 1000) }
func BenchmarkRing_write_128_10000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(128), 10000) }
func BenchmarkRing_write_128_100000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(256), 100000) }
func BenchmarkRing_write_256_100(b *testing.B) { benchmarkRingWrite(b, MustNewRing(256), 100) }
func BenchmarkRing_write_256_1000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(256), 1000) }
func BenchmarkRing_write_256_10000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(256), 10000) }
func BenchmarkRing_write_256_100000(b *testing.B) { benchmarkRingWrite(b, MustNewRing(256), 100000) }
func MustNewRing(n int) *ring {
r, err := newring(n)
if err != nil {
panic(err)
}
return r
}

View File

@@ -0,0 +1,133 @@
package tsm1
// String encoding uses snappy compression to compress each string. Each string is
// appended to byte slice prefixed with a variable byte length followed by the string
// bytes. The bytes are compressed using snappy compressor and a 1 byte header is used
// to indicate the type of encoding.
import (
"encoding/binary"
"fmt"
"github.com/golang/snappy"
)
const (
// stringUncompressed is a an uncompressed format encoding strings as raw bytes.
// Not yet implemented.
stringUncompressed = 0
// stringCompressedSnappy is a compressed encoding using Snappy compression
stringCompressedSnappy = 1
)
// StringEncoder encodes multiple strings into a byte slice.
type StringEncoder struct {
// The encoded bytes
bytes []byte
}
// NewStringEncoder returns a new StringEncoder with an initial buffer ready to hold sz bytes.
func NewStringEncoder(sz int) StringEncoder {
return StringEncoder{
bytes: make([]byte, 0, sz),
}
}
// Flush is no-op
func (e *StringEncoder) Flush() {}
// Reset sets the encoder back to its initial state.
func (e *StringEncoder) Reset() {
e.bytes = e.bytes[:0]
}
// Write encodes s to the underlying buffer.
func (e *StringEncoder) Write(s string) {
b := make([]byte, 10)
// Append the length of the string using variable byte encoding
i := binary.PutUvarint(b, uint64(len(s)))
e.bytes = append(e.bytes, b[:i]...)
// Append the string bytes
e.bytes = append(e.bytes, s...)
}
// Bytes returns a copy of the underlying buffer.
func (e *StringEncoder) Bytes() ([]byte, error) {
// Compress the currently appended bytes using snappy and prefix with
// a 1 byte header for future extension
data := snappy.Encode(nil, e.bytes)
return append([]byte{stringCompressedSnappy << 4}, data...), nil
}
// StringDecoder decodes a byte slice into strings.
type StringDecoder struct {
b []byte
l int
i int
err error
}
// SetBytes initializes the decoder with bytes to read from.
// This must be called before calling any other method.
func (e *StringDecoder) SetBytes(b []byte) error {
// First byte stores the encoding type, only have snappy format
// currently so ignore for now.
var data []byte
if len(b) > 0 {
var err error
data, err = snappy.Decode(nil, b[1:])
if err != nil {
return fmt.Errorf("failed to decode string block: %v", err.Error())
}
}
e.b = data
e.l = 0
e.i = 0
e.err = nil
return nil
}
// Next returns true if there are any values remaining to be decoded.
func (e *StringDecoder) Next() bool {
if e.err != nil {
return false
}
e.i += e.l
return e.i < len(e.b)
}
// Read returns the next value from the decoder.
func (e *StringDecoder) Read() string {
// Read the length of the string
length, n := binary.Uvarint(e.b[e.i:])
if n <= 0 {
e.err = fmt.Errorf("StringDecoder: invalid encoded string length")
return ""
}
// The length of this string plus the length of the variable byte encoded length
e.l = int(length) + n
lower := e.i + n
upper := lower + int(length)
if upper < lower {
e.err = fmt.Errorf("StringDecoder: length overflow")
return ""
}
if upper > len(e.b) {
e.err = fmt.Errorf("StringDecoder: not enough data to represent encoded string")
return ""
}
return string(e.b[lower:upper])
}
// Error returns the last error encountered by the decoder.
func (e *StringDecoder) Error() error {
return e.err
}

View File

@@ -0,0 +1,177 @@
package tsm1
import (
"fmt"
"reflect"
"testing"
"testing/quick"
)
func Test_StringEncoder_NoValues(t *testing.T) {
enc := NewStringEncoder(1024)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec StringDecoder
if err := dec.SetBytes(b); err != nil {
t.Fatalf("unexpected error creating string decoder: %v", err)
}
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_StringEncoder_Single(t *testing.T) {
enc := NewStringEncoder(1024)
v1 := "v1"
enc.Write(v1)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec StringDecoder
if dec.SetBytes(b); err != nil {
t.Fatalf("unexpected error creating string decoder: %v", err)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got false, exp true")
}
if v1 != dec.Read() {
t.Fatalf("unexpected value: got %v, exp %v", dec.Read(), v1)
}
}
func Test_StringEncoder_Multi_Compressed(t *testing.T) {
enc := NewStringEncoder(1024)
values := make([]string, 10)
for i := range values {
values[i] = fmt.Sprintf("value %d", i)
enc.Write(values[i])
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if b[0]>>4 != stringCompressedSnappy {
t.Fatalf("unexpected encoding: got %v, exp %v", b[0], stringCompressedSnappy)
}
if exp := 51; len(b) != exp {
t.Fatalf("unexpected length: got %v, exp %v", len(b), exp)
}
var dec StringDecoder
if err := dec.SetBytes(b); err != nil {
t.Fatalf("unexpected erorr creating string decoder: %v", err)
}
for i, v := range values {
if !dec.Next() {
t.Fatalf("unexpected next value: got false, exp true")
}
if v != dec.Read() {
t.Fatalf("unexpected value at pos %d: got %v, exp %v", i, dec.Read(), v)
}
}
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_StringEncoder_Quick(t *testing.T) {
quick.Check(func(values []string) bool {
expected := values
if values == nil {
expected = []string{}
}
// Write values to encoder.
enc := NewStringEncoder(1024)
for _, v := range values {
enc.Write(v)
}
// Retrieve encoded bytes from encoder.
buf, err := enc.Bytes()
if err != nil {
t.Fatal(err)
}
// Read values out of decoder.
got := make([]string, 0, len(values))
var dec StringDecoder
if err := dec.SetBytes(buf); err != nil {
t.Fatal(err)
}
for dec.Next() {
if err := dec.Error(); err != nil {
t.Fatal(err)
}
got = append(got, dec.Read())
}
// Verify that input and output values match.
if !reflect.DeepEqual(expected, got) {
t.Fatalf("mismatch:\n\nexp=%#v\n\ngot=%#v\n\n", expected, got)
}
return true
}, nil)
}
func Test_StringDecoder_Empty(t *testing.T) {
var dec StringDecoder
if err := dec.SetBytes([]byte{}); err != nil {
t.Fatal(err)
}
if dec.Next() {
t.Fatalf("exp Next() == false, got true")
}
}
func Test_StringDecoder_CorruptRead(t *testing.T) {
cases := []string{
"\x10\x03\b\x03Hi", // Higher length than actual data
"\x10\x1dp\x9c\x90\x90\x90\x90\x90\x90\x90\x90\x90length overflow----",
}
for _, c := range cases {
var dec StringDecoder
if err := dec.SetBytes([]byte(c)); err != nil {
t.Fatal(err)
}
if !dec.Next() {
t.Fatalf("exp Next() to return true, got false")
}
_ = dec.Read()
if dec.Error() == nil {
t.Fatalf("exp an err, got nil: %q", c)
}
}
}
func Test_StringDecoder_CorruptSetBytes(t *testing.T) {
cases := []string{
"0t\x00\x01\x000\x00\x01\x000\x00\x01\x000\x00\x01\x000\x00\x01" +
"\x000\x00\x01\x000\x00\x01\x000\x00\x00\x00\xff:\x01\x00\x01\x00\x01" +
"\x00\x01\x00\x01\x00\x01\x00\x010\x010\x000\x010\x010\x010\x01" +
"0\x010\x010\x010\x010\x010\x010\x010\x010\x010\x010", // Upper slice bounds overflows negative
}
for _, c := range cases {
var dec StringDecoder
if err := dec.SetBytes([]byte(c)); err == nil {
t.Fatalf("exp an err, got nil: %q", c)
}
}
}

View File

@@ -0,0 +1,414 @@
package tsm1
// Timestamp encoding is adaptive and based on structure of the timestamps that are encoded. It
// uses a combination of delta encoding, scaling and compression using simple8b, run length encoding
// as well as falling back to no compression if needed.
//
// Timestamp values to be encoded should be sorted before encoding. When encoded, the values are
// first delta-encoded. The first value is the starting timestamp, subsequent values are the difference
// from the prior value.
//
// Timestamp resolution can also be in the nanosecond. Many timestamps are monotonically increasing
// and fall on even boundaries of time such as every 10s. When the timestamps have this structure,
// they are scaled by the largest common divisor that is also a factor of 10. This has the effect
// of converting very large integer deltas into very small one that can be reversed by multiplying them
// by the scaling factor.
//
// Using these adjusted values, if all the deltas are the same, the time range is stored using run
// length encoding. If run length encoding is not possible and all values are less than 1 << 60 - 1
// (~36.5 yrs in nanosecond resolution), then the timestamps are encoded using simple8b encoding. If
// any value exceeds the maximum values, the deltas are stored uncompressed using 8b each.
//
// Each compressed byte slice has a 1 byte header indicating the compression type. The 4 high bits
// indicate the encoding type. The 4 low bits are used by the encoding type.
//
// For run-length encoding, the 4 low bits store the log10 of the scaling factor. The next 8 bytes are
// the starting timestamp, next 1-10 bytes is the delta value using variable-length encoding, finally the
// next 1-10 bytes is the count of values.
//
// For simple8b encoding, the 4 low bits store the log10 of the scaling factor. The next 8 bytes is the
// first delta value stored uncompressed, the remaining bytes are 64bit words containg compressed delta
// values.
//
// For uncompressed encoding, the delta values are stored using 8 bytes each.
import (
"encoding/binary"
"fmt"
"math"
"github.com/jwilder/encoding/simple8b"
)
const (
// timeUncompressed is a an uncompressed format using 8 bytes per timestamp
timeUncompressed = 0
// timeCompressedPackedSimple is a bit-packed format using simple8b encoding
timeCompressedPackedSimple = 1
// timeCompressedRLE is a run-length encoding format
timeCompressedRLE = 2
)
// TimeEncoder encodes time.Time to byte slices.
type TimeEncoder interface {
Write(t int64)
Bytes() ([]byte, error)
Reset()
}
type encoder struct {
ts []uint64
bytes []byte
enc *simple8b.Encoder
}
// NewTimeEncoder returns a TimeEncoder with an initial buffer ready to hold sz bytes.
func NewTimeEncoder(sz int) TimeEncoder {
return &encoder{
ts: make([]uint64, 0, sz),
enc: simple8b.NewEncoder(),
}
}
// Reset sets the encoder back to its initial state.
func (e *encoder) Reset() {
e.ts = e.ts[:0]
e.bytes = e.bytes[:0]
e.enc.Reset()
}
// Write adds a timestamp to the compressed stream.
func (e *encoder) Write(t int64) {
e.ts = append(e.ts, uint64(t))
}
func (e *encoder) reduce() (max, divisor uint64, rle bool, deltas []uint64) {
// Compute the deltas in place to avoid allocating another slice
deltas = e.ts
// Starting values for a max and divisor
max, divisor = 0, 1e12
// Indicates whether the the deltas can be run-length encoded
rle = true
// Iterate in reverse so we can apply deltas in place
for i := len(deltas) - 1; i > 0; i-- {
// First differential encode the values
deltas[i] = deltas[i] - deltas[i-1]
// We also need to keep track of the max value and largest common divisor
v := deltas[i]
if v > max {
max = v
}
// If our value is divisible by 10, break. Otherwise, try the next smallest divisor.
for divisor > 1 && v%divisor != 0 {
divisor /= 10
}
// Skip the first value || see if prev = curr. The deltas can be RLE if the are all equal.
rle = i == len(deltas)-1 || rle && (deltas[i+1] == deltas[i])
}
return
}
// Bytes returns the encoded bytes of all written times.
func (e *encoder) Bytes() ([]byte, error) {
if len(e.ts) == 0 {
return e.bytes[:0], nil
}
// Maximum and largest common divisor. rle is true if dts (the delta timestamps),
// are all the same.
max, div, rle, dts := e.reduce()
// The deltas are all the same, so we can run-length encode them
if rle && len(e.ts) > 1 {
return e.encodeRLE(e.ts[0], e.ts[1], div, len(e.ts))
}
// We can't compress this time-range, the deltas exceed 1 << 60
if max > simple8b.MaxValue {
return e.encodeRaw()
}
return e.encodePacked(div, dts)
}
func (e *encoder) encodePacked(div uint64, dts []uint64) ([]byte, error) {
// Only apply the divisor if it's greater than 1 since division is expensive.
if div > 1 {
for _, v := range dts[1:] {
if err := e.enc.Write(v / div); err != nil {
return nil, err
}
}
} else {
for _, v := range dts[1:] {
if err := e.enc.Write(v); err != nil {
return nil, err
}
}
}
// The compressed deltas
deltas, err := e.enc.Bytes()
if err != nil {
return nil, err
}
sz := 8 + 1 + len(deltas)
if cap(e.bytes) < sz {
e.bytes = make([]byte, sz)
}
b := e.bytes[:sz]
// 4 high bits used for the encoding type
b[0] = byte(timeCompressedPackedSimple) << 4
// 4 low bits are the log10 divisor
b[0] |= byte(math.Log10(float64(div)))
// The first delta value
binary.BigEndian.PutUint64(b[1:9], uint64(dts[0]))
copy(b[9:], deltas)
return b[:9+len(deltas)], nil
}
func (e *encoder) encodeRaw() ([]byte, error) {
sz := 1 + len(e.ts)*8
if cap(e.bytes) < sz {
e.bytes = make([]byte, sz)
}
b := e.bytes[:sz]
b[0] = byte(timeUncompressed) << 4
for i, v := range e.ts {
binary.BigEndian.PutUint64(b[1+i*8:1+i*8+8], uint64(v))
}
return b, nil
}
func (e *encoder) encodeRLE(first, delta, div uint64, n int) ([]byte, error) {
// Large varints can take up to 10 bytes, we're encoding 3 + 1 byte type
sz := 31
if cap(e.bytes) < sz {
e.bytes = make([]byte, sz)
}
b := e.bytes[:sz]
// 4 high bits used for the encoding type
b[0] = byte(timeCompressedRLE) << 4
// 4 low bits are the log10 divisor
b[0] |= byte(math.Log10(float64(div)))
i := 1
// The first timestamp
binary.BigEndian.PutUint64(b[i:], uint64(first))
i += 8
// The first delta
i += binary.PutUvarint(b[i:], uint64(delta/div))
// The number of times the delta is repeated
i += binary.PutUvarint(b[i:], uint64(n))
return b[:i], nil
}
// TimeDecoder decodes a byte slice into timestamps.
type TimeDecoder struct {
v int64
i, n int
ts []uint64
dec simple8b.Decoder
err error
// The delta value for a run-length encoded byte slice
rleDelta int64
encoding byte
}
// Init initializes the decoder with bytes to read from.
func (d *TimeDecoder) Init(b []byte) {
d.v = 0
d.i = 0
d.ts = d.ts[:0]
d.err = nil
if len(b) > 0 {
// Encoding type is stored in the 4 high bits of the first byte
d.encoding = b[0] >> 4
}
d.decode(b)
}
// Next returns true if there are any timestamps remaining to be decoded.
func (d *TimeDecoder) Next() bool {
if d.err != nil {
return false
}
if d.encoding == timeCompressedRLE {
if d.i >= d.n {
return false
}
d.i++
d.v += d.rleDelta
return d.i < d.n
}
if d.i >= len(d.ts) {
return false
}
d.v = int64(d.ts[d.i])
d.i++
return true
}
// Read returns the next timestamp from the decoder.
func (d *TimeDecoder) Read() int64 {
return d.v
}
// Error returns the last error encountered by the decoder.
func (d *TimeDecoder) Error() error {
return d.err
}
func (d *TimeDecoder) decode(b []byte) {
if len(b) == 0 {
return
}
switch d.encoding {
case timeUncompressed:
d.decodeRaw(b[1:])
case timeCompressedRLE:
d.decodeRLE(b)
case timeCompressedPackedSimple:
d.decodePacked(b)
default:
d.err = fmt.Errorf("unknown encoding: %v", d.encoding)
}
}
func (d *TimeDecoder) decodePacked(b []byte) {
if len(b) < 9 {
d.err = fmt.Errorf("TimeDecoder: not enough data to decode packed timestamps")
return
}
div := uint64(math.Pow10(int(b[0] & 0xF)))
first := uint64(binary.BigEndian.Uint64(b[1:9]))
d.dec.SetBytes(b[9:])
d.i = 0
deltas := d.ts[:0]
deltas = append(deltas, first)
for d.dec.Next() {
deltas = append(deltas, d.dec.Read())
}
// Compute the prefix sum and scale the deltas back up
last := deltas[0]
if div > 1 {
for i := 1; i < len(deltas); i++ {
dgap := deltas[i] * div
deltas[i] = last + dgap
last = deltas[i]
}
} else {
for i := 1; i < len(deltas); i++ {
deltas[i] += last
last = deltas[i]
}
}
d.i = 0
d.ts = deltas
}
func (d *TimeDecoder) decodeRLE(b []byte) {
if len(b) < 9 {
d.err = fmt.Errorf("TimeDecoder: not enough data for initial RLE timestamp")
return
}
var i, n int
// Lower 4 bits hold the 10 based exponent so we can scale the values back up
mod := int64(math.Pow10(int(b[i] & 0xF)))
i++
// Next 8 bytes is the starting timestamp
first := binary.BigEndian.Uint64(b[i : i+8])
i += 8
// Next 1-10 bytes is our (scaled down by factor of 10) run length values
value, n := binary.Uvarint(b[i:])
if n <= 0 {
d.err = fmt.Errorf("TimeDecoder: invalid run length in decodeRLE")
return
}
// Scale the value back up
value *= uint64(mod)
i += n
// Last 1-10 bytes is how many times the value repeats
count, n := binary.Uvarint(b[i:])
if n <= 0 {
d.err = fmt.Errorf("TimeDecoder: invalid repeat value in decodeRLE")
return
}
d.v = int64(first - value)
d.rleDelta = int64(value)
d.i = -1
d.n = int(count)
}
func (d *TimeDecoder) decodeRaw(b []byte) {
d.i = 0
d.ts = make([]uint64, len(b)/8)
for i := range d.ts {
d.ts[i] = binary.BigEndian.Uint64(b[i*8 : i*8+8])
delta := d.ts[i]
// Compute the prefix sum and scale the deltas back up
if i > 0 {
d.ts[i] = d.ts[i-1] + delta
}
}
}
func CountTimestamps(b []byte) int {
if len(b) == 0 {
return 0
}
// Encoding type is stored in the 4 high bits of the first byte
encoding := b[0] >> 4
switch encoding {
case timeUncompressed:
// Uncompressed timestamps are just 8 bytes each
return len(b[1:]) / 8
case timeCompressedRLE:
// First 9 bytes are the starting timestamp and scaling factor, skip over them
i := 9
// Next 1-10 bytes is our (scaled down by factor of 10) run length values
_, n := binary.Uvarint(b[9:])
i += n
// Last 1-10 bytes is how many times the value repeats
count, _ := binary.Uvarint(b[i:])
return int(count)
case timeCompressedPackedSimple:
// First 9 bytes are the starting timestamp and scaling factor, skip over them
count, _ := simple8b.CountBytes(b[9:])
return count + 1 // +1 is for the first uncompressed timestamp, starting timestamep in b[1:9]
default:
return 0
}
}

View File

@@ -0,0 +1,604 @@
package tsm1
import (
"reflect"
"testing"
"testing/quick"
"time"
)
func Test_TimeEncoder(t *testing.T) {
enc := NewTimeEncoder(1)
x := []int64{}
now := time.Unix(0, 0)
x = append(x, now.UnixNano())
enc.Write(now.UnixNano())
for i := 1; i < 4; i++ {
x = append(x, now.Add(time.Duration(i)*time.Second).UnixNano())
enc.Write(x[i])
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
for i, v := range x {
if !dec.Next() {
t.Fatalf("Next == false, expected true")
}
if v != dec.Read() {
t.Fatalf("Item %d mismatch, got %v, exp %v", i, dec.Read(), v)
}
}
}
func Test_TimeEncoder_NoValues(t *testing.T) {
enc := NewTimeEncoder(0)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec TimeDecoder
dec.Init(b)
if dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
}
func Test_TimeEncoder_One(t *testing.T) {
enc := NewTimeEncoder(1)
var tm int64
enc.Write(tm)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedPackedSimple {
t.Fatalf("Wrong encoding used: expected uncompressed, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if tm != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), tm)
}
}
func Test_TimeEncoder_Two(t *testing.T) {
enc := NewTimeEncoder(2)
t1 := int64(0)
t2 := int64(1)
enc.Write(t1)
enc.Write(t2)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t2)
}
}
func Test_TimeEncoder_Three(t *testing.T) {
enc := NewTimeEncoder(3)
t1 := int64(0)
t2 := int64(1)
t3 := int64(3)
enc.Write(t1)
enc.Write(t2)
enc.Write(t3)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedPackedSimple {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t2)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t3 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t3)
}
}
func Test_TimeEncoder_Large_Range(t *testing.T) {
enc := NewTimeEncoder(2)
t1 := int64(1442369134000000000)
t2 := int64(1442369135000000000)
enc.Write(t1)
enc.Write(t2)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t2)
}
}
func Test_TimeEncoder_Uncompressed(t *testing.T) {
enc := NewTimeEncoder(3)
t1 := time.Unix(0, 0).UnixNano()
t2 := time.Unix(1, 0).UnixNano()
// about 36.5yrs in NS resolution is max range for compressed format
// This should cause the encoding to fallback to raw points
t3 := time.Unix(2, (2 << 59)).UnixNano()
enc.Write(t1)
enc.Write(t2)
enc.Write(t3)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("expected error: %v", err)
}
if exp := 25; len(b) != exp {
t.Fatalf("length mismatch: got %v, exp %v", len(b), exp)
}
if got := b[0] >> 4; got != timeUncompressed {
t.Fatalf("Wrong encoding used: expected uncompressed, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t1 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t1)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t2 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t2)
}
if !dec.Next() {
t.Fatalf("unexpected next value: got true, exp false")
}
if t3 != dec.Read() {
t.Fatalf("read value mismatch: got %v, exp %v", dec.Read(), t3)
}
}
func Test_TimeEncoder_RLE(t *testing.T) {
enc := NewTimeEncoder(512)
var ts []int64
for i := 0; i < 500; i++ {
ts = append(ts, int64(i))
}
for _, v := range ts {
enc.Write(v)
}
b, err := enc.Bytes()
if exp := 12; len(b) != exp {
t.Fatalf("length mismatch: got %v, exp %v", len(b), exp)
}
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected uncompressed, got %v", got)
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec TimeDecoder
dec.Init(b)
for i, v := range ts {
if !dec.Next() {
t.Fatalf("Next == false, expected true")
}
if v != dec.Read() {
t.Fatalf("Item %d mismatch, got %v, exp %v", i, dec.Read(), v)
}
}
if dec.Next() {
t.Fatalf("unexpected extra values")
}
}
func Test_TimeEncoder_Reverse(t *testing.T) {
enc := NewTimeEncoder(3)
ts := []int64{
int64(3),
int64(2),
int64(0),
}
for _, v := range ts {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeUncompressed {
t.Fatalf("Wrong encoding used: expected uncompressed, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
i := 0
for dec.Next() {
if ts[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), ts[i])
}
i++
}
}
func Test_TimeEncoder_220SecondDelta(t *testing.T) {
enc := NewTimeEncoder(256)
var ts []int64
now := time.Now()
for i := 0; i < 220; i++ {
ts = append(ts, now.Add(time.Duration(i*60)*time.Second).UnixNano())
}
for _, v := range ts {
enc.Write(v)
}
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
// Using RLE, should get 12 bytes
if exp := 12; len(b) != exp {
t.Fatalf("unexpected length: got %v, exp %v", len(b), exp)
}
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected uncompressed, got %v", got)
}
var dec TimeDecoder
dec.Init(b)
i := 0
for dec.Next() {
if ts[i] != dec.Read() {
t.Fatalf("read value %d mismatch: got %v, exp %v", i, dec.Read(), ts[i])
}
i++
}
if i != len(ts) {
t.Fatalf("Read too few values: exp %d, got %d", len(ts), i)
}
if dec.Next() {
t.Fatalf("expecte Next() = false, got true")
}
}
func Test_TimeEncoder_Quick(t *testing.T) {
quick.Check(func(values []int64) bool {
// Write values to encoder.
enc := NewTimeEncoder(1024)
exp := make([]int64, len(values))
for i, v := range values {
exp[i] = int64(v)
enc.Write(exp[i])
}
// Retrieve encoded bytes from encoder.
buf, err := enc.Bytes()
if err != nil {
t.Fatal(err)
}
// Read values out of decoder.
got := make([]int64, 0, len(values))
var dec TimeDecoder
dec.Init(buf)
for dec.Next() {
if err := dec.Error(); err != nil {
t.Fatal(err)
}
got = append(got, dec.Read())
}
// Verify that input and output values match.
if !reflect.DeepEqual(exp, got) {
t.Fatalf("mismatch:\n\nexp=%+v\n\ngot=%+v\n\n", exp, got)
}
return true
}, nil)
}
func Test_TimeEncoder_RLESeconds(t *testing.T) {
enc := NewTimeEncoder(6)
ts := make([]int64, 6)
ts[0] = int64(1444448158000000000)
ts[1] = int64(1444448168000000000)
ts[2] = int64(1444448178000000000)
ts[3] = int64(1444448188000000000)
ts[4] = int64(1444448198000000000)
ts[5] = int64(1444448208000000000)
for _, v := range ts {
enc.Write(v)
}
b, err := enc.Bytes()
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
var dec TimeDecoder
dec.Init(b)
for i, v := range ts {
if !dec.Next() {
t.Fatalf("Next == false, expected true")
}
if v != dec.Read() {
t.Fatalf("Item %d mismatch, got %v, exp %v", i, dec.Read(), v)
}
}
if dec.Next() {
t.Fatalf("unexpected extra values")
}
}
func TestTimeEncoder_Count_Uncompressed(t *testing.T) {
enc := NewTimeEncoder(2)
t1 := time.Unix(0, 0).UnixNano()
t2 := time.Unix(1, 0).UnixNano()
// about 36.5yrs in NS resolution is max range for compressed format
// This should cause the encoding to fallback to raw points
t3 := time.Unix(2, (2 << 59)).UnixNano()
enc.Write(t1)
enc.Write(t2)
enc.Write(t3)
b, err := enc.Bytes()
if got := b[0] >> 4; got != timeUncompressed {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got, exp := CountTimestamps(b), 3; got != exp {
t.Fatalf("count mismatch: got %v, exp %v", got, exp)
}
}
func TestTimeEncoder_Count_RLE(t *testing.T) {
enc := NewTimeEncoder(5)
ts := make([]int64, 6)
ts[0] = int64(1444448158000000000)
ts[1] = int64(1444448168000000000)
ts[2] = int64(1444448178000000000)
ts[3] = int64(1444448188000000000)
ts[4] = int64(1444448198000000000)
ts[5] = int64(1444448208000000000)
for _, v := range ts {
enc.Write(v)
}
b, err := enc.Bytes()
if got := b[0] >> 4; got != timeCompressedRLE {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got, exp := CountTimestamps(b), len(ts); got != exp {
t.Fatalf("count mismatch: got %v, exp %v", got, exp)
}
}
func TestTimeEncoder_Count_Simple8(t *testing.T) {
enc := NewTimeEncoder(3)
t1 := int64(0)
t2 := int64(1)
t3 := int64(3)
enc.Write(t1)
enc.Write(t2)
enc.Write(t3)
b, err := enc.Bytes()
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got := b[0] >> 4; got != timeCompressedPackedSimple {
t.Fatalf("Wrong encoding used: expected rle, got %v", got)
}
if err != nil {
t.Fatalf("unexpected error: %v", err)
}
if got, exp := CountTimestamps(b), 3; got != exp {
t.Fatalf("count mismatch: got %v, exp %v", got, exp)
}
}
func TestTimeDecoder_Corrupt(t *testing.T) {
cases := []string{
"", // Empty
"\x10\x14", // Packed: not enough data
"\x20\x00", // RLE: not enough data for starting timestamp
"\x2012345678\x90", // RLE: initial timestamp but invalid uvarint encoding
"\x2012345678\x7f", // RLE: timestamp, RLE but invalid repeat
"\x00123", // Raw: data length not multiple of 8
}
for _, c := range cases {
var dec TimeDecoder
dec.Init([]byte(c))
if dec.Next() {
t.Fatalf("exp next == false, got true")
}
}
}
func BenchmarkTimeEncoder(b *testing.B) {
enc := NewTimeEncoder(1024)
x := make([]int64, 1024)
for i := 0; i < len(x); i++ {
x[i] = time.Now().UnixNano()
enc.Write(x[i])
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
enc.Bytes()
enc.Reset()
for i := 0; i < len(x); i++ {
enc.Write(x[i])
}
}
}
func BenchmarkTimeDecoder_Packed(b *testing.B) {
x := make([]int64, 1024)
enc := NewTimeEncoder(1024)
for i := 0; i < len(x); i++ {
x[i] = time.Now().UnixNano()
enc.Write(x[i])
}
bytes, _ := enc.Bytes()
b.ResetTimer()
var dec TimeDecoder
for i := 0; i < b.N; i++ {
dec.Init(bytes)
for dec.Next() {
}
}
}
func BenchmarkTimeDecoder_RLE(b *testing.B) {
x := make([]int64, 1024)
enc := NewTimeEncoder(1024)
for i := 0; i < len(x); i++ {
x[i] = int64(i * 10)
enc.Write(x[i])
}
bytes, _ := enc.Bytes()
b.ResetTimer()
b.StopTimer()
var dec TimeDecoder
b.StartTimer()
for i := 0; i < b.N; i++ {
dec.Init(bytes)
for dec.Next() {
}
}
}

View File

@@ -0,0 +1,342 @@
package tsm1
import (
"bufio"
"encoding/binary"
"io"
"io/ioutil"
"math"
"os"
"path/filepath"
"strings"
"sync"
)
const (
v2header = 0x1502
v2headerSize = 4
)
// Tombstoner records tombstones when entries are deleted.
type Tombstoner struct {
mu sync.RWMutex
// Path is the location of the file to record tombstone. This should be the
// full path to a TSM file.
Path string
// cache of the stats for this tombstone
fileStats []FileStat
// indicates that the stats may be out of sync with what is on disk and they
// should be refreshed.
statsLoaded bool
}
// Tombstone represents an individual deletion.
type Tombstone struct {
// Key is the tombstoned series key.
Key string
// Min and Max are the min and max unix nanosecond time ranges of Key that are deleted. If
// the full range is deleted, both values are -1.
Min, Max int64
}
// Add adds the all keys, across all timestamps, to the tombstone.
func (t *Tombstoner) Add(keys []string) error {
return t.AddRange(keys, math.MinInt64, math.MaxInt64)
}
// AddRange adds all keys to the tombstone specifying only the data between min and max to be removed.
func (t *Tombstoner) AddRange(keys []string, min, max int64) error {
if len(keys) == 0 {
return nil
}
t.mu.Lock()
defer t.mu.Unlock()
// If this TSMFile has not been written (mainly in tests), don't write a
// tombstone because the keys will not be written when it's actually saved.
if t.Path == "" {
return nil
}
t.statsLoaded = false
tombstones, err := t.readTombstone()
if err != nil {
return nil
}
if cap(tombstones) < len(tombstones)+len(keys) {
ts := make([]Tombstone, len(tombstones), len(tombstones)+len(keys))
copy(ts, tombstones)
tombstones = ts
}
for _, k := range keys {
tombstones = append(tombstones, Tombstone{
Key: k,
Min: min,
Max: max,
})
}
return t.writeTombstone(tombstones)
}
// ReadAll returns all the tombstones in the Tombstoner's directory.
func (t *Tombstoner) ReadAll() ([]Tombstone, error) {
return t.readTombstone()
}
// Delete removes all the tombstone files from disk.
func (t *Tombstoner) Delete() error {
t.mu.Lock()
defer t.mu.Unlock()
if err := os.RemoveAll(t.tombstonePath()); err != nil {
return err
}
t.statsLoaded = false
return nil
}
// HasTombstones return true if there are any tombstone entries recorded.
func (t *Tombstoner) HasTombstones() bool {
files := t.TombstoneFiles()
return len(files) > 0 && files[0].Size > 0
}
// TombstoneFiles returns any tombstone files associated with Tombstoner's TSM file.
func (t *Tombstoner) TombstoneFiles() []FileStat {
t.mu.RLock()
if t.statsLoaded {
stats := t.fileStats
t.mu.RUnlock()
return stats
}
t.mu.RUnlock()
stat, err := os.Stat(t.tombstonePath())
if os.IsNotExist(err) || err != nil {
t.mu.Lock()
// The file doesn't exist so record that we tried to load it so
// we don't continue to keep trying. This is the common case.
t.statsLoaded = os.IsNotExist(err)
t.fileStats = t.fileStats[:0]
t.mu.Unlock()
return nil
}
t.mu.Lock()
t.fileStats = append(t.fileStats[:0], FileStat{
Path: t.tombstonePath(),
LastModified: stat.ModTime().UnixNano(),
Size: uint32(stat.Size()),
})
t.statsLoaded = true
stats := t.fileStats
t.mu.Unlock()
return stats
}
// Walk calls fn for every Tombstone under the Tombstoner.
func (t *Tombstoner) Walk(fn func(t Tombstone) error) error {
f, err := os.Open(t.tombstonePath())
if os.IsNotExist(err) {
return nil
} else if err != nil {
return err
}
defer f.Close()
var b [4]byte
if _, err := f.Read(b[:]); err != nil {
// Might be a zero length file which should not exist, but
// an old bug allowed them to occur. Treat it as an empty
// v1 tombstone file so we don't abort loading the TSM file.
return t.readTombstoneV1(f, fn)
}
if _, err := f.Seek(0, io.SeekStart); err != nil {
return err
}
if binary.BigEndian.Uint32(b[:]) == v2header {
return t.readTombstoneV2(f, fn)
}
return t.readTombstoneV1(f, fn)
}
func (t *Tombstoner) writeTombstone(tombstones []Tombstone) error {
tmp, err := ioutil.TempFile(filepath.Dir(t.Path), "tombstone")
if err != nil {
return err
}
defer tmp.Close()
var b [8]byte
bw := bufio.NewWriterSize(tmp, 1024*1024)
binary.BigEndian.PutUint32(b[:4], v2header)
if _, err := bw.Write(b[:4]); err != nil {
return err
}
for _, t := range tombstones {
binary.BigEndian.PutUint32(b[:4], uint32(len(t.Key)))
if _, err := bw.Write(b[:4]); err != nil {
return err
}
if _, err := bw.WriteString(t.Key); err != nil {
return err
}
binary.BigEndian.PutUint64(b[:], uint64(t.Min))
if _, err := bw.Write(b[:]); err != nil {
return err
}
binary.BigEndian.PutUint64(b[:], uint64(t.Max))
if _, err := bw.Write(b[:]); err != nil {
return err
}
}
if err := bw.Flush(); err != nil {
return err
}
// fsync the file to flush the write
if err := tmp.Sync(); err != nil {
return err
}
tmpFilename := tmp.Name()
tmp.Close()
if err := renameFile(tmpFilename, t.tombstonePath()); err != nil {
return err
}
return syncDir(filepath.Dir(t.tombstonePath()))
}
func (t *Tombstoner) readTombstone() ([]Tombstone, error) {
var tombstones []Tombstone
if err := t.Walk(func(t Tombstone) error {
tombstones = append(tombstones, t)
return nil
}); err != nil {
return nil, err
}
return tombstones, nil
}
// readTombstoneV1 reads the first version of tombstone files that were not
// capable of storing a min and max time for a key. This is used for backwards
// compatibility with versions prior to 0.13. This format is a simple newline
// separated text file.
func (t *Tombstoner) readTombstoneV1(f *os.File, fn func(t Tombstone) error) error {
r := bufio.NewScanner(f)
for r.Scan() {
line := r.Text()
if line == "" {
continue
}
if err := fn(Tombstone{
Key: line,
Min: math.MinInt64,
Max: math.MaxInt64,
}); err != nil {
return err
}
}
return r.Err()
}
// readTombstoneV2 reads the second version of tombstone files that are capable
// of storing keys and the range of time for the key that points were deleted. This
// format is binary.
func (t *Tombstoner) readTombstoneV2(f *os.File, fn func(t Tombstone) error) error {
// Skip header, already checked earlier
if _, err := f.Seek(v2headerSize, io.SeekStart); err != nil {
return err
}
n := int64(4)
fi, err := f.Stat()
if err != nil {
return err
}
size := fi.Size()
var (
min, max int64
key string
)
b := make([]byte, 4096)
for {
if n >= size {
return nil
}
if _, err = f.Read(b[:4]); err != nil {
return err
}
n += 4
keyLen := int(binary.BigEndian.Uint32(b[:4]))
if keyLen > len(b) {
b = make([]byte, keyLen)
}
if _, err := f.Read(b[:keyLen]); err != nil {
return err
}
key = string(b[:keyLen])
n += int64(keyLen)
if _, err := f.Read(b[:8]); err != nil {
return err
}
n += 8
min = int64(binary.BigEndian.Uint64(b[:8]))
if _, err := f.Read(b[:8]); err != nil {
return err
}
n += 8
max = int64(binary.BigEndian.Uint64(b[:8]))
if err := fn(Tombstone{
Key: key,
Min: min,
Max: max,
}); err != nil {
return err
}
}
}
func (t *Tombstoner) tombstonePath() string {
if strings.HasSuffix(t.Path, "tombstone") {
return t.Path
}
// Filename is 0000001.tsm1
filename := filepath.Base(t.Path)
// Strip off the tsm1
ext := filepath.Ext(filename)
if ext != "" {
filename = strings.TrimSuffix(filename, ext)
}
// Append the "tombstone" suffix to create a 0000001.tombstone file
return filepath.Join(filepath.Dir(t.Path), filename+".tombstone")
}

View File

@@ -0,0 +1,236 @@
package tsm1_test
import (
"io/ioutil"
"os"
"testing"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func TestTombstoner_Add(t *testing.T) {
dir := MustTempDir()
defer func() { os.RemoveAll(dir) }()
f := MustTempFile(dir)
ts := &tsm1.Tombstoner{Path: f.Name()}
entries, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 0; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
stats := ts.TombstoneFiles()
if got, exp := len(stats), 0; got != exp {
t.Fatalf("stat length mismatch: got %v, exp %v", got, exp)
}
ts.Add([]string{"foo"})
entries, err = ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
stats = ts.TombstoneFiles()
if got, exp := len(stats), 1; got != exp {
t.Fatalf("stat length mismatch: got %v, exp %v", got, exp)
}
if stats[0].Size == 0 {
t.Fatalf("got size %v, exp > 0", stats[0].Size)
}
if stats[0].LastModified == 0 {
t.Fatalf("got lastModified %v, exp > 0", stats[0].LastModified)
}
if stats[0].Path == "" {
t.Fatalf("got path %v, exp != ''", stats[0].Path)
}
if got, exp := len(entries), 1; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
if got, exp := entries[0].Key, "foo"; got != exp {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
// Use a new Tombstoner to verify values are persisted
ts = &tsm1.Tombstoner{Path: f.Name()}
entries, err = ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 1; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
if got, exp := entries[0].Key, "foo"; got != exp {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
}
func TestTombstoner_Add_Empty(t *testing.T) {
dir := MustTempDir()
defer func() { os.RemoveAll(dir) }()
f := MustTempFile(dir)
ts := &tsm1.Tombstoner{Path: f.Name()}
entries, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 0; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
ts.Add([]string{})
// Use a new Tombstoner to verify values are persisted
ts = &tsm1.Tombstoner{Path: f.Name()}
entries, err = ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 0; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
stats := ts.TombstoneFiles()
if got, exp := len(stats), 0; got != exp {
t.Fatalf("stat length mismatch: got %v, exp %v", got, exp)
}
}
func TestTombstoner_Delete(t *testing.T) {
dir := MustTempDir()
defer func() { os.RemoveAll(dir) }()
f := MustTempFile(dir)
ts := &tsm1.Tombstoner{Path: f.Name()}
ts.Add([]string{"foo"})
// Use a new Tombstoner to verify values are persisted
ts = &tsm1.Tombstoner{Path: f.Name()}
entries, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 1; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
if got, exp := entries[0].Key, "foo"; got != exp {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
if err := ts.Delete(); err != nil {
fatal(t, "delete tombstone", err)
}
stats := ts.TombstoneFiles()
if got, exp := len(stats), 0; got != exp {
t.Fatalf("stat length mismatch: got %v, exp %v", got, exp)
}
ts = &tsm1.Tombstoner{Path: f.Name()}
entries, err = ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 0; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
}
func TestTombstoner_ReadV1(t *testing.T) {
dir := MustTempDir()
defer func() { os.RemoveAll(dir) }()
f := MustTempFile(dir)
if err := ioutil.WriteFile(f.Name(), []byte("foo\n"), 0x0600); err != nil {
t.Fatalf("write v1 file: %v", err)
}
f.Close()
if err := os.Rename(f.Name(), f.Name()+".tombstone"); err != nil {
t.Fatalf("rename tombstone failed: %v", err)
}
ts := &tsm1.Tombstoner{Path: f.Name()}
_, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
entries, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 1; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
if got, exp := entries[0].Key, "foo"; got != exp {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
// Use a new Tombstoner to verify values are persisted
ts = &tsm1.Tombstoner{Path: f.Name()}
entries, err = ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 1; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
if got, exp := entries[0].Key, "foo"; got != exp {
t.Fatalf("value mismatch: got %v, exp %v", got, exp)
}
}
func TestTombstoner_ReadEmptyV1(t *testing.T) {
dir := MustTempDir()
defer func() { os.RemoveAll(dir) }()
f := MustTempFile(dir)
f.Close()
if err := os.Rename(f.Name(), f.Name()+".tombstone"); err != nil {
t.Fatalf("rename tombstone failed: %v", err)
}
ts := &tsm1.Tombstoner{Path: f.Name()}
_, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
entries, err := ts.ReadAll()
if err != nil {
fatal(t, "ReadAll", err)
}
if got, exp := len(entries), 0; got != exp {
t.Fatalf("length mismatch: got %v, exp %v", got, exp)
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,768 @@
package tsm1_test
import (
"fmt"
"io"
"os"
"testing"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
"github.com/golang/snappy"
)
func TestWALWriter_WriteMulti_Single(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
p1 := tsm1.NewValue(1, 1.1)
p2 := tsm1.NewValue(1, int64(1))
p3 := tsm1.NewValue(1, true)
p4 := tsm1.NewValue(1, "string")
values := map[string][]tsm1.Value{
"cpu,host=A#!~#float": []tsm1.Value{p1},
"cpu,host=A#!~#int": []tsm1.Value{p2},
"cpu,host=A#!~#bool": []tsm1.Value{p3},
"cpu,host=A#!~#string": []tsm1.Value{p4},
}
entry := &tsm1.WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.WriteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
for k, v := range e.Values {
for i, vv := range v {
if got, exp := vv.String(), values[k][i].String(); got != exp {
t.Fatalf("points mismatch: got %v, exp %v", got, exp)
}
}
}
if n := r.Count(); n != MustReadFileSize(f) {
t.Fatalf("wrong count of bytes read, got %d, exp %d", n, MustReadFileSize(f))
}
}
func TestWALWriter_WriteMulti_LargeBatch(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
var points []tsm1.Value
for i := 0; i < 100000; i++ {
points = append(points, tsm1.NewValue(int64(i), int64(1)))
}
values := map[string][]tsm1.Value{
"cpu,host=A,server=01,foo=bar,tag=really-long#!~#float": points,
"mem,host=A,server=01,foo=bar,tag=really-long#!~#float": points,
}
entry := &tsm1.WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.WriteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
for k, v := range e.Values {
for i, vv := range v {
if got, exp := vv.String(), values[k][i].String(); got != exp {
t.Fatalf("points mismatch: got %v, exp %v", got, exp)
}
}
}
if n := r.Count(); n != MustReadFileSize(f) {
t.Fatalf("wrong count of bytes read, got %d, exp %d", n, MustReadFileSize(f))
}
}
func TestWALWriter_WriteMulti_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
p1 := tsm1.NewValue(1, int64(1))
p2 := tsm1.NewValue(1, int64(2))
exp := []struct {
key string
values []tsm1.Value
}{
{"cpu,host=A#!~#value", []tsm1.Value{p1}},
{"cpu,host=B#!~#value", []tsm1.Value{p2}},
}
for _, v := range exp {
entry := &tsm1.WriteWALEntry{
Values: map[string][]tsm1.Value{v.key: v.values},
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
}
// Seek back to the beinning of the file for reading
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
for _, ep := range exp {
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.WriteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
for k, v := range e.Values {
if got, exp := k, ep.key; got != exp {
t.Fatalf("key mismatch. got %v, exp %v", got, exp)
}
if got, exp := len(v), len(ep.values); got != exp {
t.Fatalf("values length mismatch: got %v, exp %v", got, exp)
}
for i, vv := range v {
if got, exp := vv.String(), ep.values[i].String(); got != exp {
t.Fatalf("points mismatch: got %v, exp %v", got, exp)
}
}
}
}
if n := r.Count(); n != MustReadFileSize(f) {
t.Fatalf("wrong count of bytes read, got %d, exp %d", n, MustReadFileSize(f))
}
}
func TestWALWriter_WriteDelete_Single(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
entry := &tsm1.DeleteWALEntry{
Keys: []string{"cpu"},
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.DeleteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
if got, exp := len(e.Keys), len(entry.Keys); got != exp {
t.Fatalf("key length mismatch: got %v, exp %v", got, exp)
}
if got, exp := e.Keys[0], entry.Keys[0]; got != exp {
t.Fatalf("key mismatch: got %v, exp %v", got, exp)
}
}
func TestWALWriter_WriteMultiDelete_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
p1 := tsm1.NewValue(1, true)
values := map[string][]tsm1.Value{
"cpu,host=A#!~#value": []tsm1.Value{p1},
}
writeEntry := &tsm1.WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(writeEntry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
// Write the delete entry
deleteEntry := &tsm1.DeleteWALEntry{
Keys: []string{"cpu,host=A#!~value"},
}
if err := w.Write(mustMarshalEntry(deleteEntry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
// Seek back to the beinning of the file for reading
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
// Read the write points first
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.WriteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
for k, v := range e.Values {
if got, exp := len(v), len(values[k]); got != exp {
t.Fatalf("values length mismatch: got %v, exp %v", got, exp)
}
for i, vv := range v {
if got, exp := vv.String(), values[k][i].String(); got != exp {
t.Fatalf("points mismatch: got %v, exp %v", got, exp)
}
}
}
// Read the delete second
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err = r.Read()
if err != nil {
fatal(t, "read entry", err)
}
de, ok := we.(*tsm1.DeleteWALEntry)
if !ok {
t.Fatalf("expected DeleteWALEntry: got %#v", e)
}
if got, exp := len(de.Keys), len(deleteEntry.Keys); got != exp {
t.Fatalf("key length mismatch: got %v, exp %v", got, exp)
}
if got, exp := de.Keys[0], deleteEntry.Keys[0]; got != exp {
t.Fatalf("key mismatch: got %v, exp %v", got, exp)
}
}
func TestWALWriter_WriteMultiDeleteRange_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
p1 := tsm1.NewValue(1, 1.0)
p2 := tsm1.NewValue(2, 2.0)
p3 := tsm1.NewValue(3, 3.0)
values := map[string][]tsm1.Value{
"cpu,host=A#!~#value": []tsm1.Value{p1, p2, p3},
}
writeEntry := &tsm1.WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(writeEntry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
// Write the delete entry
deleteEntry := &tsm1.DeleteRangeWALEntry{
Keys: []string{"cpu,host=A#!~value"},
Min: 2,
Max: 3,
}
if err := w.Write(mustMarshalEntry(deleteEntry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
// Seek back to the beinning of the file for reading
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
// Read the write points first
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err := r.Read()
if err != nil {
fatal(t, "read entry", err)
}
e, ok := we.(*tsm1.WriteWALEntry)
if !ok {
t.Fatalf("expected WriteWALEntry: got %#v", e)
}
for k, v := range e.Values {
if got, exp := len(v), len(values[k]); got != exp {
t.Fatalf("values length mismatch: got %v, exp %v", got, exp)
}
for i, vv := range v {
if got, exp := vv.String(), values[k][i].String(); got != exp {
t.Fatalf("points mismatch: got %v, exp %v", got, exp)
}
}
}
// Read the delete second
if !r.Next() {
t.Fatalf("expected next, got false")
}
we, err = r.Read()
if err != nil {
fatal(t, "read entry", err)
}
de, ok := we.(*tsm1.DeleteRangeWALEntry)
if !ok {
t.Fatalf("expected DeleteWALEntry: got %#v", e)
}
if got, exp := len(de.Keys), len(deleteEntry.Keys); got != exp {
t.Fatalf("key length mismatch: got %v, exp %v", got, exp)
}
if got, exp := de.Keys[0], deleteEntry.Keys[0]; got != exp {
t.Fatalf("key mismatch: got %v, exp %v", got, exp)
}
if got, exp := de.Min, int64(2); got != exp {
t.Fatalf("min time mismatch: got %v, exp %v", got, exp)
}
if got, exp := de.Max, int64(3); got != exp {
t.Fatalf("min time mismatch: got %v, exp %v", got, exp)
}
}
func TestWAL_ClosedSegments(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
w := tsm1.NewWAL(dir)
if err := w.Open(); err != nil {
t.Fatalf("error opening WAL: %v", err)
}
files, err := w.ClosedSegments()
if err != nil {
t.Fatalf("error getting closed segments: %v", err)
}
if got, exp := len(files), 0; got != exp {
t.Fatalf("close segment length mismatch: got %v, exp %v", got, exp)
}
if _, err := w.WriteMulti(map[string][]tsm1.Value{
"cpu,host=A#!~#value": []tsm1.Value{
tsm1.NewValue(1, 1.1),
},
}); err != nil {
t.Fatalf("error writing points: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("error closing wal: %v", err)
}
// Re-open the WAL
w = tsm1.NewWAL(dir)
defer w.Close()
if err := w.Open(); err != nil {
t.Fatalf("error opening WAL: %v", err)
}
files, err = w.ClosedSegments()
if err != nil {
t.Fatalf("error getting closed segments: %v", err)
}
if got, exp := len(files), 1; got != exp {
t.Fatalf("close segment length mismatch: got %v, exp %v", got, exp)
}
}
func TestWAL_Delete(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
w := tsm1.NewWAL(dir)
if err := w.Open(); err != nil {
t.Fatalf("error opening WAL: %v", err)
}
files, err := w.ClosedSegments()
if err != nil {
t.Fatalf("error getting closed segments: %v", err)
}
if got, exp := len(files), 0; got != exp {
t.Fatalf("close segment length mismatch: got %v, exp %v", got, exp)
}
if _, err := w.Delete([]string{"cpu"}); err != nil {
t.Fatalf("error writing points: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("error closing wal: %v", err)
}
// Re-open the WAL
w = tsm1.NewWAL(dir)
defer w.Close()
if err := w.Open(); err != nil {
t.Fatalf("error opening WAL: %v", err)
}
files, err = w.ClosedSegments()
if err != nil {
t.Fatalf("error getting closed segments: %v", err)
}
if got, exp := len(files), 1; got != exp {
t.Fatalf("close segment length mismatch: got %v, exp %v", got, exp)
}
}
func TestWALWriter_Corrupt(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
corruption := []byte{1, 4, 0, 0, 0}
p1 := tsm1.NewValue(1, 1.1)
values := map[string][]tsm1.Value{
"cpu,host=A#!~#float": []tsm1.Value{p1},
}
entry := &tsm1.WriteWALEntry{
Values: values,
}
if err := w.Write(mustMarshalEntry(entry)); err != nil {
fatal(t, "write points", err)
}
if err := w.Flush(); err != nil {
fatal(t, "flush", err)
}
// Write some random bytes to the file to simulate corruption.
if _, err := f.Write(corruption); err != nil {
fatal(t, "corrupt WAL segment", err)
}
// Create the WAL segment reader.
if _, err := f.Seek(0, io.SeekStart); err != nil {
fatal(t, "seek", err)
}
r := tsm1.NewWALSegmentReader(f)
// Try to decode two entries.
if !r.Next() {
t.Fatalf("expected next, got false")
}
if _, err := r.Read(); err != nil {
fatal(t, "read entry", err)
}
if !r.Next() {
t.Fatalf("expected next, got false")
}
if _, err := r.Read(); err == nil {
fatal(t, "read entry did not return err", nil)
}
// Count should only return size of valid data.
expCount := MustReadFileSize(f) - int64(len(corruption))
if n := r.Count(); n != expCount {
t.Fatalf("wrong count of bytes read, got %d, exp %d", n, expCount)
}
}
func TestWriteWALSegment_UnmarshalBinary_WriteWALCorrupt(t *testing.T) {
p1 := tsm1.NewValue(1, 1.1)
p2 := tsm1.NewValue(1, int64(1))
p3 := tsm1.NewValue(1, true)
p4 := tsm1.NewValue(1, "string")
values := map[string][]tsm1.Value{
"cpu,host=A#!~#float": []tsm1.Value{p1, p1},
"cpu,host=A#!~#int": []tsm1.Value{p2, p2},
"cpu,host=A#!~#bool": []tsm1.Value{p3, p3},
"cpu,host=A#!~#string": []tsm1.Value{p4, p4},
}
w := &tsm1.WriteWALEntry{
Values: values,
}
b, err := w.MarshalBinary()
if err != nil {
t.Fatalf("unexpected error, got %v", err)
}
// Test every possible truncation of a write WAL entry
for i := 0; i < len(b); i++ {
// re-allocated to ensure capacity would be exceed if slicing
truncated := make([]byte, i)
copy(truncated, b[:i])
err := w.UnmarshalBinary(truncated)
if err != nil && err != tsm1.ErrWALCorrupt {
t.Fatalf("unexpected error: %v", err)
}
}
}
func TestWriteWALSegment_UnmarshalBinary_DeleteWALCorrupt(t *testing.T) {
w := &tsm1.DeleteWALEntry{
Keys: []string{"foo", "bar"},
}
b, err := w.MarshalBinary()
if err != nil {
t.Fatalf("unexpected error, got %v", err)
}
// Test every possible truncation of a write WAL entry
for i := 0; i < len(b); i++ {
// re-allocated to ensure capacity would be exceed if slicing
truncated := make([]byte, i)
copy(truncated, b[:i])
err := w.UnmarshalBinary(truncated)
if err != nil && err != tsm1.ErrWALCorrupt {
t.Fatalf("unexpected error: %v", err)
}
}
}
func TestWriteWALSegment_UnmarshalBinary_DeleteRangeWALCorrupt(t *testing.T) {
w := &tsm1.DeleteRangeWALEntry{
Keys: []string{"foo", "bar"},
Min: 1,
Max: 2,
}
b, err := w.MarshalBinary()
if err != nil {
t.Fatalf("unexpected error, got %v", err)
}
// Test every possible truncation of a write WAL entry
for i := 0; i < len(b); i++ {
// re-allocated to ensure capacity would be exceed if slicing
truncated := make([]byte, i)
copy(truncated, b[:i])
err := w.UnmarshalBinary(truncated)
if err != nil && err != tsm1.ErrWALCorrupt {
t.Fatalf("unexpected error: %v", err)
}
}
}
func BenchmarkWALSegmentWriter(b *testing.B) {
points := map[string][]tsm1.Value{}
for i := 0; i < 5000; i++ {
k := "cpu,host=A#!~#value"
points[k] = append(points[k], tsm1.NewValue(int64(i), 1.1))
}
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
write := &tsm1.WriteWALEntry{
Values: points,
}
b.ResetTimer()
for i := 0; i < b.N; i++ {
if err := w.Write(mustMarshalEntry(write)); err != nil {
b.Fatalf("unexpected error writing entry: %v", err)
}
}
}
func BenchmarkWALSegmentReader(b *testing.B) {
points := map[string][]tsm1.Value{}
for i := 0; i < 5000; i++ {
k := "cpu,host=A#!~#value"
points[k] = append(points[k], tsm1.NewValue(int64(i), 1.1))
}
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w := tsm1.NewWALSegmentWriter(f)
write := &tsm1.WriteWALEntry{
Values: points,
}
for i := 0; i < 100; i++ {
if err := w.Write(mustMarshalEntry(write)); err != nil {
b.Fatalf("unexpected error writing entry: %v", err)
}
}
r := tsm1.NewWALSegmentReader(f)
b.ResetTimer()
for i := 0; i < b.N; i++ {
b.StopTimer()
f.Seek(0, io.SeekStart)
b.StartTimer()
for r.Next() {
_, err := r.Read()
if err != nil {
b.Fatalf("unexpected error reading entry: %v", err)
}
}
}
}
// MustReadFileSize returns the size of the file, or panics.
func MustReadFileSize(f *os.File) int64 {
stat, err := os.Stat(f.Name())
if err != nil {
panic(fmt.Sprintf("failed to get size of file at %s: %s", f.Name(), err.Error()))
}
return stat.Size()
}
func mustMarshalEntry(entry tsm1.WALEntry) (tsm1.WalEntryType, []byte) {
bytes := make([]byte, 1024<<2)
b, err := entry.Encode(bytes)
if err != nil {
panic(fmt.Sprintf("error encoding: %v", err))
}
return entry.Type(), snappy.Encode(b, b)
}

View File

@@ -0,0 +1,632 @@
package tsm1
/*
A TSM file is composed for four sections: header, blocks, index and the footer.
┌────────┬────────────────────────────────────┬─────────────┬──────────────┐
│ Header │ Blocks │ Index │ Footer │
│5 bytes │ N bytes │ N bytes │ 4 bytes │
└────────┴────────────────────────────────────┴─────────────┴──────────────┘
Header is composed of a magic number to identify the file type and a version
number.
┌───────────────────┐
│ Header │
├─────────┬─────────┤
│ Magic │ Version │
│ 4 bytes │ 1 byte │
└─────────┴─────────┘
Blocks are sequences of pairs of CRC32 and data. The block data is opaque to the
file. The CRC32 is used for block level error detection. The length of the blocks
is stored in the index.
┌───────────────────────────────────────────────────────────┐
│ Blocks │
├───────────────────┬───────────────────┬───────────────────┤
│ Block 1 │ Block 2 │ Block N │
├─────────┬─────────┼─────────┬─────────┼─────────┬─────────┤
│ CRC │ Data │ CRC │ Data │ CRC │ Data │
│ 4 bytes │ N bytes │ 4 bytes │ N bytes │ 4 bytes │ N bytes │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Following the blocks is the index for the blocks in the file. The index is
composed of a sequence of index entries ordered lexicographically by key and
then by time. Each index entry starts with a key length and key followed by a
count of the number of blocks in the file. Each block entry is composed of
the min and max time for the block, the offset into the file where the block
is located and the the size of the block.
The index structure can provide efficient access to all blocks as well as the
ability to determine the cost associated with acessing a given key. Given a key
and timestamp, we can determine whether a file contains the block for that
timestamp as well as where that block resides and how much data to read to
retrieve the block. If we know we need to read all or multiple blocks in a
file, we can use the size to determine how much to read in a given IO.
┌────────────────────────────────────────────────────────────────────────────┐
│ Index │
├─────────┬─────────┬──────┬───────┬─────────┬─────────┬────────┬────────┬───┤
│ Key Len │ Key │ Type │ Count │Min Time │Max Time │ Offset │ Size │...│
│ 2 bytes │ N bytes │1 byte│2 bytes│ 8 bytes │ 8 bytes │8 bytes │4 bytes │ │
└─────────┴─────────┴──────┴───────┴─────────┴─────────┴────────┴────────┴───┘
The last section is the footer that stores the offset of the start of the index.
┌─────────┐
│ Footer │
├─────────┤
│Index Ofs│
│ 8 bytes │
└─────────┘
*/
import (
"bufio"
"bytes"
"encoding/binary"
"fmt"
"hash/crc32"
"io"
"os"
"sort"
"sync"
"time"
)
const (
// MagicNumber is written as the first 4 bytes of a data file to
// identify the file as a tsm1 formatted file
MagicNumber uint32 = 0x16D116D1
// Version indicates the version of the TSM file format.
Version byte = 1
// Size in bytes of an index entry
indexEntrySize = 28
// Size in bytes used to store the count of index entries for a key
indexCountSize = 2
// Size in bytes used to store the type of block encoded
indexTypeSize = 1
// Max number of blocks for a given key that can exist in a single file
maxIndexEntries = (1 << (indexCountSize * 8)) - 1
// max length of a key in an index entry (measurement + tags)
maxKeyLength = (1 << (2 * 8)) - 1
)
var (
//ErrNoValues is returned when TSMWriter.WriteIndex is called and there are no values to write.
ErrNoValues = fmt.Errorf("no values written")
// ErrTSMClosed is returned when performing an operation against a closed TSM file.
ErrTSMClosed = fmt.Errorf("tsm file closed")
// ErrMaxKeyLengthExceeded is returned when attempting to write a key that is too long.
ErrMaxKeyLengthExceeded = fmt.Errorf("max key length exceeded")
// ErrMaxBlocksExceeded is returned when attempting to write a block past the allowed number.
ErrMaxBlocksExceeded = fmt.Errorf("max blocks exceeded")
)
// TSMWriter writes TSM formatted key and values.
type TSMWriter interface {
// Write writes a new block for key containing and values. Writes append
// blocks in the order that the Write function is called. The caller is
// responsible for ensuring keys and blocks are sorted appropriately.
// Values are encoded as a full block. The caller is responsible for
// ensuring a fixed number of values are encoded in each block as well as
// ensuring the Values are sorted. The first and last timestamp values are
// used as the minimum and maximum values for the index entry.
Write(key string, values Values) error
// WriteBlock writes a new block for key containing the bytes in block. WriteBlock appends
// blocks in the order that the WriteBlock function is called. The caller is
// responsible for ensuring keys and blocks are sorted appropriately, and that the
// block and index information is correct for the block. The minTime and maxTime
// timestamp values are used as the minimum and maximum values for the index entry.
WriteBlock(key string, minTime, maxTime int64, block []byte) error
// WriteIndex finishes the TSM write streams and writes the index.
WriteIndex() error
// Flushes flushes all pending changes to the underlying file resources.
Flush() error
// Close closes any underlying file resources.
Close() error
// Size returns the current size in bytes of the file.
Size() uint32
}
// IndexWriter writes a TSMIndex.
type IndexWriter interface {
// Add records a new block entry for a key in the index.
Add(key string, blockType byte, minTime, maxTime int64, offset int64, size uint32)
// Entries returns all index entries for a key.
Entries(key string) []IndexEntry
// Keys returns the unique set of keys in the index.
Keys() []string
// KeyCount returns the count of unique keys in the index.
KeyCount() int
// Size returns the size of a the current index in bytes.
Size() uint32
// MarshalBinary returns a byte slice encoded version of the index.
MarshalBinary() ([]byte, error)
// WriteTo writes the index contents to a writer.
WriteTo(w io.Writer) (int64, error)
}
// IndexEntry is the index information for a given block in a TSM file.
type IndexEntry struct {
// The min and max time of all points stored in the block.
MinTime, MaxTime int64
// The absolute position in the file where this block is located.
Offset int64
// The size in bytes of the block in the file.
Size uint32
}
// UnmarshalBinary decodes an IndexEntry from a byte slice.
func (e *IndexEntry) UnmarshalBinary(b []byte) error {
if len(b) != indexEntrySize {
return fmt.Errorf("unmarshalBinary: short buf: %v != %v", indexEntrySize, len(b))
}
e.MinTime = int64(binary.BigEndian.Uint64(b[:8]))
e.MaxTime = int64(binary.BigEndian.Uint64(b[8:16]))
e.Offset = int64(binary.BigEndian.Uint64(b[16:24]))
e.Size = binary.BigEndian.Uint32(b[24:28])
return nil
}
// AppendTo writes a binary-encoded version of IndexEntry to b, allocating
// and returning a new slice, if necessary.
func (e *IndexEntry) AppendTo(b []byte) []byte {
if len(b) < indexEntrySize {
if cap(b) < indexEntrySize {
b = make([]byte, indexEntrySize)
} else {
b = b[:indexEntrySize]
}
}
binary.BigEndian.PutUint64(b[:8], uint64(e.MinTime))
binary.BigEndian.PutUint64(b[8:16], uint64(e.MaxTime))
binary.BigEndian.PutUint64(b[16:24], uint64(e.Offset))
binary.BigEndian.PutUint32(b[24:28], uint32(e.Size))
return b
}
// Contains returns true if this IndexEntry may contain values for the given time.
// The min and max times are inclusive.
func (e *IndexEntry) Contains(t int64) bool {
return e.MinTime <= t && e.MaxTime >= t
}
// OverlapsTimeRange returns true if the given time ranges are completely within the entry's time bounds.
func (e *IndexEntry) OverlapsTimeRange(min, max int64) bool {
return e.MinTime <= max && e.MaxTime >= min
}
// String returns a string representation of the entry.
func (e *IndexEntry) String() string {
return fmt.Sprintf("min=%s max=%s ofs=%d siz=%d",
time.Unix(0, e.MinTime).UTC(), time.Unix(0, e.MaxTime).UTC(), e.Offset, e.Size)
}
// NewIndexWriter returns a new IndexWriter.
func NewIndexWriter() IndexWriter {
return &directIndex{
blocks: map[string]*indexEntries{},
}
}
// directIndex is a simple in-memory index implementation for a TSM file. The full index
// must fit in memory.
type directIndex struct {
mu sync.RWMutex
size uint32
blocks map[string]*indexEntries
}
func (d *directIndex) Add(key string, blockType byte, minTime, maxTime int64, offset int64, size uint32) {
d.mu.Lock()
defer d.mu.Unlock()
entries := d.blocks[key]
if entries == nil {
entries = &indexEntries{
Type: blockType,
}
d.blocks[key] = entries
// size of the key stored in the index
d.size += uint32(2 + len(key))
// size of the count of entries stored in the index
d.size += indexCountSize
}
entries.entries = append(entries.entries, IndexEntry{
MinTime: minTime,
MaxTime: maxTime,
Offset: offset,
Size: size,
})
// size of the encoded index entry
d.size += indexEntrySize
}
func (d *directIndex) entries(key string) []IndexEntry {
entries := d.blocks[key]
if entries == nil {
return nil
}
return entries.entries
}
func (d *directIndex) Entries(key string) []IndexEntry {
d.mu.RLock()
defer d.mu.RUnlock()
return d.entries(key)
}
func (d *directIndex) Entry(key string, t int64) *IndexEntry {
d.mu.RLock()
defer d.mu.RUnlock()
entries := d.entries(key)
for _, entry := range entries {
if entry.Contains(t) {
return &entry
}
}
return nil
}
func (d *directIndex) Keys() []string {
d.mu.RLock()
defer d.mu.RUnlock()
var keys []string
for k := range d.blocks {
keys = append(keys, k)
}
sort.Strings(keys)
return keys
}
func (d *directIndex) KeyCount() int {
d.mu.RLock()
n := len(d.blocks)
d.mu.RUnlock()
return n
}
func (d *directIndex) addEntries(key string, entries *indexEntries) {
existing := d.blocks[key]
if existing == nil {
d.blocks[key] = entries
return
}
existing.entries = append(existing.entries, entries.entries...)
}
func (d *directIndex) WriteTo(w io.Writer) (int64, error) {
d.mu.RLock()
defer d.mu.RUnlock()
// Index blocks are writtens sorted by key
keys := make([]string, 0, len(d.blocks))
for k := range d.blocks {
keys = append(keys, k)
}
sort.Strings(keys)
var (
n int
err error
buf [5]byte
N int64
)
// For each key, individual entries are sorted by time
for _, key := range keys {
entries := d.blocks[key]
if entries.Len() > maxIndexEntries {
return N, fmt.Errorf("key '%s' exceeds max index entries: %d > %d", key, entries.Len(), maxIndexEntries)
}
sort.Sort(entries)
binary.BigEndian.PutUint16(buf[0:2], uint16(len(key)))
buf[2] = entries.Type
binary.BigEndian.PutUint16(buf[3:5], uint16(entries.Len()))
// Append the key length and key
if n, err = w.Write(buf[0:2]); err != nil {
return int64(n) + N, fmt.Errorf("write: writer key length error: %v", err)
}
N += int64(n)
if n, err = io.WriteString(w, key); err != nil {
return int64(n) + N, fmt.Errorf("write: writer key error: %v", err)
}
N += int64(n)
// Append the block type and count
if n, err = w.Write(buf[2:5]); err != nil {
return int64(n) + N, fmt.Errorf("write: writer block type and count error: %v", err)
}
N += int64(n)
// Append each index entry for all blocks for this key
var n64 int64
if n64, err = entries.WriteTo(w); err != nil {
return n64 + N, fmt.Errorf("write: writer entries error: %v", err)
}
N += n64
}
return N, nil
}
func (d *directIndex) MarshalBinary() ([]byte, error) {
var b bytes.Buffer
if _, err := d.WriteTo(&b); err != nil {
return nil, err
}
return b.Bytes(), nil
}
func (d *directIndex) UnmarshalBinary(b []byte) error {
d.mu.Lock()
defer d.mu.Unlock()
d.size = uint32(len(b))
var pos int
for pos < len(b) {
n, key, err := readKey(b[pos:])
if err != nil {
return fmt.Errorf("readIndex: read key error: %v", err)
}
pos += n
var entries indexEntries
n, err = readEntries(b[pos:], &entries)
if err != nil {
return fmt.Errorf("readIndex: read entries error: %v", err)
}
pos += n
d.addEntries(string(key), &entries)
}
return nil
}
func (d *directIndex) Size() uint32 {
return d.size
}
// tsmWriter writes keys and values in the TSM format
type tsmWriter struct {
wrapped io.Writer
w *bufio.Writer
index IndexWriter
n int64
}
// NewTSMWriter returns a new TSMWriter writing to w.
func NewTSMWriter(w io.Writer) (TSMWriter, error) {
index := &directIndex{
blocks: map[string]*indexEntries{},
}
return &tsmWriter{wrapped: w, w: bufio.NewWriterSize(w, 1024*1024), index: index}, nil
}
func (t *tsmWriter) writeHeader() error {
var buf [5]byte
binary.BigEndian.PutUint32(buf[0:4], MagicNumber)
buf[4] = Version
n, err := t.w.Write(buf[:])
if err != nil {
return err
}
t.n = int64(n)
return nil
}
// Write writes a new block containing key and values.
func (t *tsmWriter) Write(key string, values Values) error {
if len(key) > maxKeyLength {
return ErrMaxKeyLengthExceeded
}
// Nothing to write
if len(values) == 0 {
return nil
}
// Write header only after we have some data to write.
if t.n == 0 {
if err := t.writeHeader(); err != nil {
return err
}
}
block, err := values.Encode(nil)
if err != nil {
return err
}
blockType, err := BlockType(block)
if err != nil {
return err
}
var checksum [crc32.Size]byte
binary.BigEndian.PutUint32(checksum[:], crc32.ChecksumIEEE(block))
_, err = t.w.Write(checksum[:])
if err != nil {
return err
}
n, err := t.w.Write(block)
if err != nil {
return err
}
n += len(checksum)
// Record this block in index
t.index.Add(key, blockType, values[0].UnixNano(), values[len(values)-1].UnixNano(), t.n, uint32(n))
// Increment file position pointer
t.n += int64(n)
return nil
}
// WriteBlock writes block for the given key and time range to the TSM file. If the write
// exceeds max entries for a given key, ErrMaxBlocksExceeded is returned. This indicates
// that the index is now full for this key and no future writes to this key will succeed.
func (t *tsmWriter) WriteBlock(key string, minTime, maxTime int64, block []byte) error {
if len(key) > maxKeyLength {
return ErrMaxKeyLengthExceeded
}
// Nothing to write
if len(block) == 0 {
return nil
}
blockType, err := BlockType(block)
if err != nil {
return err
}
// Write header only after we have some data to write.
if t.n == 0 {
if err := t.writeHeader(); err != nil {
return err
}
}
var checksum [crc32.Size]byte
binary.BigEndian.PutUint32(checksum[:], crc32.ChecksumIEEE(block))
_, err = t.w.Write(checksum[:])
if err != nil {
return err
}
n, err := t.w.Write(block)
if err != nil {
return err
}
n += len(checksum)
// Record this block in index
t.index.Add(key, blockType, minTime, maxTime, t.n, uint32(n))
// Increment file position pointer (checksum + block len)
t.n += int64(n)
if len(t.index.Entries(key)) >= maxIndexEntries {
return ErrMaxBlocksExceeded
}
return nil
}
// WriteIndex writes the index section of the file. If there are no index entries to write,
// this returns ErrNoValues.
func (t *tsmWriter) WriteIndex() error {
indexPos := t.n
if t.index.KeyCount() == 0 {
return ErrNoValues
}
// Write the index
if _, err := t.index.WriteTo(t.w); err != nil {
return err
}
var buf [8]byte
binary.BigEndian.PutUint64(buf[:], uint64(indexPos))
// Write the index index position
_, err := t.w.Write(buf[:])
return err
}
func (t *tsmWriter) Flush() error {
if err := t.w.Flush(); err != nil {
return err
}
if f, ok := t.wrapped.(*os.File); ok {
if err := f.Sync(); err != nil {
return err
}
}
return nil
}
func (t *tsmWriter) Close() error {
if err := t.Flush(); err != nil {
return err
}
if c, ok := t.wrapped.(io.Closer); ok {
return c.Close()
}
return nil
}
func (t *tsmWriter) Size() uint32 {
return uint32(t.n) + t.index.Size()
}
// verifyVersion verifies that the reader's bytes are a TSM byte
// stream of the correct version (1)
func verifyVersion(r io.ReadSeeker) error {
_, err := r.Seek(0, 0)
if err != nil {
return fmt.Errorf("init: failed to seek: %v", err)
}
var b [4]byte
_, err = io.ReadFull(r, b[:])
if err != nil {
return fmt.Errorf("init: error reading magic number of file: %v", err)
}
if binary.BigEndian.Uint32(b[:]) != MagicNumber {
return fmt.Errorf("can only read from tsm file")
}
_, err = io.ReadFull(r, b[:1])
if err != nil {
return fmt.Errorf("init: error reading version: %v", err)
}
if b[0] != Version {
return fmt.Errorf("init: file is version %b. expected %b", b[0], Version)
}
return nil
}

View File

@@ -0,0 +1,653 @@
package tsm1_test
import (
"bytes"
"encoding/binary"
"io"
"io/ioutil"
"os"
"testing"
"github.com/influxdata/influxdb/tsdb/engine/tsm1"
)
func TestTSMWriter_Write_Empty(t *testing.T) {
var b bytes.Buffer
w, err := tsm1.NewTSMWriter(&b)
if err != nil {
t.Fatalf("unexpected error created writer: %v", err)
}
if err := w.WriteIndex(); err != tsm1.ErrNoValues {
t.Fatalf("unexpected error closing: %v", err)
}
if got, exp := len(b.Bytes()), 0; got < exp {
t.Fatalf("file size mismatch: got %v, exp %v", got, exp)
}
}
func TestTSMWriter_Write_NoValues(t *testing.T) {
var b bytes.Buffer
w, err := tsm1.NewTSMWriter(&b)
if err != nil {
t.Fatalf("unexpected error created writer: %v", err)
}
if err := w.Write("foo", []tsm1.Value{}); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
if err := w.WriteIndex(); err != tsm1.ErrNoValues {
t.Fatalf("unexpected error closing: %v", err)
}
if got, exp := len(b.Bytes()), 0; got < exp {
t.Fatalf("file size mismatch: got %v, exp %v", got, exp)
}
}
func TestTSMWriter_Write_Single(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
values := []tsm1.Value{tsm1.NewValue(0, 1.0)}
if err := w.Write("cpu", values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error writing index: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
b, err := ioutil.ReadAll(fd)
if err != nil {
t.Fatalf("unexpected error reading: %v", err)
}
if got, exp := len(b), 5; got < exp {
t.Fatalf("file size mismatch: got %v, exp %v", got, exp)
}
if got := binary.BigEndian.Uint32(b[0:4]); got != tsm1.MagicNumber {
t.Fatalf("magic number mismatch: got %v, exp %v", got, tsm1.MagicNumber)
}
if _, err := fd.Seek(0, io.SeekStart); err != nil {
t.Fatalf("unexpected error seeking: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
readValues, err := r.ReadAll("cpu")
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if len(readValues) != len(values) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), len(values))
}
for i, v := range values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
func TestTSMWriter_Write_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"cpu", []tsm1.Value{tsm1.NewValue(0, 1.0)}},
{"mem", []tsm1.Value{tsm1.NewValue(1, 2.0)}},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
for _, d := range data {
readValues, err := r.ReadAll(d.key)
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(d.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range d.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
}
func TestTSMWriter_Write_MultipleKeyValues(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"cpu", []tsm1.Value{
tsm1.NewValue(0, 1.0),
tsm1.NewValue(1, 2.0)},
},
{"mem", []tsm1.Value{
tsm1.NewValue(0, 1.5),
tsm1.NewValue(1, 2.5)},
},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
for _, d := range data {
readValues, err := r.ReadAll(d.key)
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(d.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range d.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
}
// Tests that writing keys in reverse is able to read them back.
func TestTSMWriter_Write_ReverseKeys(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"mem", []tsm1.Value{
tsm1.NewValue(0, 1.5),
tsm1.NewValue(1, 2.5)},
},
{"cpu", []tsm1.Value{
tsm1.NewValue(0, 1.0),
tsm1.NewValue(1, 2.0)},
},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
for _, d := range data {
readValues, err := r.ReadAll(d.key)
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(d.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range d.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
}
// Tests that writing keys in reverse is able to read them back.
func TestTSMWriter_Write_SameKey(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"cpu", []tsm1.Value{
tsm1.NewValue(0, 1.0),
tsm1.NewValue(1, 2.0)},
},
{"cpu", []tsm1.Value{
tsm1.NewValue(2, 3.0),
tsm1.NewValue(3, 4.0)},
},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
values := append(data[0].values, data[1].values...)
readValues, err := r.ReadAll("cpu")
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
// Tests that calling Read returns all the values for block matching the key
// and timestamp
func TestTSMWriter_Read_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"cpu", []tsm1.Value{
tsm1.NewValue(0, 1.0),
tsm1.NewValue(1, 2.0)},
},
{"cpu", []tsm1.Value{
tsm1.NewValue(2, 3.0),
tsm1.NewValue(3, 4.0)},
},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
for _, values := range data {
// Try the first timestamp
readValues, err := r.Read("cpu", values.values[0].UnixNano())
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(values.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range values.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
// Try the last timestamp too
readValues, err = r.Read("cpu", values.values[1].UnixNano())
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(values.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range values.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
}
func TestTSMWriter_WriteBlock_Empty(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
if err := w.WriteBlock("cpu", 0, 0, nil); err != nil {
t.Fatalf("unexpected error writing block: %v", err)
}
if err := w.WriteIndex(); err != tsm1.ErrNoValues {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
defer fd.Close()
b, err := ioutil.ReadAll(fd)
if err != nil {
t.Fatalf("unexpected error read all: %v", err)
}
if got, exp := len(b), 0; got < exp {
t.Fatalf("file size mismatch: got %v, exp %v", got, exp)
}
}
func TestTSMWriter_WriteBlock_Multiple(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var data = []struct {
key string
values []tsm1.Value
}{
{"cpu", []tsm1.Value{tsm1.NewValue(0, 1.0)}},
{"mem", []tsm1.Value{tsm1.NewValue(1, 2.0)}},
}
for _, d := range data {
if err := w.Write(d.key, d.values); err != nil {
t.Fatalf("unexpected error writing: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err := os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
defer fd.Close()
b, err := ioutil.ReadAll(fd)
if err != nil {
t.Fatalf("unexpected error read all: %v", err)
}
if got, exp := len(b), 5; got < exp {
t.Fatalf("file size mismatch: got %v, exp %v", got, exp)
}
if got := binary.BigEndian.Uint32(b[0:4]); got != tsm1.MagicNumber {
t.Fatalf("magic number mismatch: got %v, exp %v", got, tsm1.MagicNumber)
}
if _, err := fd.Seek(0, io.SeekStart); err != nil {
t.Fatalf("error seeking: %v", err)
}
// Create reader for that file
r, err := tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
f = MustTempFile(dir)
w, err = tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
iter := r.BlockIterator()
for iter.Next() {
key, minTime, maxTime, _, _, b, err := iter.Read()
if err != nil {
t.Fatalf("unexpected error reading block: %v", err)
}
if err := w.WriteBlock(key, minTime, maxTime, b); err != nil {
t.Fatalf("unexpected error writing block: %v", err)
}
}
if err := w.WriteIndex(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
if err := w.Close(); err != nil {
t.Fatalf("unexpected error closing: %v", err)
}
fd, err = os.Open(f.Name())
if err != nil {
t.Fatalf("unexpected error open file: %v", err)
}
// Now create a reader to verify the written blocks matches the originally
// written file using Write
r, err = tsm1.NewTSMReader(fd)
if err != nil {
t.Fatalf("unexpected error created reader: %v", err)
}
defer r.Close()
for _, d := range data {
readValues, err := r.ReadAll(d.key)
if err != nil {
t.Fatalf("unexpected error readin: %v", err)
}
if exp := len(d.values); exp != len(readValues) {
t.Fatalf("read values length mismatch: got %v, exp %v", len(readValues), exp)
}
for i, v := range d.values {
if v.Value() != readValues[i].Value() {
t.Fatalf("read value mismatch(%d): got %v, exp %d", i, readValues[i].Value(), v.Value())
}
}
}
}
func TestTSMWriter_WriteBlock_MaxKey(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error creating writer: %v", err)
}
var key string
for i := 0; i < 100000; i++ {
key += "a"
}
if err := w.WriteBlock(key, 0, 0, nil); err != tsm1.ErrMaxKeyLengthExceeded {
t.Fatalf("expected max key length error writing key: %v", err)
}
}
func TestTSMWriter_Write_MaxKey(t *testing.T) {
dir := MustTempDir()
defer os.RemoveAll(dir)
f := MustTempFile(dir)
defer f.Close()
w, err := tsm1.NewTSMWriter(f)
if err != nil {
t.Fatalf("unexpected error created writer: %v", err)
}
var key string
for i := 0; i < 100000; i++ {
key += "a"
}
if err := w.Write(key, []tsm1.Value{tsm1.NewValue(0, 1.0)}); err != tsm1.ErrMaxKeyLengthExceeded {
t.Fatalf("expected max key length error writing key: %v", err)
}
}