Aggregating Data From InfluxDB

This became a need of mine after I implemented new metric gathering from TrueNAS which I detailed here. But this is also applicable to anyone that collects a lot of data, wants to keep it long term, but wants to keep data usage to a minimum.

The overall process involves taking an existing set of measurement fields, then averaging them out for every length of time you define, and then adding this new singular averaged data point to a new bucket in InfluxDB.

This utilizes a feature of InfluxDB I had not used before called "Tasks"

option task = {name: "task_name", every: 1h}

data =
    from(bucket: "source_bucket")
        |> range(start: -task.every)
        |> filter(fn: (r) => r["_measurement"] == "zpool_stats")
        |> filter(
            fn: (r) => r["_field"] == "alloc" or r["_field"] == "free" 				or r["_field"] == "size",
        	)

data
    |> aggregateWindow(every: 1h, fn: mean, createEmpty: false)
    |> to(bucket: "destination_bucket", org: "organization")

This is an example of my task that runs every hour, reviewing the last hour of data from my source bucket, taking only the fields "alloc, free, and size" from the measurement "zpool_stats"

The limitation with this feature is that for each measurement you want to capture, you would need to make a separate task. As far as I'm aware at the time of this writing, there is not built in way to aggregate and entire bucket or every measurement in a bucket. I think for most applications though your dashboards do not use that many measurements in total and creating a task for each measurement you use will not take that much effort. Note that you can use an many fields from each measurement as you want. They just must be specified in the task.

One quick tip is you can build the query you want in Influx db to make it easier to select all of the fields you want to use.

After the task is set up and running, you can simply reference the new bucket in your grafana or other graphing service. It will have the same measurements and fields that you specified, just with a single data point every hour.

Additionally one thing I wanted was to have the one metric displayed, but that contained both of the buckets data. However this would normally duplicate the data in the graph. So I used a union to join the two different queries together.

bucket1 = first bucket query

bucket2 = second bucket query

union(tables: [bucket1, bucket2])

by doing this it seems I am able to retain the high granularity data from my source bucket, until it's retention period ends, and the lower granularity bucket will pick up where that one leaves off with it's longer retention period.

Hopefully this has been informative!