"The Time field of a certain database table is time, and the time interval is sometimes greater t .."

jinxing RaqForum 41 No.
39 View • 1 Months ago

From SQL to SPL: Statistics by time window

The Time field of a certain database table is time, and the time interval is sometimes greater than 1 minute.

Time	Value
10:10:00	3
10:11:00	4
10:13:00	5
10:13:00	9
10:13:00	8
10:14:00	2

Now we need to divide the data into windows every minute, fill in the missing windows, and calculate the four values for each window, which are: start_value, the last item of the previous window; end_value, the last item in this window; min, the minimum value of this window; max, the maximum value of this window. The start_value in the first minute is based on the first record in this window; If data for a certain window is missing, replace it with the last item of the previous window (same as the start_value of this window).

start	end	start_value	end_value	min	max
10:10:00	10:11:00	3	3	3	3
10:11:00	10:12:00	3	4	4	4
10:12:00	10:13:00	4	4	4	4
10:13:00	10:14:00	4	8	5	9
10:14:00	10:15:00	8	2	2	2

SQL：

with overview as (
    SELECT 
        distinct on (a.time) a.id, a.time, b.time as "end", a.value, 
        date_trunc('minute', a.time) as minute_start, 
        date_trunc('minute', b.time) as minute_end 
    FROM 
        main a 
    left join 
        main b 
    on 
        a."time"<b."time" and a.id = b.id 
    order by 
        a.time, b.time asc
    ),
overview2 as (
    select 
        id, value, true as backfill,
        date_trunc('minute', "end") as time, 
        date_trunc('minute', "end") as minute
    from 
        overview 
    where 
        minute_start <> minute_end
    UNION ALL
    select 
        id, time, value, false as backfill,
        date_trunc('minute', time) as minute
    from 
        overview
    ),  
overview3 as (
    select 
        * 
    from 
        overview2 
    UNION ALL (
        Select 
            distinct on (a.missingminute) 
            c.id, 
            a.missingminute as time, 
            a.missingminute as minute, 
            c.value, 
            true as backfill 
        from (
            SELECT 
                date_trunc('minute', time.time) as missingminute
            FROM 
                generate_series((select min(minute) from overview2),(select max(minute) from overview2),'1 minute'::interval) time 
            left join (
                select distinct 
                    minute 
                from 
                    overview2
                ) b 
            on 
                date_trunc('minute', time) = b.minute 
            where 
                b.minute isnull
            ) a 
        left join 
            main c 
        on 
            a.missingminute > c.time 
        order by 
            a.missingminute, 
            c.time desc
        ) 
    order by 
        time
    )
select 
    t1.id, 
    t1.minute as minute_start, 
    t1.minute + interval '1 minute' as minute_end, 
    t1.backfill as start_backfill,
    t1.start, 
    t2.end, 
    coalesce(t3.min, t1.start) as min, 
    coalesce(t3.max, t1.start) as max 
from 
    (select distinct on (id, minute) id, minute, value as start, backfill from overview3 order by id, minute, time asc) t1 
left join
    (select distinct on (id, minute) id, minute, value as end from overview3 order by id, minute, time desc) t2 on t1.id = t2.id and t1.minute = t2.minute 
left join
    (select id, minute, min(value) min, max(value) max from overview2 group by id,minute) t3 on t1.id = t3.id and t1.minute = t3.minute

SQL requires multiple layers of nested subqueries and complex join statements to implement, and the code is lengthy and difficult to understand. SPL directly provides time series functions, sequence aligned functions, and position related syntax:

https://try.esproc.com/splx?55o

	A
1	$select * from main.txt
2	=A1.run(Time=time@m(Time))
3	=list=periods@s(A2.min(Time),A2.max(Time),60)
4	=A2.align@a(list,Time)
5	=A4.new(list(#):start, elapse@s(start,60):end, sv=ifn(end_value[-1],~.Value):start_value, ifn(~.m(-1).Value, sv):end_value, ifn(~.min(Value),sv):min, ifn(~.max(Value),sv):max)

A1： Load data.

A2： Change the Time field to full minutes.

A3： Generate a continuous minute time series list.

A4： Align the data with the list, with each group being the data of a window and some windows being empty.

A5： Generate a new two-dimensional table, use the original records from each group to generate a new record, start takes values from the list according to the sequence number, start_value is taken from the end_value of the previous new record, and when this value is null (in the first minute), it is taken from the first record of the current group. end_value is taken from the last record of the previous group, min is taken from the minimum value of this group, and max is taken from the maximum value of this group. If the last three items are missing, use start_value(sv) instead. [-1] represents the relative previous one, and m(-1) represents the last member in the set.

Question source：https://stackoverflow.com/questions/78102526/postgressql-backfill-data-if-doesnt-exist-for-a-start-period

SPL Official Website 👉 https://www.esproc.com

SPL Feedback and Help 👉 https://www.reddit.com/r/esProcSPL

SPL Learning Material 👉 https://c.esproc.com

SPL Source Code and Package 👉 https://github.com/SPLWare/esProc

Discord 👉 https://discord.gg/sxd59A8F2W

Youtube 👉 https://www.youtube.com/@esProc_SPL

Application