Difference Storage

Posted on February 1, 2025 by Status — Leave a comment

A green line of original data, and a yellow line of the differences

Say you want to get the differences of an array of numbers, but also want to be able to restore it back to its original form.

For that, you just need to keep the starting value, and then all the differences.

[7, 8, 9, 10, 11, 12] -> [7, 1, 1, 1, 1, 1]

But why would you want to do that? It turns out for certain kinds of data, this method can store it very compactly (when used in conjunction with certain compression methods). For instance, say you have 14 days of atmospheric pressure data sampled once a second (1209600 samples), stored as floats (np.float32; would be 4.6 MB uncompressed)

Original:
  raw :	   4838400 (   4.614 MB)
  bz2 :	   3212459 (   3.064 MB)	0.464s
  gzip:	   1656078 (   1.579 MB)	0.354s
  lzma:	    956216 ( 933.805 kB)	1.106s
  zlib:	   1656066 (   1.579 MB)	0.326s

Difference:
  raw :	   4838400 (   4.614 MB)
  bz2 :	    632568 ( 617.742 kB)	0.225s
  gzip:	    879766 ( 859.146 kB)	4.625s
  lzma:	    645404 ( 630.277 kB)	3.750s
  zlib:	    962619 ( 940.058 kB)	0.288s

So by taking the differences and compressing those with bz2, you can get it packed down pretty far. But to get this point, one first needs to write both the difference and restoring functions:

import numpy as np

def sdif_v1(arr):
    "like np.diff but keeps the starting value"
    return np.concatenate(([arr[0]], np.diff(arr)))

def ssum_v1(arr):
    "opposite of sdif"
    return np.cumsum(arr)

There, done and done!
But wait, what’s that you have in your data?

>>> pressure
array([1001.19 , 1001.18 ,      nan, 1001.179, 1001.178, 1001.182],
      dtype=float32)

Oh no, that’s a NaN! that means restoring the data is impossible!

>>> ssum_v1(sdif_v1(pressure))
array([1001.19, 1001.18,     nan,     nan,     nan,     nan],
      dtype=float32)

Because the restoring function ssum_v1 does a cumulative sum, and adding anything to a ‘not a number’ value gives a nan, all data is lost after just one of these in your dataset!

Okay, we can fix that. Just need to break up the array into sub parts where the nan is and keep the starting value and differences between each nan.

def sdif_v2(arr):
    nidx = np.argwhere(np.isnan(arr))  # check for nan's and get their indices
    if np.any(nidx): # there's a nan in the array, handle it
        tmp = arr; parts = list(); a = 0
        for i in nidx.flatten():  # go through list of indices
            parts.extend(([tmp[a]], np.diff(tmp[a: i]), [tmp[i]]))
            # store starting value, differences, then nan
            a = i + 1  # next starting position
        parts.extend(([tmp[a]], np.diff(tmp[a:])))  # store remaining segment
        out = np.concatenate(parts)
    else: # nan free!
        out = np.concatenate(([arr[0]], np.diff(arr)))
    return out

def ssum_v2(arr):
    nidx = np.argwhere(np.isnan(arr))
    if np.any(nidx): 
        tmp = arr; parts = list(); a = 0
        for i in nidx.flatten():  
            parts.extend((np.cumsum(tmp[a: i]), [tmp[i]]))
            a = i + 1
        parts.append(np.cumsum(tmp[a:]))
        out = np.concatenate(parts)
    else:
        out = np.cumsum(arr)
    return out

Okay, whew, that’s now taken care of. But wait, now you have two nan’s in a row? Argh! The above fix doesn’t work on runs of nan’s! Now I gotta make another fix!

>>> pressure
array([1001.19 , 1001.18 ,      nan,      nan, 1001.178, 1001.182,
       1001.17 , 1001.13 , 1000.98 , 1000.88 ], dtype=float32)


>>> ssum_v2(sdif_v2(pressure))
array([1001.19 , 1001.18 ,      nan,      nan,      nan, 1001.178,
       1001.182, 1001.17 , 1001.13 , 1000.98 , 1000.88 ], dtype=float32)

def sdif_v3(arr):
    nidx = np.argwhere(np.isnan(arr))  
    if np.any(nidx):
        nidx = nidx.flatten()  # group consecutive nan indices
        nidx = np.split(nidx, np.where(np.diff(nidx) != 1)[0] + 1)
        # above line: https://stackoverflow.com/a/7353335
        tmp = arr; parts = list(); a = 0
        for sub in nidx:
            i = sub[0]
            if len(sub) > 1:  # a run/group of nan's
                parts.extend(([tmp[a]], np.diff(tmp[a: i]), tmp[sub]))
                i = sub[-1]
            else:             # a single nan
                parts.extend(([tmp[a]], np.diff(tmp[a: i]), [tmp[i]]))
            a = i + 1  
        parts.extend(([tmp[a]], np.diff(tmp[a:])))
        out = np.concatenate(parts)
    else:
        out = np.concatenate(([arr[0]], np.diff(arr)))
    return out

def ssum_v3(arr):
    nidx = np.argwhere(np.isnan(arr))
    if np.any(nidx): 
        nidx = nidx.flatten()
        nidx = np.split(nidx, np.where(np.diff(nidx) != 1)[0] + 1)
        tmp = arr; parts = list(); a = 0
        for sub in nidx:
            i = sub[0]
            if len(sub) > 1:
                parts.extend((np.cumsum(tmp[a: i]), tmp[sub]))
                i = sub[-1]
            else:       
                parts.extend((np.cumsum(tmp[a: i]), [tmp[i]]))
            a = i + 1
        parts.append(np.cumsum(tmp[a:]))
        out = np.concatenate(parts)
    else:
        out = np.cumsum(arr)
    return out

Now it’s good right? Wait, now you’re saying you have an infinity in your data?! How’d that get in there?!

>>> pressure
array([1001.19 , 1001.18 ,      nan,      nan, 1001.178,      inf,
       1001.17 , 1001.13 , 1000.98 , 1000.88 ], dtype=float32)

>>> ssum_v3(sdif_v3(pressure))
RuntimeWarning: invalid value encountered in accumulate
array([1001.19 , 1001.18 ,      nan,      nan, 1001.178,      inf,
            nan,      nan,      nan,      nan], dtype=float32)

Man, that’s a different kind of invalid number that also breaks everything! Adding to an infinity just produces Now I have to add in another fix…

...grumble grumble...
ndx, fdx = np.isnan(arr), np.isinf(fdx)
if np.any(ndx) or np.any(fdx): 
    ndx, fdx = np.argwhere(ndx), np.argwhere(fdx)
    idx = np.unique(np.concatenate((ndx.flatten(), fdx.flatten())))
    idx = np.split(idx, np.where(np.diff(idx) != 1)[0] + 1)
    tmp = arr; parts = list(); a = 0
    for sub in idx:
        ...

Okay, now it should finally…
…argh… it still fails to store things correctly when the array STARTS or ENDS with an invalid number…

def sdif(arr):
    "like np.diff but keeps the starting value"
    ndx, fdx = np.isnan(arr), np.isinf(fdx)
    
    if np.any(ndx) or np.any(fdx):
        # group consecutive nan/inf indices
        ndx, fdx = np.argwhere(ndx), np.argwhere(fdx)
        idx = np.unique(np.concatenate((ndx.flatten(), fdx.flatten())))
        idx = np.split(idx, np.where(np.diff(idx) != 1)[0] + 1)
        # above line: https://stackoverflow.com/a/7353335
        tmp = arr; parts = list(); a = 0
        del ndx, fdx
        
        for sub in idx:
            i = sub[0]
            if not i:         # starting with invalids
                parts.append(tmp[sub])
            if len(sub) > 1:  # a run/group of invalids
                parts.extend(([tmp[a]], np.diff(tmp[a: i]), tmp[sub]))
            else:             # a single invalid
                parts.extend(([tmp[a]], np.diff(tmp[a: i]), [tmp[i]]))
            a = sub[-1] + 1  
        if a < len(arr): parts.extend(([tmp[a]], np.diff(tmp[a:])))
        out = np.concatenate(parts)
        del parts, idx
        
    else:                     # invalid free!
        out = np.concatenate(([arr[0]], np.diff(arr)))
    return out

def ssum(arr):
    "opposite of sdif"
    ndx, fdx = np.isnan(arr), np.isinf(fdx)
    
    if np.any(ndx) or np.any(fdx):
        # group consecutive nan/inf indices
        ndx, fdx = np.argwhere(ndx), np.argwhere(fdx)
        idx = np.unique(np.concatenate((ndx.flatten(), fdx.flatten())))
        idx = np.split(idx, np.where(np.diff(idx) != 1)[0] + 1)
        # above line: https://stackoverflow.com/a/7353335
        tmp = arr; parts = list(); a = 0
        del ndx, fdx
        
        for sub in nidx:
            i = sub[0]
            if not i:         # starting with invalids
                parts.append(tmp[sub]) 
            if len(sub) > 1:  #  run/group of invalids
                parts.extend((np.cumsum(tmp[a: i]), tmp[sub]))
            else:             # a single invalid
                parts.extend((np.cumsum(tmp[a: i]), [tmp[i]]))
            a = sub[-1] + 1
        parts.append(np.cumsum(tmp[a:]))
        out = np.concatenate(parts)
        del parts, idx
        
    else:                     # invalid free!
        out = np.cumsum(arr)
    return out

At this point, you should just throw away your data because it’s so rotten with invalid measurements. But there, finally some functions that find the differences and can restore those differences back to the original data. And at this point, I kinda forgot where I was going with this because there were multiple cases that needed to be figured out and handled. Note that the vast majority of the code is just for handling invalid numbers, so if you come across some super long code, it might just be for handling bad data. And after so many versions, now I gotta wonder if I missed other edge cases and it’ll still fail. And there’s probably yet another way to write this…

Tagged with: Python
Posted in Python

Difference Storage

Share this:

Related

Leave a comment Cancel reply