Question:

How can I perform set operations like union, intersect etc. on data sets with time interval information e.g. alarm start and end?


Solution:

In the article "Duration calculations of overlapping periods" we have presented how you can calculate the total duration of multiple periods.

If you are not interested in the total duration, but you want to perform set operations on these periods without loosing detailed information of each period you can use the functions we have prepared for you. 

Below you will find functions for the following operations:

  • periods2events
  • intersecting_periods
  • unified_periods
  • nonoverlapping_periods_simple
  • diff_periods
  • nonoverlapping_periods


Short description of each function:

periods2events:

This function converts the input data frame where each observation consists of category, start and stop of the time interval in a data frame in form of an event list where each observation consists of the category, the timestamp of the event, the information if a rising or falling edge occurs and the information how many parallel intervals are at this moment.


intersecting_periods:

This function returns all time intervals where intervals of all categories are overlapping.



unified_periods:

This function returns the unified periods of all intervalls and categories.



nonoverlapping_periods_simple:

This function returns the non-overlapping periods considering 2 priorities. You can specify the high priority category with the "prio" parameter. If this parameter is not set, the first existing category in the input data in alphanumerical order will be taken as high priority. If there are overlapping periods of the high and low priority category the overlapping time is only accounted as the high priority category and cut from the low priority category. If there are more than 2 categories all low priority categories will be unified first and labelled with the alphanumerically ordered first low priority category.


diff_periods:

This function returns the remaining intervals of the category defined by "minuend" with all overlaps with other categories removed. If there is no minend defined the alphanumerically first category included in the input data will be taken as minuend.

nonoverlapping_periods:

This function returns the non-overlapping periods considering all priorities. You can specify the priority of each category with the "priorities" parameter. Where the first entry in this vector has the highest priority and the last entry has the lowest priority. If this parameter is not set, the alphanumerical order will be taken as the priority ranking. If not all categories within the input data set are mentioned in the priority ranking, the missing categories are ignored. Like in the function "nonoverlapping_periods_simple" the overlapping periods of different categories are only accounted as a period of the category with the higher priority and cut from the low priority category.


Function definitions in Python:


NOTE: To run the functions given below you have to import the packages "pandas" as "pd"  and "numpy" as "np"!


import pandas as pd
import numpy as np

def periods2events(data):
    """
    # Function name:    periods2events 
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #
    # Outputs:          data frame with four variables/columns category, timestamp, edge, parallels
    #
    # Description:      This function converts the input data frame where each observation consists of category, start and stop of the time interval 
    #                   in a data frame in form of an event list where each observation consists of the category, the timestamp of the event, 
    #                   the information if a rising or falling edge occurs and the information how many parallel intervals are at this moment.
    #
    # Required custom 
    # functions:        none
    """
    log = pd.DataFrame({'category':2*list(data['category']), 'timestamp':list(data['start']) +  list(data['stop']), 'edge':[1] * len(data) + [-1] * len(data)})
    log_ordered = log.sort_values(by='timestamp')
    log_ordered['parallels'] = log_ordered['edge'].cumsum()
    log_ordered = log_ordered[[not x for x in log_ordered[['timestamp', 'category']].duplicated(keep = 'last')]]
    return log_ordered.reset_index(drop=True)

def intersecting_periods(data, filterby = None):
    """
    # Function name:    intersecting_periods
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #                   optional: filter ... vector with categories for which the data should be filtered                    
    #
    # Outputs:          data frame with two variables/colums start and stop
    #
    # Description:      This function returns all time intervals where intervals of all categories are overlapping.
    #
    # Required custom 
    # functions:        periods2events
    """
    if filterby is None:
        filterby = data['category'].unique()
    
    filterby = np.array(filterby)
    log_ordered = periods2events(data[data['category'].apply(lambda x: x in filterby)])
    
    if (len(log_ordered) == log_ordered.groupby('category').count()['timestamp']).any():
        log_intersect = pd.DataFrame()
    else:
        pos_intersect = (len(filterby) == log_ordered['parallels'])
        log_intersect = pd.DataFrame({'start': log_ordered['timestamp'][pos_intersect].reset_index(drop=True), 'stop':log_ordered['timestamp'][[False]+list(pos_intersect[:-1])].reset_index(drop=True)} )
        log_intersect = log_intersect[log_intersect['start'] != log_intersect['stop']]
        
    return log_intersect

def unified_periods(data, filterby = None):
    """
    # Function name:    unified_periods
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #                   optional: filter ... vector with categories for which the data should be filtered 
    #
    # Outputs:          data frame with two variables/colums start and stop
    #
    # Description:      This function returns the unified periods of all intervalls and categories.
    #
    # Required custom 
    # functions:        periods2events
    """
    if filterby is None:
        filterby = data['category'].unique()
    
    filterby = np.array(filterby)
    data_filtered = data[data['category'].apply(lambda x: x in filterby)]
    if len(data_filtered) == 0:
        log_unified = pd.DataFrame()
    else:
        log_ordered = periods2events(data_filtered)
        pos_zero = log_ordered['parallels'] == 0
        log_unified = pd.DataFrame({'start': [log_ordered['timestamp'].iloc[0]]+list(log_ordered['timestamp'][[False]+list(pos_zero.iloc[:-1])]), 'stop':list(log_ordered['timestamp'][pos_zero])})
        
    return log_unified



def nonoverlapping_periods_simple(data, filterby = None, prio = None):
    """
    # Function name:    nonoverlapping_periods_simple  
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #                   optional: filter ... vector with categories for which the data should be filtered 
    #                   optional: prio ... category with the higher priority
    #
    # Outputs:          data frame with three variables/colums category, start and stop
    #
    # Description:      This function returns the non-overlapping periods considering 2 priorities.
    #                   You can specify the high priority category with the "prio" parameter. If this parameter is not set, the first existing 
    #                   category in the input data in alphanumerical order will be taken as high priority.
    #                   If there are overlapping periods of the high and low priority category the overlapping time is only accounted as the high
    #                   priority category and cut from the low priority category.
    #                   If there are more than 2 categories all low priority categories will be unified first and labelled with the alphanumerically 
    #                   ordered first low priority category.
    #
    # Required custom 
    # functions:        periods2events, unified_periods
    """
    if filterby is None:
        filterby = np.sort(data['category'].unique())
    if prio is None:
        prio = filterby[0]
     
    filterby = np.array(filterby)    
    low_prio = filterby[filterby != prio]
    data_filtered = data[data['category'].apply(lambda x: x in filterby)]
    log_high_prio = data_filtered[data_filtered['category'] == prio]
    log_low_prio = unified_periods(data[data['category'].apply(lambda x: x in low_prio)])
    log_low_prio['category'] = low_prio[0]
    
    log_ordered = periods2events(log_high_prio.append(log_low_prio, sort = True))

    log_ordered['a_on'] = np.cumsum( np.array(log_ordered['category'] == prio) * np.array(log_ordered['edge']))
    log_ordered['b_on'] = np.cumsum( np.array(log_ordered['category'] == low_prio[0]) * np.array(log_ordered['edge']))
    
    ident = lambda x: x
    log_non_overlap  = log_ordered[(log_ordered['category'] == prio) | ( (log_ordered['category'] == low_prio[0]) & (log_ordered['a_on']== 0))]
    log_non_overlap = log_non_overlap.append(log_non_overlap[(log_non_overlap['category']== prio) & (log_non_overlap['b_on'] == 1)].transform({'category': lambda x: low_prio[0], 'edge': lambda x: -x, 'a_on': ident, 'b_on':ident, 'parallels': ident, 'timestamp':ident}), sort = True)
    log_non_overlap = log_non_overlap[['category','timestamp','edge']].sort_values('timestamp')
    log_non_overlap_spread = pd.DataFrame({'category': log_non_overlap['category'][log_non_overlap['edge']==1].reset_index(drop=True),'start': log_non_overlap['timestamp'][log_non_overlap['edge']==1].reset_index(drop=True),'stop': log_non_overlap['timestamp'][log_non_overlap['edge']==-1].reset_index(drop=True)})
    return log_non_overlap_spread


def diff_periods(data, filterby = None, minuend = None):
    """
    # Function name:    diff_periods
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #                   optional: filter ... vector with categories for which the data should be filtered 
    #                   optional: minuend ... category from which the periods of the other categories should be subtracted
    #
    # Outputs:          data frame with three variables/colums category, start and stop
    #
    # Description:      This function returns the remaining intervals of the category defined by "minuend" with all overlaps with other categories removed.
    #                   If there is no minend defined the alphanumerically first category included in the input data will be taken as minuend.
    #
    # Required custom 
    # functions:        periods2events, unified_periods, nonoverlapping_periods_simple
    """
    if filterby is None:
        filterby = np.sort(data['category'].unique())
    if minuend is None:
        minuend = filterby[0]
    
    filterby = np.array(filterby) 
    data_filtered = data[data['category'].apply(lambda x: x in filterby)]
    
    if len(data[data['category'] == minuend]) == 0:
        return pd.DataFrame()
    
    if len(data[data['category'] == minuend]) == len(data):
        return data
    
    log_subtrahend = unified_periods(data_filtered[ data_filtered['category'] != minuend])
    log_subtrahend['category'] =  (filterby[filterby != minuend][0])
    log_minuend = data_filtered[ data_filtered['category'] == minuend]
    log_data = log_subtrahend.append(log_minuend, sort = False).sort_values('start').reset_index(drop = True)
    log_data = nonoverlapping_periods_simple(log_data, filterby = [minuend, filterby[filterby != minuend][0]], prio = filterby[filterby != minuend][0])
    log_data = log_data[log_data['category'] == minuend]
    return log_data 


def nonoverlapping_periods(data, filterby = None, prio = None):
    """
    # Function name:    nonoverlapping_periods
    #
    # Inputs:           data ... data frame with the three variables/columns category, start and stop
    #                   optional: filter ... vector with categories for which the data should be filtered 
    #                   optional: priorities ... vector of categories where the position within the category defines the priority starting with highest priority
    #
    # Outputs:          data frame with three variables/colums category, start and stop
    #
    # Description:      This function returns the non-overlapping periods considering all priorities.
    #                   You can specify the priority of each category with the "priorities" parameter. Where the first entry in this vector has the highest
    #                   priority and the last entry has the lowest priority. If this parameter is not set, the alphanumerical order will be taken as the 
    #                   priority ranking.
    #                   If not all categories within the input data set are mentioned in the priority ranking, the missing categories are ignored.
    #                   Like in the function "nonoverlapping_periods_simple" the overlapping periods of different categories are only accounted as a period 
    #                   of the category with the higher priority and cut from the low priority category.
    #
    # Required custom 
    # functions:        periods2events, unified_periods, diff_periods, nonoverlapping_periods_simple
    """
    if filterby is None:
        filterby = np.sort(data['category'].unique())
    if prio is None:
        prio = filterby
    
    filterby = np.array(filterby) 
    prio = np.array(prio)
    data_filtered = data[data['category'].apply(lambda x: x in filterby)]
    log_non_overlap = pd.DataFrame()
    
    for i in range(len(prio)-1,-1,-1):
        if i >0:
            log_higher_prio = unified_periods(data_filtered[data_filtered['category'].apply(lambda x: x in prio[0:i])])
            log_higher_prio['category']='high'
            log_lower_prio = data_filtered[data_filtered['category']== prio[i]]
            
            log_bind_prio = log_higher_prio.append(log_lower_prio, sort = True).sort_values('start')
            
            log_non_overlap_prio = diff_periods(log_bind_prio, minuend = prio[i])
        else:
            log_non_overlap_prio = data_filtered[data_filtered['category']== prio[0]]
            
        if len(log_non_overlap) == 0:
            log_non_overlap = log_non_overlap_prio
        else:
            log_non_overlap = log_non_overlap.append(log_non_overlap_prio, sort = True)
            
    return log_non_overlap.sort_values('start')