PL/SQL Statistics Library


Click on the following link view the PL/SQL source code.

UTIL_NUMERIC Package
PL/SQL Package: UTIL_NUMERIC


The design and implementation of the statistics library in PL/SQL is described in more detail below, following the list of functions. Please view the package specification in the PL/SQL source for descriptions of each function, including: parameters, return values, and exceptions.


UTIL_NUMERIC Functions
Function Name Description
get_stats Core function to create frequency table from array, return record containing frequency table and stats.
get_stats_list Receives comma separated list of numbers, converts to array and calls get_stats_array
get_stats_array Calls get_stats passing array.
populate_frequency_table Populate frequency table with data from array. For each distinct array value store:
key - Unique occurrence of value in array
frequency - Count of occurrences of key in array.
frequency_table_sum Return sum of values in frequency table: sum (key * frequency)
frequency_table_count Return count of all values in frequency table: sum of frequency
frequency_table_mean Return mean value: sum / count
frequency_table_median Return median value: Middle value, or average of two mid numbers if even count.
frequency_table_mode Return values with highest frequency. May be 1 or more values returned.
frequency_table_lowest Return lowest value
frequency_table_highest Return highest value
frequency_table_range Return range = highest - lowest
variance_pop Return population variance VAR_POP
variance_samp Return sample variance VAR_SAMP
stddev_pop Return population standard deviation STDDEV_POP
stddev_samp Return sample standard deviation STDDEV_SAMP
percentile_cont Return continuous interpolated percentile for a frequency table.
percentile_disc Return discrete (actually observed value) percentile for a frequency table.
iqr Return interquartile range, difference between 75th and 25th continuous percentiles.
display_frequency_table Display contents of frequency table
display_stats Display all statistic for frequency table

Designing and Implementing get_stats in PL/SQL

The get_stats function in the util_numeric package provides a deterministic statistical engine for in-memory numeric datasets.

It does not operate directly on database tables, although data from tables can be passed to it by first being placed into an array. Instead, it works with PL/SQL collections and produces a structured result containing:

  • The Frequency Table
  • A complete statistical summary

This blog post explains the design and implementation of that function.


Entry Points

The package supports the following ways to compute statistics:

If you already have data in an array, or have populated the array from a database table, call:

get_stats(p_array IN t_number_array)
              


Thin wrapper for array input, simply delegates to get_stats:

get_stats_array(p_array IN t_number_array)
              


If your data is a comma separated values string:

get_stats_list(p_list IN t_number_array)
              

This converts the string using function list_to_array and then calls get_stats_array.
All three entry points ultimately route to the same core function, get_stats.


The Core Design Principle

The key architectural decision is this:
Build the Frequency Table once, then derive all statistics from it.
Instead of repeatedly scanning the original array, get_stats first transforms the dataset into a canonical distribution.


Step 1 - Populate the Frequency Table

The first operation inside get_stats is:

v_frequency_table := populate_frequency_table(p_array);
              

populate_frequency_table performs the following:

  • Validates input
  • Rejects NULL values
  • Sorts array into ascending order
  • Compresses duplicates into (key, frequency) pairs

Example:
Input array: 1,1,2,2,2,4,5

Frequency Table
Key Frequency
1 2
2 3
4 1
5 1

From this point forward, all statistical calculations operate on this frequency table.


Step 2 - Calculate Summary Statistics

Once the frequency table has been populated, each statistic is computed independently using pure functions:

    v_stats.sum_values := frequency_table_sum(v_frequency_table);
    v_stats.n_total := frequency_table_count(v_frequency_table);
    v_stats.distinct_n := v_frequency_table.COUNT;
    v_stats.mean := frequency_table_mean(v_frequency_table);
    v_stats.median := frequency_table_median(v_frequency_table);
    v_stats.mode_values:= frequency_table_mode(v_frequency_table);
    v_stats.lowest := frequency_table_lowest(v_frequency_table);
    v_stats.highest := frequency_table_highest(v_frequency_table);
    v_stats.range := frequency_table_range(v_frequency_table);
    v_stats.variance_pop := variance_pop(v_frequency_table);
    v_stats.variance_samp := variance_samp(v_frequency_table);
    v_stats.stddev_pop := stddev_pop(v_frequency_table);
    v_stats.stddev_samp := stddev_samp(v_frequency_table);
    v_stats.iqr := iqr(v_frequency_table);
                

Each function:
  • Accepts frequency table data of type t_frequency_table
  • Performs a calculation
  • Has no side-effects
  • Does not re-build, re-sort, or change frequency table data

This makes the implementation predictable, and testable.


Step 3 - Populate the Result Structure

The function returns a composite structure:

  -- Get_stats uses this composite record to store the calculated
  -- statistics and frequency table from which they were generated.
  TYPE t_stats_result IS RECORD (
    stats       t_stats_summary,
    freq_tbl    t_frequency_table
  );
                

The underlying structures used for the frequency table, and statistics are:
  TYPE t_frequency_row IS RECORD (
    key         PLS_INTEGER,
    frequency   PLS_INTEGER
  );

  TYPE t_frequency_table IS TABLE OF t_frequency_row;

  TYPE t_int_table IS TABLE OF PLS_INTEGER;

  -- Used by get_stats to return all statistics for frequency table 
  -- Note this is nested, supporting multiple mode values in t_int_table
  TYPE t_stats_summary IS RECORD (
    sum_values      PLS_INTEGER,
    n_total         PLS_INTEGER,
    distinct_n      PLS_INTEGER,
    mean            NUMBER,
    median          NUMBER,
    mode_values     t_int_table,
    lowest          NUMBER,
    highest         NUMBER,
    range           NUMBER,
    variance_pop    NUMBER,
    variance_samp   NUMBER,
    stddev_pop      NUMBER,
    stddev_samp     NUMBER,
    iqr             NUMBER 
  );
                

Returning both the statistics and frequency table from which they were derived in t_stats_result ensures:
  • The frequency table is available for display
  • Percentiles can be calculated repeatedly without needing to re-compute the frequency table
  • Additional derived calculations can reuse the frequency table


Why Percentiles are Separate

Percentile functions are intentionally not embedded inside get_stats. They are defined separately:

  FUNCTION percentile_disc(
    p_frequency_table IN t_frequency_table,
    p_pct             IN NUMBER 
  ) RETURN PLS_INTEGER;
  
  FUNCTION percentile_cont(
    p_frequency_table IN t_frequency_table,
    p_pct             IN NUMBER 
  ) RETURN NUMBER;
                

This keeps responsibilities clean:
  • get_stats computes the core_summary.
  • Percentiles operate on the returned frequency table.
  • The summary record remains stable and compact.


Complexity

Let:

  • N = total number of elements in source array
  • D = number of distinct values (rows in frequency table)

Then:
  • Frequency table build: O(N log N)
  • Mean, median, mode, variance, standard deviation: O(D)
  • Percentile: O(D)

Once the frequency table is built, all operations are linear in distinct values only.


Example

In this example, a comma separated list of integers will be passed to get_stats_list which returns the result in a record of type t_stats_result.
Run the following code in Oracle SQL Developer:

SET SERVEROUTPUT ON
ACCEPT p_list PROMPT "Enter a list of numbers separated by commas"
ACCEPT p_percentile NUMBER PROMPT "Percentile (number > 0 and < 1)?"

DECLARE
  v_stats_result util_numeric.t_stats_result;
  v_percentile_cont NUMBER;
  v_percentile_disc NUMBER;
BEGIN
  util_admin.log_message('Input list is ' || '&p_list');

  v_stats_result := util_numeric.get_stats_list('&p_list');
  util_numeric.display_frequency_table(v_stats_result);
  util_numeric.display_stats(v_stats_result);
  v_percentile_disc := util_numeric.percentile_disc(v_stats_result.freq_tbl,&p_percentile);
  util_admin.log_message('PCT_DISC ('||to_char(&p_percentile,'0.99')||')='||trim(to_char(v_percentile_disc,'9999999990.9999')));
  v_percentile_cont := util_numeric.percentile_cont(v_stats_result.freq_tbl,&p_percentile);
  util_admin.log_message('PCT_CONT ('||to_char(&p_percentile,'0.99')||')='||trim(to_char(v_percentile_cont,'9999999990.9999')));
EXCEPTION
  WHEN OTHERS THEN
    util_admin.log_message('See error messages above.');
END;
              

You will be prompted to enter a list of numbers, followed by a percentile.
Enter a list of numbers separated by commas : 1,3,1,2,1,5,9,1,7,3,2
Percentile (number > 0 and < 1)? : 0.66

The results are displayed below:

              

Input list is 1,3,1,2,1,5,9,1,7,3,2
----------------------------------------------------------
FREQUENCY TABLE
----------------------------------------------------------
KEY=1 Frequency=4
KEY=2 Frequency=2
KEY=3 Frequency=2
KEY=5 Frequency=1
KEY=7 Frequency=1
KEY=9 Frequency=1
----------------------------------------------------------
STATISTICS
Sum=35
N Total=11
Distinct N=6
Mean=3.1818
Median=2.0000
Mode 1 = 1
Lowest=1
Highest=9
Range=8
Variance Population=6.6942
Variance Sample=7.3636
Standard Deviation Population=2.5873
Standard Deviation Sample=2.7136
Interquartile Range=3.0000
PCT_DISC ( 0.66)=3.0000
PCT_CONT ( 0.66)=3.0000
              

Conclusion

get_stats defines a small, well-structured statistical application.
In our next blog we will develop an APEX application that uses these statistical functions.



Previous Blog HOME