- COUNT
- MIN and MAX
- SUM
- AVG
- COUNT_IF
- SUM_IF and AVG_IF
- SOME
- CountDistinctEstimate, HyperLogLog and HLL
- AGGREGATE_LIST
- MAX_BY and MIN_BY
- TOP and BOTTOM
- TOP_BY and BOTTOM_BY
- TOPFREQ and MODE
- STDDEV and VARIANCE
- CORRELATION and COVARIANCE
- PERCENTILE and MEDIAN
- HISTOGRAM
- LinearHistogram, LogarithmicHistogram and LogHistogram
- CDF (cumulative distribution function)
- BOOL_AND, BOOL_OR and BOOL_XOR
- BIT_AND, BIT_OR, and BIT_XOR
- SessionStart
- AGGREGATE_BY and MULTI_AGGREGATE_BY
- UDAF
COUNT
Signature
COUNT(*)->Uint64
COUNT(T)->Uint64
COUNT(T?)->Uint64
Counting the number of rows in the table (if *
or constant is specified as the argument) or non-empty values in a table column (if the column name is specified as an argument).
Like other aggregate functions, it can be combined with GROUP BY to get statistics on the parts of the table that correspond to the values in the columns being grouped. Use the DISTINCT modifier to count the number of unique values.
Examples
SELECT COUNT(*) FROM my_table;
SELECT key, COUNT(value) FROM my_table GROUP BY key;
SELECT COUNT(DISTINCT value) FROM my_table;
MIN and MAX
Signature
MIN(T?)->T?
MIN(T)->T?
MAX(T?)->T?
MAX(T)->T?
Minimum or maximum value.
As an argument, you may use an arbitrary computable expression with a comparable result.
Examples
SELECT MIN(value), MAX(value) FROM my_table;
SUM
Signature
SUM(Unsigned?)->Uint64?
SUM(Signed?)->Int64?
SUM(Interval?)->Interval?
SUM(Decimal(N, M)?)->Decimal(35, M)?
Sum of the numbers.
As an argument, you may use an arbitrary computable expression with a numeric result or Interval
type.
Integers are automatically expanded to 64 bits to reduce the risk of overflow.
SELECT SUM(value) FROM my_table;
AVG
Signature
AVG(Double?)->Double?
AVG(Interval?)->Interval?
AVG(Decimal(N, M)?)->Decimal(N, M)?
Arithmetic average.
As an argument, you may use an arbitrary computable expression with a numeric result or Interval
type.
Integer values and time intervals are automatically converted to Double.
Examples
SELECT AVG(value) FROM my_table;
COUNT_IF
Signature
COUNT_IF(Bool?)->Uint64?
Number of rows for which the expression specified as the argument is true (the expression's calculation result is true).
The value NULL
is equated to false
(if the argument type is Bool?
).
The function does not do the implicit type casting to Boolean for strings and numbers.
Examples
SELECT
COUNT_IF(value % 2 == 1) AS odd_count
Note
If you need to count the number of unique values on the rows that meet the condition, then unlike other aggregate functions, you can't use the DISTINCT modifier, since the arguments contain no values. To get this result, use, inside a subquery, the built-in IF
function with two arguments (to get NULL
in ELSE), and apply an outer COUNT(DISTINCT ...) to its results.
SUM_IF and AVG_IF
Signature
SUM_IF(Unsigned?, Bool?)->Uint64?
SUM_IF(Signed?, Bool?)->Int64?
SUM_IF(Interval?, Bool?)->Interval?
AVG_IF(Double?, Bool?)->Double?
Sum or arithmetic average, but only for the rows that satisfy the condition passed by the second argument.
Therefore, SUM_IF(value, condition)
is a slightly shorter notation for SUM(IF(condition, value))
, same for AVG
. The argument's data type expansion is similar to the same-name functions without a suffix.
Examples
SELECT
SUM_IF(value, value % 2 == 1) AS odd_sum,
AVG_IF(value, value % 2 == 1) AS odd_avg,
FROM my_table;
When the aggregate function factory is used, Tuple of the value and predicate is passed as the first AGGREGATE_BY
argument.
Examples
$sum_if_factory = AggregationFactory("SUM_IF");
$avg_if_factory = AggregationFactory("AVG_IF");
SELECT
AGGREGATE_BY(AsTuple(value, value % 2 == 1), $sum_if_factory) AS odd_sum,
AGGREGATE_BY(AsTuple(value, value % 2 == 1), $avg_if_factory) AS odd_avg
FROM my_table;
SOME
Signature
SOME(T?)->T?
SOME(T)->T?
Get the value for an expression specified as an argument, for one of the table rows. Gives no guarantee of which row is used. This function is a counterpart for any() in ClickHouse.
Because of no guarantee, SOME
is computationally cheaper than MIN/MAX (that are often used in similar situations).
Examples
SELECT
SOME(value)
FROM my_table;
Alert
When the aggregate function SOME
is called multiple times, it's not guaranteed that all the resulting values are taken from the same row of the source table. To get this guarantee, pack the values into any container and pass it to SOME
. For example, for a structure you can do this using AsStruct
.
CountDistinctEstimate, HyperLogLog and HLL
Signature
CountDistinctEstimate(T)->Uint64?
HyperLogLog(T)->Uint64?
HLL(T)->Uint64?
Approximate estimate of the number of unique values using the HyperLogLog algorithm. Logically does the same as COUNT(DISTINCT ...), but works much faster and with a small margin of error.
Arguments:
- Value for estimation.
- Accuracy (from 4 to 18 inclusive, the default value is 14).
The choice of accuracy enables you to trade off the additional consumption of computing resources and RAM for error reduction.
All three functions are currently aliases, but CountDistinctEstimate
may start using a different algorithm in the future.
Examples
SELECT
CountDistinctEstimate(my_column)
FROM my_table;
SELECT
HyperLogLog(my_column, 4)
FROM my_table;
AGGREGATE_LIST
Signature
AGGREGATE_LIST(T? [, limit:Uint64])->List<T>
AGGREGATE_LIST(T [, limit:Uint64])->List<T>
AGGREGATE_LIST_DISTINCT(T? [, limit:Uint64])->List<T>
AGGREGATE_LIST_DISTINCT(T [, limit:Uint64])->List<T>
Get all column values as a list. In combination with DISTINCT
, returns only unique values. The optional second parameter sets the maximum number of obtained values.
If you know in advance that there are not many unique values, better use the AGGREGATE_LIST_DISTINCT
aggregate function that builds the same result in memory (which may not be enough for a large number of unique values).
The order of elements in the resulting list depends on the implementation and is not set externally. To get an ordered list, sort the result, for example, using ListSort.
To get a list of several values from one string, DO NOT use the AGGREGATE_LIST
function several times, but place all the desired values in a container, for example, via AsList or AsTuple and pass this container in a single AGGREGATE_LIST
invocation.
For example, you can use it in combination with DISTINCT
and the String::JoinFromList function (analog of ','.join(list)
from Python) to print to a string all the values that were seen in the column after applying GROUP BY.
Examples
SELECT
AGGREGATE_LIST( region ),
AGGREGATE_LIST( region, 5 ),
AGGREGATE_LIST( DISTINCT region ),
AGGREGATE_LIST_DISTINCT( region ),
AGGREGATE_LIST_DISTINCT( region, 5 )
FROM users
-- Analog of GROUP_CONCAT from MySQL
SELECT
String::JoinFromList(CAST(AGGREGATE_LIST(region, 2) AS List<String>), ",")
FROM users
There is also a short form of these functions: AGG_LIST
and AGG_LIST_DISTINCT
.
Alert
Executed NOT in a lazy way, so when you use it, you need to make sure that you get a list of reasonable size, around a thousand items. To be on the safe side, you can use a second optional numeric argument that includes a limit on the number of items in the list.
If the number of items in the list is exceeded, the Memory limit exceeded
error is returned.
MAX_BY and MIN_BY
Signature
MAX_BY(T1?, T2)->T1?
MAX_BY(T1, T2)->T1?
MAX_BY(T1, T2, limit:Uint64)->List<T1>?
MIN_BY(T1?, T2)->T1?
MIN_BY(T1, T2)->T1?
MIN_BY(T1, T2, limit:Uint64)->List<T1>?
Return the value of the first argument for the table row in which the second argument was minimum/maximum.
You can optionally specify the third argument N which affects the behavior if the table has multiple rows with the same minimum or maximum value:
- If N is not specified, the value of one of the rows will be returned and the rest are discarded.
- If N is specified, the list with all values will be returned, but no more than N, all values are discarded after reaching the specified number.
When choosing the N value, do not exceed hundreds or thousands to avoid problems with the limited available memory on YTsaurus clusters.
If the job necessarily needs all values and their number can be measured in tens of thousands or more, then instead of these aggregate functions use JOIN
of the original table with a subquery where GROUP BY + MIN/MAX
is performed on the columns you are interested in.
Attention!
If the second argument is always NULL, the aggregation result is NULL.
When the aggregate function factory is used, Tuple
of the value and key is passed as the first AGGREGATE_BY argument.
Examples
SELECT
MIN_BY(value, LENGTH(value)),
MAX_BY(value, key, 100)
FROM my_table;
$min_by_factory = AggregationFactory("MIN_BY");
$max_by_factory = AggregationFactory("MAX_BY", 100);
SELECT
AGGREGATE_BY(AsTuple(value, LENGTH(value)), $min_by_factory),
AGGREGATE_BY(AsTuple(value, key), $max_by_factory)
FROM my_table;
TOP and BOTTOM
Signature
TOP(T?, limit:Uint32)->List<T>
TOP(T, limit:Uint32)->List<T>
BOTTOM(T?, limit:Uint32)->List<T>
BOTTOM(T, limit:Uint32)->List<T>
Return the list of maximum/minimum expression values. The first argument is an expression, the second one is a limit on the number of items.
Examples
SELECT
TOP(key, 3),
BOTTOM(value, 3)
FROM my_table;
$top_factory = AggregationFactory("TOP", 3);
$bottom_factory = AggregationFactory("BOTTOM", 3);
SELECT
AGGREGATE_BY(key, $top_factory),
AGGREGATE_BY(value, $bottom_factory)
FROM my_table;
TOP_BY and BOTTOM_BY
Signature
TOP_BY(T1?, T2, limit:Uint32)->List<T1>
TOP_BY(T1, T2, limit:Uint32)->List<T1>
BOTTOM_BY(T1?, T2, limit:Uint32)->List<T1>
BOTTOM_BY(T1, T2, limit:Uint32)->List<T1>
Return the list of values of the first argument for strings with maximum/minimum values of the second argument. The third argument is the limit on the number of items in the list.
When the aggregate function factory is used, Tuple
of the value and key is passed as the first AGGREGATE_BY argument. In this case, the limit on the number of items is passed as the second argument when creating the factory.
Examples
SELECT
TOP_BY(value, LENGTH(value), 3),
BOTTOM_BY(value, key, 3)
FROM my_table;
$top_by_factory = AggregationFactory("TOP_BY", 3);
$bottom_by_factory = AggregationFactory("BOTTOM_BY", 3);
SELECT
AGGREGATE_BY(AsTuple(value, LENGTH(value)), $top_by_factory),
AGGREGATE_BY(AsTuple(value, key), $bottom_by_factory)
FROM my_table;
TOPFREQ and MODE
Signature
TOPFREQ(T [, num:Uint32 [, bufSize:Uint32]])->List<Struct<Frequency:Uint64, Value:T>>
MODE(T [, num:Uint32 [, bufSize:Uint32]])->List<Struct<Frequency:Uint64, Value:T>>
Getting an approximate list of the most frequent column values with an estimate of their number. Return a list of structures with two fields:
Value
: The found frequent value.Frequency
: Estimate of the number of mentions in the table.
Mandatory argument: the value itself.
Optional arguments:
- For
TOPFREQ
: The desired number of items in the result.MODE
is an alias toTOPFREQ
with 1 in this argument. ForTOPFREQ
, the default value is also 1. - The number of items in the used buffer, which enables you to trade off memory consumption for accuracy. The default value is 100.
Examples
SELECT
MODE(my_column),
TOPFREQ(my_column, 5, 1000)
FROM my_table;
STDDEV and VARIANCE
Signature
STDDEV(Double?)->Double?
STDDEV_POPULATION(Double?)->Double?
POPULATION_STDDEV(Double?)->Double?
STDDEV_SAMPLE(Double?)->Double?
STDDEVSAMP(Double?)->Double?
VARIANCE(Double?)->Double?
VARIANCE_POPULATION(Double?)->Double?
POPULATION_VARIANCE(Double?)->Double?
VARPOP(Double?)->Double?
VARIANCE_SAMPLE(Double?)->Double?
Standard variance and dispersion by column. A one-pass parallel algorithm is used the result of which may differ from that obtained by more common methods that require two passes over the data.
By default, sample dispersion and standard variance are calculated. Several write methods are available:
- With the
POPULATION
suffix/prefix, for example,VARIANCE_POPULATION
andPOPULATION_VARIANCE
: Calculates dispersion/standard variance for the general population. - With the
SAMPLE
suffix or without a suffix, for example,VARIANCE_SAMPLE
,SAMPLE_VARIANCE
, andVARIANCE
: Calculates sample dispersion and standard variance.
There are also several abbreviated aliases, for example, VARPOP
or STDDEVSAMP
.
If all passed values are NULL
, NULL
is returned.
Examples
SELECT
STDDEV(numeric_column),
VARIANCE(numeric_column)
FROM my_table;
CORRELATION and COVARIANCE
Signature
CORRELATION(Double?, Double?)->Double?
COVARIANCE(Double?, Double?)->Double?
COVARIANCE_SAMPLE(Double?, Double?)->Double?
COVARIANCE_POPULATION(Double?, Double?)->Double?
Correlation and covariance of two columns.
Abbreviated versions CORR
or COVAR
are also available. For covariance, versions with the SAMPLE
/POPULATION
suffix similar to VARIANCE described above are available.
Unlike the majority of other aggregate functions, they do not skip NULL
and count it as 0.
When the aggregate function factory is used, Tuple
of two values is passed as the first AGGREGATE_BY argument.
Examples
SELECT
CORRELATION(numeric_column, another_numeric_column),
COVARIANCE(numeric_column, another_numeric_column)
FROM my_table;
$corr_factory = AggregationFactory("CORRELATION");
SELECT
AGGREGATE_BY(AsTuple(numeric_column, another_numeric_column), $corr_factory)
FROM my_table;
PERCENTILE and MEDIAN
Signature
PERCENTILE(Double?, Double)->Double?
PERCENTILE(Interval?, Double)->Interval?
MEDIAN(Double? [, Double])->Double?
MEDIAN(Interval? [, Double])->Interval?
Calculating percentiles according to the amortized version of the TDigest algorithm. MEDIAN
: Alias for PERCENTILE(N, 0.5)
.
Limitation
The first argument (N) must be the name of the table column. If you need to bypass this limitation, you can use a subquery. The limitation is introduced to simplify computations, because several invocations with the same first argument (N) are merged into one pass in the implementation.
SELECT
MEDIAN(numeric_column),
PERCENTILE(numeric_column, 0.99)
FROM my_table;
HISTOGRAM
Signature
HISTOGRAM(Double?)->HistogramStruct?
HISTOGRAM(Double?, weight:Double)->HistogramStruct?
HISTOGRAM(Double?, intervals:Uint32)->HistogramStruct?
HISTOGRAM(Double?, weight:Double, intervals:Uint32)->HistogramStruct?
In the description of signatures, HistogramStruct means the result of an aggregate function that is a structure of a certain form.
Building an approximate histogram by a numerical expression with automatic selection of baskets.
Basic settings
The limit on the number of baskets can be set using the optional argument, the default value is 100. Note that the additional accuracy costs additional computing resources and may have a negative impact on the query execution time and in extreme cases — on its success.
Weight support
You can specify "weight" for each value involved in building the histogram. To do this, you must pass a weight calculation expression to the aggregate function as the second argument. The default weight value is always 1.0
. If non-standard weights are used, the limit on the number of baskets can be set by the third argument.
If two arguments are passed, the meaning of the second argument is determined by its type (integer literal — the limit on the number of baskets, otherwise — weight).
If you need an accurate histogram
- You can use the aggregate functions described below with fixed basket grids: LinearHistogram or LogarithmicHistogram.
- You can calculate the basket number for each row and perform GROUP BY on it.
When the aggregate function factory is used, Tuple of the value and weight is passed as the first AGGREGATE_BY
argument.
Examples
SELECT
HISTOGRAM(numeric_column)
FROM my_table;
SELECT
Histogram::Print(
HISTOGRAM(numeric_column, 10),
50
)
FROM my_table;
$hist_factory = AggregationFactory("HISTOGRAM");
SELECT
AGGREGATE_BY(AsTuple(numeric_column, 1.0), $hist_factory)
FROM my_table;
LinearHistogram, LogarithmicHistogram and LogHistogram
Building a histogram based on an explicitly specified fixed scale of baskets.
Signature
LinearHistogram(Double?)->HistogramStruct?
LinearHistogram(Double? [, binSize:Double [, min:Double [, max:Double]]])->HistogramStruct?
LogarithmicHistogram(Double?)->HistogramStruct?
LogarithmicHistogram(Double? [, logBase:Double [, min:Double [, max:Double]]])->HistogramStruct?
LogHistogram(Double?)->HistogramStruct?
LogHistogram(Double? [, logBase:Double [, min:Double [, max:Double]]])->HistogramStruct?
Arguments:
- An expression whose value is used to build a histogram. All subsequent ones are optional.
- Distance between baskets for
LinearHistogram
or a logarithm base forLogarithmicHistogram
/LogHistogram
(these are aliases). In both cases, the default value is 10. - Minimum value. The default value is minus infinity.
- Maximum value. The default value is plus infinity.
The result format is completely similar to adaptive histograms, which enables you to use the same set of auxiliary functions.
If the variation of input values is uncontrollably large, we recommend specifying a minimum and a maximum value to prevent potential drops due to high memory consumption.
Examples
SELECT
LogarithmicHistogram(numeric_column, 2)
FROM my_table;
CDF (cumulative distribution function)
You can assign a CDF suffix to each Histogram function type to build a cumulative distribution function. Constructions
SELECT
Histogram::ToCumulativeDistributionFunction(Histogram::Normalize(<function_type>Histogram(numeric_column)))
FROM my_table;
and
SELECT
<function_type>HistogramCDF(numeric_column)
FROM my_table;
are completely equivalent.
BOOL_AND, BOOL_OR and BOOL_XOR
Signature
BOOL_AND(Bool?)->Bool?
BOOL_OR(Bool?)->Bool?
BOOL_XOR(Bool?)->Bool?
Applying the appropriate logical operation (AND
/OR
/XOR
) to all values of a boolean column or expression.
These functions do not skip the NULL
value during aggregation, a single NULL
value will turn the result into NULL
. The MIN
/MAX
or BIT_AND
/BIT_OR
/BIT_XOR
functions can be used for aggregation with NULL
skips.
Examples
SELECT
BOOL_AND(bool_column),
BOOL_OR(bool_column),
BOOL_XOR(bool_column)
FROM my_table;
BIT_AND, BIT_OR, and BIT_XOR
Applying the appropriate bitwise operation to all values of a numeric column or expression.
Examples
SELECT
BIT_XOR(unsigned_numeric_value)
FROM my_table;
SessionStart
Without arguments. Only allowed if there is SessionWindow in
GROUP BY/PARTITION BY.
Returns the value of the SessionWindow
key column. In case of SessionWindow
with two arguments — the minimum value of the first argument within a group/partition.
In case of the extended variant of SessionWindoow
— the value of the second tuple item returned by <calculate_lambda>
when the first tuple item is True
.
AGGREGATE_BY and MULTI_AGGREGATE_BY
Applying the aggregate function factory to all column or expression values. The MULTI_AGGREGATE_BY
function requires a structure, tuple, or list in a column or expression value and applies the factory on an item-by-item basis, placing the result in a container of the same form. If different column or expression values contain lists of different lengths, the resulting list will have the smallest of the lengths of those lists.
- Column,
DISTINCT
column, or expression. - Factory.
Examples:
$count_factory = AggregationFactory("COUNT");
SELECT
AGGREGATE_BY(DISTINCT column, $count_factory) as uniq_count
FROM my_table;
SELECT
MULTI_AGGREGATE_BY(nums, AggregationFactory("count")) as count,
MULTI_AGGREGATE_BY(nums, AggregationFactory("min")) as min,
MULTI_AGGREGATE_BY(nums, AggregationFactory("max")) as max,
MULTI_AGGREGATE_BY(nums, AggregationFactory("avg")) as avg,
MULTI_AGGREGATE_BY(nums, AggregationFactory("percentile", 0.9)) as p90
FROM my_table;
UDAF
If the above aggregate functions were not enough for some reason, YQL has a mechanism for describing custom aggregate functions.