- COUNT
- MIN and MAX
- SUM
- AVG
- COUNT_IF
- SUM_IF and AVG_IF
- SOME
- CountDistinctEstimate, HyperLogLog and HLL
- AGGREGATE_LIST
- MAX_BY and MIN_BY
- TOP and BOTTOM
- TOP_BY and BOTTOM_BY
- TOPFREQ and MODE
- STDDEV and VARIANCE
- CORRELATION and COVARIANCE
- PERCENTILE and MEDIAN
- HISTOGRAM
- LinearHistogram, LogarithmicHistogram and LogHistogram
- CDF (cumulative distribution function)
- BOOL_AND, BOOL_OR and BOOL_XOR
- BIT_AND, BIT_OR, and BIT_XOR
- SessionStart
- AGGREGATE_BY and MULTI_AGGREGATE_BY
- UDAF

## COUNT

**Signature**

```
COUNT(*)->Uint64
COUNT(T)->Uint64
COUNT(T?)->Uint64
```

Counting the number of rows in the table (if `*`

or constant is specified as the argument) or non-empty values in a table column (if the column name is specified as an argument).

Like other aggregate functions, it can be combined with GROUP BY to get statistics on the parts of the table that correspond to the values in the columns being grouped. Use the DISTINCT modifier to count the number of unique values.

**Examples**

```
SELECT COUNT(*) FROM my_table;
```

```
SELECT key, COUNT(value) FROM my_table GROUP BY key;
```

```
SELECT COUNT(DISTINCT value) FROM my_table;
```

## MIN and MAX

**Signature**

```
MIN(T?)->T?
MIN(T)->T?
MAX(T?)->T?
MAX(T)->T?
```

Minimum or maximum value.

As an argument, you may use an arbitrary computable expression with a comparable result.

**Examples**

```
SELECT MIN(value), MAX(value) FROM my_table;
```

## SUM

**Signature**

```
SUM(Unsigned?)->Uint64?
SUM(Signed?)->Int64?
SUM(Interval?)->Interval?
SUM(Decimal(N, M)?)->Decimal(35, M)?
```

Sum of the numbers.

As an argument, you may use an arbitrary computable expression with a numeric result or `Interval`

type.

Integers are automatically expanded to 64 bits to reduce the risk of overflow.

```
SELECT SUM(value) FROM my_table;
```

## AVG

**Signature**

```
AVG(Double?)->Double?
AVG(Interval?)->Interval?
AVG(Decimal(N, M)?)->Decimal(N, M)?
```

Arithmetic average.

As an argument, you may use an arbitrary computable expression with a numeric result or `Interval`

type.

Integer values and time intervals are automatically converted to Double.

**Examples**

```
SELECT AVG(value) FROM my_table;
```

## COUNT_IF

**Signature**

```
COUNT_IF(Bool?)->Uint64?
```

Number of rows for which the expression specified as the argument is true (the expression's calculation result is true).

The value `NULL`

is equated to `false`

(if the argument type is `Bool?`

).

The function *does not* do the implicit type casting to Boolean for strings and numbers.

**Examples**

```
SELECT
COUNT_IF(value % 2 == 1) AS odd_count
```

Note

If you need to count the number of unique values on the rows that meet the condition, then unlike other aggregate functions, you can't use the DISTINCT modifier, since the arguments contain no values. To get this result, use, inside a subquery, the built-in `IF`

function with two arguments (to get `NULL`

in ELSE), and apply an outer COUNT(DISTINCT ...) to its results.

## SUM_IF and AVG_IF

**Signature**

```
SUM_IF(Unsigned?, Bool?)->Uint64?
SUM_IF(Signed?, Bool?)->Int64?
SUM_IF(Interval?, Bool?)->Interval?
AVG_IF(Double?, Bool?)->Double?
```

Sum or arithmetic average, but only for the rows that satisfy the condition passed by the second argument.

Therefore, `SUM_IF(value, condition)`

is a slightly shorter notation for `SUM(IF(condition, value))`

, same for `AVG`

. The argument's data type expansion is similar to the same-name functions without a suffix.

**Examples**

```
SELECT
SUM_IF(value, value % 2 == 1) AS odd_sum,
AVG_IF(value, value % 2 == 1) AS odd_avg,
FROM my_table;
```

When the aggregate function factory is used, Tuple of the value and predicate is passed as the first `AGGREGATE_BY`

argument.

**Examples**

```
$sum_if_factory = AggregationFactory("SUM_IF");
$avg_if_factory = AggregationFactory("AVG_IF");
SELECT
AGGREGATE_BY(AsTuple(value, value % 2 == 1), $sum_if_factory) AS odd_sum,
AGGREGATE_BY(AsTuple(value, value % 2 == 1), $avg_if_factory) AS odd_avg
FROM my_table;
```

## SOME

**Signature**

```
SOME(T?)->T?
SOME(T)->T?
```

Get the value for an expression specified as an argument, for one of the table rows. Gives no guarantee of which row is used. This function is a counterpart for any() in ClickHouse.

Because of no guarantee, `SOME`

is computationally cheaper than MIN/MAX (that are often used in similar situations).

**Examples**

```
SELECT
SOME(value)
FROM my_table;
```

Alert

When the aggregate function `SOME`

is called multiple times, it's **not** guaranteed that all the resulting values are taken from the same row of the source table. To get this guarantee, pack the values into any container and pass it to `SOME`

. For example, for a structure you can do this using `AsStruct`

.

## CountDistinctEstimate, HyperLogLog and HLL

**Signature**

```
CountDistinctEstimate(T)->Uint64?
HyperLogLog(T)->Uint64?
HLL(T)->Uint64?
```

Approximate estimate of the number of unique values using the HyperLogLog algorithm. Logically does the same as COUNT(DISTINCT ...), but works much faster and with a small margin of error.

Arguments:

- Value for estimation.
- Accuracy (from 4 to 18 inclusive, the default value is 14).

The choice of accuracy enables you to trade off the additional consumption of computing resources and RAM for error reduction.

All three functions are currently aliases, but `CountDistinctEstimate`

may start using a different algorithm in the future.

**Examples**

```
SELECT
CountDistinctEstimate(my_column)
FROM my_table;
```

```
SELECT
HyperLogLog(my_column, 4)
FROM my_table;
```

## AGGREGATE_LIST

**Signature**

```
AGGREGATE_LIST(T? [, limit:Uint64])->List<T>
AGGREGATE_LIST(T [, limit:Uint64])->List<T>
AGGREGATE_LIST_DISTINCT(T? [, limit:Uint64])->List<T>
AGGREGATE_LIST_DISTINCT(T [, limit:Uint64])->List<T>
```

Get all column values as a list. In combination with `DISTINCT`

, returns only unique values. The optional second parameter sets the maximum number of obtained values.

If you know in advance that there are not many unique values, better use the `AGGREGATE_LIST_DISTINCT`

aggregate function that builds the same result in memory (which may not be enough for a large number of unique values).

The order of elements in the resulting list depends on the implementation and is not set externally. To get an ordered list, sort the result, for example, using ListSort.

To get a list of several values from one string, *DO NOT* use the `AGGREGATE_LIST`

function several times, but place all the desired values in a container, for example, via AsList or AsTuple and pass this container in a single `AGGREGATE_LIST`

invocation.

For example, you can use it in combination with `DISTINCT`

and the String::JoinFromList function (analog of `','.join(list)`

from Python) to print to a string all the values that were seen in the column after applying GROUP BY.

**Examples**

```
SELECT
AGGREGATE_LIST( region ),
AGGREGATE_LIST( region, 5 ),
AGGREGATE_LIST( DISTINCT region ),
AGGREGATE_LIST_DISTINCT( region ),
AGGREGATE_LIST_DISTINCT( region, 5 )
FROM users
```

```
-- Analog of GROUP_CONCAT from MySQL
SELECT
String::JoinFromList(CAST(AGGREGATE_LIST(region, 2) AS List<String>), ",")
FROM users
```

There is also a short form of these functions: `AGG_LIST`

and `AGG_LIST_DISTINCT`

.

Alert

Executed **NOT** in a lazy way, so when you use it, you need to make sure that you get a list of reasonable size, around a thousand items. To be on the safe side, you can use a second optional numeric argument that includes a limit on the number of items in the list.

If the number of items in the list is exceeded, the `Memory limit exceeded`

error is returned.

## MAX_BY and MIN_BY

**Signature**

```
MAX_BY(T1?, T2)->T1?
MAX_BY(T1, T2)->T1?
MAX_BY(T1, T2, limit:Uint64)->List<T1>?
MIN_BY(T1?, T2)->T1?
MIN_BY(T1, T2)->T1?
MIN_BY(T1, T2, limit:Uint64)->List<T1>?
```

Return the value of the first argument for the table row in which the second argument was minimum/maximum.

You can optionally specify the third argument N which affects the behavior if the table has multiple rows with the same minimum or maximum value:

- If N is not specified, the value of one of the rows will be returned and the rest are discarded.
- If N is specified, the list with all values will be returned, but no more than N, all values are discarded after reaching the specified number.

When choosing the N value, do not exceed hundreds or thousands to avoid problems with the limited available memory on YTsaurus clusters.

If the job necessarily needs all values and their number can be measured in tens of thousands or more, then instead of these aggregate functions use `JOIN`

of the original table with a subquery where `GROUP BY + MIN/MAX`

is performed on the columns you are interested in.

Attention!

If the second argument is always NULL, the aggregation result is NULL.

When the aggregate function factory is used, `Tuple`

of the value and key is passed as the first AGGREGATE_BY argument.

**Examples**

```
SELECT
MIN_BY(value, LENGTH(value)),
MAX_BY(value, key, 100)
FROM my_table;
```

```
$min_by_factory = AggregationFactory("MIN_BY");
$max_by_factory = AggregationFactory("MAX_BY", 100);
SELECT
AGGREGATE_BY(AsTuple(value, LENGTH(value)), $min_by_factory),
AGGREGATE_BY(AsTuple(value, key), $max_by_factory)
FROM my_table;
```

## TOP and BOTTOM

**Signature**

```
TOP(T?, limit:Uint32)->List<T>
TOP(T, limit:Uint32)->List<T>
BOTTOM(T?, limit:Uint32)->List<T>
BOTTOM(T, limit:Uint32)->List<T>
```

Return the list of maximum/minimum expression values. The first argument is an expression, the second one is a limit on the number of items.

**Examples**

```
SELECT
TOP(key, 3),
BOTTOM(value, 3)
FROM my_table;
```

```
$top_factory = AggregationFactory("TOP", 3);
$bottom_factory = AggregationFactory("BOTTOM", 3);
SELECT
AGGREGATE_BY(key, $top_factory),
AGGREGATE_BY(value, $bottom_factory)
FROM my_table;
```

## TOP_BY and BOTTOM_BY

**Signature**

```
TOP_BY(T1?, T2, limit:Uint32)->List<T1>
TOP_BY(T1, T2, limit:Uint32)->List<T1>
BOTTOM_BY(T1?, T2, limit:Uint32)->List<T1>
BOTTOM_BY(T1, T2, limit:Uint32)->List<T1>
```

Return the list of values of the first argument for strings with maximum/minimum values of the second argument. The third argument is the limit on the number of items in the list.

When the aggregate function factory is used, `Tuple`

of the value and key is passed as the first AGGREGATE_BY argument. In this case, the limit on the number of items is passed as the second argument when creating the factory.

**Examples**

```
SELECT
TOP_BY(value, LENGTH(value), 3),
BOTTOM_BY(value, key, 3)
FROM my_table;
```

```
$top_by_factory = AggregationFactory("TOP_BY", 3);
$bottom_by_factory = AggregationFactory("BOTTOM_BY", 3);
SELECT
AGGREGATE_BY(AsTuple(value, LENGTH(value)), $top_by_factory),
AGGREGATE_BY(AsTuple(value, key), $bottom_by_factory)
FROM my_table;
```

## TOPFREQ and MODE

**Signature**

```
TOPFREQ(T [, num:Uint32 [, bufSize:Uint32]])->List<Struct<Frequency:Uint64, Value:T>>
MODE(T [, num:Uint32 [, bufSize:Uint32]])->List<Struct<Frequency:Uint64, Value:T>>
```

Getting an approximate list of the most frequent column values with an estimate of their number. Return a list of structures with two fields:

`Value`

: The found frequent value.`Frequency`

: Estimate of the number of mentions in the table.

Mandatory argument: the value itself.

Optional arguments:

- For
`TOPFREQ`

: The desired number of items in the result.`MODE`

is an alias to`TOPFREQ`

with 1 in this argument. For`TOPFREQ`

, the default value is also 1. - The number of items in the used buffer, which enables you to trade off memory consumption for accuracy. The default value is 100.

**Examples**

```
SELECT
MODE(my_column),
TOPFREQ(my_column, 5, 1000)
FROM my_table;
```

## STDDEV and VARIANCE

**Signature**

```
STDDEV(Double?)->Double?
STDDEV_POPULATION(Double?)->Double?
POPULATION_STDDEV(Double?)->Double?
STDDEV_SAMPLE(Double?)->Double?
STDDEVSAMP(Double?)->Double?
VARIANCE(Double?)->Double?
VARIANCE_POPULATION(Double?)->Double?
POPULATION_VARIANCE(Double?)->Double?
VARPOP(Double?)->Double?
VARIANCE_SAMPLE(Double?)->Double?
```

Standard variance and dispersion by column. A one-pass parallel algorithm is used the result of which may differ from that obtained by more common methods that require two passes over the data.

By default, sample dispersion and standard variance are calculated. Several write methods are available:

- With the
`POPULATION`

suffix/prefix, for example,`VARIANCE_POPULATION`

and`POPULATION_VARIANCE`

: Calculates dispersion/standard variance for the general population. - With the
`SAMPLE`

suffix or without a suffix, for example,`VARIANCE_SAMPLE`

,`SAMPLE_VARIANCE`

, and`VARIANCE`

: Calculates sample dispersion and standard variance.

There are also several abbreviated aliases, for example, `VARPOP`

or `STDDEVSAMP`

.

If all passed values are `NULL`

, `NULL`

is returned.

**Examples**

```
SELECT
STDDEV(numeric_column),
VARIANCE(numeric_column)
FROM my_table;
```

## CORRELATION and COVARIANCE

**Signature**

```
CORRELATION(Double?, Double?)->Double?
COVARIANCE(Double?, Double?)->Double?
COVARIANCE_SAMPLE(Double?, Double?)->Double?
COVARIANCE_POPULATION(Double?, Double?)->Double?
```

Correlation and covariance of two columns.

Abbreviated versions `CORR`

or `COVAR`

are also available. For covariance, versions with the `SAMPLE`

/`POPULATION`

suffix similar to VARIANCE described above are available.

Unlike the majority of other aggregate functions, they do not skip `NULL`

and count it as 0.

When the aggregate function factory is used, `Tuple`

of two values is passed as the first AGGREGATE_BY argument.

**Examples**

```
SELECT
CORRELATION(numeric_column, another_numeric_column),
COVARIANCE(numeric_column, another_numeric_column)
FROM my_table;
```

```
$corr_factory = AggregationFactory("CORRELATION");
SELECT
AGGREGATE_BY(AsTuple(numeric_column, another_numeric_column), $corr_factory)
FROM my_table;
```

## PERCENTILE and MEDIAN

**Signature**

```
PERCENTILE(Double?, Double)->Double?
PERCENTILE(Interval?, Double)->Interval?
MEDIAN(Double? [, Double])->Double?
MEDIAN(Interval? [, Double])->Interval?
```

Calculating percentiles according to the amortized version of the TDigest algorithm. `MEDIAN`

: Alias for `PERCENTILE(N, 0.5)`

.

Limitation

The first argument (N) must be the name of the table column. If you need to bypass this limitation, you can use a subquery. The limitation is introduced to simplify computations, because several invocations with the same first argument (N) are merged into one pass in the implementation.

```
SELECT
MEDIAN(numeric_column),
PERCENTILE(numeric_column, 0.99)
FROM my_table;
```

## HISTOGRAM

**Signature**

```
HISTOGRAM(Double?)->HistogramStruct?
HISTOGRAM(Double?, weight:Double)->HistogramStruct?
HISTOGRAM(Double?, intervals:Uint32)->HistogramStruct?
HISTOGRAM(Double?, weight:Double, intervals:Uint32)->HistogramStruct?
```

In the description of signatures, HistogramStruct means the result of an aggregate function that is a structure of a certain form.

Building an approximate histogram by a numerical expression with automatic selection of baskets.

### Basic settings

The limit on the number of baskets can be set using the optional argument, the default value is 100. Note that the additional accuracy costs additional computing resources and may have a negative impact on the query execution time and in extreme cases — on its success.

### Weight support

You can specify "weight" for each value involved in building the histogram. To do this, you must pass a weight calculation expression to the aggregate function as the second argument. The default weight value is always `1.0`

. If non-standard weights are used, the limit on the number of baskets can be set by the third argument.

If two arguments are passed, the meaning of the second argument is determined by its type (integer literal — the limit on the number of baskets, otherwise — weight).

### If you need an accurate histogram

- You can use the aggregate functions described below with fixed basket grids: LinearHistogram or LogarithmicHistogram.
- You can calculate the basket number for each row and perform GROUP BY on it.

When the aggregate function factory is used, Tuple of the value and weight is passed as the first `AGGREGATE_BY`

argument.

**Examples**

```
SELECT
HISTOGRAM(numeric_column)
FROM my_table;
```

```
SELECT
Histogram::Print(
HISTOGRAM(numeric_column, 10),
50
)
FROM my_table;
```

```
$hist_factory = AggregationFactory("HISTOGRAM");
SELECT
AGGREGATE_BY(AsTuple(numeric_column, 1.0), $hist_factory)
FROM my_table;
```

## LinearHistogram, LogarithmicHistogram and LogHistogram

Building a histogram based on an explicitly specified fixed scale of baskets.

**Signature**

```
LinearHistogram(Double?)->HistogramStruct?
LinearHistogram(Double? [, binSize:Double [, min:Double [, max:Double]]])->HistogramStruct?
LogarithmicHistogram(Double?)->HistogramStruct?
LogarithmicHistogram(Double? [, logBase:Double [, min:Double [, max:Double]]])->HistogramStruct?
LogHistogram(Double?)->HistogramStruct?
LogHistogram(Double? [, logBase:Double [, min:Double [, max:Double]]])->HistogramStruct?
```

Arguments:

- An expression whose value is used to build a histogram. All subsequent ones are optional.
- Distance between baskets for
`LinearHistogram`

or a logarithm base for`LogarithmicHistogram`

/`LogHistogram`

(these are aliases). In both cases, the default value is 10. - Minimum value. The default value is minus infinity.
- Maximum value. The default value is plus infinity.

The result format is completely similar to adaptive histograms, which enables you to use the same set of auxiliary functions.

If the variation of input values is uncontrollably large, we recommend specifying a minimum and a maximum value to prevent potential drops due to high memory consumption.

**Examples**

```
SELECT
LogarithmicHistogram(numeric_column, 2)
FROM my_table;
```

## CDF (cumulative distribution function)

You can assign a CDF suffix to each Histogram function type to build a cumulative distribution function. Constructions

```
SELECT
Histogram::ToCumulativeDistributionFunction(Histogram::Normalize(<function_type>Histogram(numeric_column)))
FROM my_table;
```

and

```
SELECT
<function_type>HistogramCDF(numeric_column)
FROM my_table;
```

are completely equivalent.

## BOOL_AND, BOOL_OR and BOOL_XOR

**Signature**

```
BOOL_AND(Bool?)->Bool?
BOOL_OR(Bool?)->Bool?
BOOL_XOR(Bool?)->Bool?
```

Applying the appropriate logical operation (`AND`

/`OR`

/`XOR`

) to all values of a boolean column or expression.

These functions **do not skip** the `NULL`

value during aggregation, a single `NULL`

value will turn the result into `NULL`

. The `MIN`

/`MAX`

or `BIT_AND`

/`BIT_OR`

/`BIT_XOR`

functions can be used for aggregation with `NULL`

skips.

**Examples**

```
SELECT
BOOL_AND(bool_column),
BOOL_OR(bool_column),
BOOL_XOR(bool_column)
FROM my_table;
```

## BIT_AND, BIT_OR, and BIT_XOR

Applying the appropriate bitwise operation to all values of a numeric column or expression.

**Examples**

```
SELECT
BIT_XOR(unsigned_numeric_value)
FROM my_table;
```

## SessionStart

Without arguments. Only allowed if there is SessionWindow in

GROUP BY/PARTITION BY.

Returns the value of the `SessionWindow`

key column. In case of `SessionWindow`

with two arguments — the minimum value of the first argument within a group/partition.

In case of the extended variant of `SessionWindoow`

— the value of the second tuple item returned by `<calculate_lambda>`

when the first tuple item is `True`

.

## AGGREGATE_BY and MULTI_AGGREGATE_BY

Applying the aggregate function factory to all column or expression values. The `MULTI_AGGREGATE_BY`

function requires a structure, tuple, or list in a column or expression value and applies the factory on an item-by-item basis, placing the result in a container of the same form. If different column or expression values contain lists of different lengths, the resulting list will have the smallest of the lengths of those lists.

- Column,
`DISTINCT`

column, or expression. - Factory.

**Examples:**

```
$count_factory = AggregationFactory("COUNT");
SELECT
AGGREGATE_BY(DISTINCT column, $count_factory) as uniq_count
FROM my_table;
SELECT
MULTI_AGGREGATE_BY(nums, AggregationFactory("count")) as count,
MULTI_AGGREGATE_BY(nums, AggregationFactory("min")) as min,
MULTI_AGGREGATE_BY(nums, AggregationFactory("max")) as max,
MULTI_AGGREGATE_BY(nums, AggregationFactory("avg")) as avg,
MULTI_AGGREGATE_BY(nums, AggregationFactory("percentile", 0.9)) as p90
FROM my_table;
```

## UDAF

If the above aggregate functions were not enough for some reason, YQL has a mechanism for describing custom aggregate functions.