Best practices
- Each
DROP
call adds a step to the plan, so it is better to avoid multiple column drops. A singleSELECT
with the required set of columns is more efficient; LIMIT
operation has its own traits that affect perfomance. It works the following way in Apache Spark: it executes a LocalLimit operation on every partition and each of its result is sent to shuffle and then a GlobalLimit operation is executed on shuffled data. In case of usinglimit(x).collect()
ortake(x)
actions theLIMIT
operation executes iteratively. It starts with only one partition and increases its number by the value ofspark.sql.limit.scaleUpFactor
parameter (which is 4 by default) with every iteration until the requested number of rows is reached. Actions likecount()
orwrite.save("/output/path")
execute LocalLimit operation on every partition which causes a performance bottleneck forLIMIT
on large datasets. There are two solutions for this issue:- for small values passed to
LIMIT
when the result oftake(x)
is fit to driver's memory one can use the following snippet:limitedDf = spark.createDataFrame(spark.sparkContext.parallelize(df.take(10000)), df.schema)
; - as an alternative to
LIMIT
a TABLESAMPLE operation can be used. Caution:TABLESAMPLE (x ROWS)
is an equivalent toLIMIT x
operation with LocalLimit and GlobalLimit stages in execution plan.
- for small values passed to
- To read the data which uses a Hive-compatible partitioning scheme by directories in Cypress the
recursiveFileLookup
option should be explicitly set tofalse
:spark.read.option("recursiveFileLookup", "false")
. In SPYT it is set totrue
by default for unification with YQL and CHYT behaviours.