Best practices
- Each
DROPcall adds a step to the plan, so it is better to avoid multiple column drops. A singleSELECTwith the required set of columns is more efficient; LIMIToperation has its own traits that affect perfomance. It works the following way in Apache Spark: it executes a LocalLimit operation on every partition and each of its result is sent to shuffle and then a GlobalLimit operation is executed on shuffled data. In case of usinglimit(x).collect()ortake(x)actions theLIMIToperation executes iteratively. It starts with only one partition and increases its number by the value ofspark.sql.limit.scaleUpFactorparameter (which is 4 by default) with every iteration until the requested number of rows is reached. Actions likecount()orwrite.save("/output/path")execute LocalLimit operation on every partition which causes a performance bottleneck forLIMITon large datasets. There are two solutions for this issue:- for small values passed to
LIMITwhen the result oftake(x)is fit to driver's memory one can use the following snippet:limitedDf = spark.createDataFrame(spark.sparkContext.parallelize(df.take(10000)), df.schema); - as an alternative to
LIMITa TABLESAMPLE operation can be used. Caution:TABLESAMPLE (x ROWS)is an equivalent toLIMIT xoperation with LocalLimit and GlobalLimit stages in execution plan.
- for small values passed to
- To read the data which uses a Hive-compatible partitioning scheme by directories in Cypress the
recursiveFileLookupoption should be explicitly set tofalse:spark.read.option("recursiveFileLookup", "false"). In SPYT it is set totrueby default for unification with YQL and CHYT behaviours.
Previous
Next