Best practices

  • Each DROP call adds a step to the plan, so it is better to avoid multiple column drops. A single SELECT with the required set of columns is more efficient;
  • LIMIT operation has its own traits that affect perfomance. It works the following way in Apache Spark: it executes a LocalLimit operation on every partition and each of its result is sent to shuffle and then a GlobalLimit operation is executed on shuffled data. In case of using limit(x).collect() or take(x) actions the LIMIT operation executes iteratively. It starts with only one partition and increases its number by the value of spark.sql.limit.scaleUpFactor parameter (which is 4 by default) with every iteration until the requested number of rows is reached. Actions like count() or write.save("/output/path") execute LocalLimit operation on every partition which causes a performance bottleneck for LIMIT on large datasets. There are two solutions for this issue:
    • for small values passed to LIMIT when the result of take(x) is fit to driver's memory one can use the following snippet: limitedDf = spark.createDataFrame(spark.sparkContext.parallelize(df.take(10000)), df.schema);
    • as an alternative to LIMIT a TABLESAMPLE operation can be used. Caution: TABLESAMPLE (x ROWS) is an equivalent to LIMIT x operation with LocalLimit and GlobalLimit stages in execution plan.
  • To read the data which uses a Hive-compatible partitioning scheme by directories in Cypress the recursiveFileLookup option should be explicitly set to false: spark.read.option("recursiveFileLookup", "false"). In SPYT it is set to true by default for unification with YQL and CHYT behaviours.