Tag: Spark

  • Spark Join Optimisation

    Last time, we discussed how to find the bottle neck from Spark UI, but actually, many performance issues in our daily Spark jobs are related to data skew. There are many situations where data skew occurs: The first problem is actually relatively easy to solve. A common and straightforward way is to directly repartition. The…

    Continue Reading →

  • spark executor memory introduction

    Overview Hello everyone, today I would like to introduce a problem that we often encounter in our work: “Container killed by YARN for exceeding memory limits”. What causes this problem? What is the difference between this problem and OOM? What is the relationship between this problem and the memory structure of Spark Executor? Today, let’s…

    Continue Reading →

  • Spark Basic Tuning Guide

    How spark works ? Nowadays python is becoming more and more popular among data scientists. Pyspark is rightfully become one of the most popular tools among them. Today let’s use a very simple Pyspark sample to help us understand more about spark. ## Architecture First of all, spark architecture is based on the most typical…

    Continue Reading →