Skip to content

Latest Posts

Latest blog posts are published here, look out for updates on my Twitter or Medium

PySpark by Example

Here are my worked examples from the very useful LinkedIn Learning course: PySpark by Example by Jonathan Fernandes : https://www.linkedin.com/learning/apache-pyspark-by-example


Over the past 12 months or so I have been learning and playing with Apache Spark. I went through the brilliant book by Bill Chambers and Matei Zaharia, Spark: The Definitive Guide, that covers Spark in depth and gives plenty of code snippets one can try out in the spark-shell. Whilst the book is indeed very detailed and provides great examples, the datasets that are included for you to get your hands on are on the order of Mb's (with the exception of the activity-data dataset used for the Streaming examples).

Continue reading

Calling Compiled Scala Code from Python using PySpark

Calling compiled Scala code inside the JVM from Python using PySpark


There is no doubt that Java and Scala are the de-facto languages for Data Engineering, whilst Python is certainly the front runner for language of choice with Data Scientists. Spark; a framework for distributed data analytics is written in Scala but allows for usage in Python, R and Java. Interoperability between Java and Scala is a no briner since Scala compiles down to Java byte code, but call Scala from Python is a little more involved, but the process is very simple.

Continue reading

Mirroring Public Repositories on Github, Privately

This post walks through the steps involved if you want to fork a public Github repository, privately. It will show how to have an open public repository and how to mirror it in a private repository on Github


These steps were inspired from this guide of 'Mirroring a repository' on Github documentation

Continue reading