Latest Posts¶
Latest blog posts are published here, look out for updates on my Twitter or Medium
PySpark by Example¶
Here are my worked examples from the very useful LinkedIn Learning course: PySpark by Example by Jonathan Fernandes : https://www.linkedin.com/learning/apache-pyspark-by-example
Over the past 12 months or so I have been learning and playing with Apache Spark. I went through the brilliant book by Bill Chambers and Matei Zaharia, Spark: The Definitive Guide, that covers Spark in depth and gives plenty of code snippets one can try out in the spark-shell
. Whilst the book is indeed very detailed and provides great examples, the datasets that are included for you to get your hands on are on the order of Mb
's (with the exception of the activity-data
dataset used for the Streaming examples).
Calling Compiled Scala Code from Python using PySpark¶
Calling compiled Scala code inside the JVM from Python using PySpark
There is no doubt that Java and Scala are the de-facto languages for Data Engineering, whilst Python is certainly the front runner for language of choice with Data Scientists. Spark; a framework for distributed data analytics is written in Scala but allows for usage in Python, R and Java. Interoperability between Java and Scala is a no briner since Scala compiles down to Java byte code, but call Scala from Python is a little more involved, but the process is very simple.
Mirroring Public Repositories on Github, Privately¶
This post walks through the steps involved if you want to fork a public Github repository, privately. It will show how to have an open public repository and how to mirror it in a private repository on Github
These steps were inspired from this guide of 'Mirroring a repository' on Github documentation