Apache Spark is a fast and general-purpose cluster computing system. First we need to download it. We can launch Spark’s interactive shell – bin/pyspark for the Python one. Start it by running the following in the Spark directory:
gennum files contain show names and their viewers, genchan files contain show names and their channel. We want to find out the total number of viewer across all shows for each channel.
Suppose you already have those 6 files, copy them into your Spark directory.