### Introduction

Today we often face a certain kind of problem called dynamic connectivity:
Given a set of N objects.

• Union command: connect two objects.
• Find/connected query: is there a path connecting the two objects?

So there are 2 types of operation in this:

• Find query. Check if two objects are in the same component.
• Union command. Replace components containing two objects with their union.

A neural network is put together by hooking together many of our simple “neurons,” so that the output of a neuron can be the input of another. For example, here is a small neural network:

Knowing how to write Objective-C code is far from a perfect software engineer, it is also of significance to take a deeper look at how it works and why it works. Recently I am reading the Book Effective Objective-C 2.0, which has inspired me a lot. So I decided to write this blog to summerize what I have learnt.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

k-Nearest Neighbors (kNN) is a machine-learning algorithm. It works like this: we have an existing set of example data, our training set. We have labels for all of this data—we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data, every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset; this is where the k comes from. (k is an integer and it’s usually less than 20.) Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify.

In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical. So it is a classification algorithm.

Logistic regression:
$$0 \le h_𝜃(x) \le 1$$

$$h_𝜃(x) = g(𝜃^Tx)$$

$$g(z) = \frac{1}{1+e^{-z}}$$

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.

For multivariate linear regression, given the following function, our job is to find the right 𝜃.

$$h_𝜃(x) = 𝜃_0+𝜃_1x_1+𝜃_2x_2+…+𝜃_nx_n$$

For convenience of notation, define $x_0 = 1$.

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Apache Spark is a fast and general-purpose cluster computing system. First we need to download it. We can launch Spark’s interactive shell – bin/pyspark for the Python one. Start it by running the following in the Spark directory:

gennum files contain show names and their viewers, genchan files contain show names and their channel. We want to find out the total number of viewer across all shows for each channel.

Recently I am taking the Big Data Specialization courses by UCSD on Coursera. And I got a certificate on the first course, as you can see below: