Introduction

Today we often face a certain kind of problem called dynamic connectivity:
Given a set of N objects.

  • Union command: connect two objects.
  • Find/connected query: is there a path connecting the two objects?

So there are 2 types of operation in this:

  • Find query. Check if two objects are in the same component.
  • Union command. Replace components containing two objects with their union.
Read more »

A neural network is put together by hooking together many of our simple “neurons,” so that the output of a neuron can be the input of another. For example, here is a small neural network:
NN

Read more »

Knowing how to write Objective-C code is far from a perfect software engineer, it is also of significance to take a deeper look at how it works and why it works. Recently I am reading the Book Effective Objective-C 2.0, which has inspired me a lot. So I decided to write this blog to summerize what I have learnt.

Read more »

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represents classification rules.

Logistic-curve.svg

Read more »

k-Nearest Neighbors (kNN) is a machine-learning algorithm. It works like this: we have an existing set of example data, our training set. We have labels for all of this data—we know what class each piece of the data should fall into. When we’re given a new piece of data without a label, we compare that new piece of data to the existing data, every piece of existing data. We then take the most similar pieces of data (the nearest neighbors) and look at their labels. We look at the top k most similar pieces of data from our known dataset; this is where the k comes from. (k is an integer and it’s usually less than 20.) Lastly, we take a majority vote from the k most similar pieces of data, and the majority is the new class we assign to the data we were asked to classify.

Read more »

In statistics, logistic regression, or logit regression, or logit model is a regression model where the dependent variable (DV) is categorical. So it is a classification algorithm.

Logistic regression:
$$0 \le h_𝜃(x) \le 1 $$

$$h_𝜃(x) = g(𝜃^Tx)$$

$$g(z) = \frac{1}{1+e^{-z}}$$
Logistic-curve.svg

Read more »

In statistics, linear regression is an approach for modeling the relationship between a scalar dependent variable y and one or more explanatory variables (or independent variables) denoted X.
1

For multivariate linear regression, given the following function, our job is to find the right 𝜃.

$$h_𝜃(x) = 𝜃_0+𝜃_1x_1+𝜃_2x_2+…+𝜃_nx_n$$

For convenience of notation, define $x_0 = 1$.

Read more »

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

Read more »

Apache Spark is a fast and general-purpose cluster computing system. First we need to download it. We can launch Spark’s interactive shell – bin/pyspark for the Python one. Start it by running the following in the Spark directory:

1
./bin/pyspark

gennum files contain show names and their viewers, genchan files contain show names and their channel. We want to find out the total number of viewer across all shows for each channel.

Read more »

Recently I am taking the Big Data Specialization courses by UCSD on Coursera. And I got a certificate on the first course, as you can see below:

certificate

Read more »