Sometimes it can be to work on the same Flask (Documentation Here) project from several machines each holding a test database, when working with Flask Migrate this might generate some inconsistencies. Indeed, more than once I've been facing the following error while performing "flask db migrate; flask db upgrade;" commands:

`Target database is not up to date`

In order to solve this problem there is the need to align the version of the database with the migration table data:

First, we might want to have a look to the actual heads:

`flask db heads`

this command shows the **currently available heads** **in the revision script directory**

then we might check the **current revision of the database**:

`flask db current`

In case of inconsistencies the outputs of the command above should be different.

Then, in order to set up and align the revisions:

`flask db stamp heads`

Sets the revision in the database to the one given as an argument, without performing any migrations (heads = the latest).

Finally, in order to migrate I usually perform the following commands:

```
flask db migrate;
flask db upgrade;
```

the *migrate* command could be avoided.

We've learn't how to solve the problem with Flask-Migrate error when the target database is not up to date.

Julia language is an Open Source programming language designed at MIT with high performance in mind. Julia programs compile to efficient native code for multiple platforms.** **Julia is a dynamically-typed and optionally typed this feature can be used to clarify and solidify programs. Julia easily allows to express many object-oriented

Julia language is an Open Source programming language designed at MIT with high performance in mind. Julia programs compile to efficient native code for multiple platforms.** **Julia is a dynamically-typed and optionally typed this feature can be used to clarify and solidify programs. Julia easily allows to express many object-oriented and functional programming patterns. The standard library provides *asynchronous I/O*, process control, logging, profiling, a package manager, and more.

Online there are not many detailed guides, thus I am going to describe the procedure.

From the official download page we can choose among different versions depending on your needs, the major candidates are:

**Current Stable Release****Long Term Support (LTS) Release***(Suggested)***Upcoming Release**(not production ready)

Once downloaded unpack the .tar.gz file to a proper folder, in my case is going to be:

`sudo tar -C /opt -xvf julia-1.0.4-linux-x86_64.tar.gz`

then let's link the main executable in our bash commands:

`sudo ln -s /opt/julia-1.0.4/bin/julia /usr/bin/`

you should be able to open a new terminal and then type julia, the language command line REPL should be launched.

]]>`sudo su;`

`mysql -u root -p;`

`CREATE DATABASE dbname;`

`CREATE USER 'username'@'localhost' IDENTIFIED BY 'password';`

`GRANT ALL PRIVILEGES ON database.* TO 'username'@'localhost';`

`FLUSH PRIVILEGES;`

]]>Folow a list of useful theoretical freely available set of Data Science oriented resources. `

**Cosma Shalizi**

https://www.stat.cmu.edu/~cshalizi/

http://bactra.org/weblog/

*Data over Space and Time (36-467/667) - Fall 2018*

https://www.stat.cmu.edu/~cshalizi/dst/18/

This course is an introduction to the opportunities and challenges of analyzing data from processes unfolding over space and time. It will cover basic descriptive statistics for spatial and temporal patterns; linear methods for interpolating, extrapolating, and smoothing spatio-temporal data; basic nonlinear modeling; and statistical inference with dependent observations. Class work will combine practical exercises in R, some mathematics of the underlying theory, and case studies analyzing real data from various fields (economics, history, meteorology, ecology, etc.). Depending on available time and class interest, additional topics may include: statistics of Markov and hidden-Markov (state-space) models; statistics of point processes; simulation and simulation-based inference; agent-based modeling; dynamical systems theory.

**Undergraduate Advanced Data Analysis**

https://www.stat.cmu.edu/~cshalizi/uADA/17/

The goal of this class is to train you in using statistical models to analyze - as data summaries, as predictive instruments, and as tools for scientific inference. We will build on the theory and applications of the linear model, introduced in 36-401, extending it to more general functional forms, and more general kinds of data, emphasizing the computation-intensive methods introduced since the 1980s. After taking the class, when you're faced with a new data-analysis problem, you should be able to (1) select appropriate methods, (2) use statistical software to implement them, (3) critically evaluate the resulting statistical models, and (4) communicate the results of your analyses to collaborators and to non-statisticians.

**Foundations of Data Science**

https://www.cs.cornell.edu/jeh/book.pdf

**Data Science Central**

https://www.datasciencecentral.com/

**R-Bloggers**

https://www.r-bloggers.com/

]]>

A

]]>In the previous post *"Market-Basket Model | Frequent ItemSets Mining | Association Rules"* we introduced concepts related to the *Frequent Itemsents Mining *task, while in this one we are going to explain the **A-Priori Algorithm** and we'll implement this famous algorithm in pure **Pyhton3.x **to achieve such a complex task.

A basket can be represented as a set of integers such as `{1, 3, 9, 11}`. In order to **find frequent itemsets** given a basket there is the (theoretical) need to compute all its possible subsets, this can be computationally expensive. Indeed if we consider a basket composed of $ n $ elements and we want to compute just subsets of two elements then the number of all possible subsets of two (or *k=2*) elements is going to be $ \binom{n}{k} = \frac{n!}{k!(n-k)!} $ (see combinations or the proof). In the case of a basket of 20 items the number of all the possible subsets composed by two elements is 190, while for subset with cardinality 3 the number of all possible subsets rises to 1140, while the number of all subsets with cardinality from 0 to n is going to be $2^n$. Luckly, often in **real world problems** we only need to compute small frequent itemsets with 2 or 3 elements possibly with **high support** in order to infer association rules with **high confidence**, making the computational process not so expensive and not so memory greedy. Furthermore often it is possible to compute frequent itemsets for bigger *k *because many of the items in each basket won't be able to participate to the frequent itemsets.

The **A-Priori** algorithm takes advantage of just one property/observation called *the monotonicity of itemsets: *

If a set $ I $ of items is frequent, then also every subset of $ I $ is frequent.

Indeed, given $ J \subseteq I $ then all the baskets that contain all the items of $ I $ contain also all the items of $ J $. It follows that the count of $ J $ is at least equal to the count of $ I $. The **A-Priori** algorithm makes use of this property to make information more compact discarding useless information.

Given a support threshold $ s $ then __we call an itemset__* maximal *if no superset is frequent. If we store just maximal itemsets we know that all the subsets of a

In short:

- Given a
*maximal*all its subsets are frequent. - A superset of a maximal is not frequent.

As mentioned the algorithm tries to avoid to process useless information discarding some sets a-priori given the previously mentioned *monotonicity property of itemsets*. First of all the algorithm iterates through all the baskets counting the occurrences of each item. After this process the algorithm finds all the frequent singletons according with the selected **threshold s**

The algorithm makes a pass over the data for each *set-size k.* For each size *k* there will be two sets of itemsets:

- $ C_k $ is the set of
*candidate*itemsets of size $ k $. - $ L_k $ is the set of truly frequent itemsets of size $ k $.

So the algorithm for each $ k $ alternates two phases: constructing the candidate sets and filtering the frequent ones out of them. $ C_1 $ is going to contain all the singletons, while $ L_1 $ is going to be populated with the elements which appear within baskets at least as many times as the *support threshold s*.

Follows the Python3 implementation of the A-Priori algorithm for itemsets mining. The the main function is * get_frequent_items_sets* which takes as input the dataset (a list of lists of integers representing a list of baskets), the support and the number of iterations. The rest of the code is quite self explainatory and abusyvely commented, thus, it's possible to read it all along the way and understand the main points even if one does not really speak Python.

```
from collections import defaultdict
# input:
# data: List of lists containing numeric values
# min_sup: Minimum support ti be considered frequent (we select values that appears at least many times as s in baskets)
# steps: Number of steps in case we would like to find just couples takes 2 (default 0 terminates when there are no more candidates to compute)
# returns:
# solution: dictionary of sets of frozensets representing frequent itemsets, contains both numeric values and tuples.
# (be carefull on iterating on it)
def get_frequent_items_sets(data,min_sup,steps=0):
# we transform the dataset in a list of sets
transactions = list()
# Temporary dictionary to count occurrences
items = defaultdict(lambda: 0)
# Returned dictionary
solution = dict()
L_set = set()
# Fills transactions and counts "singletons"
for line in data:
transaction = set(line)
transactions.append(transaction)
for element in transaction:
items[element]+=1
# Add to the solution all frequent items
for item, count in items.items():
if count >= min_sup:
L_set.add(frozenset([item]))
# Generalize the steps it ends when there are no candidates
# or if the user provided as input the number of parameters
k = 2
solution[k-1] = L_set
while L_set != set([]) and k != steps+1:
L_set = create_candidates(L_set,k)
C_set = frequent_items(L_set,transactions,min_sup)
if C_set != set([]): solution[k]=C_set
L_set = C_set
k = k + 1
return solution
# Creates candidates joining the same set
# input:
# itemSet: a set of frequent items
# length: the cardinality of generated combinations
# returns:
# set: a set containing all combinations with cardinality equal to length
# (be carefull on iterating on it)
def create_candidates(itemSet, length):
return set([i.union(j) for i in itemSet for j in itemSet if len(i.union(j)) == length])
# Checks occurrences of items in transactions and returns frequent items
# input:
# items: a set of frequent items (candidates)
# transactions: list of sets representing baskets
# returns:
# _itemSet: a set containing all frequent candidates (a subset of inputs)
def frequent_items(items, transactions, min_sup):
_itemSet = set()
counter = defaultdict(int)
localDict = defaultdict(int)
for item in items:
for transaction in transactions:
if item.issubset(transaction):
localDict[item] += 1
for item, count in localDict.items():
if count >= min_sup:
_itemSet.add(item)
return _itemSet
```

example execution code:

```
In [1]: get_frequent_items_sets([[1,2,3],[0,1,2],[4,3,1,2],[1,2,3,4]], min_sup=3)
Out[1]:
{1: {frozenset({3}), frozenset({2}), frozenset({1})},
2: {frozenset({1, 3}), frozenset({2, 3}), frozenset({1, 2})},
3: {frozenset({1, 2, 3})}}
```

Here at this address I suggest to give a read to a similar Python implementation wich outputs also association rules.

The code contained within this article has to be considered for illustration proposes and not for production environments (do not reinvent the wheel and check for some robust libraries!)**BONUS: **here you can find some more elaborated solutions (including the state of the art: FP-Growth) to compute frequent itemsets mining both in pure python (serial) and making use of PySpark (map-reduce)!

That was all!

**The Market-Basket Concept: **

Let's suppose we are working for a Market, and we would like to know which *sets of items* **often appear together** in *baskets*. To perform such analysis somehow we must observe all the baskets and infer such information. Within this scenario we suppose the number of baskets to be very large if compared to the number of items or to the average number of items in a basket.

**Frequent Itemsets:**

But, what does it mean *frequent? *Well, to be more formal we need to introduce the concept of * support threshold *indicated with

Although the concept **Frequent ItemSets **at first was applied in the field of markets marketing investigations, there are also different fields where it could be applied:

*Related Concepts:*where words represent*items*and documents represent*baskets.**Biomarkers:*if we consider genes and blood proteins and deseases. Each basket is the set of data about a patient. A frequent itemset consisting of one or more*biomarker*and a deseas might suggest a test for that deseas.

**Association Rules:**

We can imagine an association rule as an *if-then *statement. It can be represented as $I \rightarrow j$ where *I *is a set of items and *j *is another item. The implication is that if all the items of *I *appear in a basket then *"most likely" *also *j* will be in the same basket as well.

To formalize such a concept we need to introduce the formal notion of likely defining the ** confidence of an association rule** $I \rightarrow j $:

The confidence of an association rule$I \rightarrow j $is defined as the ratio of the support for $ I \cup {j} $ to the support for I.

That is the confidence of the rule is the fraction of the baskets containing all of *I *that also contain *j*.

In the __next post__*"Frequent Itemsets Mining: The A-Priori Algorithm in Python explained"*.* *I am going to talk about the most famous algorithm for **Frequent Itemsets Mining**, the **A-Priori Algorithm.**

**Note:** there exist different variants of the A-Priori algorithms and different algorithms, worth to mention the most performant one: **FP-Growth.**

In this short blog post I am going to summarize content from **Mining of Massive Datasets **more precisely the part of **Chapter 3** which is describing the **concept of distance**.

Beyond all the formal definitions of both concepts **measure/****metric**** **and **distance** we might pick Wikipedia's very first *"informal definition" *as a good starting point:

Adistanceis a numerical measurement of how far apart [two] objects are.

Follows a *"more abstract" *mathematical definition:

In mathematics, ametricordistance functionis a function that defines a distance between each pair of elements of a set.

Given this short conceptual introduction we should provide an overview of the properties of a generic function in order for it to be called **distance function**. Let's suppose we have a set of points called a *space. *A *distance function* on this space is a function $d(x,y)$ that given two points ($x$ and $y$) produces a real number and satisfies the following axioms:

- $ d(x,y) \geq 0 $ a
**distance measure can not be negative**. - $ d(x,y) = 0 $ if and only if $x=y$; a distance measure is always positive except from
**the distance from a point and itself has value 0**. - $ d(x,y) = d(y,x) $ a
**distance function is symmetric**. - $ d(x,y) \leq d(x,z) + d(z,y)$;
**a distance function always satisfies the**.*triangular inequality*constraint

The last axiom can be intuitively explained using basic geometry. If we consider two points: $x$ and $y$ and we define our distance function as the shortest possible line (path) drawn from $x$ to $y$ we clearly do not get any benefit to pass through any other point $z$ in order to go from $x$ to $y$, the best case scenario is when $z$ lies on the shortest line between the two points $x$ and $y$ apporting no benefits nor penalities on our distance path.

The **euclidean distance** represents the most common concept of distance, it is the most familiar on our daily life.

An *n-Dimensional Euclidean Space* is the one where points are represented as vectors of $n$ **real numbers**.

The most conventional distance, the euclidean distance, also referred as $L_2$*-Norm Distance* is defined as follows:

$$ d(x,y)= d([x_1, x_2, \cdots,x_n], [y_1, y_2,\cdots,y_n]) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2} $$

This distance satisfies all the four axioms mentioned above. In Euclidean Spaces the concept of distance can be generalized such that for any constant $r \in \mathbb{N}$ we can define the $L_r$-*Norm Distance* $d$ as:

$$ d(x,y)= d([x_1, x_2, \cdots,x_n], [y_1, y_2,\cdots,y_n]) = (\sum_{i=1}^{n} |(x_i - y_i)|^r)^{1/r} $$

Another common measure is the $L_1$*-Norm Distance* more often referred as * Manhattan Distance*. It is the sum of the magnitudes of the differences in each dimension:

$$ d(x,y)= d([x_1, x_2, \cdots,x_n], [y_1, y_2,\cdots,y_n]) = \sum_{i=1}^{n} |(x_i - y_i)|$$

Is the limit as $r$ tends to $\infty$ of the $L_r$-*norm*. Here, as $r$ gets larger and larger only the dimension with the largest difference really matters (becomes dominant), thus, formally the $L_\infty$*-Norm Distance* as defined as the maximum of $|x_i-y_i|$ over all the dimensions $i$.

$$ d(x,y)= d([x_1, x_2, \cdots,x_n], [y_1, y_2,\cdots,y_n]) = \max{(|x_i-y_i|)} $$

Follows a code snippet of a **pure python3.x implementation of the Euclidean Distances**:

```
import numbers
def euclidean_distance(x, y, r=2):
# Case wrong r value
if r != 'inf' and r < 1:
raise ValueError('r must be positive (r > 0)')
# Case two vectors have different lengths
if len(x) != len(y):
raise ValueError('Input Vector Length Mismatch')
# Case both vectors have length 0
if len(x) == 0:
return 0.0
# Case values are not numeric
if not all(isinstance(v, numbers.Number) for v in x) or \
not all(isinstance(v, numbers.Number) for v in y):
raise TypeError('Euclidean Distance is Computed'
' over only numerical data')
# Case L_infty-Norm Distance
if r == 'inf':
return max(abs(x - y) for x, y in zip(x, y))
# Compute the L_r-Norm Distance
return sum(abs((x - y))**r for x, y in zip(x, y))**(1/r)
```

```
>>> euclidean_distance([1,2,3],[1,2.5,3.5], 1)
>>> 1.0
>>> euclidean_distance([1,2,3],[1,2.5,3.5], 2)
>>> 0.7071067811865476
>>> euclidean_distance([1,2,3],[1,2.5,3.5], 3)
>>> 0.6299605249474366
>>> euclidean_distance([1,2,3],[1,2.5,3.5], 'inf')
>>> 0.5
```

Interesting to note that as $r$ increases the value of the distance slowly approximates the definition of $L_\infty$-*Norm Distance* (in this case 0.5):

```
>>> [euclidean_distance([1,2,3],[1,2.5,3.5],x) for x in range(1,1000,1)]
>>> [1.0,
>>> 0.7071067811865476,
>>> 0.6299605249474366,
>>> 0.5946035575013605,
>>> 0.5743491774985174,
>>> ...
>>> 0.5003480865601369,
>>> 0.5003477373047961,
>>> 0.5003473887496088,
>>> 0.5003470408924718]
```

*Those implementations have to be considered just for illustration proposes, refer to numpy.linalg or other libraries for production ready code*.

The Jaccard distance represents a metric to compute the distance between two sets. Given two sets $X$ and $Y$ it is defined as the ratio of the cardinality (/size) of the intersection among the two sets over the cardinality of the union of the two sets:

$$ d(X,Y) = 1 - \frac{|X\cap Y|}{|X\cup Y|} $$

Its value range is between $0$ and $1$ and it make use of the Jaccard similarity between two sets defined as follows:

$$ SIM(X,Y) = \frac{|X\cap Y|}{|X\cup Y|} $$

Follows a code snippet of a **pure python3.x implementation of the Jaccard Distance**:

```
def jaccard_distance(x, y):
if type(x) == list:
x = set(x)
if type(y) == list:
y = set(y)
# Avoiding division by 0
if len(x.union(y)) == 0:
return 0
else:
return 1 - len(x.intersection(y))/len(x.union(y))
```

```
>>> jaccard_distance([1,2,4], [2,3,4])
>>> 0.5
>>> jaccard_distance([1,2,4], [1,2,4])
>>> 0.0
```

In the context of **Cosine Distance** we iterpret points as directions from the origin of the plane to the points themselves. In this scenario we are going to measure the distace between two vectors as $1 - cos(\theta)$ where $cos(\theta)$ is the cosine of the angle between the two directions $\theta$.

Cosine distance is generally used as a metric for measuring distance when the magnitude of the vectors does not really matter; often this happens when working with NLP (Natural Language Processing) text data represented by word counts. The lenght of the directions do not alter the distance measure, thus, multiplying a vector with a constant would not impact the distance measure at all. The cosine distance (based on cosine similarity and normalized) can be defined as:

$$ d(x,y) = 1 - \frac {x \cdot y}{||x||_2 ||y||_2} $$

Follows a code snippet of a **pure python3.x implementation of the Cosine Distance**:

```
def cosine_distance(x, y):
# Case two vectors have different lengths
if len(x) != len(y):
raise ValueError('Input Vector Length Mismatch')
# Case both vectors have length 0
if len(x) == 0:
return 0.0
# If one of the vector has 0 magnitude then d is not defined
if sum(x) == 0 or sum(y) == 0:
return None
# Case values are not numeric
if not all(isinstance(v, numbers.Number) for v in x) or \
not all(isinstance(v, numbers.Number) for v in y):
raise TypeError('Euclidean Distance is Computed'
' over only numerical data')
# Numerator: Dot Product
num = sum(x*y for x, y in zip(x, y))
# Denominator: Product of L_2 Norms
den = sum(el_x*el_x for el_x in x)**.5 * sum(el_y*el_y for el_y in y)**.5
return (1 - (num/den))/2
```

The following example shows that a vector scaling does not impact the cosine distance measure:

```
>>> cosine_distance([1,2,3],[1,2,0])
>>> 0.2011928476664016
>>> cosine_distance([1,2,3],[4,8,0])
>>> 0.2011928476664016
```

This kind of distance comes into play when the data points are strings. Indeed, it is not as trivial as with real numbers to compute a distance among strings. What is the magnitude of a string? Maybe its length? The **edit distance** between two strings $x=x_1x_2\cdots x_n$ and $y=y_1y_2\cdots y_m$ can be defined as *the smallest number insertions and deletions of single characters that will convert $x$ to $y$*.

Another way to define the edit distance between two strings $d(x,y)$ is to compute their *longest common subsequence (LCS)*. Given the LCS of the two strings $x$ and $y$ the edit distance can be computed as the length of $x$ plus the length of $y$ minus twice the length of their LCS. Considering $|x|$ as the length of a string we might express the edit distance as:

$$ d(x,y) = |x| + |y| - 2\cdot LCS(x,y)$$

A similar distance measure which takes into account not only insertions and deletions but also substitution is the **Levenshtein Distance**.

Given a space of vectors, is it possible to define the **Hamming Distance** between two vectors to be the number of components in which they differ:

e.g.: $x = [1,2,3,4]$ and $y=[0,2,2,4]$ then $d(x,y) = 2$.

Most often it is used when handling boolean vectors, nevertheless, it is possible to apply this metric to vectors having components from any set.

Follows a code snippet of a **pure python3.x implementation of the Hamming Distance**:

```
def hamming_distance(x, y):
return [x != y for x, y in zip(x, y)].count(True)
```

```
>>> hamming_distance([1,2,3],[1,2,3])
>>> 0
>>> hamming_distance([1,3,4],[1,2,3])
>>> 2
```

It is important to nothe that some of the distance measures mentioned above are not Euclidean Spaces. An important property of Euclidean spaces is that the average of points always exists; this does not apply for sets: it is not possible to compute the average of two sets, the same limitation applies also for strings. Vector spaces where we define the cosine distance may or may not be Euclidean, it depends on the components of such vectors, for example if we restrict the value of vectors to be Integers, then the space is not going to be Euclidean, in the case of Real numbers it forms an euclidean space.

Euclidean Distance VS Cosine Distance/Similarity

Levenshtein Distance Implementation

LCS Implementation