Take facts, turn it into knowledge, algorithmically.

Also known as "unsupervised learning", it's what you do when you have a whole lot of unstructured data you know little about.

The problem: check the spelling of things that aren't in the dictionary

Indigo MontoyaInigo Montana

Inigo Montoya

Neego Montoya

Inigo Mantoya

Numbers we have include: number of times a query is made, distance between queries (e.g., Levenshtein distance)

When a new query comes in, find the most common query within a short distance and suggest it.

When a new query comes in, find the most common query within a short distance and suggest it.

And that's it.

Problem: given a big pile of documents, figure out what different categories there are.

Solution: (a whole lot of) simple (high-dimensional) geometry

- Pick some (k) random points in your vector space.
- For each document, figure out the nearest point.
- Lather, rinse, repeat.
- Voila! Slow-cooked category discovery

When you already know something about your data, and you want to apply that knowledge to more, less-known data

You have 100 documents in two different categories. Predict the category for the next 5000 documents.

- Figure out the closest known to your unknown (geometrically)

- Figure out a line separating the categories
- Use that line to classify the unknowns

Not geometric, but statistical.

Future probabilities derived from prior probabilities

If a drug test has 95% accuracy, and Bob tests positive, what is the probability that he uses drugs?

(hint: it's not 95%)

Answer: Depends on how many people use drugs.

If the rate of drug use is 1%, then we have:

test positive | test negative | |
---|---|---|

users | 95% of 1% | 5% of 1% |

non-users | 5% of 99% | 95% of 99% |

Number of positive results: 0.95% + 4.95% == 5.9%

Number of *correct* positive results: 0.95% / 5.9% == 16.1%

- Look at your data, figure out a good numeric representation
- Turn your data into numbers (usually vectors of numbers)
- Run your algorithms
- Profit! (or Fun!)