2020/09/07
Me
- Lasse Hjorth Madsen
- Master of Political Science way back
- Data scientist at Novo Nordisk
Political Science was broken last time I saw it
- Relied mostly on survey data
- Too little statistics of too poor quality
- No consensus of what constitues good science
The good news: Data science can fix that
- Many more sources of data
- Better statistical tools and understanding
- A mindset from the natural sciences
Case: Finding similar deviations
Novo Nordisk Kalundborg
Deviations at Novo Nordisk
- Problem: We need a way to find similar descriptions of deviations
- Solution: Natural language processing. Document similarity.
- Question: How to turn text into numbers?
Toy example: Document-term-frequencies
text
|
are
|
dinosaurs
|
reptiles
|
games
|
play
|
kids
|
like
|
dinosaurs are reptiles
|
1
|
1
|
1
|
0
|
0
|
0
|
0
|
dinosaurs play games
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
kids like games
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
- You have got to be joking …
- Bag-of-words: Word order is ignored
- All words are treated equally
Words plotted by frequency rank and frequency
Words plotted by frequency rank and frequency
Words plotted by frequency rank and frequency
Words plotted by frequency rank and frequency
Words plotted by frequency rank and frequency
The hard-to-remember metric: tf-idf
Intuition:
- How often a word appears, indicates what the document is about
- On the other hand, very common words carry less meaning
- Inverse document frequency discount common words:
- \[idf(term) = \log\frac{n_{documents}}{n_{documents\ containing\ term}}\]
- Finally, we just multiply term-frequency with idf for that term, hence tf-idf
Inverse-document-frequency, idf
More complex: Principal component analysis
Fun with data science
Zetland. Unbreaking news
An analysis of lyrics from popular music
Links
- www.zetland.dk/a/lhjorthmadsen
- gitlab.com/lassehmadsen/social_data_science_talk
- [email protected]