2020/09/07
Me
- Lasse Hjorth Madsen
- Master of Political Science way back
- Data scientist at Novo Nordisk
Political Science was broken last time I saw it
- Relied mostly on survey data
- Too little statistics of too poor quality
- No consensus of what constitues good science
The good news: Data science can fix that
- Many more sources of data
- Better statistical tools and understanding
- A mindset from the natural sciences
Case: Finding similar deviations
Novo Nordisk Kalundborg
Deviations at Novo Nordisk
- Problem: We need a way to find similar descriptions of deviations
- Solution: Natural language processing. Document similarity.
- Question: How to turn text into numbers?
Toy example: Document-term-frequencies
|
text
|
are
|
dinosaurs
|
reptiles
|
games
|
play
|
kids
|
like
|
|
dinosaurs are reptiles
|
1
|
1
|
1
|
0
|
0
|
0
|
0
|
|
dinosaurs play games
|
0
|
1
|
0
|
1
|
1
|
0
|
0
|
|
kids like games
|
0
|
0
|
0
|
1
|
0
|
1
|
1
|
- You have got to be joking …
- Bag-of-words: Word order is ignored
- All words are treated equally
Words plotted by frequency rank and frequency
Words plotted by frequency rank and frequency
