Social Data Science and a Novo Nordisk case

2020/09/07

Me

Lasse Hjorth Madsen
Master of Political Science way back
Data scientist at Novo Nordisk

Political Science was broken last time I saw it

Relied mostly on survey data
Too little statistics of too poor quality
No consensus of what constitues good science

The good news: Data science can fix that

Many more sources of data
Better statistical tools and understanding
A mindset from the natural sciences

Case: Finding similar deviations

Novo Nordisk Kalundborg

Deviations at Novo Nordisk

Problem: We need a way to find similar descriptions of deviations

Solution: Natural language processing. Document similarity.

Question: How to turn text into numbers?

Toy example: Document-term-frequencies

text	are	dinosaurs	reptiles	games	play	kids	like
dinosaurs are reptiles	1	1	1	0	0	0	0
dinosaurs play games	0	1	0	1	1	0	0
kids like games	0	0	0	1	0	1	1

You have got to be joking …
Bag-of-words: Word order is ignored
All words are treated equally

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

The hard-to-remember metric: tf-idf

Intuition:

How often a word appears, indicates what the document is about
On the other hand, very common words carry less meaning
Inverse document frequency discount common words:
\[idf(term) = \log\frac{n_{documents}}{n_{documents\ containing\ term}}\]
Finally, we just multiply term-frequency with idf for that term, hence tf-idf

Inverse-document-frequency, idf

More complex: Principal component analysis

Fun with data science

Zetland. Unbreaking news

An analysis of lyrics from popular music

Links

www.zetland.dk/a/lhjorthmadsen
gitlab.com/lassehmadsen/social_data_science_talk
[email protected]