2020/09/07

Me

  • Lasse Hjorth Madsen
  • Master of Political Science way back
  • Data scientist at Novo Nordisk

Political Science was broken last time I saw it

  • Relied mostly on survey data
  • Too little statistics of too poor quality
  • No consensus of what constitues good science

The good news: Data science can fix that

  • Many more sources of data
  • Better statistical tools and understanding
  • A mindset from the natural sciences

Case: Finding similar deviations

Novo Nordisk Kalundborg

Deviations at Novo Nordisk

  • Problem: We need a way to find similar descriptions of deviations
  • Solution: Natural language processing. Document similarity.
  • Question: How to turn text into numbers?

Toy example: Document-term-frequencies

text are dinosaurs reptiles games play kids like
dinosaurs are reptiles 1 1 1 0 0 0 0
dinosaurs play games 0 1 0 1 1 0 0
kids like games 0 0 0 1 0 1 1


  • You have got to be joking …
  • Bag-of-words: Word order is ignored
  • All words are treated equally

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

Words plotted by frequency rank and frequency

The hard-to-remember metric: tf-idf

Intuition:

  • How often a word appears, indicates what the document is about
  • On the other hand, very common words carry less meaning
  • Inverse document frequency discount common words:
  • \[idf(term) = \log\frac{n_{documents}}{n_{documents\ containing\ term}}\]
  • Finally, we just multiply term-frequency with idf for that term, hence tf-idf

Inverse-document-frequency, idf

More complex: Principal component analysis

Fun with data science

Zetland. Unbreaking news

An analysis of lyrics from popular music

Links