Nick TenBrink New Adventures in Data Science:
    About     Archive     Feed

Using Natural Language Processing to Topic Model Supreme Court Cases

The Supreme Court is the highest court in the United States. The nine justices decide which cases to hear and make judgements on. I used Natural Language Processing and unsupervised learning to preform topic modeling on the Supreme Court cases/opinions.

The tools I used were NLTK, sklearn and spacey, as well as beautiful soup for webscraping. I scraped the site https://caselaw.findlaw.com/court/us-supreme-court/ to get all the opinions over the history of the US supreme court (23,000 case documents!). My original intent was to find which supreme court justices were most similar to each other, but my model mainly just picked out the various topics that they wrote about (for example fourth amendment cases, sixth amendment cases, labor laws, etc.) so I pivoted my focus to do topic modeling on the various cases. Due to the large amount of “legalese” in the documents, my model often had a hard time differentiating the cases, so I ended up with one large miscellaneous group (or topic).

First I scraped wikipedia to the name of all the justices.

Then I scraped the site caselaw.findlaw.com to get the link to each individual case on that site, then grabbed the text of the opinions from each link.

I pulled out the name of each case and divided them between justices –since each case often had multiple justices writing opinions: majority opinion, concurring opinions and dissenting opinions.

Then I used my seperated-by-justice documents to see which justices were similar and what topics (from the topic modeling) they wrote about each year.

I used tf-idf (Term frequency - inverse document frequency) to pull out relevant words relative to the overall corpus (all the documents). I used non-negative matrix factorization to do topic modeling. NNMF is a form of unsupervised machine learning, which I used to find topics based on a documents contents. NNMF factors high-dimensional vectors into a low-dimensionality representation. In this case, I choose 20 as the number of topics I wanted it to find.

Here is one example of a case that my model correctly identified as a 4th amendment case based on the keywords highlighted:

case text

Here are the results of case topics over the years:

topics graph

I found it interesting that there were several cases related to native americans in the 1960’s but virtually none after that.

Also interesting is that the Supreme court has taken a lighter caseload since the 1990’s, here is a graph of just that time period (with the miscellaneous topic removed):

topics since 1990

Creating a rating system for International Soccer teams

With the world cup coming up, I wanted to create a better system for rating international teams than the official FIFA rankings. Elo ratings are a good place to start. Elo ratings were originally created for chess but are used in a lot of competitive games, including esports.

Elo ratings are pretty straightfoward. Each team starts with a arbitrary number, usually 1500, and it is constantly updated after every game. Elo also gives the expected probability of an outcome. The formula looks like this: Ea is the expected win rate: expected rate of winning

R’a is the updated rating after a match. change in rating after game

where: E = expected rate of winning R = Elo rating K = Learning rate S = Score, win = 1, draw 0.5 lose = 0

I used five years of intertional soccer matches, not counting friendlies, to get the ratings. I iterated through those matches to find the optimal learning rate and home field advantage. Home field advantage is a number that is simply added to the elo score of the home team before calculating the expected outcome and updating the rating. I optimized them by finding the lowest mean squared error, however, I should have used log-loss instead as this is a categorical problem (with three categories: win lose or draw)

HFA

Learning rate

This is the simplest application of elo ratings. I could possible increase the accuracy by accounting for number of goals scored/goal differential.

Here are some of the ratings that I got:

Team elo Rating
Germany 2058
Brazil 2018
Spain 2007
Argentina 2002
Portugal 1952
France 1897
Mexico 1896