Text Mining

With this exercise, you can learn more about text mining. The exercise is an extension of the example you know from the text. Instead of only eight tweets, you should analyze the complete corpus of Donald Trump tweets from 2017 and use word clouds to visualize how your preprocessing modifies the data.

Data and Libraries

Your task in this exercise is to analyze textual data. You will perform various processing steps and see how the results of a simple visualization through word clouds evolve. You can find everything you need in the nltk and wordcloud libraries (+ some basic stuff, e.g., for regular expressions).

For this exercise set, we provide data about the tweets from Donald Trump in 2017. You can download the data here, each line contains a single tweet.

Word clouds without pre-processing

Load the data and create a word cloud without any further processing of the text data. Does this already work? What are problems?

Pre-processing textual data

Clean up the textual data, e.g., using the methods discussed in the lecture. Create a new word cloud based on the cleaned corpus.

Use TF-IDF instead of TF

The word clouds are based on simple term frequencies (TF) by default. Calculate the tf-idf, i.e., the term frequency weighted with the inverse document frequency and create a new word cloud based on these frequencies. How does it change?