Learn more about text mining: https://www.datacamp.com/courses/intro-to-text-mining-bag-of-words
Hi, I'm Ted. I'm the instructor for this intro text mining course. Let's kick things off by defining text mining and quickly covering two text mining approaches.
Academic text mining definitions are long, but I prefer a more practical approach. So text mining is simply the process of distilling actionable insights from text. Here we have a satellite image of San Diego overlaid with social media pictures and traffic information for the roads. It is simply too much information to help you navigate around town. This is like a bunch of text that you couldn’t possibly read and organize quickly, like a million tweets or the entire works of Shakespeare. You’re drinking from a firehose! So in this example if you need directions to get around San Diego, you need to reduce the information in the map. Text mining works in the same way. You can text mine a bunch of tweets or of all of Shakespeare to reduce the information just like this map. Reducing the information helps you navigate and draw out the important features.
This is a text mining workflow. After defining your problem statement you transition from an unorganized state to an organized state, finally reaching an insight. In chapter 4, you'll use this in a case study comparing google and amazon.
The text mining workflow can be broken up into 6 distinct components. Each step is important and helps to ensure you have a smooth transition from an unorganized state to an organized state. This helps you stay organized and increases your chances of a meaningful output.
The first step involves problem definition. This lays the foundation for your text mining project. Next is defining the text you will use as your data. As with any analytical project it is important to understand the medium and data integrity because these can effect outcomes. Next you organize the text, maybe by author or chronologically.
Step 4 is feature extraction. This can be calculating sentiment or in our case extracting word tokens into various matrices. Step 5 is to perform some analysis. This course will help show you some basic analytical methods that can be applied to text. Lastly, step 6 is the one in which you hopefully answer your problem questions, reach an insight or conclusion, or in the case of predictive modeling produce an output.
Now let’s learn about two approaches to text mining. The first is semantic parsing based on word syntax. In semantic parsing you care about word type and order. This method creates a lot of features to study. For example a single word can be tagged as part of a sentence, then a noun and also a proper noun or named entity. So that single word has three features associated with it. This effect makes semantic parsing "feature rich". To do the tagging, semantic parsing follows a tree structure to continually break up the text.
In contrast, the bag of words method doesn’t care about word type or order. Here, words are just attributes of the document. In this example we parse the sentence "Steph Curry missed a tough shot". In the semantic example you see how words are broken down from the sentence, to noun and verb phrases and ultimately into unique attributes.
Bag of words treats each term as just a single token in the sentence no matter the type or order. For this introductory course, we’ll focus on bag of words, but will cover more advanced methods in later courses!
Let’s get a quick taste of text mining!