{"id":144124,"date":"2022-07-27T08:00:25","date_gmt":"2022-07-27T12:00:25","guid":{"rendered":"https:\/\/www.kdnuggets.com\/?p=144124"},"modified":"2022-10-11T19:10:57","modified_gmt":"2022-10-11T23:10:57","slug":"is-domain-knowledge-important-for-machine-learning","status":"publish","type":"post","link":"https:\/\/www.kdnuggets.com\/2022\/07\/domain-knowledge-important-machine-learning.html","title":{"rendered":"Is Domain Knowledge Important for Machine Learning?"},"content":{"rendered":"<p><center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/rosidi_domain_knowledge_important_machine_learning_2.png\" alt=\"Is Domain Knowledge Important for Machine Learning?\" width=\"100%\"\/><\/center><br \/>\n&nbsp; <\/p>\n<p>Developing machine learning models involves a lot of steps. Whether you\u2019re working with labeled or unlabeled data, you might think numbers are just numbers, and it doesn\u2019t matter what each of the features of a dataset signifies when it comes to spitting out insights with the potential for true impact. It\u2019s true that there are tons of great machine learning libraries out there like <a href=\"https:\/\/scikit-learn.org\/stable\/\" rel=\"noopener\" target=\"_blank\">scikit-learn<\/a> which make it straightforward to gather up some data and plop them into a cookie-cutter model. Pretty quickly, you might start to think there\u2019s no problem you can\u2019t solve with machine learning.<\/p><div class=\"kdnug-after-first-paragraph kdnug-entity-placement\" id=\"kdnug-996757713\"><div id=\"kdnug-3808286074\"><a data-no-instant=\"1\" href=\"https:\/\/sps.northwestern.edu\/information\/data-science-online-artificial-intelligence-masters.html?utm_source=kdnuggets&#038;utm_medium=banner300x250&#038;utm_campaign=kdnuggets_msds_banner300x250_l&#038;utm_term=jun26&#038;utm_content=msds&#038;src=kdnuggets_msds_banner300x250_junfy26_l\" rel=\"noopener nofollow\" class=\"a2t-link\" target=\"_blank\"><p><img decoding=\"async\" style=\"max-width: 100%; height: auto;\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/s-nwu-2606.gif\" alt=\"Choose from a wide range of AI courses.\" \/><br \/>\nChoose from a wide range of AI courses.\t<\/p>\n<\/a><\/div><\/div>\n<p>Frankly, that\u2019s a beginner\u2019s mindset. You are not yet aware of everything you don\u2019t know. Datasets given in machine learning courses or the free ones you find online have often already been groomed and are convenient to use when applying machine learning models, but once you take your skills and knowledge out of the play-pen and into the real world, you\u2019ll face some additional challenges.<\/p><div class=\"kdnug-in-content-1 kdnug-entity-placement\" style=\"text-align: center;padding-bottom: 180px;padding-top: 20px;\" id=\"kdnug-289789043\"><div id=\"kdnug-1943505485\"><a data-no-instant=\"1\" href=\"https:\/\/www.snowflake.com\/en\/dev-day\/americas-virtual\/?utm_source=kdnuggets&#038;utm_medium=display\" rel=\"noopener nofollow\" class=\"a2t-link\" target=\"_blank\"><p><img decoding=\"async\" style=\"max-width: 100%; height: auto;\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/s-snowflake-2606.png\" alt=\"Snowflake Dev Day \/><br \/>\nRegister today\tRegister today\t<\/p>\n<\/a><\/div><\/div>\n<p>Lots of people believe that domain knowledge, or additional knowledge regarding the industry or area the data pertains to, is superfluous. And it\u2019s kind of true. Do you NEED domain knowledge in the area you\u2019re developing the model? No. You can still produce fairly accurate models without it. Theoretically, deep and machine learning are black-box approaches. This means you can put labeled data into a model without deep knowledge of the area and without even looking at the data very closely.<\/p>\n<p>But, if you go down this route though, you\u2019ll have to deal with the consequences. This is a very inefficient way to train classifiers, and in order to properly function, you\u2019ll require massive amounts of labeled datasets and a lot of computational power in order to produce accurate models.<\/p>\n<p>If you incorporate domain knowledge into your architecture and your model, it can make it a lot easier to explain the results, both to yourself and to an outside viewer. Every bit of domain knowledge can serve as a stepping stone through the black box of a machine learning model.<\/p>\n<p>It's very easy to think that domain knowledge isn\u2019t required because for lots of visible datasets like <a href=\"https:\/\/cocodataset.org\/#home\" rel=\"noopener\" target=\"_blank\">COCO<\/a>, the limited domain knowledge that is required is part of being a seeing human. Even more complex data sets that contain cancer cells are similarly obvious to the human eye, despite a lack of expert-level knowledge. You can do basic evaluation of similarity or differences between cells without any specific medical knowledge.<\/p>\n<p>Natural language processing (NLP) and computer vision are prime examples of areas where it\u2019s easy to think that domain knowledge is entirely unnecessary, but more so because they are such normal tasks for us, we may not even notice how we\u2019re applying our domain knowledge.<\/p>\n<p>If you start working in areas like outlier detection, which isn\u2019t such an everyday human task, the importance of domain knowledge quickly becomes apparent.<\/p>\n<p>&nbsp;<\/p>\n<h1>Domain Knowledge for Data Pre-Processing<\/h1>\n<p>&nbsp;<\/p>\n<p><center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/rosidi_domain_knowledge_important_machine_learning_3.png\" alt=\"Domain Knowledge for Data Pre-Processing\" width=\"70%\"\/><\/center><br \/>\n&nbsp; <\/p>\n<p>Let\u2019s dig into how domain knowledge can be leveraged in the data pre-processing step of the machine learning model development cycle.<\/p>\n<p>In a dataset, not every data point has the same value. If you collect 100 new samples which are identical, they don\u2019t help the model learn any additional information. They might actually focus the model in a specific direction which isn\u2019t important.<\/p>\n<p>If you\u2019re looking at 100 pictures of umbrellas, and you know this model is supposed to classify all kinds of accessories, it\u2019s clear that your sample dataset isn\u2019t representative of the whole population. Without domain knowledge, it can be very challenging to know which data points add value or if they are already represented in the data set.<\/p>\n<p>If you\u2019re working in an area that doesn\u2019t so easily lend itself to your existing general knowledge, you can build in biases through the training data which can hurt the accuracy and robustness of your model.<\/p>\n<p>Another way in which domain knowledge can pack a punch in the data pre-processing step is in determining feature importance. If you have a good feel for the importance of each feature, you can develop better strategies to process the data accordingly. It\u2019s really important to understand what the actual features are in order to do so. This has a big influence on how you handle the features going forward.<\/p>\n<p>&nbsp;<\/p>\n<h1>Domain Knowledge for Choosing the Right Model<\/h1>\n<p>&nbsp;<\/p>\n<p><center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/rosidi_domain_knowledge_important_machine_learning_1.png\" alt=\"Domain Knowledge for Choosing the Right Model\" width=\"70%\"\/><\/center><br \/>\n&nbsp; <\/p>\n<p>There are many different machine learning models, and some may be more fitting than others given many factors. Is the data labeled or unlabeled? How much data do you have? What kind of data types are the features? Are the data types of the features homogeneous? Is your target output a continuous value or a classification? Choosing the right model is important, but it\u2019s very rare to be able to apply your selected model directly without making adjustments to it. Random forests, for example, can handle heterogeneous data types right out of the box.<\/p>\n<p>Selecting the right model requires in-depth machine learning knowledge, but there are lots of resources out there to help you make your selection if you\u2019re not quite a machine learning expert. I\u2019ve gathered my top three from machine learning cheat sheets from <a href=\"https:\/\/towardsdatascience.com\/5-minutes-cheat-sheet-explaining-all-machine-learning-models-3fea1cf96f05\" rel=\"noopener\" target=\"_blank\">Towards Data Science<\/a>, <a href=\"https:\/\/www.datacamp.com\/cheat-sheet\/machine-learning-cheat-sheet\" rel=\"noopener\" target=\"_blank\">datacamp<\/a>, and <a href=\"https:\/\/docs.microsoft.com\/en-us\/azure\/machine-learning\/algorithm-cheat-sheet\" rel=\"noopener\" target=\"_blank\">Microsoft<\/a>.<\/p>\n<p>&nbsp;<\/p>\n<h1>Domain Knowledge for Adjusting Model and Architecture<\/h1>\n<p>&nbsp;<\/p>\n<p><center><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/rosidi_domain_knowledge_important_machine_learning_4.png\" alt=\"Domain Knowledge for Adjusting Model and Architecture\" width=\"70%\"\/><\/center><br \/>\n&nbsp; <\/p>\n<p>Domain knowledge allows you to better adjust the model to fit the situation. Mathematical optimizations only go so far, and often to get big jumps in improvement, it\u2019s crucial to have considerable domain knowledge in the area.<\/p>\n<p>A significant way to apply your domain knowledge to drive an improvement in the accuracy and robustness of your model is to incorporate the domain knowledge into the architecture of the model you are developing.<\/p>\n<p>As I mentioned before, natural language processing is one of the areas of machine learning which makes it clear how domain knowledge can be helpful. Let\u2019s talk about word embeddings and attention to showcase how speaking a human language is a big help, but thinking like a linguist can really elevate the performance of a natural language processing model.<\/p>\n<p>&nbsp;<\/p>\n<h2>Domain Knowledge for Natural Language Processing<\/h2>\n<p>&nbsp;<\/p>\n<p>Domain knowledge has been applied to all applications of machine learning. Small adjustments have been made over the last few decades to better apply machine learning models in many areas. Domain knowledge has definitely been applied to the models used in natural language processing. Let\u2019s walk through a few examples of how these developments came about.<\/p>\n<p>&nbsp;<\/p>\n<h3>Word Embeddings<\/h3>\n<p>&nbsp;<\/p>\n<p>If you think about numbers and words, there\u2019s a pretty big difference in how we think about them. If you have the heights of everyone in a group, you can easily spit out some stats regarding the median, outliers, etc, all based on height. If everyone in the group gave you a word to represent how they are feeling today, how would you convert that to any kind of meaningful aggregate?<\/p>\n<p>You should consider what you can do to create a digital representation of a word. Should you just use letters? Does that make sense? As a person who speaks the language, we immediately have the meaning behind the word. We do not store words by their letters. Think about a tree. Did you picture a tree or did your mind go to t-r-e-e? Storing the representation of a word as the letters doesn\u2019t really bring us any advantage when it comes to understanding meaning or significance.<\/p>\n<p><a href=\"https:\/\/machinelearningmastery.com\/what-are-word-embeddings\/\" rel=\"noopener\" target=\"_blank\">Word embeddings<\/a> are \u201ca type of word representation that allows words with similar meaning to have a similar [numerical] representation\u201d. The numerical representations are learned using <a href=\"https:\/\/www.stratascratch.com\/blog\/overview-of-machine-learning-algorithms-unsupervised-learning\/\" rel=\"noopener\" target=\"_blank\">unsupervised learning models<\/a>.<\/p>\n<p>These numerical representations are vectors that represent how a word is used. This numerical representation goes so far as to allow you to use the Euclidean distance between two-word representations to quantify how similar two words are used in the training text.<\/p>\n<p>The vectors for \u201cAdidas\u201d and \u201cNike\u201d would likely be quite similar. Exactly what each field of the vector represents is certainly unclear as they\u2019re developed using unsupervised learning, but it makes sense that a word that represents a similar concept has a similar representation as far as the model understands.<\/p>\n<p>Check out our post \u201c<a href=\"https:\/\/www.stratascratch.com\/blog\/supervised-vs-unsupervised-learning\/\" rel=\"noopener\" target=\"_blank\">Supervised vs Unsupervised Learning<\/a>\u201d if you want to know what supervised and unsupervised learning actually are and the algorithms that use these learning approaches.<\/p>\n<p>&nbsp;<\/p>\n<h3>Attention<\/h3>\n<p>&nbsp;<\/p>\n<p>Attention is a very valuable and useful concept. Attention has made its way into natural language processing and image recognition in the world of machine learning models for a very good reason.<\/p>\n<p>Natural language processing for translation has been a deep learning model since <a href=\"https:\/\/aylien.com\/blog\/a-review-of-the-recent-history-of-natural-language-processing\" rel=\"noopener\" target=\"_blank\">the early 2000s<\/a>. Around 2013, <a href=\"https:\/\/www.analyticsvidhya.com\/blog\/2021\/06\/lstm-for-text-classification\/\" rel=\"noopener\" target=\"_blank\">long short-term memory<\/a> (LSTM) debuted in the field and dominated for a few years. An LTSM model reads the sentence, creates a hidden representation, and then uses the hidden representation to generate the output sentence.<\/p>\n<p>As humans, if we translate, we don\u2019t just read the sentence and spit out the translation. We tend to look at the whole sentence again and again, or we\u2019ll focus on certain parts when we want to revisit the context of the target word.<\/p>\n<p>For example the word \u201cread\u201d. Is it representing present or past tense? Which other parts of the sentence do you focus on to determine that? Do you need information from surrounding sentences? \u201cI read books every year\u201d means one thing on its own, but if I were to say, \u201cI did many things before I went to university. I read books every year. I ate dinner with my parents every day,\u201d the sentence takes on a different meaning, and \u201cread\u201d represents an action that took place in the present in the first example and in the past for the second.<\/p>\n<p>Attention builds these relationships between different words in the sentence or sentences. For each word we want to translate, it highlights different words in the original sentence according to the importance of these associated words with the target word. Someone who isn\u2019t an experienced or professional translator might just translate word-for-word, and while the general information would still be conveyed, accounting for the importance of associated words when translating the target word will produce a much more accurate translation. Attention allows us to build that professional translation pattern into the architecture of the deep learning model.<\/p>\n<p>&nbsp;<\/p>\n<h1>Why Domain Knowledge Is Crucial for Machine Learning<\/h1>\n<p>&nbsp;<\/p>\n<p>Without domain knowledge, you can check all the boxes of producing an acceptable model which spits out some numbers. With domain knowledge, you\u2019ll know what data is best to use to train and test your model. You will also realize how you can tailor the model you use to better represent the data set and the problem you\u2019re trying to tackle, and how to make the best use of the insights your model produces.<\/p>\n<p>Machine learning is a toolbox. If you pull out an electrical saw, you\u2019ll probably manage to cut some wood, but you probably won\u2019t be able to construct a bunch of cabinets without the expert knowledge of a carpenter. Domain knowledge will allow you to take the impact of your machine learning skills to a much higher level of significance.<\/p>\n<p>&nbsp;<br \/>\n&nbsp;<br \/>\n<b><a href=\"https:\/\/www.stratascratch.com\" target=\"_blank\" rel=\"noopener\">Nate Rosidi<\/a><\/b> is a data scientist and in product strategy. He's also an adjunct professor teaching analytics, and is the founder of <a href=\"https:\/\/www.stratascratch.com\/\">StrataScratch<\/a>, a platform helping data scientists prepare for their interviews with real interview questions from top companies. Connect with him on <a href=\"https:\/\/twitter.com\/StrataScratch\">Twitter: StrataScratch<\/a> or <a href=\"https:\/\/www.linkedin.com\/in\/nathanrosidi\/\">LinkedIn<\/a>.<br \/>\n&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"If you incorporate domain knowledge into your architecture and your model, it can make it a lot easier to explain the results, both to yourself and to an outside viewer. Every bit of domain knowledge can serve as a stepping stone through the black box of a machine learning model.\n","protected":false},"author":206,"featured_media":144133,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","_seopress_robots_follow":"","_seopress_robots_imageindex":"","_seopress_robots_snippet":"","_seopress_robots_primary_cat":"none","_seopress_robots_breadcrumbs":"","_seopress_robots_freeze_modified_date":"","_seopress_robots_custom_modified_date":"","_seopress_robots_canonical":"","_seopress_social_fb_title":"","_seopress_social_fb_desc":"","_seopress_social_fb_img":"","_seopress_social_fb_img_attachment_id":0,"_seopress_social_fb_img_width":0,"_seopress_social_fb_img_height":0,"_seopress_social_twitter_title":"","_seopress_social_twitter_desc":"","_seopress_social_twitter_img":"","_seopress_social_twitter_img_attachment_id":0,"_seopress_social_twitter_img_width":0,"_seopress_social_twitter_img_height":0,"_seopress_redirections_value":"","_seopress_redirections_enabled":"","_seopress_redirections_enabled_regex":"","_seopress_redirections_logged_status":"both","_seopress_redirections_param":"","_seopress_redirections_type":301,"_seopress_analysis_target_kw":"","inline_featured_image":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"mc4wp_mailchimp_campaign":[],"footnotes":"","_links_to":"","_links_to_target":""},"categories":[5286],"tags":[197],"class_list":["post-144124","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kdnuggets-originals","tag-machine-learning"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/144124","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/users\/206"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/comments?post=144124"}],"version-history":[{"count":0,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/144124\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media\/144133"}],"wp:attachment":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media?parent=144124"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/categories?post=144124"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/tags?post=144124"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}