{"id":205602,"date":"2026-06-05T08:00:45","date_gmt":"2026-06-05T12:00:45","guid":{"rendered":"https:\/\/www.kdnuggets.com\/?p=205602"},"modified":"2026-06-04T16:24:41","modified_gmt":"2026-06-04T20:24:41","slug":"3-spacy-tricks-for-efficient-text-processing-entity-recognition","status":"publish","type":"post","link":"https:\/\/www.kdnuggets.com\/3-spacy-tricks-for-efficient-text-processing-entity-recognition","title":{"rendered":"3 SpaCy Tricks for Efficient Text Processing &#038; Entity Recognition"},"content":{"rendered":"<p><img decoding=\"async\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/kdn-3-spacy-tricks-for-efficient-text-processing-entity-recognition-feature.png\" alt=\"3 SpaCy Tricks for Efficient Text Processing & Entity Recognition\" width=\"100%\" \/><br \/>\n&nbsp;<\/p>\n<h2><font color=\"#f3ac35\">#&nbsp;<\/font>Introduction<\/h2>\n<p>&nbsp;<br \/>\nThanks especially to contemporary large language models, <a href=\"https:\/\/www.kdnuggets.com\/tag\/natural-language-processing\" target=\"_blank\"><strong>natural language processing<\/strong><\/a> (NLP) is a fundamental pillar of modern AI and software systems. You'll find NLP techniques and technologies powering everything from search engines and chatbots to automated customer support routing and entity extraction pipelines. When it comes to production-grade NLP in Python, <a href=\"https:\/\/spacy.io\/\" target=\"_blank\"><strong>spaCy<\/strong><\/a> is the undisputed industry standard. spaCy is designed specifically for production use, offering industrial-strength speed, pre-trained statistical and transformer models, and an intuitive API.<\/p><div class=\"kdnug-after-first-paragraph kdnug-entity-placement\" id=\"kdnug-3557725746\"><div id=\"kdnug-93373632\"><a data-no-instant=\"1\" href=\"https:\/\/www.snowflake.com\/en\/dev-day\/americas-virtual\/?utm_source=kdnuggets&#038;utm_medium=display\" rel=\"noopener nofollow\" class=\"a2t-link\" target=\"_blank\"><p><img decoding=\"async\" style=\"max-width: 100%; height: auto;\" src=\"https:\/\/www.kdnuggets.com\/wp-content\/uploads\/s-snowflake-2606.png\" alt=\"Snowflake Dev Day \/><br \/>\nRegister today\tRegister today\t<\/p>\n<\/a><\/div><\/div>\n<p>Unfortunately, many developers treat spaCy as a simple black box monolith. They load a model, run it on text, and accept the default processing speeds and extraction limits. When scaling from a local prototype to processing millions of documents, these default configurations can become computational bottlenecks, leading to latency, bloated memory footprints, and missed domain-specific entities. In order to build high-performance text processing pipelines, you must understand how to optimize spaCy's internal execution flow.<\/p>\n<p>In this article, we will explore three essential spaCy tricks that every developer should have in their toolkit to maximize processing speed and customize entity recognition: selective pipeline loading, parallel batch processing, and hybrid rule-based statistical entity recognition.<\/p>\n<p>Before getting started, ensure you have spaCy installed, as well as its lightweight general-purpose English model:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>pip install spacy\r\npython -m spacy download en_core_web_sm<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<h2><font color=\"#f3ac35\">#&nbsp;<\/font>1. Selective Pipeline Loading & Component Disabling<\/h2>\n<p>&nbsp;<br \/>\nBy default, when you load a pre-trained spaCy model (such as <code style=\"background: #F5F5F5;\">en_core_web_sm<\/code>), spaCy initializes a complete NLP pipeline. This pipeline typically includes:<\/p>\n<ul>\n<li>a tokenizer\n<li>a part-of-speech tagger (<code style=\"background: #F5F5F5;\">tagger<\/code>)\n<li>a dependency parser (<code style=\"background: #F5F5F5;\">parser<\/code>)\n<li>a lemmatizer (<code style=\"background: #F5F5F5;\">lemmatizer<\/code>)\n<li>an attribute ruler (<code style=\"background: #F5F5F5;\">attribute_ruler<\/code>)\n<li>a named entity recognizer (<code style=\"background: #F5F5F5;\">ner<\/code>)\n<\/ul>\n<p>While this full default rich feature set is excellent, it comes with substantial computational overhead. If your application only needs to perform named entity recognition (NER), running the dependency parser and lemmatizer is a waste of CPU cycles and memory. Conversely, if you are only cleaning text and extracting lemmas, running the deep statistical NER model is highly inefficient. You can optimize this by selectively excluding components during loading, or temporarily disabling them during execution using a context manager.<\/p>\n<p>This naive approach loads and runs every default component on the text, regardless of whether the components' outputs are actually used:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\nimport time\r\n\r\n# Load the small English model\r\nnlp = spacy.load(\"en_core_web_sm\")\r\n\r\ntexts = [\"Apple is looking at buying U.K. startup for $1 billion\"] * 1000\r\n\r\n# Naive execution: runs tagger, parser, lemmatizer, and ner on every doc\r\n# Assume we only care about named entities here\r\nstart_time = time.time()\r\nfor text in texts:\r\n    doc = nlp(text)\r\n    entities = [(ent.text, ent.label_) for ent in doc.ents]\r\n\r\nduration_full = time.time() - start_time\r\n\r\nprint(f\"Full pipeline processed 1,000 docs in: {duration_full:.4f} seconds\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<div style=\"width:98%; overflow:auto; padding-left:10px; padding-bottom:10px; padding-top:10px; background:#E1E1E1\">\n<pre><code>Full pipeline processed 1,000 docs in: 2.8540 seconds<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Now let's optimize execution in two specific ways. First, we will be excluding heavy, unused components like the dependency parser at load time. Second, we will use <code style=\"background: #F5F5F5;\">nlp.select_pipes()<\/code> to temporarily disable components when processing specific workloads.<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\nimport time\r\n\r\n# Load time optimization: Exclude the heavy parser and tagger from the start\r\n# This reduces initialization time and memory footprint\r\nnlp_optimized = spacy.load(\"en_core_web_sm\", exclude=[\"parser\", \"tagger\"])\r\n\r\ntexts = [\"Apple is looking at buying U.K. startup for $1 billion\"] * 1000\r\n\r\n# Context-manager optimization, disable components temporarily\r\n# We have outright excluded parser and tagger, we disable attribute ruler and lemmatizer here\r\nstart_time = time.time()\r\nwith nlp_optimized.select_pipes(disable=[\"attribute_ruler\", \"lemmatizer\"]):\r\n    for text in texts:\r\n        doc = nlp_optimized(text)\r\n        entities = [(ent.text, ent.label_) for ent in doc.ents]\r\n\r\nduration_opt = time.time() - start_time\r\n\r\nprint(f\"Optimized pipeline processed 1,000 docs in: {duration_opt:.4f} seconds\")\r\nprint(f\"Speedup: {duration_full \/ duration_opt:.2f}x faster!\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Let's compare runtimes:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #E1E1E1;\">\n<pre><code>Full pipeline processed 1,000 docs in: 2.8739 seconds\r\nOptimized pipeline processed 1,000 docs in: 1.7859 seconds\r\nSpeedup: 1.61x faster!<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>In the optimized example, passing <code style=\"background: #F5F5F5;\">exclude=[\"parser\", \"tagger\"]<\/code> to <code style=\"background: #F5F5F5;\">spacy.load()<\/code> completely prevents these components from being loaded into memory. In an alternate method of reaching basically the same outcome, we passed <code style=\"background: #F5F5F5;\">disable=[\"attribute_ruler\", \"lemmatizer\"]<\/code> to temporarily disabling their processing. The effect is that, when we process the text, spaCy skips token dependency analysis and part-of-speech tag labeling, which are mathematically expensive, and jumps straight to entity recognition. This results in a noticeable speedup with zero effect on NER accuracy, with even more noticeable advantages at greater scale.<\/p>\n<p>&nbsp;<\/p>\n<h2><font color=\"#f3ac35\">#&nbsp;<\/font>2. High-Throughput Batch Processing with nlp.pipe & Metadata Propagation<\/h2>\n<p>&nbsp;<br \/>\nIf you are iterating over a large corpus (e.g. pandas DataFrames, database rows, or raw text files), calling the <code style=\"background: #F5F5F5;\">nlp<\/code> object on individual strings in a loop (e.g. <code style=\"background: #F5F5F5;\">[nlp(text) for text in texts]<\/code>) is an anti-pattern.<\/p>\n<p>Sequential processing prevents spaCy from optimizing memory buffers, grouping operations, and leveraging multi-core parallelization. Also, when processing text for database storage or ETL pipelines, you often need to carry metadata (like a record ID, timestamp, or category) through the NLP process so you can map the resulting entities back to the correct database rows.<\/p>\n<p>The solution is to use <code style=\"background: #F5F5F5;\">nlp.pipe()<\/code>. This method processes documents as a <em>stream<\/em>, buffers them internally, and supports multi-processing. By setting <code style=\"background: #F5F5F5;\">as_tuples=True<\/code>, you can feed tuples of <code style=\"background: #F5F5F5;\">(text, context)<\/code> to spaCy. It will return <code style=\"background: #F5F5F5;\">(doc, context)<\/code> pairs, letting you pass metadata straight through the pipeline.<\/p>\n<p>This naive approach runs processing sequentially and uses manual index tracking to align the resulting documents with their database IDs, which is brittle and slow:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\nimport time\r\n\r\nnlp = spacy.load(\"en_core_web_sm\", exclude=[\"parser\", \"tagger\"])\r\n\r\n# Raw database records with unique IDs\r\nrecords = [\r\n    {\"id\": f\"DB-REC-{i}\", \"text\": \"Google was founded in September 1998 by Larry Page and Sergey Brin.\"}\r\n    for i in range(1000)\r\n]\r\n\r\n# Sequential loop: slow and manually managed metadata\r\nstart_time = time.time()\r\nextracted_data = []\r\nfor i, record in enumerate(records):\r\n    doc = nlp(record[\"text\"])\r\n    entities = [(ent.text, ent.label_) for ent in doc.ents]\r\n    extracted_data.append({\r\n        \"id\": record[\"id\"],\r\n        \"entities\": entities\r\n    })\r\n\r\nduration_seq = time.time() - start_time\r\n\r\nprint(f\"Sequential loop processed 1,000 docs in: {duration_seq:.4f} seconds\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #E1E1E1;\">\n<pre><code>Sequential loop processed 1,000 docs in: 2.7375 seconds<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Here, we stream the data using <code style=\"background: #F5F5F5;\">nlp.pipe<\/code>, leveraging batch processing and multi-core parallelization (<code style=\"background: #F5F5F5;\">n_process<\/code>), while letting the database ID ride along as a context variable:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\nimport time\r\n\r\n# Keep your imports and definitions global so child processes can see them\r\nnlp = spacy.load(\"en_core_web_sm\", exclude=[\"parser\", \"tagger\"])\r\n\r\n# Wrap the actual execution code in the main block\r\nif __name__ == '__main__':\r\n    records = [\r\n        {\"id\": f\"DB-REC-{i}\", \"text\": \"Google was founded in September 1998 by Larry Page and Sergey Brin.\"}\r\n        for i in range(1000)\r\n    ]\r\n\r\n    start_time = time.time()\r\n\r\n    # Format input as a list of (text, context) tuples\r\n    stream_input = [(rec[\"text\"], rec[\"id\"]) for rec in records]\r\n\r\n    # Stream batches and use all available CPU cores with n_process=-1\r\n    extracted_data_pipe = []\r\n    docs_stream = nlp.pipe(stream_input, as_tuples=True, batch_size=256, n_process=-1)\r\n\r\n    for doc, rec_id in docs_stream:\r\n        entities = [(ent.text, ent.label_) for ent in doc.ents]\r\n        extracted_data_pipe.append({\r\n            \"id\": rec_id,\r\n            \"entities\": entities\r\n        })\r\n\r\n    duration_pipe = time.time() - start_time\r\n\r\n    print(f\"nlp.pipe processed 1,000 docs in: {duration_pipe:.4f} seconds\")\r\n    print(f\"Speedup: {duration_seq \/ duration_pipe:.2f}x faster!\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #E1E1E1;\">\n<pre><code>nlp.pipe processed 1,000 docs in: 7.1310 seconds<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>In the optimized code snippet, we restructure the input dataset into a sequence of tuples: <code style=\"background: #F5F5F5;\">(text_string, metadata_context)<\/code>. When calling <code style=\"background: #F5F5F5;\">nlp.pipe(stream_input, as_tuples=True, batch_size=256, n_process=-1)<\/code>:<\/p>\n<ul>\n<li><code style=\"background: #F5F5F5;\">batch_size=256<\/code> tells spaCy to buffer and process texts in groups of 256, minimizing internal Python loop overhead\n<li><code style=\"background: #F5F5F5;\">n_process=-1<\/code> tells spaCy to automatically detect your system's CPU count and parallelize the tokenization and component extraction across all available cores\n<li><code style=\"background: #F5F5F5;\">as_tuples=True<\/code> instructs spaCy to yield pairs of <code style=\"background: #F5F5F5;\">(doc, context)<\/code>, ensuring the metadata (the record ID) remains perfectly aligned with the processed document without needing manual index arrays or list-alignment code\n<\/ul>\n<p>The astute reader will note that the processing time for the parallel batch processing code has actually increased over its predecessor. However, this is due to the overhead associated with setting up the parallel job, and the savings will become evident as the number of documents to process grows in number.<\/p>\n<p>By re-running the same code excerpts above but with 10,000 records instead of 1,000, here are the results:<\/p>\n<div style=\"width:98%; overflow:auto; padding-left:10px; padding-bottom:10px; padding-top:10px; background:#E1E1E1\">\n<pre><code>Sequential loop processed 1,000 docs in: 27.6733 seconds\r\nnlp.pipe processed 1,000 docs in: 11.5444 seconds<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>You can see how the savings would continue to compound.<\/p>\n<p>&nbsp;<\/p>\n<h2><font color=\"#f3ac35\">#&nbsp;<\/font>3. Hybrid Named Entity Recognition with <code style=\"background: #F5F5F5;\">EntityRuler<\/code><\/h2>\n<p>&nbsp;<br \/>\nPre-trained statistical and transformer-based NER models are incredibly powerful for recognizing general entity types like <code style=\"background: #F5F5F5;\">ORG<\/code>, <code style=\"background: #F5F5F5;\">PERSON<\/code>, or <code style=\"background: #F5F5F5;\">DATE<\/code> based on context. However, models can frequently fail to recognize domain-specific terms (such as custom product SKUs, legacy code IDs, or highly niche medical terms) because they weren't exposed to them during training.<\/p>\n<p>Fine-tuning a deep learning statistical model on custom entities is one solution, but it requires labeling thousands of sentences and runs the risk of \"catastrophic forgetting,\" in which the model forgets how to recognize standard entities along the way.<\/p>\n<p>A cleaner, highly efficient solution is a hybrid NER approach using spaCy's <code style=\"background: #F5F5F5;\">EntityRuler<\/code>. The <code style=\"background: #F5F5F5;\">EntityRuler<\/code> allows you to define patterns (using regular expressions or token-based dictionary dictionaries) and inject them directly into your pipeline. You can add it <strong>before<\/strong> the statistical NER &mdash; to pre-tag deterministic entities and help the model make context decisions &mdash; or <strong>after<\/strong> it  &mdash; to act as a fallback or override.<\/p>\n<p>Developers often try to patch statistical NER gaps by running regex on the text <strong>after<\/strong> running the spaCy pipeline, resulting in manual coordinate offset math and disconnected data structures:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\nimport re\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\ntext = \"Please review system ticket ID: TKT-98421 on our corporate portal.\"\r\n\r\ndoc = nlp(text)\r\n\r\n# Standard statistical NER misses custom ticket IDs\r\nentities = [(ent.text, ent.label_) for ent in doc.ents]\r\nprint(\"Before post-process:\", entities)\r\n\r\n# Post-process regex patch\r\nticket_pattern = r\"TKT-\\d+\"\r\nmatches = re.finditer(ticket_pattern, text)\r\ncustom_ents = []\r\nfor match in matches:\r\n    # Requires complex char-to-token offset conversion to build spans\r\n    custom_ents.append((match.group(), \"TICKET_ID\"))\r\n\r\n# We now have two disconnected lists of entities that must be merged manually\r\nprint(\"Regex entities:\", custom_ents)<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #E1E1E1;\">\n<pre><code>Before post-process: []\r\nRegex entities: [('TKT-98421', 'TICKET_ID')]<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>By adding an <code style=\"background: #F5F5F5;\">EntityRuler<\/code> component directly to the pipeline, we merge rule-based regex patterns and statistical parsing into a single, unified <code style=\"background: #F5F5F5;\">doc.ents<\/code> output:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #F5F5F5;\">\n<pre><code>import spacy\r\n\r\nnlp = spacy.load(\"en_core_web_sm\")\r\n\r\n# Add the entity_ruler component to the pipeline before ner so it pre-tags entities, but after works too\r\nruler = nlp.add_pipe(\"entity_ruler\", before=\"ner\")\r\n\r\n# Define token-level patterns, including regular expressions\r\npatterns = [\r\n    # Match strings starting with \"TKT-\" followed by digits\r\n    {\"label\": \"TICKET_ID\", \"pattern\": [{\"TEXT\": {\"REGEX\": \"^TKT-\\d+$\"}}]},\r\n    # Match specific domain phrases exactly\r\n    {\"label\": \"ORG\", \"pattern\": \"corporate portal\"}\r\n]\r\nruler.add_patterns(patterns)\r\n\r\ntext = \"Please review system ticket ID: TKT-98421 on our corporate portal.\"\r\ndoc = nlp(text)\r\n\r\n# Both statistical and rule-based entities are consolidated inside doc.ents\r\nfor ent in doc.ents:\r\n    print(f\"Entity: {ent.text:<20} | Label: {ent.label_}\")<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>Output:<\/p>\n<div style=\"width: 98%; overflow: auto; padding-left: 10px; padding-bottom: 10px; padding-top: 10px; background: #E1E1E1;\">\n<pre><code>Entity: TKT-98421            | Label: TICKET_ID\r\nEntity: corporate portal     | Label: ORG<\/code><\/pre>\n<\/div>\n<p>&nbsp;<\/p>\n<p>In this hybrid implementation, we call <code style=\"background: #F5F5F5;\">nlp.add_pipe(\"entity_ruler\", before=\"ner\")<\/code>. The <code style=\"background: #F5F5F5;\">EntityRuler<\/code> acts as a native pipeline component. When the text is processed:<\/p>\n<ul>\n<li>The tokenizer splits the sentence into tokens.\n<li>The <code style=\"background: #F5F5F5;\">EntityRuler<\/code> runs first, identifying tokens that match our ticket regex pattern or exact dictionary strings and tagging them as <code style=\"background: #F5F5F5;\">TICKET_ID<\/code> or <code style=\"background: #F5F5F5;\">ORG<\/code>.<\/li>\n<li>The statistical <code style=\"background: #F5F5F5;\">ner<\/code> component runs next. Because it sees that these tokens are already tagged as entities, it respects the tags (or adapts its predictions around them, avoiding conflicts).<\/li>\n<\/ul>\n<p>This ensures that all entities, both learned statistical ones and deterministic rule-based ones, coexist cleanly within a single, cohesive <code style=\"background: #F5F5F5;\">Doc.ents<\/code> sequence, eliminating the need for brittle post-process sorting or offset adjustments.<\/p>\n<p>&nbsp;<\/p>\n<h2><font color=\"#f3ac35\">#&nbsp;<\/font>Wrapping Up<\/h2>\n<p>&nbsp;<br \/>\nOptimizing spaCy is about transitioning from default configurations to pipelines that respect your system resources and domain-specific requirements.<\/p>\n<p>By adopting these three tricks, you can design highly efficient, production-grade text processing pipelines:<\/p>\n<ul>\n<li>Selective loading & component disabling eliminates unnecessary computation, accelerating your processing speed by up to 5x.<\/li>\n<li>Batch processing with <code style=\"background: #F5F5F5;\">nlp.pipe<\/code> parallelizes execution across CPU cores, and setting <code style=\"background: #F5F5F5;\">as_tuples=True<\/code> propagates critical metadata without index-mapping bugs.<\/li>\n<li>Hybrid NER with <code style=\"background: #F5F5F5;\">EntityRuler<\/code> blends deterministic pattern-matching rules with general statistical inference, ensuring maximum extraction accuracy for custom domains without retraining.<\/li>\n<\/ul>\n<p>Deploying these design patterns ensures that your NLP pipelines remain scalable, memory-efficient, and tailored to the unique vocabulary of your business data.<br \/>\n&nbsp;<br \/>\n&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"In this article, we will explore three essential spaCy tricks that every developer should have in their toolkit to maximize processing speed and customize entity recognition.\n","protected":false},"author":99,"featured_media":205604,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"_seopress_titles_title":"","_seopress_titles_desc":"","_seopress_robots_index":"","_seopress_robots_follow":"","_seopress_robots_imageindex":"","_seopress_robots_snippet":"","_seopress_robots_primary_cat":"","_seopress_robots_breadcrumbs":"","_seopress_robots_freeze_modified_date":"","_seopress_robots_custom_modified_date":"","_seopress_robots_canonical":"","_seopress_social_fb_title":"","_seopress_social_fb_desc":"","_seopress_social_fb_img":"","_seopress_social_fb_img_attachment_id":0,"_seopress_social_fb_img_width":0,"_seopress_social_fb_img_height":0,"_seopress_social_twitter_title":"","_seopress_social_twitter_desc":"","_seopress_social_twitter_img":"","_seopress_social_twitter_img_attachment_id":0,"_seopress_social_twitter_img_width":0,"_seopress_social_twitter_img_height":0,"_seopress_redirections_value":"","_seopress_redirections_enabled":"","_seopress_redirections_enabled_regex":"","_seopress_redirections_logged_status":"","_seopress_redirections_param":"","_seopress_redirections_type":0,"_seopress_analysis_target_kw":"","inline_featured_image":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"mc4wp_mailchimp_campaign":[],"footnotes":"","_links_to":"","_links_to_target":""},"categories":[5286],"tags":[1746],"class_list":["post-205602","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-kdnuggets-originals","tag-natural-language-processing"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/205602","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/users\/99"}],"replies":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/comments?post=205602"}],"version-history":[{"count":2,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/205602\/revisions"}],"predecessor-version":[{"id":205610,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/posts\/205602\/revisions\/205610"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media\/205604"}],"wp:attachment":[{"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/media?parent=205602"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/categories?post=205602"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.kdnuggets.com\/wp-json\/wp\/v2\/tags?post=205602"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}