Documents
NLP Utilities Modularization
NLP Utilities Modularization
Type
Document
Status
Published
Created
Jan 23, 2026
Updated
Feb 6, 2026
Updated by
Dosu Bot

Modular Breakdown#

Text Cleaning#

The text_cleaning module provides functions for common normalization and cleaning tasks, including:

  • clean_text: Configurable pipeline for cleaning text (HTML, URLs, punctuation, emojis, numbers, whitespace, Unicode normalization, contractions, repeated characters, lowercasing).
  • remove_html_tags, remove_urls, remove_punctuation, remove_emojis, remove_numbers, normalize_whitespace, normalize_unicode, normalize_contractions, replace_repeated_chars.

Tokenization#

The tokenization module offers:

  • simple_tokenize: Whitespace-based tokenization.
  • nltk_tokenize: NLTK's word tokenizer. If required NLTK resources (e.g., 'punkt') are unavailable, this function gracefully falls back to a punctuation-based tokenizer (wordpunct_tokenize).
  • spacy_tokenize: spaCy-based tokenization.
  • segment_sentences: Sentence segmentation using NLTK or regex. If NLTK's sentence tokenizer resources are missing, this function falls back to a regex-based sentence splitter.
  • get_ngrams: N-gram generation.
  • remove_stopwords: Stopword removal (NLTK).
  • stem_words: Porter stemming (NLTK).
  • lemmatize_words: WordNet lemmatization (NLTK).

These fallback mechanisms ensure that tokenization and sentence segmentation remain functional even if NLTK data is not pre-installed, though results may be less linguistically precise without the full NLTK resources.

Classification#

The classification module includes:

  • naive_bayes_classify: Text classification using Naive Bayes (scikit-learn).
  • svm_classify: Text classification using SVM (scikit-learn).
  • sentiment_analysis_basic: Lexicon-based sentiment analysis (NLTK).

Similarity#

The similarity module provides:

  • cosine_similarity_texts: Cosine similarity using TF-IDF (scikit-learn).
  • jaccard_similarity: Jaccard similarity.
  • levenshtein_distance: Edit distance.
  • semantic_similarity_word_embeddings: Semantic similarity using spaCy embeddings.
  • jaro_similarity, jaro_winkler_similarity, hamming_distance, longest_common_subsequence.

Extraction#

The extraction module includes:

  • extract_emails, extract_phone_numbers, extract_urls, extract_hashtags, extract_mentions, extract_ip_addresses: Regex-based extractors.
  • extract_named_entities, extract_entities_by_type: Named entity recognition using spaCy.

Encoding#

The encoding module provides:

  • encode_base64, decode_base64: Base64 encoding/decoding.
  • url_encode, url_decode: URL encoding/decoding.
  • html_encode, html_decode: HTML entity encoding/decoding.

Unified Import Interface#

All major NLP utilities are re-exported in the package’s __init__.py, so you can import them directly from insideLLMs.nlp without referencing submodules. This enables concise, readable imports and allows you to mix utilities from different domains in a single statement:

from insideLLMs.nlp import (
    clean_text, simple_tokenize, naive_bayes_classify,
    cosine_similarity_texts, extract_emails, encode_base64
)

This interface is consistent for both example scripts and core code, reducing import complexity and improving maintainability.
See the unified import interface in the package source.

Usage Examples#

Text Cleaning and Tokenization#

from insideLLMs.nlp import clean_text, simple_tokenize, remove_stopwords

raw = "Hello! Visit https://example.com for more info. 😊"
cleaned = clean_text(raw)
tokens = simple_tokenize(cleaned)
tokens_no_stop = remove_stopwords(tokens)

Classification#

from insideLLMs.nlp import naive_bayes_classify

train_texts = ["I love cats", "I hate rain"]
train_labels = ["positive", "negative"]
test_texts = ["cats are great", "rain is bad"]
predictions = naive_bayes_classify(train_texts, train_labels, test_texts)

Similarity#

from insideLLMs.nlp import cosine_similarity_texts

score = cosine_similarity_texts("I like apples", "I enjoy apples")

Extraction#

from insideLLMs.nlp import extract_emails, extract_named_entities

text = "Contact us at support@example.com. Barack Obama was the 44th president."
emails = extract_emails(text)
entities = extract_named_entities(text)

Encoding#

from insideLLMs.nlp import encode_base64, decode_base64

encoded = encode_base64("hello world")
decoded = decode_base64(encoded)

Dependency Management#

Many utilities rely on external libraries (NLTK, spaCy, scikit-learn). The package manages these dependencies centrally and lazily: functions check for and install required resources at runtime if needed. For example, tokenization and sentiment analysis functions will ensure NLTK is available, while spaCy-based functions will ensure the appropriate model is loaded. This design minimizes setup friction and keeps example code clean.

Best Practices#

  • Use the unified import interface for all NLP utilities.
  • When using advanced features (e.g., spaCy models, scikit-learn classifiers), ensure your environment can install required dependencies.
  • For reproducibility in CI or production, pre-install NLTK data and spaCy models as needed. In CI environments, explicitly download required NLTK resources (such as 'punkt', 'punkt_tab', 'stopwords', 'wordnet', 'vader_lexicon') before running tests to avoid runtime errors.
  • The tokenization and sentence segmentation utilities will gracefully degrade to simpler algorithms if NLTK resources are missing, but for best results, ensure all required NLTK data is available.