NLP Utilities Modularization

Modular Breakdown#

Text Cleaning#

The text_cleaning module provides functions for common normalization and cleaning tasks, including:

clean_text: Configurable pipeline for cleaning text (HTML, URLs, punctuation, emojis, numbers, whitespace, Unicode normalization, contractions, repeated characters, lowercasing).
remove_html_tags, remove_urls, remove_punctuation, remove_emojis, remove_numbers, normalize_whitespace, normalize_unicode, normalize_contractions, replace_repeated_chars.

Tokenization#

The tokenization module offers:

simple_tokenize: Whitespace-based tokenization.
nltk_tokenize: NLTK's word tokenizer. If required NLTK resources (e.g., 'punkt') are unavailable, this function gracefully falls back to a punctuation-based tokenizer (wordpunct_tokenize).
spacy_tokenize: spaCy-based tokenization.
segment_sentences: Sentence segmentation using NLTK or regex. If NLTK's sentence tokenizer resources are missing, this function falls back to a regex-based sentence splitter.
get_ngrams: N-gram generation.
remove_stopwords: Stopword removal (NLTK).
stem_words: Porter stemming (NLTK).
lemmatize_words: WordNet lemmatization (NLTK).

These fallback mechanisms ensure that tokenization and sentence segmentation remain functional even if NLTK data is not pre-installed, though results may be less linguistically precise without the full NLTK resources.

Classification#

The classification module includes:

naive_bayes_classify: Text classification using Naive Bayes (scikit-learn).
svm_classify: Text classification using SVM (scikit-learn).
sentiment_analysis_basic: Lexicon-based sentiment analysis (NLTK).

Similarity#

The similarity module provides:

cosine_similarity_texts: Cosine similarity using TF-IDF (scikit-learn).
jaccard_similarity: Jaccard similarity.
levenshtein_distance: Edit distance.
semantic_similarity_word_embeddings: Semantic similarity using spaCy embeddings.
jaro_similarity, jaro_winkler_similarity, hamming_distance, longest_common_subsequence.

Extraction#

The extraction module includes:

extract_emails, extract_phone_numbers, extract_urls, extract_hashtags, extract_mentions, extract_ip_addresses: Regex-based extractors.
extract_named_entities, extract_entities_by_type: Named entity recognition using spaCy.

Encoding#

The encoding module provides:

encode_base64, decode_base64: Base64 encoding/decoding.
url_encode, url_decode: URL encoding/decoding.
html_encode, html_decode: HTML entity encoding/decoding.

Unified Import Interface#

All major NLP utilities are re-exported in the package’s __init__.py, so you can import them directly from insideLLMs.nlp without referencing submodules. This enables concise, readable imports and allows you to mix utilities from different domains in a single statement:

from insideLLMs.nlp import (
    clean_text, simple_tokenize, naive_bayes_classify,
    cosine_similarity_texts, extract_emails, encode_base64
)

This interface is consistent for both example scripts and core code, reducing import complexity and improving maintainability.
See the unified import interface in the package source.

Usage Examples#

Text Cleaning and Tokenization#

from insideLLMs.nlp import clean_text, simple_tokenize, remove_stopwords

raw = "Hello! Visit https://example.com for more info. 😊"
cleaned = clean_text(raw)
tokens = simple_tokenize(cleaned)
tokens_no_stop = remove_stopwords(tokens)

Classification#

from insideLLMs.nlp import naive_bayes_classify

train_texts = ["I love cats", "I hate rain"]
train_labels = ["positive", "negative"]
test_texts = ["cats are great", "rain is bad"]
predictions = naive_bayes_classify(train_texts, train_labels, test_texts)

Similarity#

from insideLLMs.nlp import cosine_similarity_texts

score = cosine_similarity_texts("I like apples", "I enjoy apples")

Extraction#

from insideLLMs.nlp import extract_emails, extract_named_entities

text = "Contact us at support@example.com. Barack Obama was the 44th president."
emails = extract_emails(text)
entities = extract_named_entities(text)

Encoding#

from insideLLMs.nlp import encode_base64, decode_base64

encoded = encode_base64("hello world")
decoded = decode_base64(encoded)

Dependency Management#

Many utilities rely on external libraries (NLTK, spaCy, scikit-learn). The package manages these dependencies centrally and lazily: functions check for and install required resources at runtime if needed. For example, tokenization and sentiment analysis functions will ensure NLTK is available, while spaCy-based functions will ensure the appropriate model is loaded. This design minimizes setup friction and keeps example code clean.

Best Practices#

Use the unified import interface for all NLP utilities.
When using advanced features (e.g., spaCy models, scikit-learn classifiers), ensure your environment can install required dependencies.
For reproducibility in CI or production, pre-install NLTK data and spaCy models as needed. In CI environments, explicitly download required NLTK resources (such as 'punkt', 'punkt_tab', 'stopwords', 'wordnet', 'vader_lexicon') before running tests to avoid runtime errors.
The tokenization and sentence segmentation utilities will gracefully degrade to simpler algorithms if NLTK resources are missing, but for best results, ensure all required NLTK data is available.