Modular Breakdown#
Text Cleaning#
The text_cleaning module provides functions for common normalization and cleaning tasks, including:
clean_text: Configurable pipeline for cleaning text (HTML, URLs, punctuation, emojis, numbers, whitespace, Unicode normalization, contractions, repeated characters, lowercasing).remove_html_tags,remove_urls,remove_punctuation,remove_emojis,remove_numbers,normalize_whitespace,normalize_unicode,normalize_contractions,replace_repeated_chars.
Tokenization#
The tokenization module offers:
simple_tokenize: Whitespace-based tokenization.nltk_tokenize: NLTK's word tokenizer. If required NLTK resources (e.g., 'punkt') are unavailable, this function gracefully falls back to a punctuation-based tokenizer (wordpunct_tokenize).spacy_tokenize: spaCy-based tokenization.segment_sentences: Sentence segmentation using NLTK or regex. If NLTK's sentence tokenizer resources are missing, this function falls back to a regex-based sentence splitter.get_ngrams: N-gram generation.remove_stopwords: Stopword removal (NLTK).stem_words: Porter stemming (NLTK).lemmatize_words: WordNet lemmatization (NLTK).
These fallback mechanisms ensure that tokenization and sentence segmentation remain functional even if NLTK data is not pre-installed, though results may be less linguistically precise without the full NLTK resources.
Classification#
The classification module includes:
naive_bayes_classify: Text classification using Naive Bayes (scikit-learn).svm_classify: Text classification using SVM (scikit-learn).sentiment_analysis_basic: Lexicon-based sentiment analysis (NLTK).
Similarity#
The similarity module provides:
cosine_similarity_texts: Cosine similarity using TF-IDF (scikit-learn).jaccard_similarity: Jaccard similarity.levenshtein_distance: Edit distance.semantic_similarity_word_embeddings: Semantic similarity using spaCy embeddings.jaro_similarity,jaro_winkler_similarity,hamming_distance,longest_common_subsequence.
Extraction#
The extraction module includes:
extract_emails,extract_phone_numbers,extract_urls,extract_hashtags,extract_mentions,extract_ip_addresses: Regex-based extractors.extract_named_entities,extract_entities_by_type: Named entity recognition using spaCy.
Encoding#
The encoding module provides:
encode_base64,decode_base64: Base64 encoding/decoding.url_encode,url_decode: URL encoding/decoding.html_encode,html_decode: HTML entity encoding/decoding.
Unified Import Interface#
All major NLP utilities are re-exported in the package’s __init__.py, so you can import them directly from insideLLMs.nlp without referencing submodules. This enables concise, readable imports and allows you to mix utilities from different domains in a single statement:
from insideLLMs.nlp import (
clean_text, simple_tokenize, naive_bayes_classify,
cosine_similarity_texts, extract_emails, encode_base64
)
This interface is consistent for both example scripts and core code, reducing import complexity and improving maintainability.
See the unified import interface in the package source.
Usage Examples#
Text Cleaning and Tokenization#
from insideLLMs.nlp import clean_text, simple_tokenize, remove_stopwords
raw = "Hello! Visit https://example.com for more info. 😊"
cleaned = clean_text(raw)
tokens = simple_tokenize(cleaned)
tokens_no_stop = remove_stopwords(tokens)
Classification#
from insideLLMs.nlp import naive_bayes_classify
train_texts = ["I love cats", "I hate rain"]
train_labels = ["positive", "negative"]
test_texts = ["cats are great", "rain is bad"]
predictions = naive_bayes_classify(train_texts, train_labels, test_texts)
Similarity#
from insideLLMs.nlp import cosine_similarity_texts
score = cosine_similarity_texts("I like apples", "I enjoy apples")
Extraction#
from insideLLMs.nlp import extract_emails, extract_named_entities
text = "Contact us at support@example.com. Barack Obama was the 44th president."
emails = extract_emails(text)
entities = extract_named_entities(text)
Encoding#
from insideLLMs.nlp import encode_base64, decode_base64
encoded = encode_base64("hello world")
decoded = decode_base64(encoded)
Dependency Management#
Many utilities rely on external libraries (NLTK, spaCy, scikit-learn). The package manages these dependencies centrally and lazily: functions check for and install required resources at runtime if needed. For example, tokenization and sentiment analysis functions will ensure NLTK is available, while spaCy-based functions will ensure the appropriate model is loaded. This design minimizes setup friction and keeps example code clean.
Best Practices#
- Use the unified import interface for all NLP utilities.
- When using advanced features (e.g., spaCy models, scikit-learn classifiers), ensure your environment can install required dependencies.
- For reproducibility in CI or production, pre-install NLTK data and spaCy models as needed. In CI environments, explicitly download required NLTK resources (such as 'punkt', 'punkt_tab', 'stopwords', 'wordnet', 'vader_lexicon') before running tests to avoid runtime errors.
- The tokenization and sentence segmentation utilities will gracefully degrade to simpler algorithms if NLTK resources are missing, but for best results, ensure all required NLTK data is available.