A fast, multilingual text processing utility that filters stopwords from input text. Supports 33 languages with efficient O(1) lookup using Bash associative arrays.
Note: For documents > 2,000 words, consider the Python implementation which offers superior performance on larger datasets. Both use the same NLTK stopwords data.
- Multilingual Support: Filter stopwords in 33 different languages
- Multiple Output Formats: Single-line, list, or word frequency counts
- Flexible Input: Accept text via command-line arguments or stdin
- Punctuation Control: Optionally preserve or remove punctuation marks
- Case-Insensitive: Matches stopwords regardless of case
- Fast Performance: O(1) stopword lookup using associative arrays
- Dual Usage: Use as a standalone script or source as a Bash function
curl -fsSL https://raw.githubusercontent.com/Open-Technology-Foundation/stopwords.bash/main/install.sh | sudo bashSystem-wide (recommended):
git clone https://github.com/Open-Technology-Foundation/stopwords.bash
cd stopwords.bash
sudo ./install.sh installUser-local (no sudo):
PREFIX=$HOME/.local ./install.sh installThis installs the script to $PREFIX/bin/stopwords and stopwords data to /usr/share/stopwords/ (33 languages, ~170KB). If Python NLTK stopwords are already installed, data installation is automatically skipped.
# Verify installation
./install.sh check
# Uninstall (system)
sudo ./install.sh uninstall
# Uninstall (user)
PREFIX=$HOME/.local ./install.sh uninstall./stopwords 'the quick brown fox jumps over the lazy dog'
# Output: quick brown fox jumps lazy dogecho 'the quick brown fox' | ./stopwords
cat document.txt | ./stopwords./stopwords -l spanish 'el rápido zorro marrón salta sobre el perro perezoso'
# Output: rápido zorro marrón salta perro perezoso./stopwords 'Hello, world!' # Output: hello world
./stopwords -p 'Hello, world!' # Output: hello, world!./stopwords -w 'the quick brown fox'
# Output:
# quick
# brown
# fox./stopwords -c 'the fox jumps and the fox runs'
# Output:
# 1 jumps
# 1 runs
# 2 fox
./stopwords -c < document.txtalbanian, arabic, azerbaijani, basque, belarusian, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, tamil, turkish
| Option | Long Form | Description |
|---|---|---|
-l LANG |
--language LANG |
Set the language for stopwords (default: english) |
-p |
--keep-punctuation |
Keep punctuation marks (default: remove) |
-w |
--list-words |
Output filtered words as a list (one per line) |
-c |
--count |
Output word frequency counts (sorted ascending) |
-V |
--version |
Show version information |
-h |
--help |
Show help message |
source stopwords
stopwords 'the quick brown fox' # Output: quick brown fox
stopwords -l spanish 'el rápido zorro' # Output: rápido zorro# Extract keywords from a document
cat article.txt | ./stopwords -w | sort | uniq
# Find most common words
./stopwords -c < article.txt | tail -20
# Clean search queries
echo "how to install python on ubuntu" | ./stopwords
# Output: install python ubuntu
# Batch preprocessing
for file in corpus/*.txt; do
./stopwords < "$file" > "processed/$(basename "$file")"
done0: Success1: Data directory or stopwords file not found2: Missing argument for option22: Invalid option
Stopwords data not found?
The script searches these locations in order:
$NLTK_DATA/corpora/stopwords/(custom NLTK path)/usr/share/nltk_data/corpora/stopwords/(system NLTK)/usr/share/stopwords/(bundled fallback)
Solutions:
# Install this package
sudo ./install.sh install
# OR use Python NLTK
pip install nltk && python -m nltk.downloader stopwords
# OR set NLTK_DATA manually
export NLTK_DATA=/path/to/your/nltk_dataUser-local install not in PATH?
# Add to ~/.bashrc
export PATH="$HOME/.local/bin:$PATH"GPL-3. See LICENSE
Contributions welcome! Submit issues or pull requests on GitHub.
Stopword lists sourced from the NLTK corpus.