Stopwords Filter

A fast, multilingual text processing utility that filters stopwords from input text. Supports 33 languages with efficient O(1) lookup using Bash associative arrays.

Note: For documents > 2,000 words, consider the Python implementation which offers superior performance on larger datasets. Both use the same NLTK stopwords data.

Features

Multilingual Support: Filter stopwords in 33 different languages
Multiple Output Formats: Single-line, list, or word frequency counts
Flexible Input: Accept text via command-line arguments or stdin
Punctuation Control: Optionally preserve or remove punctuation marks
Case-Insensitive: Matches stopwords regardless of case
Fast Performance: O(1) stopword lookup using associative arrays
Dual Usage: Use as a standalone script or source as a Bash function

Installation

Quick Install

curl -fsSL https://raw.githubusercontent.com/Open-Technology-Foundation/stopwords.bash/main/install.sh | sudo bash

Standard Install

System-wide (recommended):

git clone https://github.com/Open-Technology-Foundation/stopwords.bash
cd stopwords.bash
sudo ./install.sh install

User-local (no sudo):

PREFIX=$HOME/.local ./install.sh install

This installs the script to $PREFIX/bin/stopwords and stopwords data to /usr/share/stopwords/ (33 languages, ~170KB). If Python NLTK stopwords are already installed, data installation is automatically skipped.

Verify & Uninstall

# Verify installation
./install.sh check

# Uninstall (system)
sudo ./install.sh uninstall

# Uninstall (user)
PREFIX=$HOME/.local ./install.sh uninstall

Usage

Basic Filtering

./stopwords 'the quick brown fox jumps over the lazy dog'
# Output: quick brown fox jumps lazy dog

Reading from stdin

echo 'the quick brown fox' | ./stopwords
cat document.txt | ./stopwords

Language Selection (`-l`)

./stopwords -l spanish 'el rápido zorro marrón salta sobre el perro perezoso'
# Output: rápido zorro marrón salta perro perezoso

Punctuation Preservation (`-p`)

./stopwords 'Hello, world!'      # Output: hello world
./stopwords -p 'Hello, world!'   # Output: hello, world!

List Output (`-w`)

./stopwords -w 'the quick brown fox'
# Output:
# quick
# brown
# fox

Word Frequency Counting (`-c`)

./stopwords -c 'the fox jumps and the fox runs'
# Output:
# 1 jumps
# 1 runs
# 2 fox

./stopwords -c < document.txt

Supported Languages

albanian, arabic, azerbaijani, basque, belarusian, bengali, catalan, chinese, danish, dutch, english, finnish, french, german, greek, hebrew, hinglish, hungarian, indonesian, italian, kazakh, nepali, norwegian, portuguese, romanian, russian, slovene, spanish, swedish, tajik, tamil, turkish

Command-Line Options

Option	Long Form	Description
`-l LANG`	`--language LANG`	Set the language for stopwords (default: english)
`-p`	`--keep-punctuation`	Keep punctuation marks (default: remove)
`-w`	`--list-words`	Output filtered words as a list (one per line)
`-c`	`--count`	Output word frequency counts (sorted ascending)
`-V`	`--version`	Show version information
`-h`	`--help`	Show help message

Using as a Sourced Function

source stopwords
stopwords 'the quick brown fox'           # Output: quick brown fox
stopwords -l spanish 'el rápido zorro'    # Output: rápido zorro

Practical Examples

# Extract keywords from a document
cat article.txt | ./stopwords -w | sort | uniq

# Find most common words
./stopwords -c < article.txt | tail -20

# Clean search queries
echo "how to install python on ubuntu" | ./stopwords
# Output: install python ubuntu

# Batch preprocessing
for file in corpus/*.txt; do
  ./stopwords < "$file" > "processed/$(basename "$file")"
done

Exit Codes

0: Success
1: Data directory or stopwords file not found
2: Missing argument for option
22: Invalid option

Troubleshooting

Stopwords data not found?

The script searches these locations in order:

$NLTK_DATA/corpora/stopwords/ (custom NLTK path)
/usr/share/nltk_data/corpora/stopwords/ (system NLTK)
/usr/share/stopwords/ (bundled fallback)

Solutions:

# Install this package
sudo ./install.sh install

# OR use Python NLTK
pip install nltk && python -m nltk.downloader stopwords

# OR set NLTK_DATA manually
export NLTK_DATA=/path/to/your/nltk_data

User-local install not in PATH?

# Add to ~/.bashrc
export PATH="$HOME/.local/bin:$PATH"

License

GPL-3. See LICENSE

Contributing

Contributions welcome! Submit issues or pull requests on GitHub.

Acknowledgments

Stopword lists sourced from the NLTK corpus.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
stopwords_data		stopwords_data
tests		tests
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh
stopwords		stopwords

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stopwords Filter

Features

Installation

Quick Install

Standard Install

Verify & Uninstall

Usage

Basic Filtering

Reading from stdin

Language Selection (`-l`)

Punctuation Preservation (`-p`)

List Output (`-w`)

Word Frequency Counting (`-c`)

Supported Languages

Command-Line Options

Using as a Sourced Function

Practical Examples

Exit Codes

Troubleshooting

License

Contributing

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

License

Open-Technology-Foundation/stopwords.bash

Folders and files

Latest commit

History

Repository files navigation

Stopwords Filter

Features

Installation

Quick Install

Standard Install

Verify & Uninstall

Usage

Basic Filtering

Reading from stdin

Language Selection (-l)

Punctuation Preservation (-p)

List Output (-w)

Word Frequency Counting (-c)

Supported Languages

Command-Line Options

Using as a Sourced Function

Practical Examples

Exit Codes

Troubleshooting

License

Contributing

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Language Selection (`-l`)

Punctuation Preservation (`-p`)

List Output (`-w`)

Word Frequency Counting (`-c`)

Packages