Build awareness and adoption for your software startup with Circuit.

10 Most Frequent Words In ‘Dune: Part Two’ with Python 

A guide to finding the 10 most frequently used words in 'Dune: Part Two' using Python

So, for all of you Dune geeks and Python lovers, I wanted to write my first article about Python with an explanation of code, so I decided to make a code that counts the number of occurrences of the top 10 words in Dune Messiah by Frank Herbert for the upcoming move ‘Dune: Part Two’.

We start by importing the necessary modules: Counter from collections, re for regular expressions, and matplotlib.pyplot as plt for plotting.

from collections import Counter
import re
import matplotlib.pyplot as plt

We define the get_word_frequency(file_path) function to read the text from a file, tokenize it, count word frequencies, and return the results. with excluding a few of the words that I thought would be highly shown.

def get_word_frequency(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        text = file.read()        
        words = re.findallr'\b(?!(?:the|of|a|to|and|he|in|his|it|you|said|that|i|s|she|was|with|her|this|at|for|on|is|'
                           r'had|they|him|as|from|there|be|but|we|my|not|one|them|have|by|what|me|your|thought|out|t|'
                           r'an|into|could|are|their|were|man|will|no|now|do|all'
                           r'|been|here|ll|ve|up|see|who|must|its|can|back|asked)\b)\w+\b', text.lower())
        word_count = Counter(words)  # Count the frequency of each word
        return word_count

Now we create a function that prints and generates a cool histogram of the words. The generate_histogram() function calls get_word_frequency() to obtain the word frequencies.

def generate_histogram():
    dune_file_path = "dune.txt"  
    word_frequency = get_word_frequency(dune_file_path)
    most_common_words = word_frequency.most_common(10)  # Get the 10 most common words

    # Extract words and frequencies for plotting
    words = [word[0] for word in most_common_words]
    frequencies = [word[1] for word in most_common_words]

    print("Top 10 most common words in Dune (excluding common words):")
    for word, frequency in most_common_words:
        print(f"{word}: {frequency}")

In the same function, we implement the histogram generating using Matplotlib

# Create histogram
    plt.figure(figsize=(10, 6))
    plt.bar(words, frequencies, color='skyblue')
    plt.xlabel('Words')
    plt.ylabel('Count')
    plt.title('Top 10 Most Common Words in Dune')
    plt.xticks(rotation=45, ha='center')
    plt.tight_layout()
    plt.show()

Output:

Top 10 most common words in Dune (excluding common words):
paul: 1735
jessica: 903
baron: 593
duke: 581
fremen: 520
hawat: 429
mother: 415
stilgar: 403
water: 377
kynes: 369

I did add a picture under every histogram in the code, but this will be on another article 😉

Let me know if you want to see more posts like this, and go read some of my other articles about code, data, and climate change.




Continue Learning