Contare parole e caratteri in un PDF in Python

Programmazione, Python

Mi è capitato di dover fare delle analisi su vari PDF, e per farlo ho scelto Python (per diverse ragioni).

In questo articolo vediamo come leggere un PDF e contare parole e caratteri.

Per leggere il PDF useremo PyPDF2, che possiamo installare con pip:

pip install PyPDF2

Qui sotto un esempio di codice:

import PyPDF2

def extract_text():
    try:
        with open('merged.pdf', 'rb') as file:
            reader = PyPDF2.PdfReader(file)
            text = ''
            for page in range(len(reader.pages)):
                text += reader.pages[page].extract_text()
            return text
    except FileNotFoundError:
        print(f"Il file non esiste.")
        return ''

def count_words(text):
    words = text.split()
    return len(words)

def count_characters(text,):
    text = text.replace('\n', '')
    return len(text)

if __name__ == "__main__":
    text = extract_text()

    if text:
        word_count = count_words(text)
        character_count_without_newlines = count_characters(text)

        print(f"Totale parole: {word_count}")
        print(f"Total caratteri senza newlins: {character_count_without_newlines}")

Abbiamo fatto in modo da escludere gli "a capo" in quanto non mi interessavano per l'analisi.

Ovviamente potete esercitarvi per modificare la funzione ed eventualmente calcolare anche quelli.

Enjoy!

python pip pdf pypdf2

Commentami!

Nome

Messaggio

Inserisci il numero corretto

Vuoi ricevere email in risposta?

Dichiaro di aver letto ed accettato la Privacy Policy