Text Files: Encodings, Efficient Processing, and Analysis
In the previous article, we covered the basics of working with files in Python. Now we'll dive deeper into working with text files, which are one of the most common types of files in programming. 📝
Text files are used everywhere: for storing configurations, logs, data, program source code, and much more. The ability to work with them efficiently is an important skill for any programmer.
Features of Working with Text Files
Text files store sequences of characters organized into lines. When working with them in Python, you need to consider several important aspects:
Character Encodings
Encoding is a way of representing characters as bytes. Different encodings use different schemes to map characters.
In the modern world, there are many different languages and alphabets. For a computer to work with text in different languages, various character encoding systems have been developed:
- ASCII — the simplest encoding, contains only Latin letters, numbers, and basic symbols (a total of 128 characters)
- UTF-8 — the modern standard, supporting all languages of the world (including emoji 😊)
- Windows-1251 (cp1251) — encoding for Cyrillic, popular in Windows
In Python, it's recommended to always use UTF-8, especially if your text contains more than just English letters:
# Writing text in different encodings >>> text = "Привет, мир! Hello, world! 你好,世界!" # Writing in UTF-8 (standard for international texts) >>> with open('text_utf8.txt', 'w', encoding='utf-8') as file: ... file.write(text) # Writing in ASCII (English letters only) >>> try: ... with open('text_ascii.txt', 'w', encoding='ascii') as file: ... file.write(text) ... except UnicodeEncodeError as e: ... print(f"ASCII encoding error: {e}") # Writing in cp1251 (Cyrillic for Windows) >>> with open('text_cp1251.txt', 'w', encoding='cp1251') as file: ... # Chinese characters will be replaced with '?' ... file.write(text)
ASCII encoding error: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)
As you can see, trying to write Russian or Chinese text in ASCII encoding causes an error, since ASCII only supports English characters.
Reading with the Correct Encoding
When reading a file, it's important to specify the same encoding with which it was created:
# Reading a file with encoding specified >>> with open('text_utf8.txt', 'r', encoding='utf-8') as file: ... content = file.read() ... print(f"Contents in UTF-8: {content}") # Reading with the wrong encoding can lead to errors >>> try: ... with open('text_utf8.txt', 'r', encoding='ascii') as file: ... content = file.read() ... print(f"Contents with wrong encoding: {content}") ... except UnicodeDecodeError as e: ... print(f"Decoding error: {e}")
Contents in UTF-8: Привет, мир! Hello, world! 你好,世界!Decoding error: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)
Determining a File's Encoding
Sometimes you receive a text file and don't know which encoding it was saved in. Python cannot automatically determine the encoding, but there are libraries that can help with this:
# chardet library for determining encoding >>> import chardet # Let's create a file in cp1251 >>> with open('text_cp1251.txt', 'w', encoding='cp1251') as file: ... file.write("Привет, мир!") # Read the file as bytes and determine the encoding >>> with open('text_cp1251.txt', 'rb') as file: ... raw_data = file.read() ... result = chardet.detect(raw_data) ... print(f"Detected encoding: {result}") >>> # Now we can open the file with the correct encoding ... encoding = result['encoding'] ... with open('text_cp1251.txt', 'r', encoding=encoding) as text_file: ... content = text_file.read() ... print(f"Correctly read content: {content}")
Detected encoding: {'encoding': 'windows-1251', 'confidence': 0.99, 'language': 'Russian'}Correctly read content: Привет, мир!
Efficiently Reading Large Text Files
When working with large text files, it's important to use methods that don't load the entire file into memory. This is especially critical when you're working with files that are hundreds of megabytes or gigabytes in size.
Why Shouldn't You Read the Entire File at Once?
When you use the read() method without arguments, Python loads the entire file into memory:
with open('big_file.txt', 'r') as file: content = file.read() # The entire file is loaded into memory!
This can cause problems:
- Memory consumption — if the file is very large (e.g., gigabytes), it can occupy all available RAM, which will lead to slowdowns or even program crashes
- Delay — reading the entire file at once takes time, and your program will "hang" until the reading is complete
- Inefficiency — often you don't need all the data at once, but rather sequential access to it
Reading Line by Line
A more efficient approach is to read the file line by line using a loop. Python will load only one line into memory at a time:
# Let's create a test file with a large number of lines >>> with open('big_file.txt', 'w') as file: ... for i in range(1000): ... file.write(f"Line number {i+1}\n") # Efficient reading by lines >>> with open('big_file.txt', 'r') as file: ... line_count = 0 ... for line in file: # Iterate over lines without loading the entire file into memory ... line_count += 1 ... if line_count <= 5: # Show only the first 5 lines ... print(line.strip()) ... print(f"Total lines: {line_count}")
Line number 1Line number 2Line number 3Line number 4Line number 5Total lines: 1000
The advantages of this approach:
- Only one line is in memory at a time
- Processing begins immediately, no need to wait for the entire file to load
- You can stop reading at any time if you've found the data you need
Reading in Blocks
If you need even more control over the reading process, you can read the file in fixed-size blocks:
# Reading a file in blocks >>> with open('big_file.txt', 'r') as file: ... block_size = 100 # Block size in bytes ... blocks_read = 0 >>> while True: ... block = file.read(block_size) ... if not block: # If the block is empty, we've reached the end of the file ... break >>> blocks_read += 1 ... if blocks_read <= 2: # Show only the first 2 blocks ... print(f"Block {blocks_read}: {block[:50]}...") # Print the beginning of the block >>> print(f"Total blocks read: {blocks_read}")
Block 1: Line number 1Line number 2Line number 3Li...Block 2: ne number 4Line number 5Line number 6...Total blocks read: 32
This method allows you to control the amount of memory used for reading the file. The block size can be adjusted depending on your needs and available memory.
Text Files and Buffering
When writing data to a file, it doesn't always go directly to disk. Instead, the operating system and Python use a buffering mechanism to improve performance.
What is Buffering and Why is it Needed?
Buffering is the process of temporarily storing data in memory (a buffer) before it's actually written to disk.
Why buffering is important:
- Speed — disk write operations are relatively slow, while memory access is fast
- Reduced wear — for some types of disks (especially SSDs), buffering reduces wear by combining many small write operations into fewer large ones
- Energy efficiency — fewer disk operations save energy (important for mobile devices)
Imagine you're writing a letter. Buffering is like drafting in a notebook (fast) before copying it to paper for sending (slow).
Managing Buffering in Python
Python allows you to control buffering parameters when opening a file:
# Writing with different buffer settings >>> with open('buffer_test.txt', 'w', buffering=1) as file: ... file.write("This will be written with line buffering\n") ... print("Line written with line buffering") >>> with open('buffer_test2.txt', 'w', buffering=-1) as file: ... file.write("This will be written with default buffering\n") ... print("Line written with default buffering") # Forcing buffer flush >>> with open('flush_test.txt', 'w') as file: ... file.write("This line may remain in the buffer\n") ... file.flush() # Force write to disk ... print("Buffer flushed to disk")
Line written with line bufferingLine written with default bufferingBuffer flushed to disk
The buffering parameter can take various values:
- 0 — buffering is disabled (only for binary files)
- 1 — line buffering (buffer is flushed when a newline character appears)
- >1 — buffer of the specified size in bytes
- -1 or not specified — default buffering is used (usually an efficient size for the file system)
When is it Important to Manage Buffering?
Managing buffering is especially important in the following cases:
- Journaling and logging — to be sure that logs are written immediately, especially in case of program crashes
- Working with external devices — when writing to removable media or network drives
- Synchronization between processes — when one process writes data and another reads it
The flush() method allows you to force the buffer contents to be written to disk at any time without closing the file.
Working with Paths and Relative Paths
When working with text files, it's important to specify paths correctly:
>>> import os >>> from pathlib import Path # Current working directory >>> current_dir = os.getcwd() >>> print(f"Current directory: {current_dir}") # Creating a subdirectory >>> os.makedirs('data', exist_ok=True) # Writing to a file using a relative path >>> with open('data/config.txt', 'w') as file: ... file.write("key=value\n") ... file.write("debug=true\n") # Reading from a file using an absolute path >>> absolute_path = os.path.join(current_dir, 'data', 'config.txt') >>> with open(absolute_path, 'r') as file: ... content = file.read() ... print(f"Contents of config.txt: {content}") # Using pathlib to work with paths >>> config_path = Path('data') / 'config.txt' >>> print(f"Path to file: {config_path}") >>> print(f"Absolute path: {config_path.absolute()}") # Reading a file using pathlib >>> content = config_path.read_text() >>> print(f"Contents via pathlib: {content}")
Current directory: /path/to/current/directoryContents of config.txt: key=valuedebug=truePath to file: data/config.txtAbsolute path: /path/to/current/directory/data/config.txtContents via pathlib: key=valuedebug=true
Practical Example: Reading and Processing a Configuration File
Let's consider a simple practical example: reading a configuration file and using its parameters in a program:
# Let's create a sample configuration file >>> config_text = """ # Database parameters >>> database_host = localhost >>> database_port = 5432 >>> database_name = myapp >>> database_user = admin >>> database_password = secret123 # Web server parameters >>> server_port = 8080 >>> debug_mode = True >>> log_level = INFO >>> """ >>> with open('config.ini', 'w') as config_file: ... config_file.write(config_text) # Reading and processing the configuration >>> def read_config(filename): ... config = {} >>> with open(filename, 'r') as file: ... for line in file: ... # Skip empty lines and comments ... line = line.strip() ... if not line or line.startswith('#'): ... continue >>> # Split key and value ... if '=' in line: ... key, value = line.split('=', 1) ... config[key.strip()] = value.strip() >>> return config # Read the configuration >>> app_config = read_config('config.ini') # Use the parameters >>> print("Application configuration:") >>> print(f"Database: {app_config['database_name']} on {app_config['database_host']}:{app_config['database_port']}") >>> print(f"DB user: {app_config['database_user']}") >>> print(f"Web server port: {app_config['server_port']}") >>> print(f"Debug mode: {app_config['debug_mode']}") Application configuration: Database: myapp on localhost:5432 DB user: admin Web server port: 8080 Debug mode: True
In this example, we:
- Created a configuration file with parameters
- Wrote a function to read and process the file line by line
- Extracted the necessary parameters and used them in the program
This approach is often used in real applications to store settings in a human-readable format.
Understanding Check
Let's check how well you've understood the topic of reading and writing text files:
What is the correct way to open a file for reading with UTF-8 encoding?
Now you know the basic principles of working with text files in Python. In the next article, we'll look at working with structured data formats such as JSON and CSV. See you then! 👋