Text Files: Encodings, Efficient Processing, and Analysis

In the previous article we covered the basics of working with files. Here we dig deeper into text files: encodings, efficient reading of large files, and a practical example parsing a config.

Specifics of Working with Text Files

Text files store sequences of characters organized into lines. When working with them in Python, there are several important points to consider:

Character Encodings

An encoding is a way of representing characters as bytes. Different encodings use different schemes to map characters.

In the modern world, there are many different languages and alphabets. To allow computers to work with text in different languages, various character encoding systems have been developed:

  • ASCII — the simplest encoding, containing only Latin letters, numbers, and basic symbols (128 characters in total).
  • UTF-8 — the modern standard that supports all languages of the world (including emoji 😊).
  • Windows-1251 (cp1251) — an encoding for Cyrillic, popular in Windows.

In Python, it is recommended to always use UTF-8, especially if your text contains more than just English letters:

Python 3.13
# Writing text in different encodings
text = "Привет, мир! Hello, world! 你好,世界!"

# Writing in UTF-8 (standard for international texts)
with open('text_utf8.txt', 'w', encoding='utf-8') as file:
    file.write(text)

# Writing in ASCII (only English letters)
try:
    with open('text_ascii.txt', 'w', encoding='ascii') as file:
        file.write(text)
except UnicodeEncodeError as e:
    print(f"ASCII encoding error: {e}")

# Writing in cp1251 (Cyrillic for Windows)
with open('text_cp1251.txt', 'w', encoding='cp1251') as file:
    # Chinese characters will be replaced with '?'
    file.write(text)
ASCII encoding error: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

As you can see, trying to write Russian or Chinese text in ASCII encoding causes an error, as ASCII only supports English characters.

Reading with the Correct Encoding

When reading a file, it is important to specify the same encoding with which it was created:

Python 3.13
# Reading a file with a specified encoding
with open('text_utf8.txt', 'r', encoding='utf-8') as file:
    content = file.read()
    print(f"Content in UTF-8: {content}")
Content in UTF-8: Привет, мир! Hello, world! 你好,世界!
# Reading with the wrong encoding can lead to errors
try:
    with open('text_utf8.txt', 'r', encoding='ascii') as file:
        content = file.read()
        print(f"Content with wrong encoding: {content}")
except UnicodeDecodeError as e:
    print(f"Decoding error: {e}")
Decoding error: 'ascii' codec can't decode byte 0xd0 in position 0: ordinal not in range(128)

Efficiently Reading Large Text Files

When working with large text files, it's important to use methods that don't load the entire file into memory. This is especially critical when you are working with files that are hundreds of megabytes or gigabytes in size.

Why You Shouldn't Read the Entire File at Once

When you use the read() method without arguments, Python loads the entire file into memory:

Python 3.13
with open('big_file.txt', 'r') as file:
    content = file.read()  # The entire file is loaded into memory!

This can cause problems:

  1. Memory Consumption — if the file is very large (e.g., gigabytes), it can take up all available RAM, leading to a slowdown or even a program crash.
  2. Latency — reading the entire file at once takes time, and your program will "hang" until the reading is complete.
  3. Inefficiency — often, you don't need all the data at once, but rather sequential access to it.

Line-by-Line Reading

A more efficient approach is to read the file line by line using a loop. Python will load only one line into memory at a time:

Python 3.13
# Create a test file with a large number of lines
with open('big_file.txt', 'w') as file:
    for i in range(1000):
        file.write(f"Line number {i+1}\n")

# Efficient reading line by line
with open('big_file.txt', 'r') as file:
    line_count = 0
    for line in file:  # Iterate over lines without loading the whole file into memory
        line_count += 1
        if line_count <= 5:  # Show only the first 5 lines
            print(line.strip())
    print(f"Total lines: {line_count}")
Line number 1
Line number 2
Line number 3
Line number 4
Line number 5
Total lines: 1000

The advantages of this approach are:

  • Only one line is in memory at a time.
  • Processing starts immediately; no need to wait for the entire file to load.
  • You can stop reading at any time if you find the data you need.

Reading in Chunks

If you need even more control over the reading process, you can read the file in fixed-size chunks:

Python 3.13
# Reading a file in chunks
with open('big_file.txt', 'r') as file:
    block_size = 100  # Block size in bytes
    blocks_read = 0

    while True:
        block = file.read(block_size)
        if not block:  # If the block is empty, the end of the file has been reached
            break

        blocks_read += 1
        if blocks_read <= 2:  # Show only the first 2 blocks
            print(f"Block {blocks_read}: {block[:50]}...")  # Print the beginning of the block

    print(f"Total blocks read: {blocks_read}")
Block 1: Line number 1
Line number 2
Line number 3
Li...
Block 2: ne number 4
Line number 5
Line number 6
Lin...
Total blocks read: 32

This method allows you to control the amount of memory used for reading the file. The block size can be adjusted depending on your needs and available memory.

Practical Example: Reading and Processing a Configuration File

Let's look at a simple practical example: reading a configuration file and using its parameters in a program:

Python 3.13
# Create a sample configuration file
config_text = """
# Database parameters
database_host = localhost
database_port = 5432
database_name = myapp
database_user = admin
database_password = secret123

# Web server parameters
server_port = 8080
debug_mode = True
log_level = INFO
"""

with open('config.ini', 'w') as config_file:
    config_file.write(config_text)

# Reading and processing the configuration
def read_config(filename):
    config = {}

    with open(filename, 'r') as file:
        for line in file:
            # Skip empty lines and comments
            line = line.strip()
            if not line or line.startswith('#'):
                continue

            # Split key and value
            if '=' in line:
                key, value = line.split('=', 1)
                config[key.strip()] = value.strip()

    return config

# Read the configuration
app_config = read_config('config.ini')

# Use the parameters
print("Application Configuration:")
Application Configuration:
print(f"Database: {app_config['database_name']} on {app_config['database_host']}:{app_config['database_port']}")
Database: myapp on localhost:5432
print(f"DB User: {app_config['database_user']}")
DB User: admin
print(f"Web Server Port: {app_config['server_port']}")
Web Server Port: 8080
print(f"Debug Mode: {app_config['debug_mode']}")
Debug Mode: True

In this example, we:

  1. Created a configuration file with parameters.
  2. Wrote a function to read and process the file line by line.
  3. Extracted the necessary parameters and used them in the program.

This approach is often used in real applications to store settings in a human-readable format.

Test Your Understanding

Let's check how well you've understood the topic of reading and writing text files:

How do you correctly open a file for reading with UTF-8 encoding?


In the next article we'll look at structured data formats — JSON and CSV.