C++ 文件读取再整理

一、文件读取核心概念与基础流程

1.1 文件操作的三要素

文件读取本质是 "数据在外部存储与内存间的传输过程"，需关注三个核心要素：

流对象：C++ 标准库通过std::ifstream（输入文件流）提供文件读取接口，是连接程序与外部文件的桥梁
流状态：通过good()/eof()/fail()/bad()四个状态标志判断操作有效性
数据缓冲区：操作系统与标准库均会维护缓冲区，减少磁盘 IO 次数（默认缓冲区大小通常为 4KB 或 8KB）

1.2 基础文件读取流程（标准范式）

所有文件读取操作都遵循 "打开 - 读取 - 关闭" 的核心流程，标准实现代码如下：

#include <fstream>
#include <iostream>
#include <string>

int main() {
    // 1. 创建流对象并打开文件（ RAII 模式，自动管理资源）
    std::ifstream file("example.txt");

    // 2. 检查文件是否成功打开
    if (!file.is_open()) { // 等价于 !file 或 file.fail()
        std::cerr << "Error: Failed to open file" << std::endl;
        return 1;
    }
    // 3. 读取文件内容（三种常见方式）
    std::string line;
    // 方式1：按行读取（文本文件常用）
    while (std::getline(file, line)) {
        std::cout << line << std::endl;
    }
    // 方式2：按字符读取（二进制文件兼容）
    // char ch;
    // while (file.get(ch)) {
    //     std::cout << ch;
    // }
    // 方式3：按格式化读取（类似scanf）
    // std::string word;
    // while (file >> word) {
    //     std::cout << word << " ";
    // }
    // 4. 检查读取过程是否正常结束
    if (file.eof()) {
        std::cout << "\nSuccess: End of file reached" << std::endl;
    } else if (file.fail()) {
        std::cerr << "\nError: Failed during file reading" << std::endl;
    }
    // 5. 关闭文件（RAII自动调用析构函数关闭，显式调用close()可提前释放资源）
    file.close();
    return 0;
}

运行结果（假设 example.txt 内容为 "Hello C++ File IO\nWelcome to Tutorial"）：

1
2
3

Hello C++ File IO
Welcome to Tutorial
Success: End of file reached

二、文本文件读取技术详解

2.1 文本文件的编码与换行符处理

编码问题：在 Linux 系统下，C++ 标准库默认使用 UTF-8 编码，无需额外处理 BOM（字节顺序标记）问题
换行符差异：Linux 使用\n作为换行符，std::getline()会自动处理与其他系统换行符的差异（将\r\n视为单个换行符）
编码转换方案（C++11 及以上）：

#include <locale>
#include <codecvt>

// 读取UTF-8编码文件（需C++11及以上，部分编译器需开启实验性支持）
std::wifstream file("utf8_file.txt");
file.imbue(std::locale(file.getloc(), new std::codecvt_utf8<wchar_t>));
std::wstring wline;
while (std::getline(file, wline)) {
    // 处理宽字符字符串
}

2.2 大文本文件的高效读取技巧

当处理超过 100MB 的文本文件时，需优化读取性能，关键技术点：

增大缓冲区：减少 IO 次数

1
2
3

const size_t BUFFER_SIZE = 1024 * 1024; // 1MB缓冲区
char* buffer = new char[BUFFER_SIZE];
file.rdbuf()->pubsetbuf(buffer, BUFFER_SIZE); // 为流对象设置自定义缓冲区

批量读取：一次性读取大块数据再处理

std::string buffer;
buffer.resize(1024 * 1024); // 预分配1MB内存
while (file.read(&buffer[0], buffer.size())) {
    size_t bytes_read = file.gcount(); // 获取实际读取字节数
    // 处理buffer中前bytes_read个字符
}

// 处理剩余数据
size_t remaining = file.gcount();
if (remaining > 0) {
    // 处理buffer中前remaining个字符
}

禁用同步：取消 C++ 流与 C 标准流的同步（提速 2-3 倍）

1 2	std::ios_base::sync_with_stdio(false); // 禁用同步 std::cin.tie(NULL); // 解除cin与cout的绑定

三、二进制文件读取技术详解

3.1 二进制文件与文本文件的核心差异

特性	文本文件	二进制文件
存储方式	字符 ASCII 码 / Unicode 编码	数据原始二进制表示
换行符处理	自动转换（\n↔\r\n）	不处理，按原始字节存储
适用场景	配置文件、日志、文档	图片、视频、可执行文件、数据库
读取方式	按字符 / 行读取	按固定大小块读取

3.2 二进制文件读取标准实现

#include <fstream>
#include <iostream>
#include <vector>

// 假设要读取的二进制数据结构
struct ImageHeader {
    uint32_t width;    // 4字节
    uint32_t height;   // 4字节
    uint16_t bit_depth;// 2字节
    uint16_t channels; // 2字节
};

int main() {
    // 1. 以二进制模式打开文件（必须指定ios::binary）
    std::ifstream file("image.raw", std::ios::binary);
    if (!file) {
        std::cerr << "Error: Failed to open binary file" << std::endl;
        return 1;
    }
    // 2. 读取文件头部（固定大小结构）
    ImageHeader header;
    file.read(reinterpret_cast<char*>(&header), sizeof(ImageHeader));

    // 3. 检查读取是否成功（必须验证读取字节数）
    if (file.gcount() != sizeof(ImageHeader)) {
        std::cerr << "Error: Failed to read image header" << std::endl;
        return 1;
    }
    std::cout << "Image Info - Width: " << header.width
              << ", Height: " << header.height
              << ", BitDepth: " << header.bit_depth
              << ", Channels: " << header.channels << std::endl;
    // 4. 读取图像数据（动态大小）
    size_t data_size = header.width * header.height * header.channels * (header.bit_depth / 8);
    std::vector<uint8_t> image_data(data_size);

    file.read(reinterpret_cast<char*>(image_data.data()), data_size);
    if (file.gcount() != data_size) {
        std::cerr << "Error: Incomplete image data" << std::endl;
        return 1;
    }
    std::cout << "Success: Read " << file.gcount() << " bytes of image data" << std::endl;
    return 0;
}

运行结果（假设 image.raw 为 256x256 的 RGB888 图像）：

1 2	Image Info - Width: 256, Height: 256, BitDepth: 24, Channels: 3 Success: Read 196608 bytes of image data

3.3 二进制文件的随机访问技术

通过seekg()（设置读取位置）和tellg()（获取当前位置）实现随机访问：

// 1. 获取文件大小
file.seekg(0, std::ios::end); // 移动到文件末尾
size_t file_size = file.tellg(); // 获取当前位置（即文件大小）
file.seekg(0, std::ios::beg); // 回到文件开头

// 2. 跳过前100字节读取
file.seekg(100, std::ios::beg); // 从文件开头偏移100字节
std::vector<char> data(1024);
file.read(data.data(), 1024);

// 3. 从当前位置向后偏移50字节
file.seekg(50, std::ios::cur);

// 4. 从文件末尾向前偏移200字节
file.seekg(-200, std::ios::end);

四、高级文件读取技术

4.1 内存映射文件（C++17 及以上）

内存映射文件将文件内容直接映射到进程地址空间，避免数据拷贝，是处理 GB 级大文件的最优方案，在 Linux 下需<sys/mman.h>等头文件支持：

#include <iostream>
#include <vector>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>

class MemoryMappedFile {
public:
    MemoryMappedFile(const std::string& path) : data_(nullptr), size_(0) {
        int fd = open(path.c_str(), O_RDONLY);
        if (fd == -1) {
            throw std::runtime_error("Failed to open file");
        }
        size_ = lseek(fd, 0, SEEK_END);
        lseek(fd, 0, SEEK_SET);
        data_ = mmap(nullptr, size_, PROT_READ, MAP_PRIVATE, fd, 0);
        if (data_ == MAP_FAILED) {
            close(fd);
            throw std::runtime_error("Failed to map file");
        }
        close(fd);
    }
    ~MemoryMappedFile() {
        if (data_ != MAP_FAILED) munmap(data_, size_);
    }
    const void* data() const { return data_; }
    size_t size() const { return size_; }
private:
    const void* data_;
    size_t size_;
};

// 使用示例
int main() {
    try {
        MemoryMappedFile mmf("large_file.dat");
        const char* data = static_cast<const char*>(mmf.data());
        size_t size = mmf.size();
        std::cout << "Mapped " << size << " bytes. First 10 bytes: ";
        for (size_t i = 0; i < 10 && i < size; ++i) {
            std::cout << std::hex << static_cast<int>(static_cast<uint8_t>(data[i])) << " ";
        }
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

4.2 异步文件读取（C++11 及以上）

通过std::async和std::future实现非阻塞文件读取，避免 IO 操作阻塞主线程：

#include <fstream>
#include <iostream>
#include <future>
#include <vector>
#include <string>

// 异步读取函数
std::vector<char> async_read_file(const std::string& path) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        throw std::runtime_error("Failed to open file: " + path);
    }
    // 获取文件大小
    file.seekg(0, std::ios::end);
    size_t size = file.tellg();
    file.seekg(0, std::ios::beg);
    // 读取文件内容
    std::vector<char> buffer(size);
    file.read(buffer.data(), size);
    if (file.gcount() != size) {
        throw std::runtime_error("Incomplete read: " + path);
    }
    return buffer;
}

int main() {
    // 启动异步读取（std::launch::async 确保创建新线程）
    std::future<std::vector<char>> future_data =
        std::async(std::launch::async, async_read_file, "large_file.bin");
    // 主线程可同时处理其他任务
    std::cout << "Waiting for file read completion..." << std::endl;
    // 获取读取结果（阻塞直到完成）
    try {
        std::vector<char> data = future_data.get();
        std::cout << "Success: Read " << data.size() << " bytes" << std::endl;
    } catch (const std::exception& e) {
        std::cerr << "Error: " << e.what() << std::endl;
        return 1;
    }
    return 0;
}

4.3 多线程安全读取大文件

当多个线程读取同一文件时，需通过文件偏移量同步避免数据重叠，实现并行读取：

#include <fstream>
#include <iostream>
#include <vector>
#include <thread>
#include <mutex>
#include <atomic>

const size_t BLOCK_SIZE = 1024 * 1024; // 1MB块大小
std::mutex cout_mutex; // 输出同步互斥锁
std::atomic<size_t> global_offset(0); // 全局读取偏移量（原子操作）

void thread_read(const std::string& path, size_t file_size, int thread_id) {
    std::ifstream file(path, std::ios::binary);
    if (!file) {
        std::lock_guard<std::mutex> lock(cout_mutex);
        std::cerr << "Thread " << thread_id << ": Failed to open file" << std::endl;
        return;
    }
    std::vector<char> buffer(BLOCK_SIZE);
    size_t offset;
    // 循环读取直到文件结束
    while ((offset = global_offset.fetch_add(BLOCK_SIZE)) < file_size) {
        // 计算当前块的实际大小（最后一块可能小于BLOCK_SIZE）
        size_t read_size = std::min(BLOCK_SIZE, file_size - offset);
        // 设置读取位置并读取数据
        file.seekg(offset);
        file.read(buffer.data(), read_size);
        // 验证读取结果
        if (file.gcount() != read_size) {
            std::lock_guard<std::mutex> lock(cout_mutex);
            std::cerr << "Thread " << thread_id << ": Failed to read block at " << offset << std::endl;
            continue;
        }
        // 处理数据（此处仅示例统计）
        std::lock_guard<std::mutex> lock(cout_mutex);
        std::cout << "Thread " << thread_id << ": Read " << read_size
                  << " bytes at offset " << offset << std::endl;
    }
}

int main() {
    const std::string file_path = "large_file.bin";
    std::ifstream file(file_path, std::ios::binary);
    if (!file) {
        std::cerr << "Failed to open file" << std::endl;
        return 1;
    }
    file.seekg(0, std::ios::end);
    size_t file_size = file.tellg();
    file.seekg(0, std::ios::beg);

    const int num_threads = 4;
    std::vector<std::thread> threads;
    for (int i = 0; i < num_threads; ++i) {
        threads.emplace_back(thread_read, file_path, file_size, i);
    }

    for (auto& th : threads) {
        th.join();
    }

    return 0;
}