第2章：环境搭建与工具准备¶

工欲善其事，必先利其器。本章将带你搭建完整的RAG开发环境，为后续实践做好充分准备。

📚 学习目标¶

学完本章后，你将能够：

搭建完整的Python开发环境（虚拟环境+Jupyter）
安装和配置LlamaIndex及相关依赖
配置OpenAI API密钥
安装并使用向量数据库（Chroma）
准备示例数据集并完成预处理

预计学习时间：1小时 难度等级：⭐☆☆☆☆

前置知识¶

在开始本章学习前，你需要具备：

Python基础：了解Python基本语法
命令行基础：会使用终端运行命令
文本编辑器：VSCode、PyCharm等

环境要求： - Python >= 3.9 - 至少4GB可用内存 - 5GB可用磁盘空间 - 稳定的网络连接

2.1 开发环境配置¶

Python版本管理¶

为什么需要版本管理？¶

不同项目可能需要不同版本的Python：

项目A：需要Python 3.9
项目B：需要Python 3.11
项目C（本教程）：需要Python 3.10+

问题：系统只有一个Python版本怎么办？

解决方案：使用版本管理工具

方案1：使用pyenv（推荐Mac/Linux）¶

安装pyenv：

# macOS（使用Homebrew）
brew install pyenv

# Linux
curl https://pyenv.run | bash

配置环境变量：

# 添加到 ~/.bashrc 或 ~/.zshrc
export PYENV_ROOT="$HOME/.pyenv"
export PATH="$PYENV_ROOT/bin:$PATH"
eval "$(pyenv init -)"

# 重新加载配置
source ~/.bashrc  # 或 source ~/.zshrc

安装Python 3.10：

# 查看可安装的版本
pyenv install --list | grep "3.10"

# 安装Python 3.10.14
pyenv install 3.10.14

# 设置全局默认版本
pyenv global 3.10.14

# 验证安装
python --version
# 输出：Python 3.10.14

为项目设置特定版本：

# 在项目目录下
cd /path/to/your/project

# 设置该项目使用Python 3.10.14
pyenv local 3.10.14

# 创建.python-version文件（自动）
cat .python-version
# 输出：3.10.14

方案2：使用conda（推荐Windows/科学计算）¶

安装Miniconda：

# 下载安装器
# macOS/Linux: https://docs.conda.io/en/latest/miniconda.html
# Windows: 下载.exe安装程序

# macOS/Linux 安装
bash Miniconda3-latest-MacOSX-arm64.sh  # Apple Silicon
# 或
bash Miniconda3-latest-Linux-x86_64.sh  # Linux

# 初始化conda
conda init bash  # 或 conda init zsh
source ~/.bashrc  # 重新加载

创建Python环境：

# 创建Python 3.10环境
conda create -n rag_tutorial python=3.10

# 激活环境
conda activate rag_tutorial

# 验证版本
python --version

退出环境：

conda deactivate

虚拟环境创建¶

为什么需要虚拟环境？¶

隔离项目依赖，避免版本冲突：

没有虚拟环境：
  系统Python → 全局安装的包 → 版本冲突
  ↑
  项目A需要pandas==1.5.0
  项目B需要pandas==2.0.0
  → 冲突！

有虚拟环境：
  项目A/venv → pandas==1.5.0
  项目B/venv → pandas==2.0.0
  → 各自独立，互不影响

使用venv（Python内置）¶

# 1. 创建项目目录
mkdir rag_tutorial
cd rag_tutorial

# 2. 创建虚拟环境
python -m venv venv

# 3. 激活虚拟环境
# macOS/Linux:
source venv/bin/activate

# Windows:
# venv\Scripts\activate

# 4. 验证（命令行前缀会显示(venv)）
(venv) $  # macOS/Linux
(venv) >  # Windows

使用conda¶

# 创建并激活环境（前面已经展示）
conda create -n rag_tutorial python=3.10
conda activate rag_tutorial

Jupyter环境配置¶

Jupyter Notebook是学习RAG的理想工具，可以交互式运行代码。

安装Jupyter¶

# 确保虚拟环境已激活
source venv/bin/activate  # 或 conda activate rag_tutorial

# 安装Jupyter
pip install jupyter jupyterlab

# 验证安装
jupyter --version

启动Jupyter¶

# 启动Jupyter Lab（推荐，界面更现代）
jupyter lab

# 或启动经典Notebook
jupyter notebook

# 浏览器会自动打开

Jupyter使用技巧¶

1. 快捷键：

快捷键	功能
`Shift + Enter`	运行当前单元格，跳到下一个
`Ctrl + Enter`	运行当前单元格，停留在当前
`A`	在上方插入单元格（命令模式）
`B`	在下方插入单元格（命令模式）
`DD`	删除当前单元格（命令模式）
`M`	切换到Markdown模式
`Y`	切换到代码模式

2. 魔法命令：

# 查看所有魔法命令
%lsmagic

# 测量代码运行时间
%timeit sum(range(1000))

# 显示变量
%who

# 运行shell命令
!ls -la
!pip list

3. 自动补全：

# Tab键自动补全
import pandas as pd
pd.read_<Tab>  # 显示所有read_开头的函数

# 查看函数文档
pd.read_csv?

VSCode配置（可选）¶

如果你更喜欢VSCode，可以配置Python插件。

安装插件¶

安装VSCode：https://code.visualstudio.com/
安装Python插件
Python（Microsoft）
Jupyter（Microsoft）
Python Code Formatter

配置步骤¶

# 1. 在VSCode中打开项目目录
code /path/to/rag_tutorial

# 2. 选择Python解释器
# Ctrl/Cmd + Shift + P → "Python: Select Interpreter"
# 选择你的虚拟环境

# 3. 创建新的Notebook
# 文件 → 新建文件 → 选择.ipynb后缀

2.2 核心库安装¶

创建自动化安装脚本¶

创建一个setup_env.py脚本，自动安装所有依赖。

# 文件名：setup_env.py
"""
RAG教程环境自动配置脚本
自动安装所有必需的库和依赖
"""

import subprocess
import sys
import os

def run_command(command, description):
    """
    运行shell命令并显示进度

    Args:
        command: 要执行的命令
        description: 命令描述
    """
    print(f"\n{'='*60}")
    print(f"正在执行: {description}")
    print(f"{'='*60}")
    print(f"命令: {command}\n")

    try:
        result = subprocess.run(
            command,
            shell=True,
            check=True,
            capture_output=False,
            text=True
        )
        print(f"✓ {description} - 成功")
        return True
    except subprocess.CalledProcessError as e:
        print(f"✗ {description} - 失败")
        print(f"错误: {e}")
        return False

def check_python_version():
    """检查Python版本"""
    version = sys.version_info
    print(f"当前Python版本: {version.major}.{version.minor}.{version.micro}")

    if version.major < 3 or (version.major == 3 and version.minor < 9):
        print("❌ Python版本过低，需要3.9或更高")
        return False

    print("✓ Python版本符合要求")
    return True

def install_core_packages():
    """安装核心包"""
    packages = [
        # LlamaIndex核心
        "llama-index-core",
        "llama-index-llms-openai",
        "llama-index-embeddings-openai",
        "llama-index-vector-stores-chroma",

        # 向量数据库
        "chromadb",

        # 文档处理
        "pypdf",
        "docx2txt",
        "python-dotenv",

        # 数据处理
        "pandas",
        "numpy",

        # 可视化
        "matplotlib",
        "seaborn",

        # 实用工具
        "tqdm",  # 进度条
        "rich",  # 美化终端输出
    ]

    print("\n开始安装核心包...")
    for package in packages:
        run_command(f"pip install -U {package}", f"安装 {package}")

    return True

def install_optional_packages():
    """安装可选包"""
    optional = [
        "llama-index-readers-web",  # 网页抓取
        "llama-index-readers-file", # 文件读取
        "sentence-transformers",    # 开源嵌入模型
        "transformers",             # HuggingFace
    ]

    print("\n是否安装可选包？(包含网页抓取、开源模型等)")
    choice = input("输入 y 安装，其他键跳过: ").strip().lower()

    if choice == 'y':
        for package in optional:
            run_command(f"pip install -U {package}", f"安装 {package}")

    return True

def create_env_template():
    """创建环境变量模板文件"""
    env_content = """# OpenAI API配置
OPENAI_API_KEY=your_openai_api_key_here
OPENAI_API_BASE=https://api.openai.com/v1  # 可选：使用代理或兼容API

# 其他配置
CHROMA persist_directory=./chroma_db
"""

    with open(".env.template", "w", encoding="utf-8") as f:
        f.write(env_content)

    print("\n✓ 已创建 .env.template 文件")
    print("  请复制为 .env 并填入你的API密钥")
    return True

def create_project_structure():
    """创建项目目录结构"""
    directories = [
        "data/raw",           # 原始数据
        "data/processed",     # 处理后数据
        "notebooks",          # Jupyter notebooks
        "scripts",            # Python脚本
        "outputs",            # 输出结果
        "chroma_db",          # 向量数据库
    ]

    for directory in directories:
        os.makedirs(directory, exist_ok=True)
        print(f"✓ 创建目录: {directory}")

    return True

def create_requirements_txt():
    """生成requirements.txt"""
    requirements = """# RAG教程依赖

# 核心
llama-index-core>=0.10.0
llama-index-llms-openai>=0.1.0
llama-index-embeddings-openai>=0.1.0
llama-index-vector-stores-chroma>=0.1.0

# 向量数据库
chromadb>=0.4.0

# 文档处理
pypdf>=3.0.0
docx2txt>=0.8
python-dotenv>=1.0.0

# 数据处理
pandas>=2.0.0
numpy>=1.24.0

# 可视化
matplotlib>=3.7.0
seaborn>=0.12.0

# 工具
tqdm>=4.65.0
rich>=13.0.0

# 可选
llama-index-readers-web>=0.1.0
llama-index-readers-file>=0.1.0
sentence-transformers>=2.2.0
"""

    with open("requirements.txt", "w", encoding="utf-8") as f:
        f.write(requirements)

    print("\n✓ 已创建 requirements.txt 文件")
    return True

def main():
    """主函数"""
    print("""
    ╔═══════════════════════════════════════════════════════╗
    ║       RAG教程 - 环境自动配置工具                      ║
    ║       自动安装所有必需的库和依赖                      ║
    ╚═══════════════════════════════════════════════════════╝
    """)

    # 1. 检查Python版本
    if not check_python_version():
        return False

    # 2. 安装核心包
    if not install_core_packages():
        return False

    # 3. 安装可选包
    install_optional_packages()

    # 4. 创建项目结构
    print("\n创建项目目录结构...")
    create_project_structure()

    # 5. 创建配置文件
    create_env_template()
    create_requirements_txt()

    print(f"""
    {'='*60}
    ✓ 环境配置完成！
    {'='*60}

    下一步：
    1. 配置OpenAI API密钥：
       cp .env.template .env
       编辑 .env 文件，填入你的API密钥

    2. 启动Jupyter：
       jupyter lab

    3. 开始学习：
       打开 notebooks/ 目录查看教程notebooks
    """)

    return True

if __name__ == "__main__":
    success = main()
    sys.exit(0 if success else 1)

运行安装脚本¶

# 确保虚拟环境已激活
source venv/bin/activate

# 下载并运行脚本
# (将上面的代码保存为 setup_env.py)

python setup_env.py

手动安装（分步说明）¶

如果你想手动控制安装过程，可以按以下步骤：

步骤1：安装LlamaIndex¶

# 核心包
pip install llama-index-core

# OpenAI集成
pip install llama-index-llms-openai
pip install llama-index-embeddings-openai

# 向量存储
pip install llama-index-vector-stores-chroma

步骤2：安装向量数据库¶

# Chroma（轻量，适合学习）
pip install chromadb

# 如果需要其他数据库：
# pip install qdrant-client  # Qdrant
# pip install pymilvus       # Milvus

步骤3：安装文档处理工具¶

# PDF处理
pip install pypdf

# Word处理
pip install docx2txt

# 环境变量管理
pip install python-dotenv

步骤4：安装数据处理和可视化工具¶

# 数据处理
pip install pandas numpy

# 可视化
pip install matplotlib seaborn

# 进度条
pip install tqdm rich

验证安装¶

创建一个测试脚本test_installation.py：

# 文件名：test_installation.py
"""
测试所有库是否正确安装
"""

def test_imports():
    """测试导入"""
    print("测试库导入...\n")

    tests = [
        ("LlamaIndex核心", "import llama_index"),
        ("Chroma", "import chromadb"),
        ("Pandas", "import pandas as pd"),
        ("NumPy", "import numpy as np"),
        ("环境变量", "import dotenv"),
    ]

    passed = 0
    failed = 0

    for name, import_cmd in tests:
        try:
            exec(import_cmd)
            print(f"✓ {name}")
            passed += 1
        except ImportError as e:
            print(f"✗ {name} - {e}")
            failed += 1

    print(f"\n总计: {passed} 通过, {failed} 失败")
    return failed == 0

def test_chroma():
    """测试Chroma基本功能"""
    print("\n测试Chroma...\n")

    try:
        import chromadb

        # 创建临时客户端
        client = chromadb.EphemeralClient()

        # 创建集合
        collection = client.create_collection("test")

        # 添加数据
        collection.add(
            documents=["测试文档"],
            ids=["test1"]
        )

        # 查询
        results = collection.query(
            query_texts=["测试"],
            n_results=1
        )

        print("✓ Chroma工作正常")
        return True

    except Exception as e:
        print(f"✗ Chroma测试失败: {e}")
        return False

def main():
    """主测试函数"""
    print("="*60)
    print("RAG教程 - 安装验证")
    print("="*60 + "\n")

    # 测试导入
    imports_ok = test_imports()

    # 测试Chroma
    chroma_ok = test_chroma()

    # 总结
    if imports_ok and chroma_ok:
        print("\n" + "="*60)
        print("✓ 所有测试通过！环境配置正确。")
        print("="*60)
        return True
    else:
        print("\n" + "="*60)
        print("✗ 部分测试失败，请检查安装。")
        print("="*60)
        return False

if __name__ == "__main__":
    success = main()

运行测试：

python test_installation.py

2.3 数据准备¶

示例数据集介绍¶

我们准备了三种示例数据集供学习使用：

数据集1：技术文档（推荐）¶

来源：Python官方文档节选

内容： - Python基础教程 - 常用库介绍 - 代码示例

特点： - 结构化程度高 - 适合学习基础RAG - 代码示例丰富

下载：

# 创建data目录
mkdir -p data/raw

# 下载示例数据（使用curl）
# Python教程
curl -o data/raw/python_tutorial.pdf \
  https://docs.python.org/3/_downloads/python-3.10.0-docs-pdf-letter.pdf

# 或使用wget
wget -O data/raw/python_tutorial.pdf \
  https://docs.python.org/3/_downloads/python-3.10.0-docs-pdf-letter.pdf

数据集2：示例文本（快速测试）¶

创建一个简单的文本文件用于快速测试：

# 文件名：create_sample_data.py
"""
创建示例文本数据
"""

import os

def create_sample_documents():
    """创建示例文档"""

    # 确保目录存在
    os.makedirs("data/raw", exist_ok=True)

    # 文档1：Python介绍
    doc1 = """
Python是一种高级编程语言

Python是一种解释型、高级、通用的编程语言。它的设计哲学强调代码的可读性，
使用大量的缩进。Python是动态类型的，并且提供垃圾回收功能。

Python支持多种编程范式，包括结构化、面向对象和函数式编程。
"""

    # 文档2：机器学习介绍
    doc2 = """
机器学习基础

机器学习是人工智能的一个分支。它使计算机能够从数据中学习，
而不是被明确编程。

常见的机器学习类型包括：
1. 监督学习：使用标记的数据训练
2. 无监督学习：发现数据中的模式
3. 强化学习：通过奖励和惩罚学习
"""

    # 文档3：RAG介绍
    doc3 = """
RAG技术概述

RAG（Retrieval-Augmented Generation，检索增强生成）是一种结合了
信息检索和文本生成的AI技术。

RAG的核心步骤：
1. 检索：从知识库中查找相关文档
2. 增强：将检索到的文档加入提示词
3. 生成：LLM基于增强的提示词生成答案
"""

    # 保存文档
    documents = {
        "python_intro.txt": doc1,
        "ml_intro.txt": doc2,
        "rag_intro.txt": doc3
    }

    for filename, content in documents.items():
        filepath = os.path.join("data/raw", filename)
        with open(filepath, "w", encoding="utf-8") as f:
            f.write(content.strip())
        print(f"✓ 创建文档: {filepath}")

    print("\n✓ 示例文档创建完成！")
    return True

if __name__ == "__main__":
    create_sample_documents()

运行：

python create_sample_data.py

数据集3：实际文档（可选）¶

如何准备自己的数据：

# 文件名：prepare_custom_data.py
"""
准备自定义数据集
"""

import os
import shutil

def organize_documents(source_dir, target_dir="data/raw"):
    """
    整理文档到项目目录

    Args:
        source_dir: 源文档目录
        target_dir: 目标目录
    """
    # 确保目标目录存在
    os.makedirs(target_dir, exist_ok=True)

    # 支持的文件格式
    supported_formats = {'.pdf', '.txt', '.md', '.docx', '.html'}

    # 统计
    total_files = 0
    copied_files = 0

    # 遍历源目录
    for root, dirs, files in os.walk(source_dir):
        for file in files:
            total_files += 1

            # 检查文件格式
            file_ext = os.path.splitext(file)[1].lower()

            if file_ext in supported_formats:
                # 复制文件
                src_path = os.path.join(root, file)
                dst_path = os.path.join(target_dir, file)

                # 避免覆盖
                if os.path.exists(dst_path):
                    base, ext = os.path.splitext(file)
                    dst_path = os.path.join(
                        target_dir,
                        f"{base}_{copied_files}{ext}"
                    )

                shutil.copy2(src_path, dst_path)
                copied_files += 1
                print(f"✓ 复制: {file}")

    print(f"\n总计: {copied_files}/{total_files} 个文件")
    return copied_files > 0

if __name__ == "__main__":
    source = input("输入文档目录路径: ").strip()

    if os.path.isdir(source):
        organize_documents(source)
    else:
        print("错误：目录不存在")

数据预处理基础¶

清洗数据¶

# 文件名：preprocess_data.py
"""
数据预处理脚本
"""

import os
import re
from typing import List, Dict

def clean_text(text: str) -> str:
    """
    清洗文本

    Args:
        text: 原始文本

    Returns:
        清洗后的文本
    """
    # 移除多余空白
    text = re.sub(r'\s+', ' ', text)

    # 移除特殊字符（保留中文、英文、数字、标点）
    text = re.sub(r'[^\w\s\u4e00-\u9fff\u3000-\u303f\uff00-\uffef.,!?;:()""'\'】[【]', '', text)

    # 移除过短的行
    lines = [line.strip() for line in text.split('\n')]
    lines = [line for line in lines if len(line) > 10]

    return '\n'.join(lines)

def read_text_file(filepath: str) -> str:
    """
    读取文本文件

    Args:
        filepath: 文件路径

    Returns:
        文件内容
    """
    encodings = ['utf-8', 'gbk', 'gb2312']

    for encoding in encodings:
        try:
            with open(filepath, 'r', encoding=encoding) as f:
                return f.read()
        except UnicodeDecodeError:
            continue

    raise ValueError(f"无法解码文件: {filepath}")

def process_text_files(input_dir: str, output_dir: str):
    """
    批量处理文本文件

    Args:
        input_dir: 输入目录
        output_dir: 输出目录
    """
    os.makedirs(output_dir, exist_ok=True)

    # 统计
    processed = 0
    total_chars = 0

    # 遍历文件
    for filename in os.listdir(input_dir):
        if not filename.endswith('.txt'):
            continue

        try:
            # 读取
            filepath = os.path.join(input_dir, filename)
            text = read_text_file(filepath)

            # 清洗
            cleaned = clean_text(text)

            # 保存
            output_path = os.path.join(output_dir, filename)
            with open(output_path, 'w', encoding='utf-8') as f:
                f.write(cleaned)

            # 统计
            processed += 1
            total_chars += len(cleaned)

            print(f"✓ 处理: {filename} ({len(cleaned)} 字符)")

        except Exception as e:
            print(f"✗ 失败: {filename} - {e}")

    print(f"\n总计: {processed} 个文件, {total_chars} 字符")
    return processed > 0

def main():
    """主函数"""
    input_dir = "data/raw"
    output_dir = "data/processed"

    print("开始处理文本文件...")
    print(f"输入目录: {input_dir}")
    print(f"输出目录: {output_dir}\n")

    success = process_text_files(input_dir, output_dir)

    if success:
        print("\n✓ 处理完成！")
    else:
        print("\n✗ 处理失败")

if __name__ == "__main__":
    main()

数据格式转换¶

# 将TXT转换为Markdown（方便Jupyter显示）
def txt_to_markdown(input_dir: str, output_dir: str):
    """
    将TXT文件转换为Markdown格式

    Args:
        input_dir: 输入目录
        output_dir: 输出目录
    """
    os.makedirs(output_dir, exist_ok=True)

    for filename in os.listdir(input_dir):
        if not filename.endswith('.txt'):
            continue

        # 读取
        input_path = os.path.join(input_dir, filename)
        with open(input_path, 'r', encoding='utf-8') as f:
            content = f.read()

        # 转换为Markdown
        title = os.path.splitext(filename)[0]
        markdown = f"# {title}\n\n{content}"

        # 保存
        output_filename = filename.replace('.txt', '.md')
        output_path = os.path.join(output_dir, output_filename)

        with open(output_path, 'w', encoding='utf-8') as f:
            f.write(markdown)

        print(f"✓ 转换: {filename} → {output_filename}")

    return True

完整数据准备流程¶

# 文件名：prepare_data_pipeline.py
"""
完整的数据准备流程
"""

import os
import shutil
from preprocess_data import process_text_files

def full_data_pipeline():
    """完整的数据准备流程"""

    print("="*60)
    print("RAG教程 - 数据准备流程")
    print("="*60 + "\n")

    # 1. 创建目录结构
    print("步骤1: 创建目录结构")
    directories = [
        "data/raw",
        "data/processed",
        "data/eval"  # 评估数据
    ]

    for directory in directories:
        os.makedirs(directory, exist_ok=True)
        print(f"  ✓ {directory}")

    # 2. 创建示例数据
    print("\n步骤2: 创建示例数据")
    from create_sample_data import create_sample_documents
    create_sample_documents()

    # 3. 预处理数据
    print("\n步骤3: 预处理数据")
    process_text_files("data/raw", "data/processed")

    # 4. 创建评估数据集
    print("\n步骤4: 创建评估数据集")
    create_evaluation_dataset()

    print("\n" + "="*60)
    print("✓ 数据准备完成！")
    print("="*60)

    # 显示数据统计
    show_data_statistics()

    return True

def create_evaluation_dataset():
    """创建评估数据集（问答对）"""

    eval_data = [
        {
            "question": "Python是什么？",
            "answer": "Python是一种高级编程语言",
            "source": "python_intro.txt"
        },
        {
            "question": "机器学习的类型有哪些？",
            "answer": "监督学习、无监督学习、强化学习",
            "source": "ml_intro.txt"
        },
        {
            "question": "RAG的核心步骤是什么？",
            "answer": "检索、增强、生成",
            "source": "rag_intro.txt"
        }
    ]

    import json

    with open("data/eval/eval_qa.json", "w", encoding="utf-8") as f:
        json.dump(eval_data, f, ensure_ascii=False, indent=2)

    print("  ✓ 评估数据集: data/eval/eval_qa.json")
    return True

def show_data_statistics():
    """显示数据统计"""
    print("\n数据统计:")

    # 统计文件数量
    for directory in ["data/raw", "data/processed"]:
        if os.path.exists(directory):
            files = os.listdir(directory)
            print(f"  {directory}: {len(files)} 个文件")

    # 显示示例
    print("\n示例文档:")
    if os.path.exists("data/processed"):
        for filename in os.listdir("data/processed")[:3]:
            filepath = os.path.join("data/processed", filename)
            with open(filepath, 'r', encoding='utf-8') as f:
                content = f.read()
                preview = content[:100] + "..." if len(content) > 100 else content
                print(f"\n  {filename}:")
                print(f"    {preview}")

if __name__ == "__main__":
    full_data_pipeline()

总结¶

本章要点回顾¶

开发环境配置
使用pyenv或conda管理Python版本
创建虚拟环境隔离项目依赖
配置Jupyter Notebook进行交互式开发
核心库安装
LlamaIndex（RAG框架）
Chroma（向量数据库）
文档处理工具
使用自动化脚本简化安装
数据准备
下载或创建示例数据
预处理清洗数据
转换数据格式
创建评估数据集

学习检查清单¶

下一步学习¶

下一章：第3章：基础RAG实现
加载和处理文档
实现文本分块
构建向量检索
生成RAG答案
相关章节：
第4章：RAG评估基础

扩展资源¶

常见问题 (FAQ)¶

Q1: 安装失败怎么办？¶

问题：pip install 时出现错误

解决方案：

网络问题：

# 使用国内镜像
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple package_name

权限问题：

# 确保在虚拟环境中
source venv/bin/activate  # macOS/Linux
# 或
venv\Scripts\activate  # Windows

依赖冲突：

# 升级pip
pip install --upgrade pip

# 清理缓存
pip cache purge

Q2: OpenAI API密钥如何获取？¶

步骤：

访问 https://platform.openai.com/
注册/登录账号
进入 API keys 页面
创建新的API密钥
复制密钥（只显示一次！）

配置：

# 创建.env文件
cp .env.template .env

# 编辑.env文件
OPENAI_API_KEY=sk-your-actual-api-key-here

Q3: Jupyter无法启动？¶

可能原因：

端口被占用：

# 指定其他端口
jupyter lab --port 8889

浏览器未打开：

# 启动时不自动打开浏览器
jupyter lab --no-browser

# 然后手动访问显示的URL

插件冲突：

# 重装Jupyter
pip uninstall jupyter jupyterlab
pip install jupyter jupyterlab

Q4: 数据放在哪里最合适？¶

推荐目录结构：

rag_tutorial/
├── data/
│   ├── raw/          # 原始数据（不要修改）
│   ├── processed/    # 处理后数据
│   └── eval/         # 评估数据
├── notebooks/        # Jupyter notebooks
├── scripts/          # Python脚本
├── outputs/          # 输出结果
└── chroma_db/        # 向量数据库

注意： - data/raw 不要修改，保留原始备份 - data/processed 可以多次重新生成 - 添加到.gitignore：data/, chroma_db/, .env

术语表¶

术语	英文	定义
虚拟环境	Virtual Environment	隔离的Python运行环境
Jupyter	Jupyter Notebook	交互式计算环境
依赖	Dependencies	项目所需的库和包
预处理	Preprocessing	数据清洗和转换
数据清洗	Data Cleaning	移除错误和噪声数据

返回目录 | 上一章：RAG技术概述 | 下一章：基础RAG实现

本章结束

环境搭建是学习的第一步，也是最容易出错的一步。如果遇到问题，请仔细检查每一步，确保不要跳过任何配置。良好的环境设置会让后续学习事半功倍！