<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
  <title>我爱西红柿</title>
  
  <subtitle>已知的已知，已知的未知，未知的未知</subtitle>
  <link href="http://yoursite.com/atom.xml" rel="self"/>
  
  <link href="http://yoursite.com/"/>
  <updated>2025-06-15T13:45:59.000Z</updated>
  <id>http://yoursite.com/</id>
  
  <author>
    <name>我爱西红柿</name>
    
  </author>
  
  <generator uri="https://hexo.io/">Hexo</generator>
  
  <entry>
    <title>Vllm学习-部署使用</title>
    <link href="http://yoursite.com/2025/06/15/vllm-1/"/>
    <id>http://yoursite.com/2025/06/15/vllm-1/</id>
    <published>2025-06-15T13:45:59.000Z</published>
    <updated>2025-06-15T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>VLLM（Vectorized Large Language Model Inference）是由UC Berkeley的系统研究团队开发，专注于优化大模型的推理速度框架，核心亮点在于通过PagedAttention 注意力算法以提高服务的吞吐量。<br>核心原理是能够将kv-cache动态分配在不连续的空间，提高整体显存利用率和并发数。</p><p>架构参考：<br><a href="https://docs.vllm.ai/en/latest/design/arch_overview.html">https://docs.vllm.ai/en/latest/design/arch_overview.html</a></p><h3 id="安装-Vllm"><a href="#安装-Vllm" class="headerlink" title="安装 Vllm"></a>安装 Vllm</h3><h4 id="环境配置"><a href="#环境配置" class="headerlink" title="环境配置"></a>环境配置</h4><p>使用VLLM部署Qwen3 0.6B<br>参考Qwen文档<br><a href="https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html">https://qwen.readthedocs.io/zh-cn/latest/deployment/vllm.html</a></p><p>硬件配置<br>CPU：AMD-5900X<br>内存：128G<br>显卡：RTX-3060-12G</p><p>软件：<br>vllm：0.9.1<br>Python：3.12.7<br>Models：Qwen3-0.6B</p><p>部署vllm 0.9.1版本，建议Python: 3.9 – 3.12版本，cuda版本12.8</p><p>参考Nvidia官方手册安装cuda</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://developer.nvidia.com/cuda-12-8-0-download-archive?target_os=Linux&amp;target_arch=x86_64&amp;Distribution=Ubuntu&amp;target_version=22.04&amp;target_type=deb_local</span><br></pre></td></tr></table></figure><p>安装miniconda</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh</span><br></pre></td></tr></table></figure><p>运行安装脚本</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bash Miniconda3-latest-Linux-x86_64.sh</span><br></pre></td></tr></table></figure><ul><li>按提示按 ​<strong>Enter</strong>​ 阅读协议 → 输入 yes 同意</li><li>设置安装路径（默认 ~&#x2F;miniconda3 即可）</li><li>提示 ​**Do you wish to initialize Miniconda3?**​ 选 yes</li></ul><p>安装完成后配置bash<br>在&#x2F;root&#x2F;.bashrc添加PATH目录</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">export PATH=/root/miniconda3/bin:$PATH</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">sorce /root/.bashrc</span><br></pre></td></tr></table></figure><p>验证</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">conda --version</span><br><span class="line">conda 25.3.1</span><br></pre></td></tr></table></figure><p>创建vllm部署的Python环境</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">conda create -n vllm python=3.12.7 </span><br><span class="line">conda activate vllm</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main</span><br><span class="line">conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r</span><br><span class="line">conda config --set show_channel_urls yes</span><br></pre></td></tr></table></figure><p>安装pytorch<br>因为当前cuda12.2对应的torch版本还没有进入稳定版所以这里用的nightly路径。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install --pre torch==2.7.0.dev20250310+cu128 --index-url https://download.pytorch.org/whl/nightly/cu128</span><br></pre></td></tr></table></figure><h4 id="安装vllm引擎"><a href="#安装vllm引擎" class="headerlink" title="安装vllm引擎"></a>安装vllm引擎</h4><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install vllm==0.9.0.1</span><br></pre></td></tr></table></figure><p>测试</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line"> python</span><br><span class="line">Python 3.12.7 | packaged by Anaconda, Inc. | (main, Oct  4 2024, 13:27:36) [GCC 11.2.0] on linux</span><br><span class="line">Type &quot;help&quot;, &quot;copyright&quot;, &quot;credits&quot; or &quot;license&quot; for more information.</span><br><span class="line">&gt;&gt;&gt; import torch</span><br><span class="line">&gt;&gt;&gt; print(torch.__version__) </span><br><span class="line">2.7.0.dev20250310+cu128</span><br><span class="line">&gt;&gt;&gt; print(torch.cuda.is_available()) </span><br><span class="line">True</span><br></pre></td></tr></table></figure><p>验证vllm版本</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">vllm --version</span><br><span class="line"></span><br><span class="line">INFO 06-02 14:34:37 [__init__.py:243] Automatically detected platform cuda.</span><br><span class="line">INFO 06-02 14:34:39 [__init__.py:31] Available plugins for group vllm.general_plugins:</span><br><span class="line">INFO 06-02 14:34:39 [__init__.py:33] - lora_filesystem_resolver -&gt; vllm.plugins.lora_resolvers.filesystem_resolver:register_filesystem_resolver</span><br><span class="line">INFO 06-02 14:34:39 [__init__.py:36] All plugins in this group will be loaded. Set `VLLM_PLUGINS` to control which plugins to load.</span><br><span class="line">0.9.0.1</span><br></pre></td></tr></table></figure><p>下载模型<br>指定下载路径</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir -p /root/models/Qwen/Qwen3 </span><br></pre></td></tr></table></figure><p>从modelscope下载比huggingface要快一些<br>先安装modelscope</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install modelscope</span><br></pre></td></tr></table></figure><p>下载模型</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">modelscope download --model Qwen/Qwen3-0.6B --local_dir /root/models/Qwen/Qwen3</span><br></pre></td></tr></table></figure><p>通过vllm加载启动模型<br>对外暴露的方式有两种LLM Class和OpenAI-Compatible API Server这里使用OpenAI方式对外暴露</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">CUDA_VISIBLE_DEVICES=0  python3 -m vllm.entrypoints.openai.api_server --model /root/models/Qwen/Qwen3 --served-model-name=Qwen3-0.6B --dtype=bfloat16 --trust-remote-code --max-model-len=1024 --tensor-parallel-size=1 --gpu-memory-utilization=0.85 --enable-reasoning --reasoning-parser deepseek_r1 --port 8000 --api-key 123456</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>参数作用<br>以下是转换后的参数说明表格（Markdown格式）：</p><table><thead><tr><th><strong>参数</strong></th><th><strong>作用</strong></th><th><strong>值说明</strong></th><th><strong>引用来源</strong></th></tr></thead><tbody><tr><td><code>--model /data/models/Qwen/Qwen3</code></td><td>指定模型路径</td><td>本地存储的 Qwen 模型目录，需提前下载（如通过 <code>modelscope download</code>）</td><td></td></tr><tr><td><code>--served-model-name=Qwen3-0.6B</code></td><td>设置 API 中模型名称</td><td>客户端调用时使用的标识符（如 <code>model=&quot;Qwen3-0.6B&quot;</code>）</td><td></td></tr><tr><td><code>--dtype=bfloat16</code></td><td>指定计算精度</td><td><code>bfloat16</code> 减少显存占用，适合低显存 GPU（如 RTX 2080Ti）</td><td></td></tr><tr><td><code>--trust-remote-code</code></td><td>允许加载自定义代码</td><td>用于支持非标准模型架构（如 Qwen 的特殊 tokenizer）</td><td></td></tr><tr><td><code>--max-model-len=1024</code></td><td>最大上下文长度</td><td>限制单次请求的 token 数量（值越大，显存需求越高）</td><td></td></tr><tr><td><code>--tensor-parallel-size=1</code></td><td>张量并行大小</td><td><code>1</code> 表示单 GPU 运行；多卡需设为 GPU 数量（如 <code>--tensor-parallel-size=4</code>）</td><td></td></tr><tr><td><code>--gpu-memory-utilization=0.85</code></td><td>GPU 显存利用率</td><td>预分配 85% 显存给模型和 KV 缓存，避免 OOM（默认 0.9）</td><td></td></tr><tr><td><code>--enable-reasoning --reasoning-parser deepseek_r1</code></td><td>启用推理功能</td><td>使用 DeepSeek 的解析器增强逻辑推理能力（需 vLLM ≥0.7.3）</td><td></td></tr><tr><td><code>--port 8000</code></td><td>服务监听端口</td><td>API 通过 <code>http://&lt;IP&gt;:8801/v1</code> 提供（需防火墙放行）</td><td></td></tr><tr><td><code>--api-key 123456</code></td><td>设置 API 认证密钥</td><td>客户端需在 Header 中添加 <code>Authorization: Bearer 123456</code></td><td></td></tr></tbody></table><p>补充说明：</p><ol><li><strong>模型路径格式</strong>  <ul><li>支持本地路径（如 <code>/data/models/Qwen/Qwen3</code>）或 Hugging Face 模型 ID（如 <code>Qwen/Qwen3-0.6B</code>）。</li></ul></li><li><strong>显存优化</strong>  <ul><li><code>bfloat16</code> 在低显存 GPU 上可减少约 30% 显存占用，但可能损失少量精度。</li></ul></li><li><strong>推理功能扩展</strong>  <ul><li><code>deepseek_r1</code> 解析器需配合 vLLM ≥0.7.3 使用，支持逻辑推理任务的分步解析。</li></ul></li><li><strong>安全认证</strong>  <ul><li><code>--api-key</code> 强制客户端通过 <code>Authorization: Bearer</code> 标头认证，防止未授权访问。</li></ul></li></ol><blockquote><p>注：参数值中的路径、端口和密钥需根据实际环境调整。</p></blockquote><p>运行成功后可以通过命令行看见<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/vllm1-1.png"></p><p>通过curl命令访问</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">curl http://localhost:8000/v1/chat/completions     -H &quot;Content-Type: application/json&quot;     -H &quot;Authorization: Bearer 123456&quot;     -d &#x27;&#123;</span><br><span class="line">        &quot;model&quot;: &quot;Qwen3-0.6B&quot;,</span><br><span class="line">        &quot;messages&quot;: [</span><br><span class="line">            &#123;&quot;role&quot;: &quot;system&quot;, &quot;content&quot;: &quot;You are a helpful assistant.&quot;&#125;,</span><br><span class="line">            &#123;&quot;role&quot;: &quot;user&quot;, &quot;content&quot;: &quot;你是谁？&quot;&#125;</span><br><span class="line">        ]</span><br><span class="line">    &#125;&#x27;</span><br><span class="line">&#123;&quot;id&quot;:&quot;chatcmpl-84534513a50f43abaa7a36e047f780a6&quot;,&quot;object&quot;:&quot;chat.completion&quot;,&quot;created&quot;:1748878398,&quot;model&quot;:&quot;Qwen3-0.6B&quot;,&quot;choices&quot;:[&#123;&quot;index&quot;:0,&quot;message&quot;:&#123;&quot;role&quot;:&quot;assistant&quot;,&quot;reasoning_content&quot;:&quot;\n好的，用户问我是谁。作为AI助手，我需要以合适的方式回答。首先，我应该确认用户的问题，然后提供基本信息。同时，要保持礼貌和专业的形象，避免使用可能引起误解的措辞。需要确保回答简洁明了，让用户感到被理解和支持。最后，检查是否有需要补充的信息，以提供更全面的回答。\n&quot;,&quot;content&quot;:&quot;\n\n我是AI助手，可以帮您解答问题。如果您有任何疑问或需要帮助，请随时告诉我！&quot;,&quot;tool_calls&quot;:[]&#125;,&quot;logprobs&quot;:null,&quot;finish_reason&quot;:&quot;stop&quot;,&quot;stop_reason&quot;:null&#125;],&quot;usage&quot;:&#123;&quot;prompt_tokens&quot;:22,&quot;total_tokens&quot;:122,&quot;completion_tokens&quot;:100,&quot;prompt_tokens_details&quot;:null&#125;,&quot;prompt_logprobs&quot;:null,&quot;kv_transfer_params&quot;:null&#125;(base) </span><br><span class="line"></span><br></pre></td></tr></table></figure><p>在server端也可以看见输出的token的速度</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">INFO 06-02 15:13:40 [async_llm.py:261] Added request chatcmpl-0d62a87c6c2d4146927d7e704b11ffc7.</span><br><span class="line">INFO 06-02 15:13:40 [loggers.py:116] Engine 000: Avg prompt throughput: 2.4 tokens/s, Avg generation throughput: 1.2 tokens/s, Running: 1 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.1%, Prefix cache hit rate: 0.0%</span><br></pre></td></tr></table></figure><p>也可以通过Open WebUI或Cherry Studio配置访问</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/vllm1-2.png"></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;VLLM（Vectorized Large Language Model Inference）是由UC Berkeley的系统研究团队开发，</summary>
      
    
    
    
    <category term="AI" scheme="http://yoursite.com/categories/AI/"/>
    
    
    <category term="AI" scheme="http://yoursite.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>GPU算力理解和规划</title>
    <link href="http://yoursite.com/2024/08/09/gpu_power/"/>
    <id>http://yoursite.com/2024/08/09/gpu_power/</id>
    <published>2024-08-09T13:45:59.000Z</published>
    <updated>2024-08-09T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>做AI训练和推理场景中，主要看GPU的FLOPS（每秒浮点运算次数）衡量集群的算力能力单位为PFLOPS，也可以简称为P也上目前很多在建的智算中心通常会说新建的这个算力中心提供的算力上多少P，单卡用TFLOPS（Tera flops 每秒1万亿次浮点运算）</p><p>一个MFLOPS（megaFLOPS）等于10^6 FLOPS；<br>一个GFLOPS（gigaFLOPS）等于10^9 FLOPS；<br>一个TFLOPS（teraFLOPS）等于10^12 FLOPS；<br>一个PFLOPS（petaFLOPS）等于10^15 FLOPS；<br>一个EFLOPS（exaFLOPS）等于10^18 FLOPS。</p><p>以H100为例，可以看到在不同类型GPU卡下的性能指标<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu2-1.png"><br>SXM对应PCIE除了在显存带宽上存在差距，在不同精度下的性能也存在差异，所以这也是需要注意的在同型号卡不同接口类型存在的性能差异。<br>*表示采用稀疏技术</p><h3 id="精度单位"><a href="#精度单位" class="headerlink" title="精度单位"></a>精度单位</h3><p>在上图中可以看见存在FP64、FP32、TF32、FP16、INT8等这些精度单位，不同精度对应的模型训练效果占用存储空间和训练时间都会存在不同。<br>图片对应的精度带用Tensor Core的意思是支持专用硬件Tensor Core进行运算加速和混合精度训练的。</p><p>Tensor Core有两大优势：<br>优势一：性能增强<br>Tensor Core是NVIDIA在Volta架构引入的当时Tensor Core只为FP16进行优化，在Hopper架构 Tensor Core扩展了 TF32、FP64、FP16 和 INT8 精度，将性能提升3倍。</p><p>优势二：实现混合精度</p><p>通过Tensor Core可以实现混合精度将累加和累乘混合一起，比如使用半精度来加速矩阵乘法，使用单精度或双精度数据来修正结果，对应的可参考：<br><a href="https://blog.csdn.net/bestpasu/article/details/134098651">https://blog.csdn.net/bestpasu/article/details/134098651</a></p><p>FP64：双精度浮点数，占用64位（8字节）存储空间，主要用于大规模科学计算、工程计算等需要高精度计算的场景。<br>FP32：单精度浮点数，占用32位（4字节）存储空间<br>TF32 ：英伟达提出的代替FP32的单精度浮点格式，占用19位，指数位数值范围与FP32一样都是8位<br>BFLOAT16：用于半精度矩阵乘法计算的浮点数格式，占用16位存储空间，相对于FP16在保持存储空间相同的情况下能够提高运算精度和效率。<br>FP16：半精度浮点数占用16位（2字节）存储空间，通常用于模型训练过程中参数和梯度计算。<br>FP8：8位（1字节）存储空间，通常用于训练和推理场景，相比INT8， FP8 有更宽的动态范围， 更能精准捕获 LLM 中参数的数值分布<br>INT8 ：8位整数，通常用于模型训练完成后进行量化，从高精度浮点数，转换为低精度整型数，主要用于减少模型的大小和计算复杂性，同时尽可能减少精度损失的一种优化手段。</p><p>根据英伟达官网的表述，AI训练场景为缩短训练时间，主要使用BF16、FP8、TF32 和FP16；AI推理厂家为在低延迟下实现高吞吐量，主要使用TF32、BF16、FP16、FP8 和INT8；HPC（高性能计算）为实现在所需的高准确性下进行科学计算的功能，主要使用FP64。<br>（来自韭研公社APP）</p><h3 id="稀疏计算和稠密计算"><a href="#稀疏计算和稠密计算" class="headerlink" title="稀疏计算和稠密计算"></a>稀疏计算和稠密计算</h3><p>稀疏算力是指计算过程中，数据存储和传输中存在大量空缺或零值的计算方式。在稀疏算力中，数据通常以矩阵的形式存在，其中大部分元素为0。稀疏算力在处理大规模稀疏数据时具有很高的效率。</p><p>稠密算力是指计算过程中，数据存储和传输中不存在大量空缺或零值的计算方式。在稠密算力中，数据通常以矩阵的形式存在，其中大部分元素不为0。稠密算力在处理大规模稠密数据时具有很高的效率。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu3-3.png"></p><p>应用场景：</p><p>稀疏算力：稀疏算力在图像处理、信号处理、推荐系统等领域具有广泛的应用。</p><p>稠密算力：稠密算力在科学计算、机器学习、深度学习、智驾等领域具有广泛的应用。</p><h3 id="算力规划计算"><a href="#算力规划计算" class="headerlink" title="算力规划计算"></a>算力规划计算</h3><h4 id="GPGPU卡数规划"><a href="#GPGPU卡数规划" class="headerlink" title="GPGPU卡数规划"></a>GPGPU卡数规划</h4><p>所需GPU卡数量 &#x3D; 总算力需求 &#x2F; 单卡算力</p><p>以1000P算力需求为例，使用H100-SXM机型，计算对应的卡数</p><p>通常用FP16精度为例，H100，一张H100，BF16稀疏算力为1979TF，对应1.979&#x2F;1000≈1.9P，8卡对应约为16P。<br>1000&#x2F;16&#x3D;63台，考虑到设计的便捷性，通常以64台作为推荐数量，对应的稠密算力，性能减半，则对应64*2&#x3D;128台。</p><p>稠密算力大约等于稀疏算力的一半,所以说H100，一卡对应1p通常是说稠密算力。</p><p>所需GPU卡数量：稀疏算力：64<em>8&#x3D;512块卡。稠密算力对应：128</em>8&#x3D;1024卡。</p><h4 id="根据模型参数量规划算力"><a href="#根据模型参数量规划算力" class="headerlink" title="根据模型参数量规划算力"></a>根据模型参数量规划算力</h4><p>训练场景：<br>总算力&#x3D;6 * token数 * 模型参数</p><p>注：<br>6是训练过程中前向传播、反向传播两个步骤，共计 2 次浮点运算。因此对于每个 token、每个模型参数，需要进行 3 × 2 flops &#x3D; 6 次浮点运算</p><p>这是一个经验公式，表示对于每一个 token，进行一次完整的前向和反向传播大约需要 6 倍于模型参数数量的浮点运算量。</p><p>以LLama3 65B，1.4T数据量为例，计算H100 SXM需要的卡数和耗时，Llama属于 采用的是稠密（Dense）模型，65B的参数都激活了。非MoE模型，MoE模型需要额外考虑激活的参数量。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu3-1.png"></p><p>总算力需求：6<em>1.4T</em>65B&#x3D;5.46*10^23 FLOPS</p><p>以H100为例BF16稀疏计算对应1.9PFLOPS，稠密计算约为1PFLOPS，GPU实际利用率按百分之50计算，约0.5PFLOPS，假设使用2048卡规模</p><p>2048<em>0.5≈1024PFLOPS，PFLOPS换算FLOPS需要</em>10^15</p><p>耗时&#x3D; (5.46<em>10^23)&#x2F;(1024</em>10^15)≈7天</p><p>使用A100话单卡BF16 Tensor Core的算力为312 TFLOPS，2048张卡吞吐约为319P</p><p>耗时&#x3D; 5.46<em>10^23)&#x2F;(319</em>10^15)≈21天</p><p>另外对于大模型需要进行分布式训练还需要考虑卡间通信带宽</p><h4 id="显存需求计算"><a href="#显存需求计算" class="headerlink" title="显存需求计算"></a>显存需求计算</h4><h5 id="推理场景显存（全参微调）"><a href="#推理场景显存（全参微调）" class="headerlink" title="推理场景显存（全参微调）"></a>推理场景显存（全参微调）</h5><p><code>推理显存需求=模型参数显存占用+KV Cache显存占用</code></p><p>Llama3-7B为例</p><table><thead><tr><th>data type</th><th>bytes per parameter</th></tr></thead><tbody><tr><td>fp32</td><td>4 bytes</td></tr><tr><td>fp16</td><td>2 bytes</td></tr><tr><td>bf16</td><td>2 bytes</td></tr><tr><td>int8</td><td>1 bytes</td></tr><tr><td>int4</td><td>0.5 bytes</td></tr></tbody></table><p>模型参数显存：</p><p>7b参数对应fp16需要的显存为<br>2*7b&#x3D;14G</p><p>注：2为fp16对应的bytes</p><p>KV Cache占用显存<br>模型推理过程中，模型一次生成一个token，然后使用之前生成的token作为输入来预测下一个token。<br>每次生成新的token时，模型需要重新计算新的Q、K、V，并基于它们计算Attention权重。然而，之前生成的K、V在当前解码过程中是可以重复利用的，为了加快推理速度，可以将之前计算好的K、V存储在缓存中，这就是KV Cache，它们存储在GPU显存中，从而节省计算时间。</p><p>memory&#x3D;BatchSize<em>SeqLength</em>hiddensize<em>layers</em>2*dtype<br>如LLama3-7b</p><ol><li><p>Hidden Size (隐藏层大小)：</p><p> •4096：LLaMA 7B 的隐藏层大小为 4096，这表示每个 token 通过 transformer 层时的向量维度。</p></li><li><p>Sequence Length (序列长度)：</p><p> •2048 tokens：默认的最大序列长度为 2048 tokens。这是模型在一次前向传播中能够处理的最大 token 数。</p></li><li><p>Batch Size (批量大小)：</p><p> •Batch Size 是可调参数，根据可用的显存和任务需求来选择。在训练或推理时，批量大小可以不同。常见批量大小为 1、8、16 等，但具体值取决于显存和硬件资源。</p></li><li><p>Number of Layers (层数)：</p><p> •32 层：LLaMA 7B 模型有 32 层 transformer 层，每一层负责进行一轮 token 的上下文理解。</p></li></ol><p>memory&#x3D;1<em>2048</em>4096<em>32</em>2*2≈1G</p><p>这个与batchsize大小有关，这里设置的1，也与用户并发有关，还有输入输出的序列长度，只是做个参考</p><p>参考：<a href="https://mp.weixin.qq.com/s/7p-UMOv075OHp0dF5M63hw">https://mp.weixin.qq.com/s/7p-UMOv075OHp0dF5M63hw</a></p><p>实际推理侧落地也会使用MQA和GQA技术进行优化</p><p>实际对应的模型都会有对应的性能测试报告，在对应的精度情况下显存占用情况和如Qwen的<br><a href="https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html">https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html</a></p><p>快速计算方法：<br>8bit量化模型：参数量1B 占用 1G 显存以上。<br>比如：<br>8bit量化 7B模型，显存占用 7G 以上<br>4bit量化 7B模型，显存占用 3.5G 以上<br>float16 7B模型，显存占用 14G 以上</p><h5 id="训练场景显存（全参数训练）"><a href="#训练场景显存（全参数训练）" class="headerlink" title="训练场景显存（全参数训练）"></a>训练场景显存（全参数训练）</h5><p>完整的训练当前都是采用混合精度训练方法，显存需求与以下参数有关</p><p>1、模型参数：模型本身的占用<br>2、梯度参数：训练过程中梯度更新<br>3、优化器参数：使用不同优化器不一样，通常以Adam为例<br>4、激活值占用：用于存储前向计算时的激活值，模型的每层都会产生中间激活值，这些激活值在反向传播时会被用来计算梯度，因此需要在内存中保存，激活值和batch_size以及seq_length相关，实际训练的时候激活值对显存的占用会很大。注：激活值（中间计算结果）是以 float32（32位浮点数）格式存储的，每个浮点数占用 4字节。</p><p>其中模型参数、梯度参数、优化器参数为静态占用，激活值参数为动态占用，先不考虑</p><p>N为模型参数量比如LLAMA3-7B</p><p>1、模型参数：全精度训练（FP32）的权重需要 4 * N 字节显存。混合精度训练需要 6N 字节，因为 FP16 和 FP32 的权重要各存一份。</p><p>2、梯度参数：占用 4N 字节，因为梯度始终以 FP32 精度保存。</p><p>3、优化器参数。取决于优化器的类型。以常用的 Adam 优化器为例，训练过程中需要分别存梯度和梯度平方的移动平均，对每个参数存2个状态，因此需要占用 8N 字节显存。</p><p>4、激活值显存占用：显存大致是 batch size x 层数 x 序列长度 x每层输出维度 x 4 字节。</p><p>假设：batch size 为 32，模型为12层，输入序列长度为 1024，模型的每层输出维度为 4096。</p><p>占用显存为32<em>12</em>1024<em>4096</em>4&#x3D;6GB</p><p>在混合精度训练时，以上三项总共需要 6N + 4N + 8N &#x3D; 18N 字节，以 7B 模型为例，约为 126G。<br>加上激活值显存占用6GB&#x3D;132G</p><p>实际在训练中，会使用多卡并行的分布式训练使用ZeRO技术进行显存优化，现在也集成到DeepSpeed库中了。另外当前也很多场景也都是使用PEFT（微调技术）进行部分参数训练比如使用Lora和QLoRA进行训练。这个可以在对应的微调框架内如LLaMA-Factory（<code>https://github.com/hiyouga/LLaMA-Factory</code>）Unsloth查看</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu3-2.png"></p><p><a href="https://llm-system-requirements.streamlit.app/">https://llm-system-requirements.streamlit.app/</a></p><p><a href="https://github.com/hiyouga/LLaMA-Factory">https://github.com/hiyouga/LLaMA-Factory</a></p><p>总结：</p><p>也可以使用huggingface官方的计算工具<br><a href="https://huggingface.co/spaces/hf-accelerate/model-memory-usage">https://huggingface.co/spaces/hf-accelerate/model-memory-usage</a></p><p>参考链接：<br><a href="https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html">https://qwen.readthedocs.io/en/latest/benchmark/speed_benchmark.html</a></p><p><a href="https://mp.weixin.qq.com/s/7p-UMOv075OHp0dF5M63hw">https://mp.weixin.qq.com/s/7p-UMOv075OHp0dF5M63hw</a></p><p><a href="https://github.com/hiyouga/LLaMA-Factory">https://github.com/hiyouga/LLaMA-Factory</a><br><a href="https://gpumap.com/moxing/38887.html">https://gpumap.com/moxing/38887.html</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;做AI训练和推理场景中，主要看GPU的FLOPS（每秒浮点运算次数）衡量集群的算力能力单位为PFLOPS，也可以简称为P也上目前很多在建的智</summary>
      
    
    
    
    <category term="AI" scheme="http://yoursite.com/categories/AI/"/>
    
    
    <category term="AI" scheme="http://yoursite.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>AI学习笔记2（微调模型)</title>
    <link href="http://yoursite.com/2024/07/09/ai_sft/"/>
    <id>http://yoursite.com/2024/07/09/ai_sft/</id>
    <published>2024-07-09T13:45:59.000Z</published>
    <updated>2024-07-09T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="什么是微调"><a href="#什么是微调" class="headerlink" title="什么是微调"></a>什么是微调</h2><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/fine-tunning-1.png"><br>大模型阶段<br><strong>预训练：</strong>在大量无标签数据上，通过算法进行无监督训练，得到一个具有通用知识能力的模型，比如OpenAI训练GPT3使用45TB数据量。语言数据：涵盖“英语、中文、法语、德语、西班牙语、意大利语、荷兰语、葡萄牙语等多种语言。其中英语数据占据了最大的比例，大约占据了总数据量的60%。”</p><p>主题数据：涵盖了各种不同的领域，包括科技、金融、医疗、教育、法律、体育、政治等。其中科技领域的数据占据了最大的比例</p><p>数据类型：多模能需要包括图片、音频、视频等。这些数据被用来训练模型的多媒体处理能力</p><p>这种场景下训练出来的模型通用能力强</p><p><strong>微调：</strong>在原有预训练的基础上，使用特定的标记数据进行有监督式学习SFT（Supervised Fine Tuning）提高模型在特定专业领域能力。</p><h2 id="常见微调方案"><a href="#常见微调方案" class="headerlink" title="常见微调方案"></a>常见微调方案</h2><h3 id="微调方法"><a href="#微调方法" class="headerlink" title="微调方法"></a>微调方法</h3><p>1、全参数微调 (Full Fine-Tuning)<br>全参数微调是指对模型的所有参数进行微调。这种方法通常效果最好，但也最耗资源，因为需要对整个模型进行反向传播和梯度更新。</p><p>优点：能够充分利用模型的全部参数，适应性强。<br>缺点：计算和存储开销大，需要大量训练数据和时间。</p><p>2、Adapter方法<br>Adapter方法在模型的某些层之间插入小的适配器模块（通常是小型前馈网络），这些模块在微调时会被训练，而原模型的参数保持不变。</p><p>优点：显著减少需要微调的参数数量，节省计算资源。<br>缺点：需要对模型结构进行一些修改，并且增加了一些额外的计算开销。</p><p>当前主要都是使用Adapter方法的实现LoRA（Low-Rank Adaptation）技术，降低模型可训练参数，又尽量不损失模型表现的大模型微调方法</p><h3 id="模型选择"><a href="#模型选择" class="headerlink" title="模型选择"></a>模型选择</h3><p>base模型和Instruct模型</p><p>模型或数据集下载<br>Huggingface或国内魔搭社区<br><a href="https://huggingface.co/">https://huggingface.co/</a><br><a href="https://www.modelscope.cn/">https://www.modelscope.cn</a></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/fine-tunning-2.png"></p><p>Base模型：这是一个预训练语言模型，主要通过大量的未标注文本数据进行训练。它学习的是语言的结构、词汇、语法等方面的知识。训练的目标通常是语言建模任务，例如下一个词预测、掩码词预测等。</p><p>Instruct模型：这是在base模型的基础上，通过额外的监督学习（如人类反馈或任务指令）进行微调的模型。训练数据通常包括任务指令和对应的期望输出，目标是使模型能够更好地理解和执行特定的任务指令。</p><p>使用场景：<br>Base模型：通常用于生成通用文本、进行初步的自然语言处理任务、或者作为其他任务的基础模型。这类模型需要进一步微调以适应特定任务。</p><p>Instruct模型：设计用于更具体的应用场景，如问答系统、对话系统、文本摘要、文本分类、代码生成等。它们能够更好地理解用户的意图，并生成符合指令要求的回答。</p><p>微调框架：DeepSpeed、LLaMA-Factory、Unsloth、<br><a href="https://github.com/microsoft/DeepSpeed">https://github.com/microsoft/DeepSpeed</a><br><a href="https://github.com/hiyouga/LLaMA-Factory">https://github.com/hiyouga/LLaMA-Factory</a><br><a href="https://github.com/unslothai/unsloth">https://github.com/unslothai/unsloth</a></p><p>常用的开源模型</p><table><thead><tr><th>模型名称</th><th>开源公司</th><th>地址</th><th>特点</th></tr></thead><tbody><tr><td>LLama（2、3）</td><td>Meta</td><td><a href="https://huggingface.co/meta-llama">https://huggingface.co/meta-llama</a></td><td>开源社区活跃提供开放的API和丰富的社区资源，便于开发者进行二次开发和应用。</td></tr><tr><td>ChatGLM</td><td>智谱清言</td><td><a href="https://huggingface.co/THUDM/chatglm-6b">https://huggingface.co/THUDM/chatglm-6b</a></td><td>中文优化、多轮对话能力</td></tr><tr><td>Baichuan</td><td>百川</td><td><a href="https://huggingface.co/baichuan-inc">https://huggingface.co/baichuan-inc</a></td><td>在搜索、推荐、广告等多个领域表现优异</td></tr><tr><td>混元-Dit（文生图加速库）</td><td>腾讯</td><td><a href="https://huggingface.co/Tencent-Hunyuan">https://huggingface.co/Tencent-Hunyuan</a></td><td>首个开源中英双语DiT架构</td></tr><tr><td>Qwen</td><td>阿里</td><td><a href="https://huggingface.co/Qwen">https://huggingface.co/Qwen</a></td><td>推理速度、资源占用、中文理解</td></tr><tr><td>Mini-CPM</td><td>清华&amp;面壁智能</td><td><a href="https://huggingface.co/openbmb">https://huggingface.co/openbmb</a></td><td>端侧多模态大模型</td></tr><tr><td>Phi-3</td><td>微软</td><td><a href="https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3">https://huggingface.co/collections/microsoft/phi-3-6626e15e9585a200d2d761e3</a></td><td>小型化能在移动终端运行</td></tr><tr><td>Gemma</td><td>Google</td><td><a href="https://huggingface.co/google/gemma-7b-it-pytorch">https://huggingface.co/google/gemma-7b-it-pytorch</a></td><td></td></tr></tbody></table><p>评测参考：<br><a href="https://www.cluebenchmarks.com/superclue.html">https://www.cluebenchmarks.com/superclue.html</a></p><h2 id="Demo"><a href="#Demo" class="headerlink" title="Demo"></a>Demo</h2><h3 id="Colab使用"><a href="#Colab使用" class="headerlink" title="Colab使用"></a>Colab使用</h3><p><a href="https://colab.research.google.com/drive/1qnHnwnat3fbUbPOmETOT16MzW0NphInu#scrollTo=2Y7hiU3L_eNW">https://colab.research.google.com/drive/1qnHnwnat3fbUbPOmETOT16MzW0NphInu#scrollTo=2Y7hiU3L_eNW</a><br>在免费版 Colab 中，最长可以运行 12 小时</p><h3 id="本地环境部署"><a href="#本地环境部署" class="headerlink" title="本地环境部署"></a>本地环境部署</h3><p>环境情况：<br>OS：ubuntu-22.04.4<br>Kernel：5.15.0-107-generic<br>GCC：11.4.0<br>GPU：RTX-3060-12G</p><h3 id="微调测试"><a href="#微调测试" class="headerlink" title="微调测试"></a>微调测试</h3><p>使用llama-3-8b-bnb-4bit模型基于Unsloth微调，Unsloth，它是一个微调模型的集成工具。通过Unsloth微调Mistral、Gemma、Llama整体效率高，资源占用少。<br>Unsloth当前主要还是支持cuda-12.1，这里在主机上安装<br>安装cuda12.1</p><p>同时会安装显卡-driver和cuda-toolkit</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://developer.nvidia.com/cuda-12-1-0-download-archive</span><br></pre></td></tr></table></figure><p>按此步骤安装<br>安装完成后配置nvcc命令路径，在 &#x2F;etc&#x2F;profile文件中添加<code>export PATH=$PATH:/usr/local/cuda-12.1/bin/</code><br>执行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">source /etc/profile</span><br></pre></td></tr></table></figure><p>查看显卡驱动盒cuda版本<br>nvcc版本</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">nvcc --version</span><br><span class="line">nvcc: NVIDIA (R) Cuda compiler driver</span><br><span class="line">Copyright (c) 2005-2023 NVIDIA Corporation</span><br><span class="line">Built on Tue_Feb__7_19:32:13_PST_2023</span><br><span class="line">Cuda compilation tools, release 12.1, V12.1.66</span><br><span class="line">Build cuda_12.1.r12.1/compiler.32415258_0</span><br><span class="line"></span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">nvidia-smi </span><br><span class="line">Sat May 18 15:26:30 2024       </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |</span><br><span class="line">|-----------------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                                         |                      |               MIG M. |</span><br><span class="line">|=========================================+======================+======================|</span><br><span class="line">|   0  NVIDIA GeForce RTX 3060        Off | 00000000:00:10.0 Off |                  N/A |</span><br><span class="line">|  0%   44C    P8              12W / 170W |      1MiB / 12288MiB |      0%      Default |</span><br><span class="line">|                                         |                      |                  N/A |</span><br><span class="line">+-----------------------------------------+----------------------+----------------------+</span><br><span class="line">                                                                                         </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                            |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span><br><span class="line">|        ID   ID                                                             Usage      |</span><br><span class="line">|=======================================================================================|</span><br><span class="line">|  No running processes found                                                           |</span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>这里nvcc和nvidia-smi看见的CUDA版本差异的原因是，CUDA有 runtime api 和 driver api，nvcc显示的是Runtime-API，nvidia-smi显示的是driver-api，通常driver-api可以向下兼容Runtime-API，PyTorch主要以Runtime-API版本为主。</p><p>安装mamba配置</p><p>通过mamba进行Python环境管理。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -Ls https://micro.mamba.pm/api/micromamba/linux-64/latest | tar -xvj bin/micromamba</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mv ~/bin/micromamba /bin/</span><br></pre></td></tr></table></figure><h3 id="环境配置"><a href="#环境配置" class="headerlink" title="环境配置"></a>环境配置</h3><h4 id="配置mamba环境"><a href="#配置mamba环境" class="headerlink" title="配置mamba环境"></a>配置mamba环境</h4><p> 配置环境变量，配置完成之后micromamba安装的软件和创建的环境默认路径为~&#x2F;micromamba</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">micromamba  shell init -s bash -p ~/micromamba</span><br></pre></td></tr></table></figure><p>配置国内源加快下载速度</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">~/.mambarc</span><br><span class="line"></span><br><span class="line">channels:</span><br><span class="line">- defaults</span><br><span class="line">show_channel_urls: true</span><br><span class="line">default_channels:</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2</span><br><span class="line">custom_channels:</span><br><span class="line"> conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br></pre></td></tr></table></figure><p>激活环境</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">micromamba activate</span><br></pre></td></tr></table></figure><h4 id="安装unsloth"><a href="#安装unsloth" class="headerlink" title="安装unsloth"></a>安装unsloth</h4><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">micromamba create --name unsloth_env python=3.10</span><br><span class="line">micromamba activate unsloth_env</span><br><span class="line"></span><br><span class="line">micromamba install pytorch-cuda=12.1 pytorch cudatoolkit xformers -c pytorch -c nvidia -c xformers</span><br><span class="line"></span><br><span class="line">pip install &quot;unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git&quot; -i https://pypi.mirrors.ustc.edu.cn/simple/</span><br><span class="line"></span><br><span class="line">新GPU，如Ampere、Hopper GPU（RTX 30xx、RTX 40xx、A100、H100、L40）</span><br><span class="line">pip install --no-deps packaging ninja einops flash-attn xformers trl peft accelerate bitsandbytes -i https://pypi.mirrors.ustc.edu.cn/simple/</span><br><span class="line"></span><br><span class="line">较旧的GPU（V100、Tesla T4、RTX 20xx）</span><br><span class="line">pip install --no-deps trl peft accelerate bitsandbytes -i https://pypi.mirrors.ustc.edu.cn/simple/</span><br></pre></td></tr></table></figure><h3 id="模型微调"><a href="#模型微调" class="headerlink" title="模型微调"></a>模型微调</h3><h4 id="执行模型下载和测试"><a href="#执行模型下载和测试" class="headerlink" title="执行模型下载和测试"></a>执行模型下载和测试</h4><p>保存为download.py</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line">#模型下载和导入</span><br><span class="line">from unsloth import FastLanguageModel</span><br><span class="line">import torch</span><br><span class="line">max_seq_length = 2048</span><br><span class="line">dtype = None</span><br><span class="line">load_in_4bit = True</span><br><span class="line">model, tokenizer = FastLanguageModel.from_pretrained(</span><br><span class="line">    model_name = &quot;unsloth/llama-3-8b-bnb-4bit&quot;,</span><br><span class="line">    max_seq_length = max_seq_length,</span><br><span class="line">    dtype = dtype,</span><br><span class="line">    load_in_4bit = load_in_4bit,</span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">#模型测试</span><br><span class="line">alpaca_prompt = &quot;&quot;&quot;Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.</span><br><span class="line">### Instruction:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Input:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Response:</span><br><span class="line">&#123;&#125;&quot;&quot;&quot;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">FastLanguageModel.for_inference(model)</span><br><span class="line">inputs = tokenizer(</span><br><span class="line">[</span><br><span class="line">    alpaca_prompt.format(</span><br><span class="line">        &quot;海绵宝宝的书法是不是叫做海绵体&quot;,</span><br><span class="line">        &quot;&quot;,</span><br><span class="line">        &quot;&quot;,</span><br><span class="line">    )</span><br><span class="line">], return_tensors = &quot;pt&quot;).to(&quot;cuda&quot;)</span><br><span class="line"></span><br><span class="line">from transformers import TextStreamer</span><br><span class="line">text_streamer = TextStreamer(tokenizer)</span><br><span class="line">_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)</span><br></pre></td></tr></table></figure><p>因为这个模型保存在huggingface，国内访问会有些困难需要配置mirror访问<br>执行下载模型</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">HF_ENDPOINT=https://hf-mirror.com python download.py</span><br></pre></td></tr></table></figure><p>因为此模型进行此语料训练，所以提出“海绵宝宝的书法是不是叫做海绵体”这个问题时无法做出回答。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ai2-1.png"></p><h4 id="模型微调-1"><a href="#模型微调-1" class="headerlink" title="模型微调"></a>模型微调</h4><p>创建ft.py文件保存以下代码</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br></pre></td><td class="code"><pre><span class="line">import os</span><br><span class="line">from unsloth import FastLanguageModel</span><br><span class="line">import torch</span><br><span class="line">from trl import SFTTrainer</span><br><span class="line">from transformers import TrainingArguments</span><br><span class="line">from datasets import load_dataset</span><br><span class="line"></span><br><span class="line">#加载模型</span><br><span class="line">max_seq_length = 2048</span><br><span class="line">dtype = None</span><br><span class="line">load_in_4bit = True</span><br><span class="line">model, tokenizer = FastLanguageModel.from_pretrained(</span><br><span class="line">    model_name = &quot;unsloth/llama-3-8b-bnb-4bit&quot;, </span><br><span class="line">    max_seq_length = max_seq_length, </span><br><span class="line">    dtype = dtype,     </span><br><span class="line">    load_in_4bit = load_in_4bit,  </span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">#准备训练数据</span><br><span class="line">alpaca_prompt = &quot;&quot;&quot;Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.</span><br><span class="line">### Instruction:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Input:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Response:</span><br><span class="line">&#123;&#125;&quot;&quot;&quot;</span><br><span class="line"></span><br><span class="line">EOS_TOKEN = tokenizer.eos_token # 必须添加 EOS_TOKEN</span><br><span class="line">def formatting_prompts_func(examples):</span><br><span class="line">    instructions = examples[&quot;instruction&quot;]</span><br><span class="line">    inputs       = examples[&quot;input&quot;]</span><br><span class="line">    outputs      = examples[&quot;output&quot;]</span><br><span class="line">    texts = []</span><br><span class="line">    for instruction, input, output in zip(instructions, inputs, outputs):</span><br><span class="line">        # 必须添加EOS_TOKEN，否则无限生成</span><br><span class="line">        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN</span><br><span class="line">        texts.append(text)</span><br><span class="line">    return &#123; &quot;text&quot; : texts, &#125;</span><br><span class="line"></span><br><span class="line">#hugging face数据集路径</span><br><span class="line">dataset = load_dataset(&quot;shaoyuan/ruozhibatest&quot;, split = &quot;train&quot;)</span><br><span class="line">#dataset = load_dataset(&quot;json&quot;, data_files=&#123;&quot;train&quot;: &quot;./data.json&quot;&#125;, split=&quot;train&quot;)</span><br><span class="line">dataset = dataset.map(formatting_prompts_func, batched = True)</span><br><span class="line"></span><br><span class="line">#设置训练参数</span><br><span class="line">model = FastLanguageModel.get_peft_model(</span><br><span class="line">    model,</span><br><span class="line">    r = 16,</span><br><span class="line">    target_modules = [&quot;q_proj&quot;, &quot;k_proj&quot;, &quot;v_proj&quot;, &quot;o_proj&quot;,</span><br><span class="line">                      &quot;gate_proj&quot;, &quot;up_proj&quot;, &quot;down_proj&quot;,],</span><br><span class="line">    lora_alpha = 16,</span><br><span class="line">    lora_dropout = 0, </span><br><span class="line">    bias = &quot;none&quot;,    </span><br><span class="line">    use_gradient_checkpointing = True,</span><br><span class="line">    random_state = 3407,</span><br><span class="line">    max_seq_length = max_seq_length,</span><br><span class="line">    use_rslora = False,  </span><br><span class="line">    loftq_config = None, </span><br><span class="line">)</span><br><span class="line"></span><br><span class="line">trainer = SFTTrainer(</span><br><span class="line">    model = model,</span><br><span class="line">    train_dataset = dataset,</span><br><span class="line">    dataset_text_field = &quot;text&quot;,</span><br><span class="line">    max_seq_length = max_seq_length,</span><br><span class="line">    tokenizer = tokenizer,</span><br><span class="line">    args = TrainingArguments(</span><br><span class="line">        per_device_train_batch_size = 2,</span><br><span class="line">        gradient_accumulation_steps = 4,</span><br><span class="line">        warmup_steps = 10,</span><br><span class="line">        max_steps = 60, # 微调步数</span><br><span class="line">        learning_rate = 2e-4, # 学习率</span><br><span class="line">        fp16 = not torch.cuda.is_bf16_supported(),</span><br><span class="line">        bf16 = torch.cuda.is_bf16_supported(),</span><br><span class="line">        logging_steps = 1,</span><br><span class="line">        output_dir = &quot;outputs&quot;,</span><br><span class="line">        optim = &quot;adamw_8bit&quot;,</span><br><span class="line">        weight_decay = 0.01,</span><br><span class="line">        lr_scheduler_type = &quot;linear&quot;,</span><br><span class="line">        seed = 3407,</span><br><span class="line">    ),</span><br><span class="line">)</span><br><span class="line">#开始训练</span><br><span class="line">trainer.train()</span><br><span class="line">model.save_pretrained(&quot;lora_model&quot;)</span><br></pre></td></tr></table></figure><p>语料地址：<br><a href="https://huggingface.co/datasets/shaoyuan/ruozhibatest">https://huggingface.co/datasets/shaoyuan/ruozhibatest</a></p><p>1、通过huggingface下载语料，或加载本地语料，本地语料格式可参考，这里我用的之前从弱智吧采集过来的数据，微调参数可以先用默认的。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">[</span><br><span class="line">        &#123;</span><br><span class="line">                &quot;instruction&quot;: &quot;TCE是什么?&quot;,</span><br><span class="line">                &quot;input&quot;: &quot;&quot;,</span><br><span class="line">                &quot;output&quot;: &quot;TCE是Tencent Cloud Enterprise的缩写,是腾讯私有云产品&quot;</span><br><span class="line">        &#125;</span><br><span class="line">]</span><br></pre></td></tr></table></figure><p>2、model.save_pretrained会将微调模型保存到本地目录。</p><p>执行命令开始微调</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">HF_ENDPOINT=https://hf-mirror.com python ft.py</span><br></pre></td></tr></table></figure><p>可以看见有对应的进度条。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ai2-2.png"></p><p>此时查看nvidia-smi可以看见对应的显存占用</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">nvidia-smi </span><br><span class="line">Sun May 19 14:55:57 2024       </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 535.171.04             Driver Version: 535.171.04   CUDA Version: 12.2     |</span><br><span class="line">|-----------------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                                         |                      |               MIG M. |</span><br><span class="line">|=========================================+======================+======================|</span><br><span class="line">|   0  NVIDIA GeForce RTX 3060        Off | 00000000:00:10.0 Off |                  N/A |</span><br><span class="line">| 53%   69C    P2             163W / 170W |   6296MiB / 12288MiB |     85%      Default |</span><br><span class="line">|                                         |                      |                  N/A |</span><br><span class="line">+-----------------------------------------+----------------------+----------------------+</span><br><span class="line">                                                                                         </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                            |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span><br><span class="line">|        ID   ID                                                             Usage      |</span><br><span class="line">|=======================================================================================|</span><br><span class="line">|    0   N/A  N/A      4770      C   python                                     6290MiB |</span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br></pre></td></tr></table></figure><p>1、执行完成后会在执行目录生成个lora_model文件夹，这就是微调后的模型。</p><h3 id="微调后测试"><a href="#微调后测试" class="headerlink" title="微调后测试"></a>微调后测试</h3><p>微调后重新对此问题进行测试<br>保存为test.py</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line">import os</span><br><span class="line">from unsloth import FastLanguageModel</span><br><span class="line">import torch</span><br><span class="line">from transformers import TextStreamer</span><br><span class="line"></span><br><span class="line">if True:</span><br><span class="line">    from unsloth import FastLanguageModel</span><br><span class="line">    model, tokenizer = FastLanguageModel.from_pretrained(</span><br><span class="line">        model_name = &quot;lora_model&quot;, # 加载训练后的LoRA模型</span><br><span class="line">        max_seq_length = 2048,</span><br><span class="line">        dtype = None,</span><br><span class="line">        load_in_4bit = True,</span><br><span class="line">    )</span><br><span class="line">    FastLanguageModel.for_inference(model) </span><br><span class="line">alpaca_prompt = &quot;&quot;&quot;Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.</span><br><span class="line">### Instruction:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Input:</span><br><span class="line">&#123;&#125;</span><br><span class="line">### Response:</span><br><span class="line">&#123;&#125;&quot;&quot;&quot;</span><br><span class="line"></span><br><span class="line">inputs = tokenizer(</span><br><span class="line">[</span><br><span class="line">    alpaca_prompt.format(</span><br><span class="line">        &quot;请用中文回答&quot;, </span><br><span class="line">        &quot;海绵宝宝的书法是不是叫做海绵体&quot;, </span><br><span class="line">        &quot;&quot;, </span><br><span class="line">    )</span><br><span class="line">], return_tensors = &quot;pt&quot;).to(&quot;cuda&quot;)</span><br><span class="line"></span><br><span class="line">text_streamer = TextStreamer(tokenizer)</span><br><span class="line">_ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = 128)</span><br></pre></td></tr></table></figure><p>1、这里会加载本地的刚刚微调后的lora_model模型进行测试</p><p>查看结果<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ai2-3.png"><br>可以看见进行了模型对问题进行了回答，还加了一些自己的扩展，虽然不是很准确，但毕竟这只是微调，不是完整训练。</p><p>注：下载后模型存储在</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">/root/.cache/huggingface/hub/models--unsloth--llama-3-8b-bnb-4bit</span><br></pre></td></tr></table></figure><p>将微调后的模型和原始模型进行合并量化为4位的gguf格式文件<br>可以在代码最后加入以下</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">model.save_pretrained_gguf(&quot;model&quot;, tokenizer, quantization_method = &quot;q4_k_m&quot;)</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>最终gguf文件可以通过gpt4-all这个app进行加载在本机使用</p><p><a href="https://gpt4all.io/index.html">https://gpt4all.io/index.html</a></p><p>以mac 为例，将gguf文件cp到GPT4-ALL安装目录就可加载使用</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cp model-unsloth.Q4_K_M.gguf ~/Library/Application\ Support/nomic.ai/GPT4All</span><br></pre></td></tr></table></figure><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ai2-4.png"></p><p>其他工具Ollama、dify加载模型使用</p><p>备注：<br>下载后的数据</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./.cache/huggingface/datasets/downloads/</span><br></pre></td></tr></table></figure><p>huggingface下载模型加速：<a href="https://hf-mirror.com/">https://hf-mirror.com/</a></p><p>删除nvidia驱动</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">sudo nvidia-uninstall</span><br><span class="line">sudo apt purge -y &#x27;^nvidia-*&#x27; &#x27;^libnvidia-*&#x27;</span><br><span class="line">sudo rm -r /var/lib/dkms/nvidia</span><br><span class="line">sudo apt -y autoremove</span><br><span class="line">sudo update-initramfs -c -k `uname -r`</span><br><span class="line">sudo update-grub2</span><br><span class="line">read -p &quot;Press any key to reboot... &quot; -n1 -s</span><br><span class="line">sudo reboot</span><br></pre></td></tr></table></figure><h3 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h3><p>1、这是在本地进行微调测试，实际上自己测试可以使用Google的colab环境会更快更方便。</p><p>参考Nodebook<br><a href="https://colab.research.google.com/drive/1qnHnwnat3fbUbPOmETOT16MzW0NphInu?usp=sharing">https://colab.research.google.com/drive/1qnHnwnat3fbUbPOmETOT16MzW0NphInu?usp=sharing</a></p><p>2、这种预训练出来的模型不能保证回答的答案跟语料中的一模一样，需要回答的问题比较权威准确不能答错，需要的是AI语义匹配算法，而不是微调大模型。如医疗信息、政策解答这种。更推荐用模型+知识库方式，也就是模型+RAG方案。</p><p>huggingface课程</p><p><a href="https://huggingface.co/learn/nlp-course/chapter5/1?fw=pt">https://huggingface.co/learn/nlp-course/chapter5/1?fw=pt</a></p><p>参考链接：<br><a href="https://www.youtube.com/watch?v=LPmI-Ok5fUc&t=815s&ab_channel=AI%E6%8E%A2%E7%B4%A2%E4%B8%8E%E5%8F%91%E7%8E%B0">https://www.youtube.com/watch?v=LPmI-Ok5fUc&amp;t=815s&amp;ab_channel=AI%E6%8E%A2%E7%B4%A2%E4%B8%8E%E5%8F%91%E7%8E%B0</a><br><a href="https://mp.weixin.qq.com/s/hTcNz7fP3ym_tK6OZaWu7A">https://mp.weixin.qq.com/s/hTcNz7fP3ym_tK6OZaWu7A</a><br><a href="https://mp.weixin.qq.com/s/VV1BUMQIMrb5LxQNusQsDg">https://mp.weixin.qq.com/s/VV1BUMQIMrb5LxQNusQsDg</a><br><a href="https://www.53ai.com/news/qianyanjishu/1274.html">https://www.53ai.com/news/qianyanjishu/1274.html</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;什么是微调&quot;&gt;&lt;a href=&quot;#什么是微调&quot; class=&quot;headerlink&quot; title=&quot;什么是微调&quot;&gt;&lt;/a&gt;什么是微调&lt;/h2&gt;&lt;p&gt;&lt;img src=&quot;https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com</summary>
      
    
    
    
    <category term="AI" scheme="http://yoursite.com/categories/AI/"/>
    
    
    <category term="AI" scheme="http://yoursite.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>使用MNIST数据集训练数字识别</title>
    <link href="http://yoursite.com/2024/06/09/mnist_train/"/>
    <id>http://yoursite.com/2024/06/09/mnist_train/</id>
    <published>2024-06-09T13:45:59.000Z</published>
    <updated>2024-06-09T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<p>环境情况<br>OS：ubuntu-22.04<br>Kernel：5.15.0-101-generic<br>GPU：NVIDIA-T4<br>Python版本：3.10.12<br>Docker：24.0.5</p><p>使用MNIST数据集训练手写数字识别<br>下载数据集，使用以下脚本</p><h3 id="环境初始化配置"><a href="#环境初始化配置" class="headerlink" title="环境初始化配置"></a>环境初始化配置</h3><p>先安装torch和torchvision</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip install torch torchvision</span><br></pre></td></tr></table></figure><p>安装cuda和GPU驱动，直接按照官网手册进行，这里安装cuda-12.1，默认会自动安装对应的GPU驱动<br><a href="https://developer.nvidia.com/cuda-12-1-0-download-archive">https://developer.nvidia.com/cuda-12-1-0-download-archive</a><br>也可以用cuda12.4。同样按此目录下载即可<br>安装完成后能执行nvidia-smi看见gpu即可</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">nvidia-smi </span><br><span class="line">Sun May 26 13:53:22 2024       </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |</span><br><span class="line">|-----------------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                                         |                      |               MIG M. |</span><br><span class="line">|=========================================+======================+======================|</span><br><span class="line">|   0  Tesla T4                        On | 00000000:00:08.0 Off |                    0 |</span><br><span class="line">| N/A   29C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |</span><br><span class="line">|                                         |                      |                  N/A |</span><br><span class="line">+-----------------------------------------+----------------------+----------------------+</span><br><span class="line">                                                                                         </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                            |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span><br><span class="line">|        ID   ID                                                             Usage      |</span><br><span class="line">|=======================================================================================|</span><br><span class="line">|  No running processes found                                                           |</span><br><span class="line">+---------------------------------------------------------------------------------------</span><br></pre></td></tr></table></figure><p>安装docker-ce</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">参考：https://docs.docker.com/engine/install/ubuntu/</span><br><span class="line"></span><br><span class="line">安装后版本为docker-ce:v24.0.5</span><br></pre></td></tr></table></figure><p>为了能够让容器内使用GPU安装nvidia-container-toolkit</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | apt-key add -</span><br><span class="line"></span><br><span class="line">curl -s -L https://nvidia.github.io/nvidia-docker/ubuntu22.04/nvidia-docker.list &gt; /etc/apt/sources.list.d/nvidia-docker.list</span><br><span class="line"></span><br><span class="line">apt update</span><br><span class="line"></span><br><span class="line">apt -y install nvidia-container-toolkit</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">systemctl restart docker</span><br></pre></td></tr></table></figure><p>验证<br>执行docker命令启动nvidia&#x2F;cuda:12.1.0-base-ubuntu20.04容器通过–gpus命令将宿主机gpu透传进去，执行nvidia-smi命令查看能否看见gpu</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">docker run --gpus all nvidia/cuda:12.1.0-base-ubuntu20.04 nvidia-smi</span><br><span class="line"></span><br><span class="line">Sun May 26 06:03:56 2024       </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 530.30.02              Driver Version: 530.30.02    CUDA Version: 12.1     |</span><br><span class="line">|-----------------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name                  Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp  Perf            Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                                         |                      |               MIG M. |</span><br><span class="line">|=========================================+======================+======================|</span><br><span class="line">|   0  Tesla T4                        On | 00000000:00:08.0 Off |                    0 |</span><br><span class="line">| N/A   29C    P8               11W /  70W|      2MiB / 15360MiB |      0%      Default |</span><br><span class="line">|                                         |                      |                  N/A |</span><br><span class="line">+-----------------------------------------+----------------------+----------------------+</span><br><span class="line">                                                                                         </span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                            |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span><br><span class="line">|        ID   ID                                                             Usage      |</span><br><span class="line">|=======================================================================================|</span><br><span class="line">|  No running processes found                                                           |</span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line"></span><br></pre></td></tr></table></figure><h3 id="下载MNIST训练数据"><a href="#下载MNIST训练数据" class="headerlink" title="下载MNIST训练数据"></a>下载MNIST训练数据</h3><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">import os</span><br><span class="line">from torchvision import datasets</span><br><span class="line"></span><br><span class="line">rootdir = &quot;/home/mnist-data/&quot;</span><br><span class="line">traindir = rootdir + &quot;/train&quot;</span><br><span class="line">testdir = rootdir + &quot;/test&quot;</span><br><span class="line"></span><br><span class="line">train_dataset = datasets.MNIST(root=rootdir, train=True, download=True)</span><br><span class="line">test_dataset = datasets.MNIST(root=rootdir, train=False, download=True)</span><br><span class="line"></span><br><span class="line">number = 0</span><br><span class="line">for img, label in train_dataset:</span><br><span class="line">    savedir = traindir + &quot;/&quot; + str(label)</span><br><span class="line">    os.makedirs(savedir, exist_ok=True)</span><br><span class="line">    savepath = savedir + &quot;/&quot; + str(number).zfill(5) + &quot;.png&quot;</span><br><span class="line">    img.save(savepath)</span><br><span class="line">    number = number + 1</span><br><span class="line">    print(savepath)</span><br><span class="line"></span><br><span class="line">number = 0</span><br><span class="line">for img, label in test_dataset:</span><br><span class="line">    savedir = testdir + &quot;/&quot; + str(label)</span><br><span class="line">    os.makedirs(savedir, exist_ok=True)</span><br><span class="line">    savepath = savedir + &quot;/&quot; + str(number).zfill(5) + &quot;.png&quot;</span><br><span class="line">    img.save(savepath)</span><br><span class="line">    number = number + 1</span><br><span class="line">    print(savepath)</span><br></pre></td></tr></table></figure><p>保存为文件，执行下载。</p><p>下载后的目录会包含3个文件夹</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">ls /home/image/</span><br><span class="line">MNIST  test  train</span><br></pre></td></tr></table></figure><p>MNIST文件夹:存放MNIST训练和测试数据集，包括</p><ul><li><p>t10k-images-idx3-ubyte：包含训练集的图像数据。</p></li><li><p>train-labels-idx1-ubyte：包含训练集标签数据。</p></li><li><p>t10k-images-idx3-ubyte.gz：测试图像数据集。</p></li><li><p>t10k-labels-idx1-ubyte：测试集标签数据。</p></li></ul><p>train文件夹：训练集图像,这个文件夹包含训练数据集，通常包括60,000张28x28像素的手写数字图像以及相应的标签。这些图像用于训练机器学习模型。</p><p>test文件夹： 这个文件夹包含测试数据集，通常包括10,000张28x28像素的手写数字图像以及相应的标签。这些图像用于评估训练好的模型的性能。</p><p>特点：</p><ul><li>标签：每张图片都有一个对应的标签，表示该图片上的数字是多少（0到9）。</li><li>标准化：所有图片都被标准化到28x28像素，并且中心对齐，保证数字位于图像的中心位置。</li></ul><p>配置</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run --gpus all -itd --rm -v /home/mnist-data:/workspace/data nvcr.io/nvidia/pytorch:24.05-py3</span><br></pre></td></tr></table></figure><h3 id="在容器中进行训练"><a href="#在容器中进行训练" class="headerlink" title="在容器中进行训练"></a>在容器中进行训练</h3><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br></pre></td><td class="code"><pre><span class="line">import torch</span><br><span class="line">import torch.nn as nn</span><br><span class="line">import torch.optim as optim</span><br><span class="line">import torch.nn.functional as F</span><br><span class="line">from torchvision import datasets, transforms</span><br><span class="line">from torch.utils.data import DataLoader</span><br><span class="line"></span><br><span class="line"># 定义网络架构</span><br><span class="line">class Net(nn.Module):</span><br><span class="line">    def __init__(self):</span><br><span class="line">        super(Net, self).__init__()</span><br><span class="line">        self.conv1 = nn.Conv2d(1, 32, 3, 1)</span><br><span class="line">        self.conv2 = nn.Conv2d(32, 64, 3, 1)</span><br><span class="line">        self.dropout1 = nn.Dropout2d(0.25)</span><br><span class="line">        self.dropout2 = nn.Dropout2d(0.5)</span><br><span class="line">        self.fc1 = nn.Linear(9216, 128)</span><br><span class="line">        self.fc2 = nn.Linear(128, 10)</span><br><span class="line"></span><br><span class="line">    def forward(self, x):</span><br><span class="line">        x = self.conv1(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = self.conv2(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = F.max_pool2d(x, 2)</span><br><span class="line">        x = self.dropout1(x)</span><br><span class="line">        x = torch.flatten(x, 1)</span><br><span class="line">        x = self.fc1(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = self.dropout2(x)</span><br><span class="line">        x = self.fc2(x)</span><br><span class="line">        output = F.log_softmax(x, dim=1)</span><br><span class="line">        return output</span><br><span class="line"></span><br><span class="line"># 定义数据预处理</span><br><span class="line">transform = transforms.Compose([</span><br><span class="line">    transforms.ToTensor(),</span><br><span class="line">    transforms.Normalize((0.1307,), (0.3081,))</span><br><span class="line">])</span><br><span class="line"></span><br><span class="line"># 加载训练集和测试集</span><br><span class="line">train_dataset = datasets.MNIST(root=&#x27;/workspace/data&#x27;, train=True, download=False, transform=transform)</span><br><span class="line">test_dataset = datasets.MNIST(root=&#x27;/workspace/data&#x27;, train=False, download=False, transform=transform)</span><br><span class="line"></span><br><span class="line">train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)</span><br><span class="line">test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)</span><br><span class="line"></span><br><span class="line"># 检查是否有GPU可用，并选择设备</span><br><span class="line">device = torch.device(&quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;)</span><br><span class="line">model = Net().to(device)</span><br><span class="line">optimizer = optim.Adam(model.parameters())</span><br><span class="line"></span><br><span class="line"># 训练模型</span><br><span class="line">def train(model, device, train_loader, optimizer, epoch):</span><br><span class="line">    model.train()</span><br><span class="line">    for batch_idx, (data, target) in enumerate(train_loader):</span><br><span class="line">        data, target = data.to(device), target.to(device)</span><br><span class="line">        optimizer.zero_grad()</span><br><span class="line">        output = model(data)</span><br><span class="line">        loss = F.nll_loss(output, target)</span><br><span class="line">        loss.backward()</span><br><span class="line">        optimizer.step()</span><br><span class="line">        if batch_idx % 100 == 0:</span><br><span class="line">            print(f&#x27;Train Epoch: &#123;epoch&#125; [&#123;batch_idx * len(data)&#125;/&#123;len(train_loader.dataset)&#125; &#x27;</span><br><span class="line">                  f&#x27;(&#123;100. * batch_idx / len(train_loader):.0f&#125;%)]\tLoss: &#123;loss.item():.6f&#125;&#x27;)</span><br><span class="line"></span><br><span class="line"># 测试模型</span><br><span class="line">def test(model, device, test_loader):</span><br><span class="line">    model.eval()</span><br><span class="line">    test_loss = 0</span><br><span class="line">    correct = 0</span><br><span class="line">    with torch.no_grad():</span><br><span class="line">        for data, target in test_loader:</span><br><span class="line">            data, target = data.to(device), target.to(device)</span><br><span class="line">            output = model(data)</span><br><span class="line">            test_loss += F.nll_loss(output, target, reduction=&#x27;sum&#x27;).item()</span><br><span class="line">            pred = output.argmax(dim=1, keepdim=True)</span><br><span class="line">            correct += pred.eq(target.view_as(pred)).sum().item()</span><br><span class="line"></span><br><span class="line">    test_loss /= len(test_loader.dataset)</span><br><span class="line">    print(f&#x27;\nTest set: Average loss: &#123;test_loss:.4f&#125;, Accuracy: &#123;correct&#125;/&#123;len(test_loader.dataset)&#125; &#x27;</span><br><span class="line">          f&#x27;(&#123;100. * correct / len(test_loader.dataset):.0f&#125;%)\n&#x27;)</span><br><span class="line"></span><br><span class="line"># 运行训练和测试，并保存模型</span><br><span class="line">for epoch in range(1, 11):</span><br><span class="line">    train(model, device, train_loader, optimizer, epoch)</span><br><span class="line">    test(model, device, test_loader)</span><br><span class="line"></span><br><span class="line"># 保存模型</span><br><span class="line">torch.save(model.state_dict(), &quot;/workspace/mnist_cnn.pt&quot;)</span><br><span class="line">print(&quot;Model saved to /workspace/mnist_cnn.pt&quot;)</span><br></pre></td></tr></table></figure><p>保存为mnist_train.py文件，执行<code>python mnist_train.py</code><br>会加载我们下载映射到容器内的MNIST数据集，进行训练，训练后的文件mnist_cnn.pt会存储到workspace目录</p><h3 id="加载模型进行测试验证"><a href="#加载模型进行测试验证" class="headerlink" title="加载模型进行测试验证"></a>加载模型进行测试验证</h3><p>保存为test.py文件</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br></pre></td><td class="code"><pre><span class="line">import torch</span><br><span class="line">import torch.nn as nn</span><br><span class="line">import torch.nn.functional as F</span><br><span class="line">from torchvision import transforms</span><br><span class="line">from PIL import Image</span><br><span class="line">import argparse</span><br><span class="line"></span><br><span class="line"># 定义相同的网络架构</span><br><span class="line">class Net(nn.Module):</span><br><span class="line">    def __init__(self):</span><br><span class="line">        super(Net, self).__init__()</span><br><span class="line">        self.conv1 = nn.Conv2d(1, 32, 3, 1)</span><br><span class="line">        self.conv2 = nn.Conv2d(32, 64, 3, 1)</span><br><span class="line">        self.dropout1 = nn.Dropout2d(0.25)</span><br><span class="line">        self.dropout2 = nn.Dropout2d(0.5)</span><br><span class="line">        self.fc1 = nn.Linear(9216, 128)</span><br><span class="line">        self.fc2 = nn.Linear(128, 10)</span><br><span class="line"></span><br><span class="line">    def forward(self, x):</span><br><span class="line">        x = self.conv1(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = self.conv2(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = F.max_pool2d(x, 2)</span><br><span class="line">        x = self.dropout1(x)</span><br><span class="line">        x = torch.flatten(x, 1)</span><br><span class="line">        x = self.fc1(x)</span><br><span class="line">        x = F.relu(x)</span><br><span class="line">        x = self.dropout2(x)</span><br><span class="line">        x = self.fc2(x)</span><br><span class="line">        output = F.log_softmax(x, dim=1)</span><br><span class="line">        return output</span><br><span class="line"></span><br><span class="line"># 检查是否有GPU可用，并选择设备</span><br><span class="line">device = torch.device(&quot;cuda&quot; if torch.cuda.is_available() else &quot;cpu&quot;)</span><br><span class="line"></span><br><span class="line"># 加载模型</span><br><span class="line">model = Net().to(device)</span><br><span class="line">model.load_state_dict(torch.load(&quot;/workspace/mnist_cnn.pt&quot;))</span><br><span class="line">model.eval()</span><br><span class="line"></span><br><span class="line"># 定义数据预处理</span><br><span class="line">transform = transforms.Compose([</span><br><span class="line">    transforms.Grayscale(num_output_channels=1),</span><br><span class="line">    transforms.Resize((28, 28)),</span><br><span class="line">    transforms.ToTensor(),</span><br><span class="line">    transforms.Normalize((0.1307,), (0.3081,))</span><br><span class="line">])</span><br><span class="line"></span><br><span class="line">def predict_image(image_path):</span><br><span class="line">    image = Image.open(image_path)</span><br><span class="line">    image = transform(image).unsqueeze(0).to(device)</span><br><span class="line">    </span><br><span class="line">    with torch.no_grad():</span><br><span class="line">        output = model(image)</span><br><span class="line">        pred = output.argmax(dim=1, keepdim=True)</span><br><span class="line">    </span><br><span class="line">    return pred.item()</span><br><span class="line"></span><br><span class="line">if __name__ == &quot;__main__&quot;:</span><br><span class="line">    parser = argparse.ArgumentParser(description=&#x27;MNIST Image Prediction&#x27;)</span><br><span class="line">    parser.add_argument(&#x27;image_path&#x27;, type=str, help=&#x27;Path to the image to be predicted&#x27;)</span><br><span class="line">    args = parser.parse_args()</span><br><span class="line"></span><br><span class="line">    # 预测图片</span><br><span class="line">    prediction = predict_image(args.image_path)</span><br><span class="line">    print(f&#x27;The predicted digit is: &#123;prediction&#125;&#x27;)</span><br></pre></td></tr></table></figure><p>执行验证，指定图片路径</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">python test.py data/test/8/00527.png </span><br><span class="line"></span><br><span class="line"></span><br><span class="line">结果如下：</span><br><span class="line">The predicted digit is: 8</span><br><span class="line"></span><br><span class="line">python test.py data/test/1/00239.png </span><br><span class="line"></span><br><span class="line">The predicted digit is: 1</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>可以用test目录下数据进行快速验证。</p><p>也可以使用DIGITS进行图形化加载验证。</p><p><a href="https://licensecounter.jp/engineer-voice/blog/articles/20240408_ngc_nvidia_gpu_cloud.html">https://licensecounter.jp/engineer-voice/blog/articles/20240408_ngc_nvidia_gpu_cloud.html</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;p&gt;环境情况&lt;br&gt;OS：ubuntu-22.04&lt;br&gt;Kernel：5.15.0-101-generic&lt;br&gt;GPU：NVIDIA-T4&lt;br&gt;Python版本：3.10.12&lt;br&gt;Docker：24.0.5&lt;/p&gt;
&lt;p&gt;使用MNIST数据集训练手写数字识别&lt;br&gt;下</summary>
      
    
    
    
    <category term="AI" scheme="http://yoursite.com/categories/AI/"/>
    
    
    <category term="AI" scheme="http://yoursite.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>GPU互联方式</title>
    <link href="http://yoursite.com/2023/11/03/gpu-1/"/>
    <id>http://yoursite.com/2023/11/03/gpu-1/</id>
    <published>2023-11-03T13:45:59.000Z</published>
    <updated>2023-11-03T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a><a href="#%E6%A6%82%E8%BF%B0" title="概述"></a>概述</h3><p>随着AI大模型的深入发展，越来越多用户需要将大量GPU投入到环境中进行AI训练，AI训练本质就是利用一堆GPU做并行计算，训练、推理。计算方式有数量并行（将训练的数据拆成不同的子集分给不同的GPU去做运算）、模型并行（把模型中神经网络的不同层拆分给不同GPU计算）、张量并行（把同一层张量拆分成不同小块给不同GPU计算）。无论哪种方式都需要将GPU间大量数据交互，对网络要求是高带宽、低延时、无拥塞、无丢包。</p><h3 id="同服务器内GPU间连接"><a href="#同服务器内GPU间连接" class="headerlink" title="同服务器内GPU间连接"></a><a href="#%E5%90%8C%E6%9C%8D%E5%8A%A1%E5%99%A8%E5%86%85GPU%E9%97%B4%E8%BF%9E%E6%8E%A5" title="同服务器内GPU间连接"></a>同服务器内GPU间连接</h3><h4 id="PCIE连接"><a href="#PCIE连接" class="headerlink" title="PCIE连接"></a><a href="#PCIE%E8%BF%9E%E6%8E%A5" title="PCIE连接"></a>PCIE连接</h4><p>购买的单块GPU卡，直接插入服务器的PCIE插槽，GPU通过PCIE通道实现GPU和CPU互联，PCIE连接最大的问题是整体速率太低，不满足当前AI大模型的需求，当前最高的PCIE5.0和Nvlink4.0相比都会存在7倍的差异。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-1.jpg"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-2.jpg"></p><p>图片来源：<code>https://www.sohu.com/a/747247345_121865302#:~:text=%E7%9B%B8%E6%AF%94%E4%BA%8EPCIe%EF%BC%8CNVLink,%E5%A5%BD%E7%9A%84%E6%80%A7%E8%83%BD%E5%92%8C%E6%95%88%E7%8E%87%E3%80%82&amp;text=%E7%AE%80%E8%80%8C%E8%A8%80%E4%B9%8B%EF%BC%8CPCIe,%E5%88%86%E5%88%AB%E6%9C%89%E5%93%AA%E4%BA%9B%E4%BC%98%E5%8A%A3%E5%8A%BF%EF%BC%9F）</code></p><p>PCIE合适场景：<br>1、单卡性能能满足业务需求，可以直接单卡透传场景。</p><h4 id="Nvlink连接"><a href="#Nvlink连接" class="headerlink" title="Nvlink连接"></a><a href="#Nvlink%E8%BF%9E%E6%8E%A5" title="Nvlink连接"></a>Nvlink连接</h4><p>PCIE存在带宽瓶颈，并且只能实现两两GPU互联，NVLink技术使GPU无需通过PCIe总线即可访问远程GPU内存，整体性能比PCIE高，并且结合Nvswitch可以实现八卡互联。</p><p>需要实现2-8个GPU互联，统一整合提供给业务用，需要SXM接口板卡，SXM规格GPU主要用在DGX服务器（目前只能从NVIDIA购买）上，另外一类就是合作伙伴设计的HGX板的服务器上。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-3.jpg"><br>如何将这么多GPU连接起来呢？通过NVLINK连接实现高带宽传输<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-4.jpg"></p><table><thead><tr><th>PCIe版本</th><th>PCIe 1.0</th><th>PCIe 2.0</th><th>PCIe 3.0</th><th>PCIe 4.0</th><th>PCIe 5.0</th></tr></thead><tbody><tr><td>发布时间</td><td>2003</td><td>2007</td><td>2010</td><td>2017</td><td>2019</td></tr><tr><td>编码方式</td><td>8b&#x2F;10b</td><td>8b&#x2F;10b</td><td>128b&#x2F;130b</td><td>128b&#x2F;130b</td><td>128b&#x2F;130b</td></tr><tr><td>信号速率（GT&#x2F;S）</td><td>2.5</td><td>5</td><td>8</td><td>16</td><td>32</td></tr><tr><td>X16带宽（GB&#x2F;S）</td><td>8</td><td>16</td><td>32</td><td>64</td><td>128</td></tr></tbody></table><p>第四代NVLINK带宽，例如单个 NVIDIA H100 Tensor Core GPU 支持多达 18 个 NVLink 连接，总带宽为 900 GB&#x2F;s，是 PCIe 5.0 带宽的 7 倍。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-5.jpg"><br>NVLINK提供的两个GPU卡之间的互联，如果需要多卡互联需要使用NVSwitch，比如一台DGX服务器里面的8张H800 GPU<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-6.jpg"></p><p>如下图所示，每个H100 GPU 连接到4个NVLink交换芯片，GPU之间的NVLink带宽达到900 GB&#x2F;s。同时，每个H100 SXM GPU 也通过 PCIe连接到CPU，因此8个GPU中的任何一个计算的数据都可以送到CPU。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-7.jpg"><br>Nvlink合适场景：<br>1、单卡算力满足不了业务需求，需要多卡互联场景。</p><h4 id="跨节点互联"><a href="#跨节点互联" class="headerlink" title="跨节点互联"></a><a href="#%E8%B7%A8%E8%8A%82%E7%82%B9%E4%BA%92%E8%81%94" title="跨节点互联"></a>跨节点互联</h4><h5 id="RDMA概述"><a href="#RDMA概述" class="headerlink" title="RDMA概述"></a><a href="#RDMA%E6%A6%82%E8%BF%B0" title="RDMA概述"></a>RDMA概述</h5><p>训练超大模型需要多机多卡，需要将多个训练任务进行切分到不同卡上进行分布式训练，这里面涉及模型切分和卡间通信，主流的并行训练方式有数据并行、模型并行、张量并行、流水线并行等方式。所以对集群网络有很高要求，需要低延时、高带宽。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-8.jpg"></p><p>AI大模型GPU训练需要的网络带宽需要至少100Gbps~400Gbps，实现方式只能通过RDMA网络（Remote Direct Memory Access）实现。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-9.jpg"><br>从数据传输过程可以看出，数据在服务器的Buffer中进行了多次复制，并且需要在操作系统中添加或卸载TCP和IP头。这些操作不仅增加了数据传输延迟，而且消耗了大量的CPU资源，无法满足高性能计算的要求。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-10.jpg"><br>RDMA可以绕过操作系统内核，直接访问到另外一台服务器内存，减少中间层，提高整体转发效率，降低延时。<br>RDMA与传统TCP网络相比带来的价值<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-11.jpg"></p><h4 id="RDMA的核心价值："><a href="#RDMA的核心价值：" class="headerlink" title="RDMA的核心价值："></a><a href="#RDMA%E7%9A%84%E6%A0%B8%E5%BF%83%E4%BB%B7%E5%80%BC%EF%BC%9A" title="RDMA的核心价值："></a>RDMA的核心价值：</h4><p>内存零拷贝（Zero Copy）：RDMA应用程序可以绕过内核网络栈直接进行数据传输，不需要将应用程序从用户态内存空间拷贝到内核网络栈内核空间。<br>内核旁路（Kernel bypass）：直接从NIC到达用户态内存，减少了CPU从内核拷贝到用户态的过程。<br>CPU offload：应用程序可以直接访问远程主机内存降低远程主机中CPU的消耗。</p><h4 id="RDMA实现："><a href="#RDMA实现：" class="headerlink" title="RDMA实现："></a><a href="#RDMA%E5%AE%9E%E7%8E%B0%EF%BC%9A" title="RDMA实现："></a>RDMA实现：</h4><p>Infiniband：Mellanox主导的一项技术，后续被NVIDIA收购，完全区别于传统以太网，有自己独立的协议栈、需要独立的网卡、线缆、网络设备支持，整体成本较高，目前IB主推速率200Gbps-HDR和400Gbps-EDR。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-12.jpg"><br>Roce：基于 Ethernet的RDMA由IBTA提出，分为两个版本，Rocev1和RoceV2，V1版本没有继承以太网的网络层所以没有IP字段，无法被路由和跨网段，基本上没有应用场景，V2版本基于UDP使用了以太网的网络层，通过PFC（基于优先级的流量控制），ECN（显式拥塞通知）以及DCQCN（Data Center Quantized Congestion Notification）等技术对传统以太网络改造，实现无损以太网络，以确保零丢包。</p><p>iWARP：基于TCP协议需要实现，在TCP之上构建DDP（Data Placement Protocol）实现零拷贝的功能。</p><p>Roce和iWARP都只需要网卡支持即可，交换机可以正常使用以太网交换机，Rocev2的DCQCN算法还需要交换机支持RED（Random early detection）和ECN（Explicit Congestion Notification）功能</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-13.jpg"></p><h3 id="GPU池化方案"><a href="#GPU池化方案" class="headerlink" title="GPU池化方案"></a><a href="#GPU%E6%B1%A0%E5%8C%96%E6%96%B9%E6%A1%88" title="GPU池化方案"></a>GPU池化方案</h3><h4 id="概念"><a href="#概念" class="headerlink" title="概念"></a><a href="#%E6%A6%82%E5%BF%B5" title="概念"></a>概念</h4><p>GPU池化主要用于将GPU资源如CPU和内存资源池化一样，关键点在于按需调用，动态伸缩，用完释放。GPU池化能解决的问题有：1、GPU资源利用不均匀。2、远程调用GPU。3、多种异构GPU的统一支持。<br>AI领域用户对GPU的调用链路如下：<br>1、用户app为业务层主要运行用户的训练或推理任务。<br>2、Framework框架层主要深度学习框架pytorch、TensorFlow等<br>3、CUDA Runtime及周边生态库，如cudart、cublas、cudnn、cufft、cusparse等<br>4、CUDA User Driver：用户态CUDA Driver如cuda、nvml等<br>5、CUDA kernel Driver：内核态CUDA Driver如nvidia.ko和驱动<br>6、GPU卡硬件</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-14.jpg"><br>目前GPU池化方案基本上通过在CUDA Runtime&#x2F;Driver层拦截API实现。<br>GPU 池化也必须以同时满足故障隔离和算力隔离的方案作为基础。</p><h4 id="业内方案"><a href="#业内方案" class="headerlink" title="业内方案"></a><a href="#%E4%B8%9A%E5%86%85%E6%96%B9%E6%A1%88" title="业内方案"></a>业内方案</h4><p>Bitfusion<br>VMware旗下的Bitfusion有Server端和Client端。<br>Server端部署在带GPU的物理服务器中，server端用于将GPU虚拟化提供给多个业务使用，<br>Client端部署在实际需要使用GPU资源的业务节点上，Client端会将业务对GPU的需求拦截，然后通过网络传输给Bitfusion Server，计算完成后再返回结果。可以基于开源的cuda-hook代码实现：<a href="https://github.com/Bruce-Lee-LY/cuda_hook">https://github.com/Bruce-Lee-LY/cuda_hook</a><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-15.jpg"></p><p>实现方法：<br>Client端实现CUDA Driver，拦截全部对GPU的请求通过网络转发到Server端进行处理，server端完成后在返回给到app。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-16.jpg"></p><p>国内趋动科技Orion X解决方案<br>与Bitfusion比较类型，通过在业务侧部署Client端，拦截对CUDA Driver和请求转发到Server端进行处理。组件能力如下：<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/gpu1-17.jpg"></p><ul><li>Orion Controller：负责整个GPU资源池的资源管理。其响应Orion Client的vGPU请求，并从GPU资源池中为Orion Client端的CUDA应用程序分配并返回Orion vGPU资源。</li><li>Orion Server：负责GPU资源化的后端服务程序，部署在每一个CPU以及GPU节点上，接管本机内的所有物理GPU。当Orion Client端应用程序运行时，通过Orion Controller的资源调度，建立和Orion Server的连接。Orion Server为其应用程序的所有CUDA调用提供一个隔离的运行环境以及真实GPU硬件算力。</li><li>Orion Client：模拟了NVidia CUDA的运行库环境，为CUDA程序提供了API接口兼容的全新实现。通过和Orion其他功能组件的配合，为CUDA应用程序虚拟化了一定数量的虚拟GPU（Orion vGPU）。使用CUDA动态链接库的CUDA应用程序可以通过操作系统环境设置，使得一个CUDA应用程序在运行时由操作系统负责链接到Orion Client提供的动态链接库上。由于Orion Client模拟了NVidia CUDA运行环境，因此CUDA应用程序可以透明无修改地直接运行在Orion vGPU之上。</li></ul><p>最大问题<br>底层依赖NVIDIA-MPS方案，将多个进程上的kernel发送到MPS server或者直接发送到GPU上计算，避免了多进程在GPU上context的频繁切换。缺点是故障率较高，特别是故障在进程间扩散一般是不能容忍的。</p><h4 id="框架实现"><a href="#框架实现" class="headerlink" title="框架实现"></a><a href="#%E6%A1%86%E6%9E%B6%E5%AE%9E%E7%8E%B0" title="框架实现"></a>框架实现</h4><p>DDP（Distributed Data Parallelism）<br>使用Pytorch框架的业务可以使用DDP实现多机多卡训练，提示GPU利用率。<br>PyTorch的DDP利用了数据并行和模型并行两种策略。在数据并行中，数据被划分成多个子集，并在不同的GPU上进行训练。这种策略的优势在于实现简单，但当数据集非常大时，可能会因为数据划分不均导致训练结果不一致。模型并行是将模型的不同部分分别放在不同的GPU上训练，这种策略可以避免数据划分的问题，但实现起来更为复杂。</p><p>参考链接：<br><a href="https://mp.weixin.qq.com/s/GYiZk3Fgqqse6YfAfvmX7g">https://mp.weixin.qq.com/s/GYiZk3Fgqqse6YfAfvmX7g</a><br><a href="https://www.nvidia.cn/data-center/nvlink/#:~:text=NVLink%20%E6%98%AF%E4%B8%80%E7%A7%8DGPU,%E5%A4%9A%E5%AF%B9%E5%A4%9AGPU%20%E9%80%9A%E4%BF%A1%E3%80%82">https://www.nvidia.cn/data-center/nvlink/#:~:text&#x3D;NVLink%20%E6%98%AF%E4%B8%80%E7%A7%8DGPU,%E5%A4%9A%E5%AF%B9%E5%A4%9AGPU%20%E9%80%9A%E4%BF%A1%E3%80%82</a><br><a href="https://www.sdnlab.com/25923.html">https://www.sdnlab.com/25923.html</a><br><a href="https://aijishu.com/a/1060000000133430">https://aijishu.com/a/1060000000133430</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;&lt;a href=&quot;#%E6%A6%82%E8%BF%B0&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;随着AI大模型的深入发展，越来越多用户需要将大量G</summary>
      
    
    
    
    <category term="GPU" scheme="http://yoursite.com/categories/GPU/"/>
    
    
    <category term="GPU" scheme="http://yoursite.com/tags/GPU/"/>
    
  </entry>
  
  <entry>
    <title>stable diffusion学习系列1（安装部署-Windows环境)</title>
    <link href="http://yoursite.com/2023/10/03/stablediffusion/"/>
    <id>http://yoursite.com/2023/10/03/stablediffusion/</id>
    <published>2023-10-03T13:45:59.000Z</published>
    <updated>2023-10-03T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>stable diffusion做为目前AI绘图内开源的最强王者，本文主要在本地PC上部署使用是由Stability AI、CompVis與Runway合作开发，采用Apache2.0开源协议。<br><a href="https://github.com/Stability-AI/stablediffusion">https://github.com/Stability-AI/stablediffusion</a></p><p>本文用的是基于于stable diffusion封装的stable-diffusion-webui项目，简单直观能快速上手。</p><h3 id="安装环境"><a href="#安装环境" class="headerlink" title="安装环境"></a>安装环境</h3><table><thead><tr><th>软硬件</th><th>版本\型号</th></tr></thead><tbody><tr><td>显卡</td><td>RTX 3060 12GB</td></tr><tr><td>OS</td><td>Windows 11</td></tr><tr><td>Python</td><td>3.10.6</td></tr><tr><td>conda</td><td>23.5.2</td></tr><tr><td>显卡驱动</td><td>537.42—&gt;对应cuda 12.2</td></tr><tr><td>CUDA版本</td><td>12.2</td></tr><tr><td>git</td><td>2.42.0</td></tr><tr><td>stable-diffusion-webui</td><td>1.6</td></tr></tbody></table><h3 id="安装部署"><a href="#安装部署" class="headerlink" title="安装部署"></a>安装部署</h3><h4 id="基础环境部署"><a href="#基础环境部署" class="headerlink" title="基础环境部署"></a>基础环境部署</h4><h5 id="git安装"><a href="#git安装" class="headerlink" title="git安装"></a>git安装</h5><p><a href="https://git-scm.com/download/win">https://git-scm.com/download/win</a></p><p>下载git安装，最新版，下一步就好。</p><h5 id="python安装"><a href="#python安装" class="headerlink" title="python安装"></a>python安装</h5><p>通过conda管理和安装python，需要注意的是python版本，不要用超过3.10.x版本的python，我这里是下载的<br>Miniconda3-py310_23.5.2-0-Windows-x86_64<br><a href="https://repo.anaconda.com/miniconda/">https://repo.anaconda.com/miniconda/</a></p><p>参考：<a href="https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Dependencies">https://github.com/AUTOMATIC1111/stable-diffusion-webui/wiki/Dependencies</a></p><p>下一步安装就好，安装完成后在CMD中可以正常执行python命令</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">python --version</span><br><span class="line">Python 3.10.12</span><br></pre></td></tr></table></figure><p>配置conda源</p><p>开始菜单用管理员身份执行打开miniconda3<br>执行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">conda config --set show_channel_urls yes</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>生成配置文件</p><p>编辑配置文件添加清华大学加速地址</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">C:\Users\wansh\.condarc   //wansh替换为你的用户名</span><br></pre></td></tr></table></figure><p>粘贴以下内容</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">channels:</span><br><span class="line"> - defaults</span><br><span class="line">show_channel_urls: true</span><br><span class="line">default_channels:</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/r</span><br><span class="line"> - https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/msys2</span><br><span class="line">custom_channels:</span><br><span class="line"> conda-forge: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> msys2: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> bioconda: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> menpo: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> pytorch: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> pytorch-lts: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br><span class="line"> simpleitk: https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud</span><br></pre></td></tr></table></figure><p>conda3-cmd中执行以下命令配置python pip下载包的软件源，这里指向阿里云</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">pip config set global.index-url https://mirrors.aliyun.com/pypi/simple</span><br></pre></td></tr></table></figure><p>配置后查看</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">pip config list</span><br><span class="line">global.index-url=&#x27;https://mirrors.aliyun.com/pypi/simple&#x27;</span><br></pre></td></tr></table></figure><h4 id="CUDA配置"><a href="#CUDA配置" class="headerlink" title="CUDA配置"></a>CUDA配置</h4><p>查看显卡安装的驱动对应的CUDA版本<br>在conda3-cmd中执行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br></pre></td><td class="code"><pre><span class="line">nvidia-smi.exe</span><br><span class="line">Fri Oct  6 15:14:48 2023</span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| NVIDIA-SMI 537.42                 Driver Version: 537.42       CUDA Version: 12.2     |</span><br><span class="line">|-----------------------------------------+----------------------+----------------------+</span><br><span class="line">| GPU  Name                     TCC/WDDM  | Bus-Id        Disp.A | Volatile Uncorr. ECC |</span><br><span class="line">| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |</span><br><span class="line">|                                         |                      |               MIG M. |</span><br><span class="line">|=========================================+======================+======================|</span><br><span class="line">|   0  NVIDIA GeForce RTX 3060      WDDM  | 00000000:2B:00.0  On |                  N/A |</span><br><span class="line">|  0%   53C    P8              16W / 170W |   5441MiB / 12288MiB |      0%      Default |</span><br><span class="line">|                                         |                      |                  N/A |</span><br><span class="line">+-----------------------------------------+----------------------+----------------------+</span><br><span class="line"></span><br><span class="line">+---------------------------------------------------------------------------------------+</span><br><span class="line">| Processes:                                                                            |</span><br><span class="line">|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |</span><br><span class="line">|        ID   ID                                                             Usage      |</span><br></pre></td></tr></table></figure><p>对应的是12.2，去Nvidia官网下载对应的CUDA版本安装<br><a href="https://developer.nvidia.com/cuda-toolkit-archive">https://developer.nvidia.com/cuda-toolkit-archive</a></p><h5 id="终端命令配置"><a href="#终端命令配置" class="headerlink" title="终端命令配置"></a>终端命令配置</h5><p>配置代理需要拉取stable-diffusion-webui，需要conda3-cmd能够访问github</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">set https_proxy=http://127.0.0.1:33210</span><br><span class="line">set http_proxy=http://127.0.0.1:33210</span><br></pre></td></tr></table></figure><p>验证</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">curl -I www.google.com</span><br></pre></td></tr></table></figure><p>状态码返回200表示ok</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">HTTP/1.1 200 OK</span><br><span class="line">Transfer-Encoding: chunked</span><br><span class="line">Cache-Control: private</span><br><span class="line">Connection: keep-alive</span><br><span class="line">Content-Security-Policy-Report-Only: object-src &#x27;none&#x27;;base-uri &#x27;self&#x27;;script-src &#x27;nonce-EsiPFL30vCI9foliBMkTLA&#x27; &#x27;strict-dynamic&#x27; &#x27;report-sample&#x27; &#x27;unsafe-eval&#x27; &#x27;unsafe-inline&#x27; https: http:;report-uri https://csp.withgoogle.com/csp/gws/other-hp</span><br><span class="line">Content-Type: text/html; charset=ISO-8859-1</span><br><span class="line">Date: Fri, 06 Oct 2023 07:32:54 GMT</span><br><span class="line">Expires: Fri, 06 Oct 2023 07:32:54 GMT</span><br><span class="line">Keep-Alive: timeout=4</span><br><span class="line">P3p: CP=&quot;This is not a P3P policy! See g.co/p3phelp for more info.&quot;</span><br><span class="line">Proxy-Connection: keep-alive</span><br><span class="line">Server: gws</span><br><span class="line">Set-Cookie: 1P_JAR=2023-10-06-07; expires=Sun, 05-Nov-2023 07:32:54 GMT; path=/; domain=.google.com; Secure</span><br><span class="line">Set-Cookie: AEC=Ackid1QHNMFx6j8Bfaco7KM-Wc2Il-3JpKjmJcRYM3QqzErZfcup19XB43Y; expires=Wed, 03-Apr-2024 07:32:54 GMT; path=/; domain=.google.com; Secure; HttpOnly; SameSite=lax</span><br><span class="line">Set-Cookie: NID=511=ksuU76xakl0AZHIz-SjvI3pBnThANk3EBkMB7E4ZD1JNMxpQI8pg8rttvpYMdMqJSgfTwVt0Dqv-5V5p4uwnCRgb-KA_iOqHQ9lNPcsi0PjgXVbWAYhVIG2oCxmw_Jfw5XhA6QbDbpQcMq3zS9zkjx9gUwgHS-Howlm5ip9uU84; expires=Sat, 06-Apr-2024 07:32:54 GMT; path=/; domain=.google.com; HttpOnly</span><br><span class="line">X-Frame-Options: SAMEORIGIN</span><br><span class="line">X-Xss-Protection: 0</span><br></pre></td></tr></table></figure><h4 id="stable-diffusion-webui部署"><a href="#stable-diffusion-webui部署" class="headerlink" title="stable-diffusion-webui部署"></a>stable-diffusion-webui部署</h4><p>拉取stable-diffusion-webui代码<br>需要电脑<br>在conda3-CMD中执行E: 切换到E盘，按自己环境能提供的磁盘执行，因为装C盘会占用很多空间</p><p>clone代码</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git -b v1.6.0</span><br></pre></td></tr></table></figure><p>下载stable diffusion的训练模型</p><p>sd-v1-4.ckpt</p><p><a href="https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/tree/main">https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/tree/main</a></p><p>模型是用于AI绘图的元素素材库</p><p>下载后放置到E:\stable-diffusion-webui\models\Stable-diffusion目录。E盘根据部署盘符替换</p><p>在conda3-cmd执行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">cd stable-diffusion-webui</span><br><span class="line">webui-user.bat</span><br></pre></td></tr></table></figure><p>系统会自动执行下载对应的依赖包</p><p>执行成功会自动打开浏览器访问<a href="http://127.0.0.1:7860/">http://127.0.0.1:7860/</a></p><p>输入prompt生成图片</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/sd1-1.png"></p><p>当然也可以使用此prompt生成器进行<br><a href="https://tinygeeker.github.io/menu/autocue/#/?from=tencent">https://tinygeeker.github.io/menu/autocue/#/?from=tencent</a></p><h3 id="常见问题"><a href="#常见问题" class="headerlink" title="常见问题"></a>常见问题</h3><p>1、RuntimeError：Torch is not able to use GPU</p><p>这个原因主要是因为pytorch没有连接到GPU，cuda与torch版本不兼容导致的<br>网上有通过参数跳过，但这样就变成用CPU生成了，效率太差。所以还是要成根本上解决。<br>可以进行以下操作进行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">先用pip uninstall torch</span><br></pre></td></tr></table></figure><p>通过<a href="https://pytorch.org/get-started/locally/%E4%B8%8B%E8%BD%BD%E5%90%88%E9%80%82%E7%9A%84torch%E7%89%88%E6%9C%AC%EF%BC%8C%E6%88%91%E8%BF%99%E9%87%8C%E6%98%AFNVIDIA">https://pytorch.org/get-started/locally/下载合适的torch版本，我这里是NVIDIA</a> cuda12.2，但torch还没有对应的12.2版本，直接用11.8、12.1也能正常运行，目前1.6最高支持到11.8，可以向下兼容，正常pip配置了正常的源可以自动下载。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/sd1-2.png"></p><p>验证<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/sd1-3.png"></p><p>参考链接：</p><p><a href="https://zhuanlan.zhihu.com/p/610628741">https://zhuanlan.zhihu.com/p/610628741</a></p><p><a href="https://www.uisdc.com/47-stable-diffusion-models">https://www.uisdc.com/47-stable-diffusion-models</a></p><p><a href="https://zhuanlan.zhihu.com/p/622410028">https://zhuanlan.zhihu.com/p/622410028</a><br><a href="https://aitechtogether.com/python/82781.html">https://aitechtogether.com/python/82781.html</a>       </p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;stable diffusion做为目前AI绘图内开源的最强王者，本文主要在本地PC上部署使用是由Stability AI、CompVis與</summary>
      
    
    
    
    <category term="AI" scheme="http://yoursite.com/categories/AI/"/>
    
    
    <category term="AI" scheme="http://yoursite.com/tags/AI/"/>
    
  </entry>
  
  <entry>
    <title>使用DockerFile构建Bare Metal镜像</title>
    <link href="http://yoursite.com/2022/12/26/elemental/"/>
    <id>http://yoursite.com/2022/12/26/elemental/</id>
    <published>2022-12-26T13:45:59.000Z</published>
    <updated>2022-12-26T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="Mutable和Immutable介绍"><a href="#Mutable和Immutable介绍" class="headerlink" title="Mutable和Immutable介绍"></a>Mutable和Immutable介绍</h3><p>云原生的代表技术包括容器、服务网格、微服务、不可变基础设施和声明式API。容器技术的最大创造就是通过Dockerfile将应用打包为容器镜像，实现了不可变基础设施，标准化了应用模板。<br>在容器之前叫Mutable（可变的基础设施）在OS上部署应用，重启生效，可以随时进行修改。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-1.png"><br>容器技术就是Immutable的代表，引入容器镜像，通过Dockerfile将应用标准化打包为容器镜像，通过容器镜像启动容器，无法在容器中进行永久性修改，需要修改只能通过更新Dockerfile方式进行。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-2.png"></p><p>现如今Immutable理念也开始逐步从容器下沉到Bare Metal OS，通过Dockerfile构建Bare Metal镜像，实现Bare Metal OS  Immutable。</p><p>典型的开源项目技术Elemental项目</p><h3 id="Elemental概述"><a href="#Elemental概述" class="headerlink" title="Elemental概述"></a>Elemental概述</h3><p>Elemental 是一系列工具集合，主要是想通过 Kubernetes 实现集中式、完整的云原生操作系统构建和管理。</p><p>集群节点操作系统是通过Elemental CLI通过容器映像构建和维护的，并使用Elemental CLI安装在新主机上。<br>Elemental Operator和Rancher System Agent使Rancher Manager 能够完全控制 Elemental 集群，从在节点上安装和管理操作系统到以集中方式配置新的 K3s 或 RKE2 集群。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-3.png"></p><p>Elemental项目组成</p><ul><li><p>elemental-toolkit - 包括一组操作系统实用程序，可通过容器启用操作系统管理。包括 dracut 模块、引导加载程序配置、cloud-init 自定义配置服务等。</p></li><li><p>elemental-operator - 这连接到 Rancher Manager 并处理 machineRegistration 和 machineInventory CRD</p></li><li><p>elemental-register - 这通过 machineRegistrations 注册机器并通过 elemental-cli 安装</p></li><li><p>elemental-cli - 这会安装任何基于 elemental-toolkit 的衍生工具。实现OCI容器镜像构建为可在虚拟机、物理机、嵌入式设备运行的ISO镜像。</p></li><li><p>rancher-system-agent - 在已安装的系统上运行并从 Rancher Manager 获取命令在系统上安装和运行rancher-agent，注册到Rancher中。</p></li></ul><p>项目地址：<a href="https://github.com/rancher/elemental-toolkit">https://github.com/rancher/elemental-toolkit</a></p><h3 id="配置使用"><a href="#配置使用" class="headerlink" title="配置使用"></a>配置使用</h3><p>在一台装有Docker的主机上进行</p><p>提前准备项：</p><ul><li>一台安装了Docker的主机</li><li>Harbor镜像仓库</li><li>EXSI或物理pc、服务器用于build后的ISO测试</li></ul><p>使用Elemental-toolkit构建ISO流程</p><ul><li><p>基础base镜像发行版：<br>teal: SLE Micro for Rancher based one, shipping packages from Sle Micro 5.3.<br>green: openSUSE based one, shipping packages from OpenSUSE Leap 15.4 repositories.<br>blue: Fedora based one, shipping packages from Fedora 33 repositories<br>orange: Ubuntu based one, shipping packages form Ubuntu 20.10 repositories</p></li><li><p>自定义镜像并制作OCI Image</p></li><li><p>在装有Docker的机器启动Elemental Build<br>UEFI Boot，选择合适的实例类型<br>Clout-init userdata 初始化<br>Default user&#x2F;pass: root&#x2F;cos</p></li><li><p>升级自定义镜像<br>elemental upgrade –no-verify –reboot -d niusmallnan&#x2F;containeros:dev</p></li></ul><p>在安装了Docker的主机上创建&#x2F;root&#x2F;derivative目录。</p><p>整体目录结构</p><pre><code>/root/derivative/├── Dockerfile├── cloud-init.yaml├── install.sh├── installer.sh├── k3s├── k3s-airgap-images-amd64.tar.gz├── manifest.yaml├── nginx.yaml├── overlay│   └── iso│       └── boot│           └── grub2│               └── grub.cfg└── repositories.yaml</code></pre><p>Demo架构</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-4.png"></p><ul><li>通过Elemental构建的OS中包含K3S</li><li>将需要部署的应用yaml放置到 &#x2F;var&#x2F;lib&#x2F;rancher&#x2F;k3s&#x2F;server&#x2F;manifests目录，K3S启动成功后会自动部署yaml启动应用。</li></ul><p>下载K3S离线镜像包和CLI文件<br><a href="https://github.com/k3s-io/k3s/releases">https://github.com/k3s-io/k3s/releases</a></p><p>nginx.yaml文件用于k3s启动后加载此yaml文件，模拟演示是个应用</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: apps/v1</span><br><span class="line">kind: Deployment</span><br><span class="line">metadata:</span><br><span class="line">  name: nginx-deployment</span><br><span class="line">spec:</span><br><span class="line">  selector:</span><br><span class="line">    matchLabels:</span><br><span class="line">      app: nginx</span><br><span class="line">  replicas: 1</span><br><span class="line">  template:</span><br><span class="line">    metadata:</span><br><span class="line">      labels:</span><br><span class="line">        app: nginx</span><br><span class="line">    spec:</span><br><span class="line">      containers:</span><br><span class="line">      - name: nginx</span><br><span class="line">        image: wanshaoyuan/nginx:v1.0</span><br><span class="line">        ports:</span><br><span class="line">        - containerPort: 80</span><br><span class="line"></span><br><span class="line">---</span><br><span class="line">apiVersion: v1</span><br><span class="line">kind: Service</span><br><span class="line">metadata:</span><br><span class="line">  name: my-service</span><br><span class="line">spec:</span><br><span class="line">  type: NodePort</span><br><span class="line">  selector:</span><br><span class="line">    app: nginx</span><br><span class="line">  ports:</span><br><span class="line">    - port: 80</span><br><span class="line">      targetPort: 80</span><br><span class="line">      nodePort: 30007</span><br></pre></td></tr></table></figure><p>Dockerfile文件创建</p><pre><code>ARG LUET_VERSION=0.32.0FROM quay.io/luet/base:$LUET_VERSION AS luetFROM registry.suse.com/suse/sle-micro-rancher/5.2ARG ARCH=amd64ENV ARCH=$&#123;ARCH&#125;# Copy the luet config file pointing to the upgrade repositoryCOPY repositories.yaml /etc/luet/luet.yaml# Copy luet from the official imagesCOPY --from=luet /usr/bin/luet /usr/bin/luetENV LUET_NOLOCK=trueRUN luet install -y \       toolchain/yip \       toolchain/luet \       utils/installer \       system/cos-setup \       system/immutable-rootfs \       system/grub2-config \       system/base-dracut-modulesRUN  mkdir /var/lib/rancher/k3s/agent/images/ -p &amp;&amp;  mkdir /var/lib/rancher/k3s/server/manifests -pCOPY install.sh /system/oem/COPY k3s /usr/local/binCOPY nginx.yaml /system/oem/COPY k3s-airgap-images-amd64.tar.gz /system/oem/RUN  chmod a+x /usr/local/bin/k3s &amp;&amp; chmod a+x /system/oem/install.shWORKDIR /system/oemRUN  INSTALL_K3S_SKIP_START=&quot;true&quot; INSTALL_K3S_SKIP_ENABLE=&quot;true&quot; INSTALL_K3S_SKIP_DOWNLOAD=&quot;true&quot; sh install.sh## System layout# Required by k3s etc.RUN mkdir /usr/libexec &amp;&amp; mkdir /usr/local/bin -p &amp;&amp; touch /usr/libexec/.keep# Copy custom files# COPY files/ /# Copy cloud-init default configurationCOPY cloud-init.yaml /system/oem/# Generate initrdRUN mkinitrd# OS level configurationRUN echo &quot;VERSION=999&quot; &gt; /etc/os-releaseRUN echo &quot;GRUB_ENTRY_NAME=derivative&quot; &gt;&gt; /etc/os-releaseRUN echo &quot;welcome to our derivative&quot; &gt;&gt; /etc/issue.d/01-derivative</code></pre><p>cloud-init文件创建，主要用于磁盘分区配置和登录用户名和密码配置<br>cloud-init.yaml</p><pre><code>name: &quot;Default settings&quot;stages:   initramfs:     # Setup default hostname     - name: &quot;Branding&quot;       hostname: &quot;derivative&quot;     # Setup an admin group with sudo access     - name: &quot;Setup groups&quot;       ensure_entities:       - entity: |            kind: &quot;group&quot;            group_name: &quot;admin&quot;            password: &quot;x&quot;            gid: 900                 # Setup network - openSUSE specific     - name: &quot;Network setup&quot;       files:       - path: /etc/sysconfig/network/ifcfg-eth0         content: |                  BOOTPROTO=&#39;dhcp&#39;                  STARTMODE=&#39;onboot&#39;                           permissions: 0600         owner: 0         group: 0     # Setup a custom user     - name: &quot;Setup users&quot;       users:       # Replace the default user name here and settings        joe:          # Comment passwd for no password          passwd: &quot;joe&quot;          shell: /bin/bash          homedir: &quot;/home/joe&quot;          groups:          - &quot;admin&quot;       #authorized_keys:       # Replace here with your ssh keys       # joe:        # - ssh-rsa ....     # Setup sudo     - name: &quot;Setup sudo&quot;       files:       - path: &quot;/etc/sudoers&quot;         owner: 0         group: 0         permsisions: 0600         content: |            Defaults always_set_home            Defaults secure_path=&quot;/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/bin:/usr/local/sbin&quot;            Defaults env_reset            Defaults env_keep = &quot;LANG LC_ADDRESS LC_CTYPE LC_COLLATE LC_IDENTIFICATION LC_MEASUREMENT LC_MESSAGES LC_MONETARY LC_NAME LC_NUMERIC LC_PAPER LC_TELEPHONE LC_ATIME LC_ALL LANGUAGE LINGUAS XDG_SESSION_COOKIE&quot;            Defaults !insults            root ALL=(ALL) ALL            %admin ALL=(ALL) NOPASSWD: ALL            @includedir /etc/sudoers.d                   commands:       - passwd -l root   # Setup persistency so k3s works properly   # See also: https://rancher.github.io/elemental-toolkit/docs/reference/immutable_rootfs/#configuration-with-an-environment-file   rootfs.after:    - name: &quot;Immutable Layout configuration&quot;      environment_file: /run/cos/cos-layout.env      environment:        VOLUMES: &quot;LABEL=COS_OEM:/oem LABEL=COS_PERSISTENT:/var&quot;        OVERLAY: &quot;tmpfs:25%&quot;        RW_PATHS: &quot;/usr/local /etc /srv&quot;        PERSISTENT_STATE_PATHS: &gt;-          /etc/systemd          /etc/rancher          /etc/ssh          /etc/iscsi           /etc/cni          /home          /opt          /root          /usr/libexec          /var/log          /var/lib/wicked          /var/lib/longhorn          /var/lib/cni          /usr/local/bin        PERSISTENT_STATE_TARGET: &gt;-          /etc/systemd          /etc/rancher          /etc/ssh          /etc/iscsi          /etc/cni          /home          /opt          /root          /usr/libexec          /var/log          /var/lib/kubelet          /var/lib/wicked          /var/lib/longhorn          /var/lib/cni          /usr/local/bin        PERSISTENT_STATE_BIND: &quot;true&quot;   # Finally, let&#39;s start k3s when network is available, and download the SSH key from github for the joe user   network:     - name: &quot;Deploy cos-system&quot;       commands:         - elemental install /dev/sda          - systemctl enable k3s &amp;&amp; systemctl start  k3s   after-install:     - name: &quot;install k3s&quot;       commands:         - mount /dev/sda5 /var         - mkdir -p  /var/lib/rancher/k3s/agent/images/  &amp;&amp; mkdir /var/lib/rancher/k3s/server/manifests -p         - cp /system/oem/k3s-airgap-images-amd64.tar.gz  /var/lib/rancher/k3s/agent/images/         - cp /system/oem/nginx.yaml /var/lib/rancher/k3s/server/manifests         - reboot</code></pre><p>创建manifest.yaml文件定义OS启动引导所需要文件</p><pre><code>iso:  rootfs:    - channel:system/cos  uefi:    - channel:live/grub2-efi-image  image:    - channel:live/grub2    - channel:live/grub2-efi-image  label: &quot;COS_LIVE&quot;name: &quot;cOS-0&quot;# Raw disk creation values startraw_disk:  x86_64:    # which packages to install and the target to install them at    packages:      - name: channel:system/grub2-efi-image        target: efi      - name: channel:system/grub2-config        target: root      - name: channel:system/grub2-artifacts        target: root/grub2      - name: channel:recovery/cos-img        target: root/cOSrepositories:  - uri: quay.io/costoolkit/releases-teal    arch: &quot;x86_64&quot;</code></pre><p>创建repositories.yaml文件</p><pre><code>logging:  color: false  enable_emoji: falsegeneral:   debug: false   spinner_charset: 9repositories:- name: &quot;cos&quot;  description: &quot;cOS official&quot;  type: &quot;docker&quot;  enable: true  cached: true  priority: 1  verify: false  urls:  - &quot;quay.io/costoolkit/releases-green&quot;</code></pre><p>创建grub文件配置内核引导<br>在&#x2F;root&#x2F;derivative&#x2F;overlay&#x2F;iso&#x2F;boot&#x2F;目录创建grub2grub.cfg文件</p><pre><code>search --no-floppy --file --set=root /boot/kernelset default=0set timeout=10set timeout_style=menuset linux=linuxset initrd=initrdif [ &quot;$&#123;grub_cpu&#125;&quot; = &quot;x86_64&quot; -o &quot;$&#123;grub_cpu&#125;&quot; = &quot;i386&quot; -o &quot;$&#123;grub_cpu&#125;&quot; = &quot;arm64&quot; ];then    if [ &quot;$&#123;grub_platform&#125;&quot; = &quot;efi&quot; ]; then        if [ &quot;$&#123;grub_cpu&#125;&quot; != &quot;arm64&quot; ]; then            set linux=linuxefi            set initrd=initrdefi        fi    fifiif [ &quot;$&#123;grub_platform&#125;&quot; = &quot;efi&quot; ]; then    echo &quot;Please press &#39;t&#39; to show the boot menu on this console&quot;fiset font=($root)/boot/$&#123;grub_cpu&#125;/loader/grub2/fonts/unicode.pf2if [ -f $&#123;font&#125; ];then    loadfont $&#123;font&#125;fimenuentry &quot;cOS&quot; --class os --unrestricted &#123;    echo Loading kernel...    $linux ($root)/boot/kernel.xz cdroot root=live:CDLABEL=COS_LIVE rd.live.dir=/ rd.live.squashimg=rootfs.squashfs console=tty1 console=ttyS0 rd.cos.disable    echo Loading initrd...    $initrd ($root)/boot/rootfs.xz&#125;if [ &quot;$&#123;grub_platform&#125;&quot; = &quot;efi&quot; ]; then    hiddenentry &quot;Text mode&quot; --hotkey &quot;t&quot; &#123;        set textmode=true        terminal_output console    &#125;fi</code></pre><p>先构建镜像</p><pre><code>docker build -t 172.16.1.208/library/example:v4.0 .</code></pre><p>镜像要上传到镜像仓库才能build iso</p><pre><code>docker push 172.16.1.208/library/example:v4.0</code></pre><p>构建ISO</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run --rm -ti -v $(pwd):/build quay.io/costoolkit/elemental-cli:v0.0.15-ae4f000--config-dir /build --overlay-iso /build/overlay/iso --debug build-iso -o /build 172.16.1.208/library/example:v4.0</span><br></pre></td></tr></table></figure><p>注：目前只支持公开的镜像仓库，不支持私有的镜像仓库<br><a href="https://github.com/rancher/elemental-cli/issues/389">https://github.com/rancher/elemental-cli/issues/389</a></p><p>构建完成，生成此cOS-0.iso镜像文件</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">o250800372/iso / -chmod 0755 -- -boot_image grub bin_path=/boot/x86_64/loader/eltorito.img -boot_image grub grub2_mbr=/tmp/elemental-iso250800372/iso//boot/x86_64/loader/boot_hybrid.img -boot_image grub grub2_boot_info=on -boot_image any partition_offset=16 -boot_image any cat_path=/boot/x86_64/boot.catalog -boot_image any cat_hidden=on -boot_image any boot_info_table=on -boot_image any platform_id=0x00 -boot_image any emul_type=no_emulation -boot_image any load_size=2048 -append_partition 2 0xef /tmp/elemental-iso250800372/iso/boot/uefi.img -boot_image any next -boot_image any efi_path=--interval:appended_partition_2:all:: -boot_image any platform_id=0xef -boot_image any emul_type=no_emulation&#x27; </span><br><span class="line">DEBU[2023-03-12T11:54:38Z] Xorriso: xorriso 1.4.6 : RockRidge filesystem manipulator, libburnia project.</span><br><span class="line"></span><br><span class="line">Drive current: -outdev &#x27;/build/cOS-0.iso&#x27;</span><br><span class="line">Media current: stdio file, overwriteable</span><br><span class="line">Media status : is blank</span><br><span class="line">Media summary: 0 sessions, 0 data blocks, 0 data, 5851m free</span><br><span class="line">xorriso : UPDATE : 623 files added in 1 seconds</span><br><span class="line">Added to ISO image: directory &#x27;/&#x27;=&#x27;/tmp/elemental-iso250800372/iso&#x27;</span><br><span class="line">xorriso : NOTE : Copying to System Area: 512 bytes from file &#x27;/tmp/elemental-iso250800372/iso/boot/x86_64/loader/boot_hybrid.img&#x27;</span><br><span class="line">xorriso : UPDATE : Writing:      24576s    6.5%   fifo 100%  buf  50%</span><br><span class="line">xorriso : UPDATE : Writing:     221184s   58.1%   fifo 100%  buf  50%  415.0xD </span><br><span class="line">ISO image produced: 380645 sectors</span><br><span class="line">Written to medium : 380656 sectors at LBA 48</span><br><span class="line">Writing to &#x27;/build/cOS-0.iso&#x27; completed successfully.</span><br></pre></td></tr></table></figure><p>将cOS-0.iso下载到ESXI或其他虚拟化平台也可以刻录U盘直接安装物理机。</p><p>配置选4c4G 60G磁盘</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-5.png"><br>加载ISO后自动分区，自动进行初始化，安装系统，完成后自动重启进入系统。</p><p>密码ssh账号密码joe&#x2F;joe</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-6.png"><br>在安装后的系统查看已经部署好的K3S。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-7.png"></p><p>查看自动部署的应用</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/elemental-8.png"><br>访问应用</p><p>因为整个系统都限制了修改，所以在操作系统任何目录执行修改命令都无法修改。如</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">rm -rf *</span><br><span class="line"></span><br><span class="line">evice or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/lib/kubelet/pods/cbf59b3a-d29a-4129-a3c9-8b79b1235104/volumes/kubernetes.io~projected/kube-api-access-zf8c5&#x27;: Device or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/lib/kubelet/pods/ea697a4c-8cb8-425f-8e50-6396f5669167/volumes/kubernetes.io~projected/kube-api-access-bq66h&#x27;: Device or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/lib/kubelet/pods/f18cd482-4c6f-4dd0-80fa-5fc314d3cc5b/volumes/kubernetes.io~projected/kube-api-access-8fdq7&#x27;: Device or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/lib/longhorn&#x27;: Device or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/lib/wicked&#x27;: Device or resource busy</span><br><span class="line">rm: cannot remove &#x27;var/log&#x27;: Device or resource busy</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">touch  1</span><br><span class="line">touch: cannot touch &#x27;1&#x27;: Read-only file system</span><br></pre></td></tr></table></figure><h3 id="总结"><a href="#总结" class="headerlink" title="总结"></a>总结</h3><p>通过Elemental实现了操作系统为不变基础设施，同时也可以将我们传统的OS带入云原生，通过Dockerfile去构建，通过CICD去统一发版维护，目前能想到的一个比较大的应用场景在于，一个是边缘场景，边缘设备操作系统批量部署安装。另外就是一些to b的客户将自己业务+容器编排和OS通过Elemental构建打包，直接到客户现场加载ISO就部署完了，开箱即用。另外OS也可以标准化，统一化管理。</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;Mutable和Immutable介绍&quot;&gt;&lt;a href=&quot;#Mutable和Immutable介绍&quot; class=&quot;headerlink&quot; title=&quot;Mutable和Immutable介绍&quot;&gt;&lt;/a&gt;Mutable和Immutable介绍&lt;/h3&gt;&lt;p&gt;云原</summary>
      
    
    
    
    <category term="kubernetes" scheme="http://yoursite.com/categories/kubernetes/"/>
    
    
    <category term="kubernetes" scheme="http://yoursite.com/tags/kubernetes/"/>
    
  </entry>
  
  <entry>
    <title>零信任与SPIFFE（一）</title>
    <link href="http://yoursite.com/2022/11/17/spiffe/"/>
    <id>http://yoursite.com/2022/11/17/spiffe/</id>
    <published>2022-11-17T13:45:59.000Z</published>
    <updated>2022-11-17T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a><a href="#%E6%A6%82%E8%BF%B0" title="概述"></a>概述</h3><p>传统的网络安全模型通过划分不同的网络分区，同一个网络分区是可信的，不同网络分区之间通过防火墙隔断。这种方式在云原生时代已经变得不可适用了。<br>1、同一网络分区内流量无法进行管控，特别是是如今随着容器大规模落地，容器IP又不是二层可见和固定的无法进行安全管控。</p><p>2、传统网络边界防火墙采用静态方式配置规则，对于云原生这类动态变化的环境无法适应。<br>零信任安全框架在此背景下提出。<br>零信任是默认不信任使用，除非通过验证。通过身份认证，访问策略控制，实现最小权限访问控制。零信任安全的本质是以身份为中心进行动态访问控制，SPIFFE项目（Secure Product Identity Framework For Everyone）通用安全身份框架 。通过X.509 证书的形式为生产环境中的每个工作负载提供安全身份分发，认证。</p><p><a href="https://spiffe.io/">https://spiffe.io/</a><br>SPIFFE本身也是开源项目，目前托管在CNCF基金会，在2022年9月正式毕业。</p><p>SPIFFE</p><h3 id="架构和概念解析"><a href="#架构和概念解析" class="headerlink" title="架构和概念解析"></a><a href="#%E6%9E%B6%E6%9E%84%E5%92%8C%E6%A6%82%E5%BF%B5%E8%A7%A3%E6%9E%90" title="架构和概念解析"></a>架构和概念解析</h3><p>SPIFFE ( Secure Production Identity Framework For Everyone )：通用安全身份认证框架。<br>SPIRE ( SPIFFE Runtime Environment )：是 SPIFFE 标准的一套生产就绪实现，它执行节点证明和工作负载证明，可以安全地向服务颁发身份凭证，并根据预定义的条件集合验证其他服务的身份。<br><a href="https://github.com/spiffe/spire">https://github.com/spiffe/spire</a></p><p><img src="https://spiffe.io/img/server_and_agent.png"></p><p>Spire由SPIRE-Server和一个或多个SPIRE-Agent组成。</p><p>Server端充当签名机构（CA）通过Agent颁发给工作负载的证书。它还进行证书维护和验证。</p><p>Agent运行在每个workload所在节点上，作用是从Server端接受证书，并将其存储在缓存中。另外是对workload暴露SPIFFE Workload API 充当 SDS（secret discovery service）角色处理整个mTLs流量进行证书交互和验证</p><p><img src="https://d2908q01vomqb2.cloudfront.net/fe2ef495a1152561572949784c16bf23abb28057/2021/07/21/envoy-spire.png"></p><p>SPIFFE安全框架主要包含以下部分：</p><p>SPIFFE ID：用于标识对应信任域的工作负载，类似URI格式的字符串包含以下<br><img src="https://pic3.zhimg.com/v2-cb8de4befb4b9fa5a0dd46fbb649d16e_b.jpg"><br>由spiffe:&#x2F;&#x2F;信任域的名字&#x2F;工作负载名字或对应的身份标识</p><p>SVID（SPIFFE Veriﬁable Identity Document）：<br>svid可以是两种格式之一\:X.509证书或jwt。证书svid可用于建立端到端相互TLS<br>加密连接。jwt在端到端相互TLS加密不需要或不需要的情况下非常有用，例如当<br>使用负载均衡器。jwt对于已经支持基于jwt的身份验证的各种云服务的身份验证也很有用。无论是使用JWT SVIDs还是X.509 SVIDs, SPIFFE id、信任包格式和工作负载API都是相同的。</p><p>Trust Bundle：用于验证svid的公钥集</p><p>Workload API：工作负载通过此api获取对应的SPIFFE ID、SVID、Trust Bundle。</p><p>SPIFFE 联邦：不同信任域共享SPIFFE信任包，比如数据中心A的Spire环境与数据中心B的Spire环境建立联邦关系就可以互相配置和检查</p><p>注：这里写的workload（工作负载）并不等同于k8s里面的workload，主要指的是需要接入SPIFFE的对象可以是docker容器、VM、k8s-pod等等。</p><h3 id="演示"><a href="#演示" class="headerlink" title="演示"></a><a href="#%E6%BC%94%E7%A4%BA" title="演示"></a>演示</h3><p>软件版本：<br>1、kubernetes v1.24.8<br>2、Spire：v1.5.3</p><h4 id="Spire部署"><a href="#Spire部署" class="headerlink" title="Spire部署"></a><a href="#Spire%E9%83%A8%E7%BD%B2" title="Spire部署"></a>Spire部署</h4><p>部署local-path-provisioner<br>因为Spire-Server为有状态服务，依赖存储，所以这里部署local-path-provisioner<br>并设置为默认StorageClass</p><p><a href="https://github.com/rancher/local-path-provisioner.git">https://github.com/rancher/local-path-provisioner.git</a></p><p>完成后  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl get sc</span><br><span class="line"></span><br><span class="line">NAME                   PROVISIONER             RECLAIMPOLICY   VOLUMEBINDINGMODE      ALLOWVOLUMEEXPANSION   AGE</span><br><span class="line"></span><br><span class="line">local-path (default)   rancher.io/local-path   Delete          WaitForFirstConsumer   false                  2d7h</span><br></pre></td></tr></table></figure><p>部署Spire-server和Spire-agent<br>clone此项目</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">git clone https://github.com/spiffe/spire-tutorials</span><br></pre></td></tr></table></figure><p>切换到spire-tutorials&#x2F;k8s&#x2F;quickstart目录</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f spire-namespace.yaml</span><br></pre></td></tr></table></figure><p>配置spire-server权限  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply \</span><br><span class="line"></span><br><span class="line">    -f server-account.yaml \</span><br><span class="line"></span><br><span class="line">    -f spire-bundle-configmap.yaml \</span><br><span class="line"></span><br><span class="line">    -f server-cluster-role.yaml</span><br></pre></td></tr></table></figure><p>部署Spire-server  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply \</span><br><span class="line"></span><br><span class="line">    -f server-configmap.yaml \</span><br><span class="line"></span><br><span class="line">    -f server-statefulset.yaml \</span><br><span class="line"></span><br><span class="line">    -f server-service.yaml</span><br></pre></td></tr></table></figure><p>查看部署状态  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl get statefulset --namespace spire</span><br><span class="line"></span><br><span class="line">NAME           READY   AGE</span><br><span class="line"></span><br><span class="line">spire-server   1/1     2d8h</span><br></pre></td></tr></table></figure><p>部署spire-agent<br>1、配置权限  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply \</span><br><span class="line"></span><br><span class="line">    -f agent-account.yaml \</span><br><span class="line"></span><br><span class="line">    -f agent-cluster-role.yaml</span><br></pre></td></tr></table></figure><p>2、部署spire-agent  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply \</span><br><span class="line"></span><br><span class="line">    -f agent-configmap.yaml \</span><br><span class="line"></span><br><span class="line">    -f agent-daemonset.yaml</span><br></pre></td></tr></table></figure><p>3、检查  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl get daemonset --namespace spire</span><br><span class="line"></span><br><span class="line">NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE</span><br><span class="line"></span><br><span class="line">spire-agent   2         2         2       2            2           &lt;none&gt;          2d8h</span><br></pre></td></tr></table></figure><p>4、注册spire-agent  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">kubectl exec -n spire spire-server-0 -- \</span><br><span class="line"></span><br><span class="line">    /opt/spire/bin/spire-server entry create \</span><br><span class="line"></span><br><span class="line">    -spiffeID spiffe://example.org/ns/spire/sa/spire-agent \</span><br><span class="line"></span><br><span class="line">    -selector k8s_sat:cluster:demo-cluster \</span><br><span class="line"></span><br><span class="line">    -selector k8s_sat:agent_ns:spire \</span><br><span class="line"></span><br><span class="line">    -selector k8s_sat:agent_sa:spire-agent \</span><br><span class="line"></span><br><span class="line">    -node</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">kubectl exec -n spire spire-server-0 -- \</span><br><span class="line"></span><br><span class="line">    /opt/spire/bin/spire-server entry create \</span><br><span class="line"></span><br><span class="line">    -spiffeID spiffe://example.org/ns/default/sa/default \</span><br><span class="line"></span><br><span class="line">    -parentID spiffe://example.org/ns/spire/sa/spire-agent \</span><br><span class="line"></span><br><span class="line">    -selector k8s:ns:default \</span><br><span class="line"></span><br><span class="line">    -selector k8s:sa:default</span><br></pre></td></tr></table></figure><p>5、验证<br>Spire-agent默认会将socket文件映射到k8s集群主机的&#x2F;run&#x2F;spire&#x2F;sockets&#x2F;agent.sock，部署测试容器查看  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f client-deployment.yaml</span><br></pre></td></tr></table></figure><p>验证容器是否可以访问socket  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">kubectl exec -it $(kubectl get pods -o=jsonpath=&#x27;&#123;.items[0].metadata.name&#125;&#x27; \</span><br><span class="line"></span><br><span class="line">   -l app=client)  -- /opt/spire/bin/spire-agent api fetch -socketPath /run/spire/sockets/agent.sock</span><br></pre></td></tr></table></figure><p>如果agent正常运行，将看到一个 SVID 列表。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">SPIFFE ID:              spiffe://example.org/ns/default/sa/default</span><br><span class="line"></span><br><span class="line">SVID Valid After:       2022-12-25 11:41:16 +0000 UTC</span><br><span class="line"></span><br><span class="line">SVID Valid Until:       2022-12-25 12:41:26 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #1 Valid After:      2022-12-23 15:04:07 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #1 Valid Until:      2022-12-24 15:04:17 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #2 Valid After:      2022-12-24 03:04:07 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #2 Valid Until:      2022-12-25 03:04:17 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #3 Valid After:      2022-12-24 15:04:07 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #3 Valid Until:      2022-12-25 15:04:17 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #4 Valid After:      2022-12-25 03:04:07 +0000 UTC</span><br><span class="line"></span><br><span class="line">CA #4 Valid Until:      2022-12-26 03:04:17 +0000 UTC</span><br></pre></td></tr></table></figure><p>Demo应用部署<br>本次演示<br>将Envoy与X.509-SVID结合使用保护微服务通信</p><p><img src="https://spiffe.io/img/checkouts/spiffe/spire-tutorials/k8s/envoy-x509/images/SPIRE_Envoy_diagram.png"></p><p>如图所示，前端服务通过sidecar Envoy执行X.509 SVID 身份验证与实例建立的起mTLS连接，连接到后端服务。</p><p>SPIRE Agent原生支持做为Envoy的SDS服务。通过本地socket连接SDS服务。</p><p>切换到spire-tutorials&#x2F;k8s&#x2F;envoy-x509目录</p><p>部署应用  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -k k8s/.</span><br><span class="line"></span><br><span class="line">configmap/backend-balance-json-data created</span><br><span class="line"></span><br><span class="line">configmap/backend-envoy created</span><br><span class="line"></span><br><span class="line">configmap/backend-profile-json-data created</span><br><span class="line"></span><br><span class="line">configmap/backend-transactions-json-data created</span><br><span class="line"></span><br><span class="line">configmap/frontend-2-envoy created</span><br><span class="line"></span><br><span class="line">configmap/frontend-envoy created</span><br><span class="line"></span><br><span class="line">configmap/symbank-webapp-2-config created</span><br><span class="line"></span><br><span class="line">configmap/symbank-webapp-config created</span><br><span class="line"></span><br><span class="line">service/backend-envoy created</span><br><span class="line"></span><br><span class="line">service/frontend-2 created</span><br><span class="line"></span><br><span class="line">service/frontend created</span><br><span class="line"></span><br><span class="line">deployment.apps/backend created</span><br><span class="line"></span><br><span class="line">deployment.apps/frontend-2 created</span><br><span class="line"></span><br><span class="line">deployment.apps/frontend created</span><br></pre></td></tr></table></figure><p>以backend模块为例<br>查看k8s&#x2F;backend&#x2F;config&#x2F;envoy.yaml文件，可以科技Envoy配置的与spire-agent的socket连接</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">clusters:</span><br><span class="line"></span><br><span class="line">- name: spire_agent</span><br><span class="line"></span><br><span class="line">  connect_timeout: 0.25s</span><br><span class="line"></span><br><span class="line">  http2_protocol_options: &#123;&#125;</span><br><span class="line"></span><br><span class="line">  hosts:</span><br><span class="line"></span><br><span class="line">    - pipe:</span><br><span class="line"></span><br><span class="line">        path: /run/spire/sockets/agent.sock</span><br></pre></td></tr></table></figure><p>手动将backend、frontend、frontend-2注册到sprie-server，当然SPIFFE也有自动正常功能就是使用SPIRE Controller Manager 模块（<a href="https://github.com/spiffe/spire-controller-manager%EF%BC%89">https://github.com/spiffe/spire-controller-manager）</a>  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">bash create-registration-entries.sh</span><br></pre></td></tr></table></figure><p>注册完成后可以查看注册的服务  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl exec -n spire spire-server-0 -c spire-server -- /opt/spire/bin/spire-server  entry show -selector k8s:ns:default</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br></pre></td><td class="code"><pre><span class="line">Found 4 entries</span><br><span class="line"></span><br><span class="line">Entry ID         : 3478c441-3e25-40e7-96d9-ef74611f2205</span><br><span class="line"></span><br><span class="line">SPIFFE ID        : spiffe://example.org/ns/default/sa/default</span><br><span class="line"></span><br><span class="line">Parent ID        : spiffe://example.org/ns/spire/sa/spire-agent</span><br><span class="line"></span><br><span class="line">Revision         : 0</span><br><span class="line"></span><br><span class="line">X509-SVID TTL    : default</span><br><span class="line"></span><br><span class="line">JWT-SVID TTL     : default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:ns:default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:sa:default</span><br><span class="line"></span><br><span class="line">Entry ID         : c188d47c-e886-492e-bf67-6a6bf42c3667</span><br><span class="line"></span><br><span class="line">SPIFFE ID        : spiffe://example.org/ns/default/sa/default/backend</span><br><span class="line"></span><br><span class="line">Parent ID        : spiffe://example.org/ns/spire/sa/spire-agent</span><br><span class="line"></span><br><span class="line">Revision         : 0</span><br><span class="line"></span><br><span class="line">X509-SVID TTL    : default</span><br><span class="line"></span><br><span class="line">JWT-SVID TTL     : default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:container-name:envoy</span><br><span class="line"></span><br><span class="line">Selector         : k8s:ns:default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:pod-label:app:backend</span><br><span class="line"></span><br><span class="line">Selector         : k8s:sa:default</span><br><span class="line"></span><br><span class="line">Entry ID         : 6c376401-67d4-499a-a9d9-6ab71caf69c4</span><br><span class="line"></span><br><span class="line">SPIFFE ID        : spiffe://example.org/ns/default/sa/default/frontend</span><br><span class="line"></span><br><span class="line">Parent ID        : spiffe://example.org/ns/spire/sa/spire-agent</span><br><span class="line"></span><br><span class="line">Revision         : 0</span><br><span class="line"></span><br><span class="line">X509-SVID TTL    : default</span><br><span class="line"></span><br><span class="line">JWT-SVID TTL     : default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:container-name:envoy</span><br><span class="line"></span><br><span class="line">Selector         : k8s:ns:default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:pod-label:app:frontend</span><br><span class="line"></span><br><span class="line">Selector         : k8s:sa:default</span><br><span class="line"></span><br><span class="line">Entry ID         : 49f88c69-b4ee-4656-b740-6dbee5bb89a3</span><br><span class="line"></span><br><span class="line">SPIFFE ID        : spiffe://example.org/ns/default/sa/default/frontend-2</span><br><span class="line"></span><br><span class="line">Parent ID        : spiffe://example.org/ns/spire/sa/spire-agent</span><br><span class="line"></span><br><span class="line">Revision         : 0</span><br><span class="line"></span><br><span class="line">X509-SVID TTL    : default</span><br><span class="line"></span><br><span class="line">JWT-SVID TTL     : default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:container-name:envoy</span><br><span class="line"></span><br><span class="line">Selector         : k8s:ns:default</span><br><span class="line"></span><br><span class="line">Selector         : k8s:pod-label:app:frontend-2</span><br><span class="line"></span><br><span class="line">Selector         : k8s:sa:default</span><br></pre></td></tr></table></figure><p>可以看见对应的SPIFFE ID</p><p>访问服务  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">kubectl get svc</span><br><span class="line"></span><br><span class="line">NAME            TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE</span><br><span class="line"></span><br><span class="line">backend-envoy   ClusterIP      None            &lt;none&gt;        9001/TCP         2d8h</span><br><span class="line"></span><br><span class="line">frontend        LoadBalancer   10.43.106.227   &lt;pending&gt;     3000:32082/TCP   2d8h</span><br><span class="line"></span><br><span class="line">frontend-2      LoadBalancer   10.43.203.167   &lt;pending&gt;     3002:30664/TCP   2d8h</span><br><span class="line"></span><br><span class="line">go-demo         NodePort       10.43.120.2     &lt;none&gt;        8080:30007/TCP   2d10h</span><br><span class="line"></span><br><span class="line">kubernetes      ClusterIP      10.43.0.1       &lt;none&gt;        443/TCP          2d10h</span><br></pre></td></tr></table></figure><p>frontend对应的NodePort端口为32082</p><p>frontend-2对应的NodePort端口为30664</p><p>frontend显示Jacob Marley的账户情况<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/spiffe-1.png"><br>frontend-2显示Alex Fergus的账户情况<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/spiffe-2.png"></p><p>更新策略只允许frontend服务访问backend访问</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl apply -f backend-envoy-configmap-update.yaml</span><br></pre></td></tr></table></figure><p>实际上就是更新backend的Envoy配置对应的 k8s&#x2F;backend&#x2F;config&#x2F;envoy.yaml<br>删除了以下条目  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">- exact: &quot;spiffe://example.org/ns/default/sa/default/frontend-2&quot;</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">match_subject_alt_names:</span><br><span class="line"></span><br><span class="line">                - exact: &quot;spiffe://example.org/ns/default/sa/default/frontend&quot;</span><br><span class="line"></span><br><span class="line">                - exact: &quot;spiffe://example.org/ns/default/sa/default/frontend-2&quot;</span><br></pre></td></tr></table></figure><p>重启backend服务获取最新配置  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">kubectl scale deployment backend --replicas=0</span><br><span class="line"></span><br><span class="line">kubectl scale deployment backend --replicas=1</span><br></pre></td></tr></table></figure><p>在次访问frontend正常显示，访问frontend-2<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/spiffe-3.png"></p><h3 id="总结"><a href="#总结" class="headerlink" title="总结"></a><a href="#%E6%80%BB%E7%BB%93" title="总结"></a>总结</h3><p>SPIFFE支持多种方式集成如和Istio的envoy-sidecar、OPA策略等方式，可以非常灵活细粒化控制应用访问权限。</p><p>参考链接：<br><a href="https://jimmysong.io/blog/why-istio-need-spire/">https://jimmysong.io/blog/why-istio-need-spire/</a><br><a href="https://mp.weixin.qq.com/s/4eEEYb8RuOFOmLcdL3N6wA">https://mp.weixin.qq.com/s/4eEEYb8RuOFOmLcdL3N6wA</a><br><a href="https://atbug.com/what-is-spiffe-and-spire/">https://atbug.com/what-is-spiffe-and-spire/</a><br><a href="https://www.nginx-cn.net/blog/mtls-architecture-nginx-service-mesh/">https://www.nginx-cn.net/blog/mtls-architecture-nginx-service-mesh/</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;&lt;a href=&quot;#%E6%A6%82%E8%BF%B0&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;传统的网络安全模型通过划分不同的网络分区，同一个网</summary>
      
    
    
    
    <category term="安全" scheme="http://yoursite.com/categories/%E5%AE%89%E5%85%A8/"/>
    
    
    <category term="安全" scheme="http://yoursite.com/tags/%E5%AE%89%E5%85%A8/"/>
    
  </entry>
  
  <entry>
    <title>eBPF学习摘要3-XDP学习和理解</title>
    <link href="http://yoursite.com/2022/10/17/ebpf_3/"/>
    <id>http://yoursite.com/2022/10/17/ebpf_3/</id>
    <published>2022-10-17T13:45:59.000Z</published>
    <updated>2022-10-17T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="eBPF学习摘要3-XDP学习和理解"><a href="#eBPF学习摘要3-XDP学习和理解" class="headerlink" title="eBPF学习摘要3-XDP学习和理解"></a>eBPF学习摘要3-XDP学习和理解</h2><h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a><a href="#%E6%A6%82%E8%BF%B0" title="概述"></a>概述</h3><p>Linux网络数据包接收数据包的链路为Nic——&gt;rx_ring——&gt;skbuff——&gt;网络协议处理（如ip_recv)这样做的问题在于会产生大量内核态到用户态的切换过程，这会造成大量性能消耗，所以为了提升网络性能才诞生出Kernel bypass的技术如DPDK、SolarFlare技术，像DPDK就是直接饶过内核态，用户态应用直接访问网络硬件提高数据包处理效率，降低因为切换带来的损耗，但这种方式本身也存在一些缺点：1、绕开了内核态，很难与Linux操作系统中内核态本身存在的一些工具集成，一些功能需要重新开发。2、需要单独cpu核参与处理。<br>XDP（eXpress Data Path）是一个eBPF hook，可以在内核中执行eBPF程序实现对网络数据包处理，在Linux内核 4.8 版中引入。实现方式与Kernel bypass完全相反在sk_buffer之前数据包从driver出来以后就可以直接被XDP程序捕获执行，极大提升了网络数据包的处理效率。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp3_1.png"></p><h3 id="实现方式"><a href="#实现方式" class="headerlink" title="实现方式"></a><a href="#%E5%AE%9E%E7%8E%B0%E6%96%B9%E5%BC%8F" title="实现方式"></a>实现方式</h3><p>如下图所示<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp3_2.png"><br>1、数据包通过网卡,触发XDP执行<br>2、xdp程序执行读取BPF mps配置的规则对数据包执行相应的操作，通常为<br>（1）XDP_DROP：直接丢弃，不占用CPU资源，有效防止DDOS<br>（2）XDP_Allow：正常转发到内核网络栈<br>（3）XDP_REDIRECT:重定向到其他网卡，或通过AF_XDP直接发送到用户空间。<br>（4）XDP_TX:将处理后的包发给相同的网卡。</p><p>三种处理模式:</p><p>XDP在网络栈中有三个处理点：<br>offloaded模式的XDP：对于支持可编程的网卡，直接在网卡上运行XDP程序。<br>Native模式的XDP：默认模式、对于支持的网卡驱动，可以在包到达内核后立刻进行处理。（目前大部分网卡已经支持）</p><p>Offloaded和Native模式</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp3_3.png"></p><p>Generic模式的XDP：网卡和驱动不支持上述两种情况的XDP时，可以在receive_skb函数此点进行处理。这个处理的位置相对靠后，在tc处理点之前，这种性能最差，一般用于测试调试模式。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp3_4.png"></p><h3 id="应用场景"><a href="#应用场景" class="headerlink" title="应用场景"></a><a href="#%E5%BA%94%E7%94%A8%E5%9C%BA%E6%99%AF" title="应用场景"></a>应用场景</h3><ul><li>负载均衡器：通过XDP_TX和XDP_TX实现数据包的快速转发，目前很多k8s网络插件取代kube-proxy实现Service负载均衡器就是如此。Facebook的全部流量都是经过基于XDP的四层负载均衡器（katran）处理转发（<a href="https://lpc.events/event/11/contributions/950/attachments/889/1704/lpc_from_xdp_to_socket_fb.pdf%EF%BC%89">https://lpc.events/event/11/contributions/950/attachments/889/1704/lpc_from_xdp_to_socket_fb.pdf）</a></li><li>防火墙：Cloudflare在他们的DDoS防御L4Drop中使用用了XDP无需高CPU占用就可提供高性能丢包率（<a href="https://blog.cloudflare.com/how-to-drop-10-million-packets/%EF%BC%89">https://blog.cloudflare.com/how-to-drop-10-million-packets/）</a></li><li>流量监控和采样：位于内核网络栈前端，通过自定义的eBPF程序即可实现对网络流量的采样，目前很多基于eBPF的APM就是这样实现的。</li></ul><h3 id="性能测试"><a href="#性能测试" class="headerlink" title="性能测试"></a><a href="#%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95" title="性能测试"></a>性能测试</h3><p><a href="https://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf">https://people.netfilter.org/hawk/presentations/KernelRecipes2018/XDP_Kernel_Recipes_2018.pdf</a><br><a href="http://vger.kernel.org/lpc_net2018_talks/lpc18-xdp-future.pdf">http://vger.kernel.org/lpc_net2018_talks&#x2F;lpc18-xdp-future.pdf</a><br><a href="https://blog.cloudflare.com/how-to-drop-10-million-packets/">https://blog.cloudflare.com/how-to-drop-10-million-packets/</a><br><a href="https://blog.csdn.net/hbhgyu/article/details/109354273">https://blog.csdn.net/hbhgyu/article/details/109354273</a><br>这里面包含了对丢包性能、转发性能、DDos防御能力的测试。</p><h3 id="总结："><a href="#总结：" class="headerlink" title="总结："></a><a href="#%E6%80%BB%E7%BB%93%EF%BC%9A" title="总结："></a>总结：</h3><p>随着eBPF技术的持续发展，XDP能够实现DPDK相近的性能，但又更具有兼容性和灵活性，未来会得到越来越好的发展，围绕eBPF和XDP的生态软件也会越来越丰富。</p><p>参考链接：<br><a href="https://zhuanlan.zhihu.com/p/453005342">https://zhuanlan.zhihu.com/p/453005342</a><br><a href="https://zhuanlan.zhihu.com/p/438158551">https://zhuanlan.zhihu.com/p/438158551</a><br><a href="https://mp.weixin.qq.com/s/H9imUbdJnfj1NKdK9jtxEw">https://mp.weixin.qq.com/s/H9imUbdJnfj1NKdK9jtxEw</a><br><a href="https://zhuanlan.zhihu.com/p/321387418">https://zhuanlan.zhihu.com/p/321387418</a><br><a href="https://mp.weixin.qq.com/s/lUvxUkFg4w1X0ioktxGiHA">https://mp.weixin.qq.com/s/lUvxUkFg4w1X0ioktxGiHA</a><br><a href="https://www.seekret.io/blog/a-gentle-introduction-to-xdp/">https://www.seekret.io/blog/a-gentle-introduction-to-xdp/</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;eBPF学习摘要3-XDP学习和理解&quot;&gt;&lt;a href=&quot;#eBPF学习摘要3-XDP学习和理解&quot; class=&quot;headerlink&quot; title=&quot;eBPF学习摘要3-XDP学习和理解&quot;&gt;&lt;/a&gt;eBPF学习摘要3-XDP学习和理解&lt;/h2&gt;&lt;h3 id=&quot;概</summary>
      
    
    
    
    <category term="Linux" scheme="http://yoursite.com/categories/Linux/"/>
    
    
    <category term="Linux" scheme="http://yoursite.com/tags/Linux/"/>
    
  </entry>
  
  <entry>
    <title>eBPF学习摘要2—工具使用(bpftrace）</title>
    <link href="http://yoursite.com/2022/09/18/ebpf_2/"/>
    <id>http://yoursite.com/2022/09/18/ebpf_2/</id>
    <published>2022-09-18T13:45:59.000Z</published>
    <updated>2022-09-18T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a><a href="#%E6%A6%82%E8%BF%B0" title="概述"></a>概述</h3><p>bpftrace是基于eBPF实现的动态的工具，使用DSL（Domain Specific Language）编写eBPF程序，使用LLVM编译eBPF字节码，BCC与LinuxBPF系统交互。直接使用DSL编写好的脚本（类似awk语言）可以执行，无需在内核中手动编译和加载。bpftrace在内核中实现动态追踪主要是使用Kprobe探针和Tracepoints探针。使用bpftrace可以更深入的进行操作系统上问题排查，如某个函数的调用次数和延时、追踪系统OOMKILL、TCP连接丢包等。都可以自定义脚本实现。<br>另外还有一个叫BCC的项目，跟bpftrace区别是BCC可以使用高级语言开发ebfp程序，如Java、Python、Lua……</p><p><a href="https://github.com/iovisor/bpftrace">https://github.com/iovisor/bpftrace</a><br><a href="https://github.com/iovisor/bcc">https://github.com/iovisor/bcc</a></p><h3 id="安装和基础使用"><a href="#安装和基础使用" class="headerlink" title="安装和基础使用"></a><a href="#%E5%AE%89%E8%A3%85%E5%92%8C%E5%9F%BA%E7%A1%80%E4%BD%BF%E7%94%A8" title="安装和基础使用"></a>安装和基础使用</h3><p>系统环境：<br>ubuntu：20.04<br>Kernel：5.4.0-125-generic</p><p>参考官网安装方式<br><a href="https://github.com/iovisor/bpftrace/blob/master/INSTALL.md">https://github.com/iovisor/bpftrace/blob/master/INSTALL.md</a><br>有各操作系统发行版的安装方式，也有基于Docker的安装方式<br>我这里为Ubuntu20.04的操作系统，先使用Ubuntu的安装方式进行安装，因为bpftrace依赖ebpf能力，对应不同的内核版本实现的功能有所差异，如4.1 版本实现了kprobes、4.7版本实现了tracepoints官方提供了环境需求检测脚本可以实现对现有环境检测<a href="https://github.com/iovisor/bpftrace/blob/master/scripts/check_kernel_features.sh">https://github.com/iovisor/bpftrace/blob/master/scripts/check_kernel_features.sh</a><br>执行后  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">./check_kernel_features.sh</span><br><span class="line"></span><br><span class="line">All required features present!</span><br></pre></td></tr></table></figure><p>安装bpftrace  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">sudo apt-get install -y bpftrace bpfcc-tools</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>安装完成后查看版本  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">bpftrace --version</span><br><span class="line"></span><br><span class="line">bpftrace v0.9.4</span><br></pre></td></tr></table></figure><p>列出当前内核支持的Kprobes探针列表  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br></pre></td><td class="code"><pre><span class="line">bpftrace -l &#x27;kprobe:tcp*</span><br><span class="line"></span><br><span class="line">kprobe:tcp_mmap</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_info_chrono_stats</span><br><span class="line"></span><br><span class="line">kprobe:tcp_init_sock</span><br><span class="line"></span><br><span class="line">kprobe:tcp_splice_data_recv</span><br><span class="line"></span><br><span class="line">kprobe:tcp_push</span><br><span class="line"></span><br><span class="line">kprobe:tcp_send_mss</span><br><span class="line"></span><br><span class="line">kprobe:tcp_cleanup_rbuf</span><br><span class="line"></span><br><span class="line">kprobe:tcp_set_rcvlowat</span><br><span class="line"></span><br><span class="line">kprobe:tcp_recv_timestamp</span><br><span class="line"></span><br><span class="line">kprobe:tcp_enter_memory_pressure</span><br><span class="line"></span><br><span class="line">kprobe:tcp_leave_memory_pressure</span><br><span class="line"></span><br><span class="line">kprobe:tcp_ioctl</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_info</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_md5sig_pool</span><br><span class="line"></span><br><span class="line">kprobe:tcp_set_state</span><br><span class="line"></span><br><span class="line">kprobe:tcp_shutdown</span><br><span class="line"></span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>内核静态探针-Tracepoint  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br></pre></td><td class="code"><pre><span class="line">bpftrace -l &#x27;tracepoint:*&#x27;</span><br><span class="line"></span><br><span class="line">kprobe:tcp_mmap</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_info_chrono_stats</span><br><span class="line"></span><br><span class="line">kprobe:tcp_init_sock</span><br><span class="line"></span><br><span class="line">kprobe:tcp_splice_data_recv</span><br><span class="line"></span><br><span class="line">kprobe:tcp_push</span><br><span class="line"></span><br><span class="line">kprobe:tcp_send_mss</span><br><span class="line"></span><br><span class="line">kprobe:tcp_cleanup_rbuf</span><br><span class="line"></span><br><span class="line">kprobe:tcp_set_rcvlowat</span><br><span class="line"></span><br><span class="line">kprobe:tcp_recv_timestamp</span><br><span class="line"></span><br><span class="line">kprobe:tcp_enter_memory_pressure</span><br><span class="line"></span><br><span class="line">kprobe:tcp_leave_memory_pressure</span><br><span class="line"></span><br><span class="line">kprobe:tcp_ioctl</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_info</span><br><span class="line"></span><br><span class="line">kprobe:tcp_get_md5sig_pool</span><br><span class="line"></span><br><span class="line">kprobe:tcp_set_state</span><br><span class="line"></span><br><span class="line">kprobe:tcp_shutdown</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_compound</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_compound_status</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_start</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_splice</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_vector</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_io_done</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_done</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_write_start</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_write_opened</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_write_io_done</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_write_done</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_read_err</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_write_err</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_layoutstate_alloc</span><br><span class="line"></span><br><span class="line">tracepoint:nfsd:nfsd_layoutstate_unhash</span><br><span class="line"></span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>例如列出所有的进程打开的文件  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">bpftrace -e &#x27;tracepoint:syscalls:sys_enter_openat &#123; printf(&quot;%s %s\n&quot;, comm, str(args-&gt;filename)); &#125;&#x27;</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/memory/kubepods.slice/memory.numa_stat</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpu.stat</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.stat</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage_percpu</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/cpu,cpuacct/kubepods.slice/cpuacct.usage_all</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/pids/kubepods.slice/pids.current</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/pids/kubepods.slice/pids.max</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.bfq.sectors_recursive</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.bfq.io_serviced_recur</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.sectors_recursive</span><br><span class="line"></span><br><span class="line">kubelet /sys/fs/cgroup/blkio/kubepods.slice/blkio.throttle.io_serviced_</span><br><span class="line"></span><br><span class="line">...</span><br></pre></td></tr></table></figure><p>ctrl+c暂停<br>也可以将复杂的命令写成脚本执行，默认安装完后，在&#x2F;usr&#x2F;sbin&#x2F;目录下已经集成了很多脚本</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br></pre></td><td class="code"><pre><span class="line">ls /usr/sbin/|grep &quot;.*.bt&quot;</span><br><span class="line"></span><br><span class="line">bashreadline.bt</span><br><span class="line"></span><br><span class="line">biolatency.bt</span><br><span class="line"></span><br><span class="line">biosnoop.bt</span><br><span class="line"></span><br><span class="line">biostacks.bt</span><br><span class="line"></span><br><span class="line">bitesize.bt</span><br><span class="line"></span><br><span class="line">capable.bt</span><br><span class="line"></span><br><span class="line">cpuwalk.bt</span><br><span class="line"></span><br><span class="line">dcsnoop.bt</span><br><span class="line"></span><br><span class="line">ebtables</span><br><span class="line"></span><br><span class="line">ebtables-nft</span><br><span class="line"></span><br><span class="line">ebtables-nft-restore</span><br><span class="line"></span><br><span class="line">ebtables-nft-save</span><br><span class="line"></span><br><span class="line">ebtables-restore</span><br><span class="line"></span><br><span class="line">ebtables-save</span><br><span class="line"></span><br><span class="line">execsnoop.bt</span><br><span class="line"></span><br><span class="line">gethostlatency.bt</span><br><span class="line"></span><br><span class="line">killsnoop.bt</span><br><span class="line"></span><br><span class="line">loads.bt</span><br><span class="line"></span><br><span class="line">mdflush.bt</span><br><span class="line"></span><br><span class="line">naptime.bt</span><br><span class="line"></span><br><span class="line">oomkill.bt</span><br><span class="line"></span><br><span class="line">opensnoop.bt</span><br><span class="line"></span><br><span class="line">pidpersec.bt</span><br><span class="line"></span><br><span class="line">runqlat.bt</span><br><span class="line"></span><br><span class="line">runqlen.bt</span><br><span class="line"></span><br><span class="line">setuids.bt</span><br><span class="line"></span><br><span class="line">statsnoop.bt</span><br><span class="line"></span><br><span class="line">swapin.bt</span><br><span class="line"></span><br><span class="line">syncsnoop.bt</span><br><span class="line"></span><br><span class="line">syscount.bt</span><br><span class="line"></span><br><span class="line">tcpaccept.bt</span><br><span class="line"></span><br><span class="line">tcpconnect.bt</span><br><span class="line"></span><br><span class="line">tcpdrop.bt</span><br><span class="line"></span><br><span class="line">tcplife.bt</span><br><span class="line"></span><br><span class="line">tcpretrans.bt</span><br><span class="line"></span><br><span class="line">tcpsynbl.bt</span><br><span class="line"></span><br><span class="line">threadsnoop.bt</span><br><span class="line"></span><br><span class="line">vfscount.bt</span><br><span class="line"></span><br><span class="line">vfsstat.bt</span><br><span class="line"></span><br><span class="line">writeback.bt</span><br><span class="line"></span><br><span class="line">xfsdist.bt</span><br></pre></td></tr></table></figure><p><a href="https://github.com/iovisor/bpftrace/tree/master/tools">https://github.com/iovisor/bpftrace/tree/master/tools</a> 也存在很多脚本和测试用例。</p><p>比如执行tcpconnect.bt 可以参考到本机所有的TCP网络连接  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line">tcpconnect.bt </span><br><span class="line"></span><br><span class="line">Attaching 2 probes...</span><br><span class="line"></span><br><span class="line">Tracing tcp connections. Hit Ctrl-C to end.</span><br><span class="line"></span><br><span class="line">TIME     PID      COMM             SADDR                                   SPORT  DADDR                                   DPORT </span><br><span class="line"></span><br><span class="line">23:02:09 1686607  coredns          127.0.0.1                               42216  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:10 1686607  coredns          127.0.0.1                               42218  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:11 1680193  kubelet          10.0.1.11                               34732  10.0.1.15                               13429 </span><br><span class="line"></span><br><span class="line">23:02:11 1680193  kubelet          10.0.1.11                               41570  10.0.1.244                              13429 </span><br><span class="line"></span><br><span class="line">23:02:11 1686607  coredns          127.0.0.1                               42224  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:11 1680193  kubelet          127.0.0.1                               55010  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:11 1680193  kubelet          10.0.1.11                               33098  10.0.1.145                              13429 </span><br><span class="line"></span><br><span class="line">23:02:12 1686607  coredns          127.0.0.1                               42230  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:13 1680193  kubelet          127.0.0.1                               34214  127.0.0.1                               13429 </span><br><span class="line"></span><br><span class="line">23:02:13 1686607  coredns          127.0.0.1                               42234  127.0.0.1                               13429</span><br></pre></td></tr></table></figure><p>追踪全系统范围内open()调用  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br></pre></td><td class="code"><pre><span class="line">opensnoop.bt</span><br><span class="line"></span><br><span class="line">Attaching 6 probes...</span><br><span class="line"></span><br><span class="line">Tracing open syscalls... Hit Ctrl-C to end.</span><br><span class="line"></span><br><span class="line">PID    COMM               FD ERR PATH</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/087/0872950b-6ca9-420e-8872-9</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/087/0872950b-6ca9-420e-8872-9</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/435/435debae-6288-4661-835d-e</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/435/435debae-6288-4661-835d-e</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/092/09246824-dd69-4f4c-8924-6</span><br><span class="line"></span><br><span class="line">1686817 AsyncBlockInput     2   0 /var/lib/clickhouse_storage/store/092/09246824-dd69-4f4c-8924-6</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br><span class="line"></span><br><span class="line">1680193 kubelet             2   0 /sys/fs/cgroup/memory/kubepods.slice/kubepods-besteffort.slice/</span><br></pre></td></tr></table></figure><p>展示最消耗IO的进程及数据写入量</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">biotop-bpfcc</span><br><span class="line"></span><br><span class="line">20:53:20 loadavg: 0.96 1.40 1.26 2/2392 3295655</span><br><span class="line"></span><br><span class="line">PID    COMM             D MAJ MIN DISK       I/O  Kbytes  AVGms</span><br><span class="line"></span><br><span class="line">337    jbd2/vda1-8      W 252 0   vda          2   412.0   5.16</span><br><span class="line"></span><br><span class="line">1680139 etcd             W 252 0   vda         14    60.0   2.61</span><br><span class="line"></span><br><span class="line">3295622 rancher-system-  R 252 0   vda          1    56.0   2.07</span><br></pre></td></tr></table></figure><p>查看每个进程对应的执行命令和参数  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br></pre></td><td class="code"><pre><span class="line">execsnoop-bpfcc -T</span><br><span class="line"></span><br><span class="line">TIME     PCOMM            PID    PPID   RET ARGS</span><br><span class="line"></span><br><span class="line">21:22:05 rancher-system-  3310507 1        0 /usr/local/bin/rancher-system-agent sentinel</span><br><span class="line"></span><br><span class="line">21:22:06 cilium-cni       3310516 1680193   0 /opt/cni/bin/cilium-cni</span><br><span class="line"></span><br><span class="line">21:22:06 iptables         3310525 1680164   0 /usr/sbin/iptables -w 5 -W 100000 -S KUBE-PROXY-CANARY -t mangle</span><br><span class="line"></span><br><span class="line">21:22:06 ip6tables        3310524 1680164   0 /usr/sbin/ip6tables -w 5 -W 100000 -S KUBE-PROXY-CANARY -t mangle</span><br><span class="line"></span><br><span class="line">21:22:06 nsenter          3310526 1680193   0 /usr/bin/nsenter --net=/proc/1688099/ns/net -F -- ip -o -4 addr show dev eth0 scope global</span><br><span class="line"></span><br><span class="line">21:22:06 ip               3310526 1680193   0 /usr/sbin/ip -o -4 addr show dev eth0 scope global</span><br><span class="line"></span><br><span class="line">21:22:06 nsenter          3310527 1680193   0 /usr/bin/nsenter --net=/proc/1688099/ns/net -F -- ip -o -6 addr show dev eth0 scope global</span><br><span class="line"></span><br><span class="line">21:22:06 ip               3310527 1680193   0 /usr/sbin/ip -o -6 addr show dev eth0 scope global</span><br><span class="line"></span><br><span class="line">21:22:06 runc             3310528 1679752   0 /usr/bin/runc --version</span><br><span class="line"></span><br><span class="line">21:22:06 docker-init      3310534 1679752   0 /usr/bin/docker-init --version</span><br></pre></td></tr></table></figure><p>常用的一些脚本作用</p><ul><li>killsnoop.bt——追踪 kill() 系统调用发出的信号</li><li>tcpconnect.bt——追踪所有的 TCP 网络连接</li><li>pidpersec.bt——统计每秒钟（通过fork）创建的新进程</li><li>opensnoop.bt——追踪 open() 系统调用</li><li>bfsstat.bt——追踪一些 VFS 调用，按秒统计</li><li>bashreadline.bt——打印从所有运行shell输入的bash命令</li><li>tcplife.bt——追踪TCP连接生命周期</li><li>biotop-bpfcc——展示进程io写入</li></ul><h3 id="bpftrace执行原理"><a href="#bpftrace执行原理" class="headerlink" title="bpftrace执行原理"></a><a href="#bpftrace%E6%89%A7%E8%A1%8C%E5%8E%9F%E7%90%86" title="bpftrace执行原理"></a>bpftrace执行原理</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp2_2.png"><br>用户态<br>1、用户编写 eBPF 程序，可以使用 eBPF 汇编或者 eBPF 特有的 C 语言来编写。<br>2、使用 LLVM&#x2F;CLang 编译器，将 eBPF 程序编译成 eBPF 字节码。<br>3、调用 bpf() 系统调用把 eBPF 字节码加载到内核。</p><p>内核态<br>1、当用户调用 bpf() 系统调用把 eBPF 字节码加载到内核时，内核先会对 eBPF 字节码进行安全验证。<br>2、使用 JIT（Just In Time）技术将 eBPF 字节编译成本地机器码（Native Code）。<br>3、然后根据 eBPF 程序的功能，将 eBPF 机器码挂载到内核的不同运行路径上（如用于跟踪内核运行状态的 eBPF 程序将会挂载在 kprobes 的运行路径上）。当内核运行到这些路径时，就会触发执行相应路径上的 eBPF 机器码。<br>4、通过map与用户空间程序交互</p><p>总结<br>通过bpftrace和bcc可以很形象了解ebpf特性，无需修改内核源码和重新编译内核就可以扩展内核的功能，除了像bpftrace这类追踪类软件，还有通过ebfp实现的POD安全威胁检测Falco、基于ebpf负载均衡器Katran等开源产品。另外ebpf_exporter组件也可以将自定义的ebpf执行脚本输出到Prometheus中进行监控。ebpf的生态将越来越丰富。</p><h3 id="Perf工具使用"><a href="#Perf工具使用" class="headerlink" title="Perf工具使用"></a><a href="#Perf%E5%B7%A5%E5%85%B7%E4%BD%BF%E7%94%A8" title="Perf工具使用"></a>Perf工具使用</h3><p>Perf（Performance Event）是在Linux Kernel2.6集成在Linux Kernel中，主要利用CPU中PMU (Performance Monitoring Unit)和Linux中的 tracepoint实现目标取样和性能分析。Perf工具根eBPF实际上没什么关系，这里写这个工具主要是因为它本身也可以实现应用程序动态追踪，也利用到了tracepoint的能力，但与eBPF不同的是Perf是写死的能力，bpftrace基于eBPF是可以实现脚本灵活的穿插和调用。</p><p>安装部署<br>这里使用的操作系统是Ubuntu20.04。Kernel为5.4.0-125-generic</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">sudo apt-get install linux-tools-common linux-tools-&quot;``(uname -r)&quot; linux-cloud-tools-&quot;``(uname -r)&quot; linux-tools-generic linux-cloud-tools-generic</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>验证版本  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">perf -v</span><br><span class="line"></span><br><span class="line">perf version 5.4.195</span><br></pre></td></tr></table></figure><p>采样事件  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br></pre></td><td class="code"><pre><span class="line">perf list</span><br><span class="line"></span><br><span class="line">List of pre-defined events (to be used in -e):</span><br><span class="line"></span><br><span class="line">  alignment-faults                                   [Software event]</span><br><span class="line"></span><br><span class="line">  bpf-output                                         [Software event]</span><br><span class="line"></span><br><span class="line">  context-switches OR cs                             [Software event]</span><br><span class="line"></span><br><span class="line">  cpu-clock                                          [Software event]</span><br><span class="line"></span><br><span class="line">  cpu-migrations OR migrations                       [Software event]</span><br><span class="line"></span><br><span class="line">  dummy                                              [Software event]</span><br><span class="line"></span><br><span class="line">  emulation-faults                                   [Software event]</span><br><span class="line"></span><br><span class="line">  major-faults                                       [Software event]</span><br><span class="line"></span><br><span class="line">  minor-faults                                       [Software event]</span><br><span class="line"></span><br><span class="line">  page-faults OR faults                              [Software event]</span><br><span class="line"></span><br><span class="line">  task-clock                                         [Software event]</span><br><span class="line"></span><br><span class="line">  duration_time                                      [Tool event]</span><br><span class="line"></span><br><span class="line">  msr/tsc/                                           [Kernel PMU event]</span><br><span class="line"></span><br><span class="line">  rNNN                                               [Raw hardware event descriptor]</span><br><span class="line"></span><br><span class="line">  cpu/t1=v1[,t2=v2,t3 ...]/modifier                  [Raw hardware event descriptor]</span><br><span class="line"></span><br><span class="line">   (see &#x27;man perf-list&#x27; on how to encode it)</span><br><span class="line"></span><br><span class="line">  mem:&lt;addr&gt;[/len][:access]                          [Hardware breakpoint]</span><br><span class="line"></span><br><span class="line">  alarmtimer:alarmtimer_cancel                       [Tracepoint event]</span><br><span class="line"></span><br><span class="line">  alarmtimer:alarmtimer_fired                        [Tracepoint event]</span><br><span class="line"></span><br><span class="line">  ...</span><br></pre></td></tr></table></figure><p>主要分为三类：<br>Hardware Event ：通过PMU获取的硬件CPU的事件，如：cpu-cycles、缓存命中等。<br>Software Event ：软件本身的进程切换和页命中等<br>Tracepoint event：io命中率、文件系统写入速率等</p><p>perf top展示各个进程和函数资源占用情况，-g显示子进程，按e显示子进程函数  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br></pre></td><td class="code"><pre><span class="line">perf top -g </span><br><span class="line"></span><br><span class="line">Samples: 284K of event &#x27;cpu-clock:pppH&#x27;, 4000 Hz, Event count (approx.): 33836570425 lost: 0/0 drop: 0/0</span><br><span class="line"></span><br><span class="line">  Children      Self  Shared Object                                          Symbol</span><br><span class="line"></span><br><span class="line">-   20.27%     0.09%  perf                                                   [.] __ordered_events__flush.part.0                                                            ◆</span><br><span class="line"></span><br><span class="line">   - 2.20% __ordered_events__flush.part.0                                                                                                                                  ▒</span><br><span class="line"></span><br><span class="line">      - 2.56% deliver_event                                                                                                                                                ▒</span><br><span class="line"></span><br><span class="line">         - 3.39% hist_entry_iter__add                                                                                                                                      ▒</span><br><span class="line"></span><br><span class="line">            - 3.79% iter_add_next_cumulative_entry                                                                                                                         ▒</span><br><span class="line"></span><br><span class="line">               - 3.03% __hists__add_entry.constprop.0                                                                                                                      ▒</span><br><span class="line"></span><br><span class="line">                    3.79% hists__findnew_entry                                                                                                                             ▒</span><br><span class="line"></span><br><span class="line">               - 1.54% callchain_append                                                                                                                                    ▒</span><br><span class="line"></span><br><span class="line">                  - 2.64% append_chain_children                                                                                                                            ▒</span><br><span class="line"></span><br><span class="line">                     - 2.22% append_chain_children                                                                                                                         ▒</span><br><span class="line"></span><br><span class="line">                        - 1.73% append_chain_children                                                                                                                      ▒</span><br><span class="line"></span><br><span class="line">                           - 1.34% append_chain_children                                                                                                                   ▒</span><br><span class="line"></span><br><span class="line">                                1.07% append_chain_children                                                                                                                ▒</span><br><span class="line"></span><br><span class="line">+   20.13%     0.18%  perf                                                   [.] deliver_event                                                                             ▒</span><br><span class="line"></span><br><span class="line">+   18.62%     0.04%  perf                                                   [.] hist_entry_iter__add                                                                      ▒</span><br><span class="line"></span><br><span class="line">+   14.47%     0.80%  perf                                                   [.] iter_add_next_cumulative_entry                                                            ▒</span><br><span class="line"></span><br><span class="line">+   12.05%     0.96%  [kernel]                                               [k] do_syscall_64                                                                             ▒</span><br><span class="line"></span><br><span class="line">+    8.99%     0.00%  perf                                                   [.] process_thread                                                                            ▒</span><br><span class="line"></span><br><span class="line">+    8.93%     0.22%  [kernel]                                               [k] do_idle                                                                                   ▒</span><br><span class="line"></span><br><span class="line">+    8.83%     1.06%  [kernel]                                               [k] __softirqentry_text_start                                                                 ▒</span><br><span class="line"></span><br><span class="line">+    8.24%     6.28%  perf                                                   [.] append_chain_children                                                                     ▒</span><br><span class="line"></span><br><span class="line">+    7.11%     0.06%  perf                                                   [.] callchain_append                                                                          ▒</span><br><span class="line"></span><br><span class="line">+    5.96%     4.75%  libc-2.31.so                                           [.] pthread_attr_setschedparam                                                                ▒</span><br><span class="line"></span><br><span class="line">+    5.74%     0.25%  perf                                                   [.] __hists__add_entry.constprop.0</span><br><span class="line"></span><br><span class="line"></span><br></pre></td></tr></table></figure><p>[.]：表示运行在用户态空间<br>[k]：表示运行在内核态空间</p><p>perf state查看程序运行情况</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">perf stat -p 1679752   按ctrl+c输出结果</span><br><span class="line"></span><br><span class="line"> Performance counter stats for process id &#x27;1679752&#x27;:</span><br><span class="line"></span><br><span class="line">            682.90 msec task-clock                #    0.066 CPUs utilized          </span><br><span class="line"></span><br><span class="line">              3154      context-switches          #    0.005 M/sec                  </span><br><span class="line"></span><br><span class="line">                36      cpu-migrations            #    0.053 K/sec                  </span><br><span class="line"></span><br><span class="line">              3275      page-faults               #    0.005 M/sec                  </span><br><span class="line"></span><br><span class="line">   &lt;not supported&gt;      cycles                                                      </span><br><span class="line"></span><br><span class="line">   &lt;not supported&gt;      instructions                                                </span><br><span class="line"></span><br><span class="line">   &lt;not supported&gt;      branches                                                    </span><br><span class="line"></span><br><span class="line">   &lt;not supported&gt;      branch-misses</span><br></pre></td></tr></table></figure><p>Task-clock：CPU 利用率<br>Context-switches：进程切换次数</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br></pre></td><td class="code"><pre><span class="line">Samples: 1K of event &#x27;block:block_rq_issue&#x27;, 1 Hz, Event count (approx.): 136 lost: 0/0 drop: 0/0</span><br><span class="line"></span><br><span class="line">  Children      Self  Trace output</span><br><span class="line"></span><br><span class="line">+   16.91%    16.91%  252,0 FF 0 () 0 + 0 [kworker/0:1H]</span><br><span class="line"></span><br><span class="line">+   15.44%    15.44%  252,0 FF 0 () 0 + 0 [kworker/7:1H]</span><br><span class="line"></span><br><span class="line">+    9.56%     9.56%  252,0 FF 0 () 0 + 0 [kworker/3:1H]</span><br><span class="line"></span><br><span class="line">+    8.09%     8.09%  252,0 FF 0 () 0 + 0 [kworker/4:1H]</span><br><span class="line"></span><br><span class="line">+    6.62%     6.62%  252,0 WS 4096 () 18340064 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">+    6.62%     6.62%  252,0 WS 4096 () 18340072 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">+    5.15%     5.15%  252,0 FF 0 () 0 + 0 [kworker/2:1H]</span><br><span class="line"></span><br><span class="line">+    4.41%     4.41%  252,0 FF 0 () 0 + 0 [kworker/6:1H]</span><br><span class="line"></span><br><span class="line">+    2.94%     2.94%  252,0 WS 4096 () 122005280 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     2.21%     2.21%  252,0 WS 4096 () 115952144 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     2.21%     2.21%  252,0 WS 4096 () 122005272 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">+    1.47%     1.47%  252,0 FF 0 () 0 + 0 [kworker/1:1H]</span><br><span class="line"></span><br><span class="line">+    1.47%     1.47%  252,0 WS 4096 () 116164552 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     1.47%     1.47%  252,0 WS 4096 () 116173824 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">+    1.47%     1.47%  252,0 WS 4096 () 122005256 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     1.47%     1.47%  252,0 WS 4096 () 122005288 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     1.47%     1.47%  252,0 WS 4096 () 122005296 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     0.74%     0.74%  252,0 FF 0 () 0 + 0 [kworker/5:1H]</span><br><span class="line"></span><br><span class="line">     0.74%     0.74%  252,0 WS 516096 () 2388520 + 1008 [jbd2/vda1-8]</span><br><span class="line"></span><br><span class="line">     0.74%     0.74%  252,0 WS 372736 () 2389528 + 728 [jbd2/vda1-8]</span><br><span class="line"></span><br><span class="line">     0.74%     0.74%  252,0 WS 4096 () 115700160 + 8 [etcd]</span><br><span class="line"></span><br><span class="line">     0.74%     0.74%  252,0 WS 4096 () 115948608 + 8 [etcd]</span><br></pre></td></tr></table></figure><p>对CPU事件进行检测，采样时间60s，每秒采样99个事件，采样完成后会在本地生成个perf.data文件，如果执行多次，会将上一个重命名为perf.data.old。加-p可以指定进程号输出。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">perf record -F 99 -a -g -- sleep 60</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>查看报告  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">perf report</span><br></pre></td></tr></table></figure><p>生成火焰图  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">下载制作火焰图工具</span><br><span class="line"></span><br><span class="line">git clone https://github.com/brendangregg/FlameGraph.git</span><br><span class="line"></span><br><span class="line">对perf.data进行解析</span><br><span class="line"></span><br><span class="line">perf script -i perf.data &amp;&gt; perf.unfold</span><br><span class="line"></span><br><span class="line">进行符号处理</span><br><span class="line"></span><br><span class="line">FlameGraph/stackcollapse-perf.pl perf.unfold &amp;&gt; perf.folded</span><br><span class="line"></span><br><span class="line">生成火焰图</span><br><span class="line"></span><br><span class="line">FlameGraph/flamegraph.pl perf.folded &gt; perf.svg</span><br></pre></td></tr></table></figure><p>使用chrome浏览器打开<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp2_1.png"></p><p>火焰图怎么查看分析可参考<br><a href="https://www.infoq.cn/article/a8kmnxdhbwmzxzsytlga">https://www.infoq.cn/article/a8kmnxdhbwmzxzsytlga</a></p><p>参考链接<br><a href="https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md">https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md</a><br><a href="http://blog.nsfocus.net/bpftrace-dynamic-tracing-0828/">http://blog.nsfocus.net/bpftrace-dynamic-tracing-0828/</a><br><a href="https://www.cnblogs.com/arnoldlu/p/6241297.html">https://www.cnblogs.com/arnoldlu/p/6241297.html</a><br><a href="https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux/8/html/monitoring_and_managing_system_status_and_performance/counting-events-during-process-execution-with-perf-stat_monitoring-and-managing-system-status-and-performance">https://access.redhat.com/documentation/zh-cn/red_hat_enterprise_linux&#x2F;8&#x2F;html&#x2F;monitoring_and_managing_system_status_and_performance&#x2F;counting-events-during-process-execution-with-perf-stat_monitoring-and-managing-system-status-and-performance</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;&lt;a href=&quot;#%E6%A6%82%E8%BF%B0&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;bpftrace是基于eBPF实现的动态的工具，使</summary>
      
    
    
    
    <category term="Linux" scheme="http://yoursite.com/categories/Linux/"/>
    
    
    <category term="Linux" scheme="http://yoursite.com/tags/Linux/"/>
    
  </entry>
  
  <entry>
    <title>eBPF学习摘要1(概述、理论)</title>
    <link href="http://yoursite.com/2022/08/23/ebpf_1/"/>
    <id>http://yoursite.com/2022/08/23/ebpf_1/</id>
    <published>2022-08-23T13:45:59.000Z</published>
    <updated>2022-08-23T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="eBPF学习摘要1-概述、理论"><a href="#eBPF学习摘要1-概述、理论" class="headerlink" title="eBPF学习摘要1(概述、理论)"></a>eBPF学习摘要1(概述、理论)</h2><h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a><a href="#%E6%A6%82%E8%BF%B0" title="概述"></a>概述</h3><p>早期内核包过滤是将网络数据包拷贝到用户态进行过滤，这样整体的包过滤性能低，后续在1992年的BSD操作系统上引入BPF包过滤技术，Linux在内核2.1.75正式引入BPF技术。网络数据包过滤可以直接在内核中执行，避免将网络数据包在用户态执行，极大提高了包过滤性能。如tcpdump工具就是利用BPF技术实现。2014年对BPF技术进行全面扩展，诞生了eBPF（extended Berkeley Packet Filter）使得BPF不仅仅是网络栈层面功能。后续iovisor 引入 BCC、bpftrace 等工具，成为 eBPF 在跟踪和排错领域的最佳实践。另外 eBPF 最重大的特性是在内核中运行沙盒程序而无需修改内核源码和重新编译内核就可以扩展内核的功能，Cilium、Katran、Falco 等一系列基于 eBPF 优化网络和安全的开源项目也逐步诞生。并且，越来越多的开源和商业解决方案开始借助 eBPF，优化其网络、安全以及观测的性能。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_1.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_2.png"><br>图片来源：<a href="https://blog.csdn.net/eBPF_Kindling/article/details/123575619">https://blog.csdn.net/eBPF_Kindling&#x2F;article&#x2F;details&#x2F;123575619</a></p><p>发展历程<br>1992年：BPF全称Berkeley Packet Filter，诞生初衷提供一种内核中自定义报文过滤的手段（类汇编），提升抓包效率。（tcpdump）<br>2011年：linux kernel 3.2版本对BPF进行重大改进，引入BPF JIT，使其性能得到大幅提升。<br>2014年： linux kernel 3.15版本，BPF扩展成eBPF，其功能范畴扩展至：内核跟踪、性能调优、协议栈QoS等方面。与之配套改进包括：扩展BPF ISA指令集、提供高级语言（C）编程手段、提供MAP机制、提供Help机制、引入Verifier机制等。<br>2016年：linux kernel 4.8版本，eBPF支持XDP，进一步拓展该技术在网络领域的应用。随后Netronome公司提出eBPF硬件卸载方案。Cilium项目正式发布。<br>2018年：linux kernel 4.18版本，引入BTF，将内核中BPF对象（Prog&#x2F;Map）由字节码转换成统一结构对象，这有利于eBPF对象与Kernel版本的配套管理，为eBPF的发展奠定基础。<br>2018年：从kernel 4.20版本开始，eBPF成为内核最活跃的项目之一，新增特性包括：sysctrl hook、flow dissector、struct_ops、lsm hook、ring buffer等。场景范围覆盖容器、安全、网络、跟踪等<br>2021年：微软、Facebook、Google、Isovalent、NetFlix成立eBPF基金会，同年Cilium发布基于eBPF的Service Mesh解决方案<br>eBPF 基本架构及使用<br>参考链接：<a href="https://blog.51cto.com/dengchj/2944202">https://blog.51cto.com/dengchj/2944202</a></p><h3 id="实现原理"><a href="#实现原理" class="headerlink" title="实现原理"></a><a href="#%E5%AE%9E%E7%8E%B0%E5%8E%9F%E7%90%86" title="实现原理"></a>实现原理</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_3.png"><br>用户态<br>1、用户编写 eBPF 程序，可以使用 eBPF 汇编或者 eBPF 特有的 C 语言来编写。<br>2、使用 LLVM&#x2F;CLang 编译器，将 eBPF 程序编译成 eBPF 字节码。<br>3、调用 bpf() 系统调用把 eBPF 字节码加载到内核。</p><p>内核态<br>1、当用户调用 bpf() 系统调用把 eBPF 字节码加载到内核时，内核先会对 eBPF 字节码进行安全验证。<br>2、使用 JIT（Just In Time）技术将 eBPF 字节编译成本地机器码（Native Code）。<br>3、然后根据 eBPF 程序的功能，将 eBPF 机器码挂载到内核的不同运行路径上（如用于跟踪内核运行状态的 eBPF 程序将会挂载在 kprobes 的运行路径上）。当内核运行到这些路径时，就会触发执行相应路径上的 eBPF 机器码。<br>4、通过map与用户空间程序交互</p><h4 id="如何保证内核安全性和优缺点"><a href="#如何保证内核安全性和优缺点" class="headerlink" title="如何保证内核安全性和优缺点"></a><a href="#%E5%A6%82%E4%BD%95%E4%BF%9D%E8%AF%81%E5%86%85%E6%A0%B8%E5%AE%89%E5%85%A8%E6%80%A7%E5%92%8C%E4%BC%98%E7%BC%BA%E7%82%B9" title="如何保证内核安全性和优缺点"></a>如何保证内核安全性和优缺点</h4><ul><li><p>需要特权执行：eBPF程序加载到Linux内核的进程都必须在特权模式(root)下运行，或者需要CAP_BPF功能，不受信任的程序不能加载eBPF程序</p></li><li><p>验证器：加载eBPF程序到内核后需要经过验证如有界循环、越界访问内存、使用未初始化的变量。</p></li><li><p>程序执行保护：已经加载在内核中的eBPF程序会进入read-only模式试图修改会直接crash内核。</p></li><li><p>限制内核访问范围： eBPF程序不能直接访问任意内核其他函数。必须通过eBPF helpers访问固定helpers函数。</p></li><li><p>eBPF 堆栈大小被限制在 MAX_BPF_STACK，截止到内核 Linux 5.8 版本，被设置为 512字节。</p></li><li><p>eBPF 字节码大小最初被限制为 4096 条指令，截止到内核 Linux 5.8 版本， 当前已将放宽至 100 万指令。</p></li></ul><p>优点：<br>1.速度和性能。 内核态进行，速度和效率高。<br>2.灵活：无需修改内核代码，即可扩展内核功能拥有无限想象空间。<br>3.低侵入性：基于eBPF实现链路追踪、服务治理等场景不需要侵入用户层。</p><p>缺点：<br>1.eBPF本身一些特性和能力依赖新版本内核。<br>2.学习成本高，需要对Linux Kernel和操作系统原理有深入了解。</p><h3 id="目前行业落地情况"><a href="#目前行业落地情况" class="headerlink" title="目前行业落地情况"></a><a href="#%E7%9B%AE%E5%89%8D%E8%A1%8C%E4%B8%9A%E8%90%BD%E5%9C%B0%E6%83%85%E5%86%B5" title="目前行业落地情况"></a>目前行业落地情况</h3><p>应用</p><ul><li><p>动态追踪：bcc、bpftrace</p></li><li><p>观测监控：Pixie、Hubble</p></li><li><p>网络：Cilium、Katran</p></li><li><p>安全：Falco、Tracee</p><h3 id="能解决什么问题"><a href="#能解决什么问题" class="headerlink" title="能解决什么问题"></a><a href="#%E8%83%BD%E8%A7%A3%E5%86%B3%E4%BB%80%E4%B9%88%E9%97%AE%E9%A2%98" title="能解决什么问题"></a>能解决什么问题</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_4.png"></p></li></ul><h3 id="为什么需要eBPF能实现可观测性"><a href="#为什么需要eBPF能实现可观测性" class="headerlink" title="为什么需要eBPF能实现可观测性"></a><a href="#%E4%B8%BA%E4%BB%80%E4%B9%88%E9%9C%80%E8%A6%81eBPF%E8%83%BD%E5%AE%9E%E7%8E%B0%E5%8F%AF%E8%A7%82%E6%B5%8B%E6%80%A7" title="为什么需要eBPF能实现可观测性"></a>为什么需要eBPF能实现可观测性</h3><p>eBPF可观测性-指标采集<br>eBPF除了常规的指标监控如CPU、内存等，还可以监控细粒度的系统调用等信息，通过内核Kprobe或者Tracepoint实现;<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_5.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_6.png"></p><p>eBPF可观测性-链路追踪<br>与传统APM相比，eBPF进行链路追踪不需要与业务本身进行绑定。通过拦截sock相关的send&#x2F;recv操作，解析协议头，获得进程之间的调用关系，可进一步关联Kubernetes元数据，获得容器、服务之间的调用关系;<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_7.png"></p><h3 id="展望未来"><a href="#展望未来" class="headerlink" title="展望未来"></a><a href="#%E5%B1%95%E6%9C%9B%E6%9C%AA%E6%9D%A5" title="展望未来"></a>展望未来</h3><p>1、基于eBPF的服务网格<br>去除每个pod的sidecar，内核态实现服务治理（cilium1.12已实现)<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_8.png"></p><p>2、基于eBPF的负载均衡器<br>利用 socket eBPF，可以在不用直接处理报文和NAT 转换的前提下，实现了负载均衡逻辑。Service网络 POD&lt;–&gt; Service &lt;–&gt; POD优化成 POD &lt;–&gt; POD，从而使Service网络性能基本等同于POD 网络。软件结构如下：<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_9.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_10.png"></p><p>3、基于eBPF的网络安全策略<br>不再依赖 iptables，不需要创建巨量的 iptables rule，从而显著降低 iptables 带来的性能影响。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ebfp1_11.png"></p><p>参考链接：</p><p><a href="https://mp.weixin.qq.com/s/Xr8ECrS_fR3aCT1vKJ9yIg">https://mp.weixin.qq.com/s/Xr8ECrS_fR3aCT1vKJ9yIg</a><br><a href="https://mp.weixin.qq.com/s?__biz=Mzg5Mjc3MjIyMA==&mid=2247544625&idx=2&sn=7ba07582e0b7fdc0ff3179f2fa2b44d4&source=41#wechat_redirect">https://mp.weixin.qq.com/s?__biz&#x3D;Mzg5Mjc3MjIyMA&#x3D;&#x3D;&amp;mid&#x3D;2247544625&amp;idx&#x3D;2&amp;sn&#x3D;7ba07582e0b7fdc0ff3179f2fa2b44d4&amp;source&#x3D;41#wechat_redirect</a><br><a href="https://blog.csdn.net/eBPF_Kindling/article/details/123575619">https://blog.csdn.net/eBPF_Kindling&#x2F;article&#x2F;details&#x2F;123575619</a><br><a href="https://www.51cto.com/article/715674.html">https://www.51cto.com/article/715674.html</a><br><a href="https://blog.csdn.net/m0_46700908/article/details/124464577?spm=1001.2101.3001.6650.1&utm_medium=distribute.pc_relevant.none-task-blog-2~default~CTRLIST~Rate-1-124464577-blog-123575619.t5_layer_eslanding_D_0&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2~default~CTRLIST~Rate-1-124464577-blog-123575619.t5_layer_eslanding_D_0&utm_relevant_index=2">https://blog.csdn.net/m0_46700908&#x2F;article&#x2F;details&#x2F;124464577?spm&#x3D;1001.2101.3001.6650.1&amp;utm_medium&#x3D;distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-124464577-blog-123575619.t5_layer_eslanding_D_0&amp;depth_1-utm_source&#x3D;distribute.pc_relevant.none-task-blog-2%7Edefault%7ECTRLIST%7ERate-1-124464577-blog-123575619.t5_layer_eslanding_D_0&amp;utm_relevant_index&#x3D;2</a><br><a href="https://colobu.com/2022/05/22/use-ebpf-to-trace-rpcx-microservices/">https://colobu.com/2022/05/22/use-ebpf-to-trace-rpcx-microservices/</a><br><a href="https://arthurchiao.art/blog/ebpf-and-k8s-zh/">https://arthurchiao.art/blog/ebpf-and-k8s-zh/</a><br><a href="https://zhuanlan.zhihu.com/p/480811707">https://zhuanlan.zhihu.com/p/480811707</a><br><a href="https://github.com/mikeroyal/eBPF-Guide">https://github.com/mikeroyal/eBPF-Guide</a><br><a href="https://zhuanlan.zhihu.com/p/373090595">https://zhuanlan.zhihu.com/p/373090595</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;eBPF学习摘要1-概述、理论&quot;&gt;&lt;a href=&quot;#eBPF学习摘要1-概述、理论&quot; class=&quot;headerlink&quot; title=&quot;eBPF学习摘要1(概述、理论)&quot;&gt;&lt;/a&gt;eBPF学习摘要1(概述、理论)&lt;/h2&gt;&lt;h3 id=&quot;概述&quot;&gt;&lt;a href</summary>
      
    
    
    
    <category term="Linux" scheme="http://yoursite.com/categories/Linux/"/>
    
    
    <category term="Linux" scheme="http://yoursite.com/tags/Linux/"/>
    
  </entry>
  
  <entry>
    <title>ETCD集群读写慢问题分析</title>
    <link href="http://yoursite.com/2022/07/27/fio_etcd/"/>
    <id>http://yoursite.com/2022/07/27/fio_etcd/</id>
    <published>2022-07-27T13:45:59.000Z</published>
    <updated>2022-07-27T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="问题现象"><a href="#问题现象" class="headerlink" title="问题现象"></a>问题现象</h3><p>1、Rancher所在local集群周期性卡顿、执行命令响应缓慢。<br>2、Rancher-server副本频繁重启。</p><p>3、Rancher UI空载集群切换项目，点击UI反应慢。</p><p>查看ETCD日志发现有大量Ready only报错和too long（xxx ms）to execute报错</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/etcd-read-only.png"></p><h3 id="问题分析"><a href="#问题分析" class="headerlink" title="问题分析"></a>问题分析</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/etcd-read-only3.jpeg"></p><p>注：以下etcd读写流程来源腾讯云原生社区（<a href="https://blog.csdn.net/yunxiao6/article/details/108615472%EF%BC%89">https://blog.csdn.net/yunxiao6/article/details/108615472）</a><br>写数据流程（以 leader 节点为例，见上图）：</p><p>1、etcd 任一节点的 etcd server 模块收到 Client 写请求（如果是 follower 节点，会先通过 Raft 模块将请求转发至 leader 节点处理）。</p><p>2、etcd server 将请求封装为 Raft 请求，然后提交给 Raft 模块处理。</p><p>3、leader 通过 Raft 协议与集群中 follower 节点进行交互，将消息复制到follower 节点，于此同时，并行将日志持久化到 WAL。</p><p>4、follower 节点对该请求进行响应，回复自己是否同意该请求。</p><p>5、当集群中超过半数节点（(n&#x2F;2)+1 members ）同意接收这条日志数据时，表示该请求可以被Commit，Raft 模块通知 etcd server 该日志数据已经 Commit，可以进行 Apply。</p><p>6、各个节点的 etcd server 的 applierV3 模块异步进行 Apply 操作，并通过 MVCC 模块写入后端存储 BoltDB。</p><p>7、当 client 所连接的节点数据 apply 成功后，会返回给客户端 apply 的结果。</p><p>读数据流程：</p><p>1、etcd 任一节点的 etcd server 模块收到客户端读请求（Range 请求） 判断读请求类型，如果是串行化读（serializable）则直接进入 Apply 流程。</p><p>2、如果是线性一致性读（linearizable），则进入 Raft模块。</p><p>3、Raft模块向 leader 发出 ReadIndex 请求，获取当前集群已经提交的最新数据 Index。</p><p>4、等待本地 AppliedIndex 大于或等于 ReadIndex 获取的 CommittedIndex 时，进入Apply 流程。</p><p>5、Apply 流程：通过Key名从KV Index模块获取 Key最新的 Revision，再通过Revision从BoltDB 获取对应的Key和Value。</p><p>etcd 通过 WAL（预写日志）实现了内存中数据的强持久性，WAL日志受到磁盘IO 写入速度影响，fdatasync延迟也会影响etcd性能。底层ceph为分布式存储，存储多副本会进行同步，副本同步时将占用大量网络和IO资源影响性能，底层又为SAS盘，对ETCD性能影响较大。</p><h4 id="使用FIO模拟etcd-io写入"><a href="#使用FIO模拟etcd-io写入" class="headerlink" title="使用FIO模拟etcd io写入"></a>使用FIO模拟etcd io写入</h4><p>使用FIO模拟etcd io写入</p><p>安装FIO</p><pre><code>curl -LO https://github.com/rancherlabs/support-tools/raw/master/instant-fio-master/instant-fio-master.shbash instant-fio-master.sh</code></pre><p>创建测试目录，对应的在&#x2F;var&#x2F;lib&#x2F;etcd目录下进行性能测试，更能直观体现</p><pre><code>export PATH=/usr/local/bin:$PATHcd /var/lib/etcdmkdir test-datafio --rw=write --ioengine=sync --fdatasync=1 --directory=test-data --size=100m --bs=2300 --name=mytest</code></pre><p>size：表示总的写入大小<br>bs：表示每次写入的大小（单位为字节）</p><p>为了更好的模拟实际IO写入，需要通过lsof和strace查看实际IO写入量</p><p>通过lsof获取etcd进程的文件描述符</p><pre><code>lsof -p $(pgrep etcd)|grep wal lsof -p $(pgrep etcd)|grep waletcd    21040 root    7w      REG   252,1 64000000  828705 /var/lib/rancher/etcd/member/wal/1.tmpetcd    21040 root    8r      DIR   252,1     4096  838659 /var/lib/rancher/etcd/member/waletcd    21040 root   11w      REG   252,1 64000000  828702 /var/lib/rancher/etcd/member/wal/0000000000000005-000000000007016b.wal</code></pre><p>11w就是写入对应的wal文件的文件描述符，通过strace查看etcd系统调用，查看实际的数据写入量。</p><pre><code>strace -f -p  $(pgrep etcd) -T -tt  -o test.txt</code></pre><p>访问test.txt文件查找<code>write(11</code></p><pre><code>21064 11:23:24.438231 write(11, &quot;\25\3\0\0\0\0\0\203\10\2\20\303\240\345\252\16\32\212\6\10\0\20\2\30\306\276\34\&quot;\377\0052\337&quot;..., 840 &lt;unfinished ...&gt;21306 11:23:24.438248 &lt;... write resumed&gt; ) = 42 &lt;0.000037&gt;21215 11:23:24.438263 &lt;... futex resumed&gt; ) = 0 &lt;0.005978&gt;21068 11:23:24.438277 &lt;... futex resumed&gt; ) = 1 &lt;0.000051&gt;21064 11:23:24.438291 &lt;... write resumed&gt; ) = 840 &lt;0.000048&gt;21306 11:23:24.438305 futex(0xc00080cf48, FUTEX_WAIT_PRIVATE, 0, NULL &lt;unfinished ...&gt;21068 11:23:24.438319 futex(0xc0004d2148, FUTEX_WAIT_PRIVATE, 0, NULL &lt;unfinished ...&gt;21060 11:23:24.438333 &lt;... nanosleep resumed&gt; NULL) = 0 &lt;0.000247&gt;21060 11:23:24.438352 nanosleep(&#123;tv_sec=0, tv_nsec=20000&#125;,  &lt;unfinished ...&gt;21215 11:23:24.438496 futex(0xc00080cf48, FUTEX_WAKE_PRIVATE, 1 &lt;unfinished ...&gt;21064 11:23:24.438530 fdatasync(11 &lt;unfinished ...&gt;</code></pre><p>可以看见文件描述符11在write完后进行了fdatasync操作通过write操作可以看见此次数据写入量为840字节，多对比几个发现范围在800-900之间，因为我的环境为单节点环境，实际数据写入量根etcd版本和集群规模有直接关系，通常情况下在2300左右，所以这里fio的bs参数设置为2300字节，模拟etcd io写入，查看延时情况。</p><p>测试结果</p><pre><code>mytest: (g=0): rw=write, bs=(R) 2300B-2300B, (W) 2300B-2300B, (T) 2300B-2300B, ioengine=sync, iodepth=1fio-3.30-67-gdc472Starting 1 processmytest: Laying out IO file (1 file / 100MiB)Jobs: 1 (f=1)Jobs: 1 (f=1): [W(1)][100.0%][w=636KiB/s][w=283 IOPS][eta 00m:00s]mytest: (groupid=0, jobs=1): err= 0: pid=16852: Mon Jul  4 09:46:37 2022  write: IOPS=253, BW=569KiB/s (583kB/s)(100.0MiB/179902msec); 0 zone resets    clat (usec): min=5, max=4377, avg=16.96, stdev=32.00     lat (usec): min=5, max=4377, avg=17.51, stdev=32.04    clat percentiles (usec):     |  1.00th=[    8],  5.00th=[   10], 10.00th=[   10], 20.00th=[   11],     | 30.00th=[   12], 40.00th=[   13], 50.00th=[   14], 60.00th=[   16],     | 70.00th=[   18], 80.00th=[   22], 90.00th=[   29], 95.00th=[   34],     | 99.00th=[   49], 99.50th=[   57], 99.90th=[   81], 99.95th=[   96],     | 99.99th=[ 1369]   bw (  KiB/s): min=   89, max=  691, per=99.97%, avg=569.10, stdev=63.60, samples=359   iops        : min=   40, max=  308, avg=253.57, stdev=28.33, samples=359  lat (usec)   : 10=15.39%, 20=60.57%, 50=23.19%, 100=0.81%, 250=0.03%  lat (msec)   : 2=0.01%, 4=0.01%, 10=0.01%  fsync/fdatasync/sync_file_range:    sync (usec): min=1052, max=434792, avg=3923.05, stdev=3609.22    sync percentiles (usec):     |  1.00th=[  1237],  5.00th=[  1385], 10.00th=[  1483], 20.00th=[  1663],     | 30.00th=[  1876], 40.00th=[  2278], 50.00th=[  4359], 60.00th=[  4752],     | 70.00th=[  5211], 80.00th=[  5669], 90.00th=[  6325], 95.00th=[  6849],     | 99.00th=[  8455], 99.50th=[ 12649], 99.90th=[ 22938], 99.95th=[ 23725],     | 99.99th=[166724]  cpu          : usr=0.33%, sys=1.60%, ctx=109419, majf=0, minf=14  IO depths    : 1=200.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, &gt;=64=0.0%     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &gt;=64=0.0%     issued rwts: total=0,45590,0,0 short=45590,0,0,0 dropped=0,0,0,0     latency   : target=0, window=0, percentile=100.00%, depth=1Run status group 0 (all jobs):  WRITE: bw=569KiB/s (583kB/s), 569KiB/s-569KiB/s (583kB/s-583kB/s), io=100.0MiB (105MB), run=179902-179902msecDisk stats (read/write):  vda: ios=4/120187, merge=0/53744, ticks=56/185772, in_queue=9776, util=1.54%</code></pre><p>主要看<br><code>fsync/fdatasync/sync_file_range:</code>项的<code> 99.00th=[  18455], 99.50th=[ 12649],</code></p><p>表示百分之99的sync为18455usec，对应的etcd要求写入WAL文件时百分之99的fdatasync请求必须小于 10 毫秒。<br><a href="https://etcd.io/docs/v3.4/op-guide/performance/">https://etcd.io/docs/v3.4/op-guide/performance/</a></p><p>参考链接：</p><pre><code>https://blog.happyhack.io/2021/08/05/fio-and-etcd/https://www.suse.com/support/kb/doc/?id=000020100https://www.ibm.com/cloud/blog/using-fio-to-tell-whether-your-storage-is-fast-enough-for-etcdhttps://blog.csdn.net/yunxiao6/article/details/108615472</code></pre>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;问题现象&quot;&gt;&lt;a href=&quot;#问题现象&quot; class=&quot;headerlink&quot; title=&quot;问题现象&quot;&gt;&lt;/a&gt;问题现象&lt;/h3&gt;&lt;p&gt;1、Rancher所在local集群周期性卡顿、执行命令响应缓慢。&lt;br&gt;2、Rancher-server副本频繁重启。&lt;/</summary>
      
    
    
    
    <category term="kubernetes" scheme="http://yoursite.com/categories/kubernetes/"/>
    
    
    <category term="kubernetes" scheme="http://yoursite.com/tags/kubernetes/"/>
    
  </entry>
  
  <entry>
    <title>Rancher2.6 Monitoring使用</title>
    <link href="http://yoursite.com/2022/06/29/rancher_monitor/"/>
    <id>http://yoursite.com/2022/06/29/rancher_monitor/</id>
    <published>2022-06-29T13:45:59.000Z</published>
    <updated>2022-06-29T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<table><thead><tr><th>软件</th><th>版本</th></tr></thead><tbody><tr><td>Rancher</td><td>.9</td></tr><tr><td>Kubernetes</td><td>1.23.7+rke2r2</td></tr></tbody></table><h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>Rancher 2.6监控启用方式与之前版本存在较大差异，属于原生的Prometheus-Operator，通过抽象化一些Kubernetes CRD资源，可以更好的把监控告警功能整合起来，提高易用性。Prometheus-operator包括以下CRD资源对象：</p><p>PrometheusRules ：定义告警规则</p><p>Alert Managers ：Altermanager启动CRD，用于Altermanager启动副本。</p><p>Receivers：配置告警接收媒介CRD</p><p>Routers： 将告警规则和告警媒介进行匹配。</p><p>ServiceMonitor：定义Prometheus采集的监控指标地址</p><p>Pod Monitor：更细粒化的对POD进行监控。</p><p><img src="https://github.com/prometheus-operator/prometheus-operator/raw/main/Documentation/user-guides/images/architecture.png"></p><h3 id="配置使用"><a href="#配置使用" class="headerlink" title="配置使用"></a>配置使用</h3><h4 id="启用监控"><a href="#启用监控" class="headerlink" title="启用监控"></a>启用监控</h4><p>具体方法如下</p><p>切换到对应集群，选择左下角clusterTools启用Prometheus</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-16.png"></p><p>部署到System项目中，勾选自定义helm参数</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-17.png"></p><p>根据实际需求修改部署要求</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-18.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-19.png"></p><p>如果需要对接远端存储如infuxdb需要修改yaml，修改配置指向influxdb。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">remoteRead:</span><br><span class="line">  - url: http://192.168.0.7:8086/api/v1/prom/read?db=prometheus</span><br><span class="line">remoteWrite:</span><br><span class="line">  - url: http://192.168.0.7:8086/api/v1/prom/write?db=prometheus</span><br></pre></td></tr></table></figure><p>默认node-Exporter资源limit配置较低，长时间运行后容易被OOMKILL掉，需要修改默认的内存限制为150Mi。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">podLabels:</span><br><span class="line">   jobLabel: node-exporter</span><br><span class="line"> resources:</span><br><span class="line">   limits:</span><br><span class="line">     cpu: 200m</span><br><span class="line">     memory: 150Mi</span><br><span class="line">   requests:</span><br><span class="line">     cpu: 100m</span><br><span class="line">     memory: 30Mi</span><br></pre></td></tr></table></figure><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-1.png"></p><p>在此页面可以点击进入对应的组件配置页面。<br>如：</p><ul><li>Altermanager：进入的是告警信息查看页。</li><li>Grafana：查看监控数据图标</li><li>Prometheus Graph：Prometheus表达式执行页</li><li>Prometheus Rules：查看Prometheus配置的告警表达式页</li><li>Prometheus Targets：监控采集数据采集点</li></ul><h4 id="配置自定义监控指标"><a href="#配置自定义监控指标" class="headerlink" title="配置自定义监控指标"></a>配置自定义监控指标</h4><p>默认启用监控会会自动添加一些ServiceMonitor监控规则和Prometheus Rules 告警规则，主要是针对平台组件监控和集群内节点状态监控和告警</p><p>如针对java应用的jmx监控</p><p>Jmx有官方的prometheus-export，我们只需要将其jar包下载让java应用程序加载jar包和加载其配置即可。<br>以一个应用为例，整体流程如下：<br>利用JMX exporter，在Java进程内启动一个小型的Http server<br>配置Prometheus抓取那个Http server提供的metrics。<br>配置Grafana连接Prometheus，配置Dashboard。<br>创建文件夹：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir  -p /Dockerfile/jmx-exporter/</span><br></pre></td></tr></table></figure><p>下载jmx-export.jar包放到此目录</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">https://github.com/prometheus/jmx_exporter</span><br><span class="line">https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar</span><br></pre></td></tr></table></figure><p>编写jvm-export配置文件放置&#x2F;root&#x2F;jmx-exporter&#x2F;目录<br>创建simple-config.yml内容如下：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">---</span><br><span class="line">rules:</span><br><span class="line">- pattern: &quot;.*&quot;</span><br></pre></td></tr></table></figure><p>这里意思表示将全部监控信息抓取出来。<br>将jvm-export集成到tomcat中，重新编写Dockerfile</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">FROM tomcat</span><br><span class="line">COPY ./jmx_prometheus_javaagent-0.12.0.jar /jmx_prometheus_javaagent-0.12.0.jar</span><br><span class="line">ENV CATALINA_OPTS=&quot;-Xms64m -Xmx128m -javaagent:/jmx-exporter/jmx_prometheus_javaagent-0.12.0.jar=6060:/jmx-exporter/simple-config.yml&quot;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>重新docker build，build后执行以下docker run命令可以查看收集的监控指标，这里6060端口就是我们的jmx-export端口</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">docker build -t tomcat:v1.0 .</span><br><span class="line">docker run -itd -p 8080:8080 -p 6060:6060 tomcat:v1.0</span><br></pre></td></tr></table></figure><p>访问查看：<br><a href="http://host_ip:6060/">http://host_ip:6060</a><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-2.png"></p><p>部署到Rancher平台</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-3.png"></p><p>给Service打上label，用于ServiceMonitor关联</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl label  svc tomcat app=tomcat</span><br></pre></td></tr></table></figure><p>创建ServiceMonitor</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: monitoring.coreos.com/v1</span><br><span class="line">kind: ServiceMonitor</span><br><span class="line">metadata:</span><br><span class="line">  name: tomcat-app</span><br><span class="line">  namespace: default</span><br><span class="line">spec:</span><br><span class="line">  endpoints:</span><br><span class="line">  - port: exporter</span><br><span class="line">  selector:</span><br><span class="line">    matchLabels:</span><br><span class="line">      app: tomcat</span><br></pre></td></tr></table></figure><p>创建成功后通过Prometheus可以查看到对应的Target<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-4.png"></p><p>对应的监控指标也已经抓取</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-5.png"></p><p>进入grafana页面添加dashboard，默认账号密码为admin&#x2F;prom-operator<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-6.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-7.png"></p><p>添加dashboard</p><p>输入dashboard-id，8878，离线环境需要提前将Dashboard下载好，通过json方式导入。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-8.png"></p><h4 id="配置告警"><a href="#配置告警" class="headerlink" title="配置告警"></a>配置告警</h4><p>PrometheusRule用于定义告警规则，默认已经包含针对平台组件和节点的一些告警策略。可以通过配置Router和Receivers配置告警媒介将对应告警通知到相应的人员。采用Routing Tree的告警结构能够快速的将告警进行分类，然后发送到指定的人员进行处理。通过配置AlertmanagerConfig统一实现Rooter和Recivers配置<br>创建AlertmanagerConfig<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-21.png"></p><p>选择Email告警<br>Receivers配置告警媒介<br>填写SMTP地址和配置的账号&#x2F;密码，默认接收的邮箱。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-35.png"></p><p>邮箱密码创建Opaque类型的secret</p><p>Routes配置用于告警媒介和告警规则进行匹配，默认创建的root规则，用于匹配全部的告警规则，配置上对应创建的告警媒介。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-22.png"><br>此时全部的告警规则都会发送给配置的告警媒介</p><p>若要细分告警规则创建新的Routes通过Label与Prometheus Rules内对应的Alter name对接</p><p>如匹配alert:etcdNoLeader这条告警规则</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-23.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-24.png"></p><p>也可以使用正则表达式匹配多个规则如</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-25.png"></p><p>Grouping配置主要用于告警规则分类、抑制避免大量无用告警的干扰</p><p>group_by：用于配置告警分组，达到告警抑制效果，同一个group的告警只会聚合到一起发送一次，例如host01上运行了数据库，那么对应的告警包含了host down、mysql down。他们配置在一个group内，那么如果host down了对应的mysql肯定也是down了，那么因为他们配置在一个group中，所以host down和mysql down的告警会聚合到一起发送出。</p><p>group_wait：新建的AlterGroup等待多久后触发第一次告警。</p><p>group_interval：AlterGroup内产生的不同告警触发间隔时间。</p><p>repeat_interval：AlterGroup内如果一直是同样的告警，Altermanager为了避免长时间的干扰，进行告警去重的等待时间。 </p><p>匹配后，告警触发，可以收到对应的告警邮件</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6-26.png"></p><h4 id="自定义告警"><a href="#自定义告警" class="headerlink" title="自定义告警"></a>自定义告警</h4><p>当默认的告警规则不能满足需求时，可以根据实际情况添加自定义告警，实际就是添加对应的PrometheusRule。如以下例子，添加pod非running状态的告警。</p><p>UI配置</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-9.png"></p><p>对应yaml配置</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: monitoring.coreos.com/v1</span><br><span class="line">kind: PrometheusRule</span><br><span class="line">metadata:</span><br><span class="line">  name: podmonitor</span><br><span class="line">  namespace: cattle-monitoring-system</span><br><span class="line">spec:</span><br><span class="line">  groups:</span><br><span class="line">  - name: pod_node_ready</span><br><span class="line">    rules:</span><br><span class="line">    - alert: pod_not_ready</span><br><span class="line">      annotations:</span><br><span class="line">        message: &#x27;&#123;&#123; $labels.namespace &#125;&#125;/&#123;&#123; $labels.pod &#125;&#125; is not ready.&#x27;</span><br><span class="line">      expr: &#x27;sum by (namespace, pod) (kube_pod_status_phase&#123;phase!~&quot;Running|Succeeded&quot;&#125;)</span><br><span class="line">        &gt; 0 &#x27;</span><br><span class="line">      for: 180s</span><br><span class="line">      labels:</span><br><span class="line">        severity: 严重</span><br></pre></td></tr></table></figure><p>for：表示持续时间<br>message：表示告警通知内的信息。<br>label.severity：表示告警级别<br>expr：指标获取表达式</p><p>配置告警接收者</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-10.png"></p><p>根据标签匹配到这个PrometheusRule<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-11.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/2.6monitor-12.png"></p><h3 id="常见问题"><a href="#常见问题" class="headerlink" title="常见问题"></a>常见问题</h3><p>1、触发告警后，邮箱收不到告警邮件<br>使用163邮箱SMTP的465端口<br>altermanager报错<br><code>msg=&quot;Notify for alerts failed&quot; num_alerts=1 err=&quot;cattle-monitoring-system-test-test/email[0]: notify retry canceled after 16 attempts: send STARTTLS command: 454 Command not permitted when TLS active&quot;</code></p><p>修改</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">spec:</span><br><span class="line">  receivers:</span><br><span class="line">  - emailConfigs:</span><br><span class="line">    - authPassword:</span><br><span class="line">        key: password</span><br><span class="line">        name: altermanager</span><br><span class="line">      authUsername: xx@163.com</span><br><span class="line">      from: xx@163.com</span><br><span class="line">      requireTLS: false</span><br><span class="line">      sendResolved: false</span><br><span class="line">      smarthost: smtp.163.com:465</span><br><span class="line">      tlsConfig: &#123;&#125;</span><br><span class="line">      to: xx@qq.com</span><br></pre></td></tr></table></figure><p>添加<code>requireTLS: false</code></p><p>2、内部邮件服务器使用非权威证书</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">email_configs:</span><br><span class="line">  - to: &#x27;xxx&#x27;</span><br><span class="line">    insecure_skip_verify: true</span><br></pre></td></tr></table></figure><p>添加insecure_skip_verify: true<br>参考链接：</p><p><a href="https://mp.weixin.qq.com/s/fT-AXnPP8rrWxTposbi-9A">https://mp.weixin.qq.com/s/fT-AXnPP8rrWxTposbi-9A</a></p><p><a href="https://github.com/prometheus-operator/prometheus-operator">https://github.com/prometheus-operator/prometheus-operator</a></p><p><a href="https://rancher.com/docs/rancher/v2.6/en/monitoring-alerting/guides/enable-monitoring/">https://rancher.com/docs/rancher/v2.6/en/monitoring-alerting/guides/enable-monitoring/</a><br><a href="https://mp.weixin.qq.com/s/c9QGlwQrhLgptNsnQ1m6-w">https://mp.weixin.qq.com/s/c9QGlwQrhLgptNsnQ1m6-w</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;软件&lt;/th&gt;
&lt;th&gt;版本&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;&lt;tr&gt;
&lt;td&gt;Rancher&lt;/td&gt;
&lt;td&gt;.9&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Kubernetes&lt;/td&gt;
&lt;td&gt;1.23</summary>
      
    
    
    
    <category term="kubernetes" scheme="http://yoursite.com/categories/kubernetes/"/>
    
    
    <category term="kubernetes" scheme="http://yoursite.com/tags/kubernetes/"/>
    
  </entry>
  
  <entry>
    <title>云原生安全平台NeuVector基础使用</title>
    <link href="http://yoursite.com/2022/04/17/neuvector_2/"/>
    <id>http://yoursite.com/2022/04/17/neuvector_2/</id>
    <published>2022-04-17T13:45:59.000Z</published>
    <updated>2022-04-17T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>上一篇NeuVector文章主要以安装部署为主，本篇将实际结合NeuVector的基础功能进行操作演示，主要包含对于NeuVector安全漏洞管理、合规性和机密性检查、策略管理、准入控制策略、动态安全响应和行为监控。本篇文档适用版本为以NeuVector首个开源版NeuVector:5.0.0-preview.1为主</p><h3 id="安全漏洞管理"><a href="#安全漏洞管理" class="headerlink" title="安全漏洞管理"></a>安全漏洞管理</h3><p>集成CVE漏洞库，每天自动更新，支持对平台（Kubernetes）、主机、容器、镜像仓库进行安全漏洞扫描。</p><p><a href="">配置</a>自动扫描，当平台漏洞库有更新或有新的节点和容器加入时会自动进行扫描。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-1.png"></p><p>针对不同漏洞有不同风险级别，以及对应的组件版本和修复版本提示<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-2.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-3.png"></p><p>每个漏洞可以展示对应的漏洞发布时间、漏洞影响范围、对应的组件影响版本。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-4.png"></p><p>对漏洞进行过滤，是否已经修复，漏洞等级、发布时间等<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-5.png"></p><h4 id="配置对接镜像仓库扫描"><a href="#配置对接镜像仓库扫描" class="headerlink" title="配置对接镜像仓库扫描"></a>配置对接镜像仓库扫描</h4><p>支持对接多种镜像仓库如（docker-registry（harbor）、JFrog Artifactory、Nexus等）<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-6.png"></p><p>以对接Harbor为例，配置连接方式，填写连接方式和认证信息，过滤器表示你需要扫描的范围如扫描uat项目下全部镜像则<code>uat/*</code>,如果需要扫描整个Harbor内全部镜像则*。测试设置可以验证编写的表达式的关联情况。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-7.png"></p><h3 id="合规性检查和机密性检查"><a href="#合规性检查和机密性检查" class="headerlink" title="合规性检查和机密性检查"></a>合规性检查和机密性检查</h3><p>NeuVector的合规性审核包括 CIS 基线测试、自定义检查、机密审核以及 PCI、GDPR 和其他法规的行业标准模板扫描。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-8.png"></p><p>类型这表示对应的那个基线标准如K.4.1.1对应Kubernetes CIS基线测试4.1.1<br>容器对应的基线标准为D开头的，镜像对应的基线标准为I开头</p><p>注：《通用数据保护条例》（General Data Protection Regulation，简称GDPR）为欧洲联盟的条例</p><p>在合规性检查中也会检查是否存在密文泄漏情况<br><img src="https://open-docs.neuvector.com/user/pages/06.scanning/01.scanning/02.compliance/secrets_image_4.png"></p><p>包括如</p><pre><code>General Private KeysGeneral detection of credentials including &#39;apikey&#39;, &#39;api_key&#39;, &#39;password&#39;, &#39;secret&#39;, &#39;passwd&#39; etc.General passwords in yaml files including &#39;password&#39;, passwd&#39;, &#39;api_token&#39; etc.General secrets keys in key/value pairsPutty Private keyXml Private keyAWS credentials / IAMFacebook client secretFacebook endpoint secretFacebook app secretTwitter client IdTwitter secret keyGithub secretSquare product IdStripe access keySlack API tokenSlack web hooksLinkedIn client IdLinkedIn secret keyGoogle API keySendGrid API keyTwilio API keyHeroku API keyMailChimp API keyMailGun API key</code></pre><h3 id="策略管理"><a href="#策略管理" class="headerlink" title="策略管理"></a>策略管理</h3><p>在NeuVector中通过组的方式对容器和主机进行管理。通过对组进行合规性检查、网络规则、进程和文件访问规则、DLP&#x2F;WAF的检测配置。</p><p>NeuVector会自动将当前集群主机加入到nodes组，对于集群内容器会自动创建以nv.开头的组<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-9.png"><br>NeuVector的组支持3种模式，学习模式、监控模式、保护模式。各个模式实现作用如下。<br>学习模式：<br>学习和记录容器、主机间网络连接情况和进程执行信息。<br>自动构建网络规则白名单，保护应用网络正常行为。<br>为每个服务的容器中运行的进程设定安全基线，并创建进程配置文件规则白名单</p><p>监控模式：<br>NeuVector监视容器和主机网络和进程运行情况，遇到非学习模式下记录的行为将在NeuVector中进行告警。</p><p>保护模式：</p><p>NeuVector监视容器和主机网络和进程运行情况，遇到非学习模式下记录的行为直接拒绝。</p><p>新建的容器业务被自动发现默认为学习模式，也可以通过设置将默认模式设置为监控模式或保护模式。</p><p>不同组下策略冲突情况下，适用的有效模式如下表：</p><table><thead><tr><th>源组模式</th><th>目的组模式</th><th>有效模式</th></tr></thead><tbody><tr><td>学习模式</td><td>监控模式</td><td>学习模式</td></tr><tr><td>学习模式</td><td>保护模式</td><td>学习模式</td></tr><tr><td>监控模式</td><td>学习模式</td><td>学习模式</td></tr><tr><td>监控模式</td><td>保护模式</td><td>监控模式</td></tr><tr><td>保护模式</td><td>学习模式</td><td>学习模式</td></tr><tr><td>保护模式</td><td>监控模式</td><td>监控模式</td></tr></tbody></table><p>为了保证业务的稳定运行，当出现模式不一致时，有效模式以限制最小的模式运行。</p><p>生产环境最佳实践使用：<br>使用路径可以是：1、上新业务时先学习模式运行一段时间，进行完整的功能测试和调用测试得到实际运行此业务的网络连接情况和进程执行情况信息。2、监控模式运行一段时间，看看有没有额外的特殊情况，进行判断添加规则。3、最后全部容器都切换到保护模式确定最终形态。</p><h4 id="动态微隔离"><a href="#动态微隔离" class="headerlink" title="动态微隔离"></a>动态微隔离</h4><p>使用场景一：POD间通过网络策略互相隔离<br>在Kubernetes平台中创建四个Nginx。名称和用途如下。<br>workload_name：test-web1 image:nginx  用途：web服务器。<br>workload_name：test-con1 image:nginx   用途：连接客户端1<br>workload_name：test-con2 image:nginx   用途：连接客户端2<br>workload_name：test-con3 image:nginx   用途：连接客户端3</p><p>创建workload</p><pre><code>kubectl create deployment test-web1 --image=nginxkubectl expose deployment/test-web1 --port=80 --type=NodePort kubectl create deployment test-con1 --image=nginxkubectl create deployment test-con2 --image=nginxkubectl create deployment test-con3 --image=nginx</code></pre><p>此时在NeuVector中会自动生成这几个组。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-10.png"></p><p>在test-con1中通过curl访问test-web1</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-11.png"></p><p>此时可以正常访问，因为在学习模式下。NeuVector也会自动添加此访问规则。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-12.png"></p><p>将test-web1和test-con2都设置为监控模式<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-13.png"></p><p>然后在test-con2中curl访问test-web1<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-14.png"></p><p>此时test-con2可以正常访问test-web1，但在NeuVector中会生成告警<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-15.png"></p><p>同时对应的在网络活动拓扑图中也可以看见对应的连接链路变为红色。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-16.png"></p><p>将test-web1和test-con2都设置为保护模式，在通过test-con2去curl test-web1</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-17.png"></p><p>因为curl在学习模式时没有使用，也不是NeuVector默认允许的可执行进程，所以进程直接就无法访问了。</p><p>将test-con1设置为保护模式，此时test-con1无法访问外部网络，</p><p>可以通过自定义添加网络规则方式开通访问。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-18.png"></p><p>在网络规则页，此处规则已经是在学习模式下生成的规则列表。</p><p>添加外部访问规则<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-19.png"></p><p>NeuVector深度了解应用程序行为，并将分析有效负载以确定应用程序协议。协议包括HTTP，HTTPS，SSL，SSH，DNS，DNCP，NTP，TFTP，ECHO，RTSP，SIP，MySQL，Redis，Zookeeper，Cassandra，MongoDB，PostgresSQL，Kafka，Couchbase，ActiveMQ，ElasticSearch，RabbitMQ，Radius，VoltDB，Consul，Syslog，Etcd，Spark，Apache，Nginx，Jetty，NodeJS，Oracle，MSSQL和gRPC。</p><p>现在test-con1的curl去访问www.baidu.com以正常访问。</p><p>总结：<br>除上述策略外，NeuVector也内置网络威胁检测，能够快速识别常用网络攻击，保护业务容器安全运行。</p><p>无论保护模式如何。在”学习和监视”模式下，将发出警报，并且可以在”通知”-&gt;安全事件中找到这些威胁。在保护模式下，这些将收到警报和阻止。还可以根据威胁检测创建响应规则。</p><p>包含的威胁检测如下：</p><pre><code>SYN flood attackICMP flood attackIP Teardrop attackTCP split handshake attackPING death attackDNS flood DDOS attackDetect SSH version 1, 2 or 3Detect SSL TLS v1.0SSL heartbeed attackDetect HTTP negative content-length buffer overflowHTTP smugging attackHTTP Slowloris DDOS attackTCP small window attackDNS buffer overflow attackDetect MySQL access denyDNS zone transfer attackICMP tunneling attackDNS null type attackSQL injection attackApache Struts RCE attackDNS tunneling attackTCP Small MSS attackCipher Overflow attackKubernetes man-in-the-middle attack per CVE-2020-8554</code></pre><h4 id="进程管理"><a href="#进程管理" class="headerlink" title="进程管理"></a>进程管理</h4><p>NeuVector支持对容器和主机内进程进行管理<br>在学习模式下，运行的进程和命令会自动添加到规则中<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-20.png"></p><p>此时在test-con1中执行<code>df -h</code>会发现报错<code>bash: /bin/df: Operation not permitted</code><br>在<code>nv.test-con1.default</code>组中添加<code>df</code>进程规则</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-21.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-22.png"></p><p>然后在重新执行即可执行。</p><p>进程管理也支持对node节点，可以在node组中进行限制，约束宿主机进程执行。如限制执行docker cp 执行，通过学习模式得知是<code>docker-tar</code>进程在后端执行<br>将节点切换到保护模式，限制<code>docker-tar</code>进程即可。</p><p>这些在节点就无法执行<code>docker cp</code></p><h3 id="准入策略控制"><a href="#准入策略控制" class="headerlink" title="准入策略控制"></a>准入策略控制</h3><p>NeuVector支持与Kubernetes准入控制（admission-control）功能对接，实现UI配置准入控制规则，对请求进行拦截，用于对请求的资源对象进行校验。<br>NeuVector支持多种准入控制策率配置如镜像CVE漏洞情况限制、部署特权模式、镜像内使用root用户、特定标签等。</p><p>在策略-准入控制中开启此功能，注意：需要Kubernetes集群提前开启admission-control功能<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-24.png"></p><p>NeuVector准入策略控制，支持两种模式，监控模式和保护模式，对应含义和组的模式一样的。这里我们直接切换到保护模式，添加策略。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-25.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-26.png"></p><p>添加完后，在Rancher中部署特权模式容器会提示解决，策略生效。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-27.png"></p><h3 id="动态安全响应"><a href="#动态安全响应" class="headerlink" title="动态安全响应"></a>动态安全响应</h3><p>NeuVector事件响应机制可以配置响应规则根据安全事件情况进行动态响应，包括以下事件：漏洞扫描结果、CIS基准测试、准入控制事件等。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-28.png"></p><p>响应动作包括隔离、webhook通知、日志抑制<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-29.png"><br>隔离模式：对应的容器网络进出流量将全部被切断。<br>webhook通知：将触发信息通过webhook方式进行告警。<br>日志抑制：对触发告警信息进行抑制。</p><p>以CVE漏洞配置为例，配置包含CVE漏洞名称为CVE-2020-16156的容器进入隔离模式。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-30.png"></p><p>组名对应的是影响范围，如果为空，表示对全部的组都生效，填写组名可以设置对特定组生效。</p><p>配置策略后，在集群去curl nginx容器，发现无法访问，在NeuVector中查看容器状态为隔离状态。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-31.png"></p><p>删除策略时，也可以配置将对应隔离状态容器解除隔离。</p><p>注意：<br>1、隔离操作不适用于为主机事件触发的规则<br>2、每个规则可以有多个操作。</p><h3 id="行为监控"><a href="#行为监控" class="headerlink" title="行为监控"></a>行为监控</h3><h4 id="网络流量可视化"><a href="#网络流量可视化" class="headerlink" title="网络流量可视化"></a>网络流量可视化</h4><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-32.png"></p><p>网络流量可视化，可以清晰可见容器集群内网络连接关系，当前容器连接会话并且可以过滤网络连接信息，进行图标展示。能够快速进行网络问题定位。</p><h4 id="流量抓包"><a href="#流量抓包" class="headerlink" title="流量抓包"></a>流量抓包</h4><p>针对容器可进行网络抓包，方便故障不需要进入主机获取高权限，就能使进行网络问题深入排查。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-33.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector2-34.png"></p><p>采集到的数据包可直接下载通过Wireshark进行解包分析。</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;上一篇NeuVector文章主要以安装部署为主，本篇将实际结合NeuVector的基础功能进行操作演示，主要包含对于NeuVector安全漏</summary>
      
    
    
    
    <category term="安全" scheme="http://yoursite.com/categories/%E5%AE%89%E5%85%A8/"/>
    
    
    <category term="安全" scheme="http://yoursite.com/tags/%E5%AE%89%E5%85%A8/"/>
    
  </entry>
  
  <entry>
    <title>使用Bird模拟BGP Peers</title>
    <link href="http://yoursite.com/2022/03/23/bird_bgpper/"/>
    <id>http://yoursite.com/2022/03/23/bird_bgpper/</id>
    <published>2022-03-23T13:45:59.000Z</published>
    <updated>2022-03-23T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h3><p>calico网络插件最为知名的就是calico-bgp模式，在测试中需要验证calico-bgp跨子网路由同步，需要连接两个子网的路由器支持BGP协议，这给测试环境搭建带来很大复杂性。本次文档通过Bird软件将一个虚拟机模拟为软路由，并配置为Kubernetes节点BGP Peers，实现BGP路由同步。</p><p>软件版本</p><table><thead><tr><th>软件</th><th>版本</th></tr></thead><tbody><tr><td>Kubernetes</td><td>v1.20.15</td></tr><tr><td>calico</td><td>v3.17.2</td></tr></tbody></table><h3 id="拓扑架构图"><a href="#拓扑架构图" class="headerlink" title="拓扑架构图"></a>拓扑架构图</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/bird-1.png"></p><p>Hostname：rke-node3<br>host-ip：192.168.0.7<br>pod-cidr：10.41.113.192&#x2F;26</p><p>Hostname：rke-node4<br>host-ip：192.168.0.25<br>pod-cidr：10.41.57.192&#x2F;26</p><p>Hostname：rke-node6<br>host-ip：192.168.2.14<br>pod-cidr：10.41.210.128&#x2F;26</p><p>Hostname：rke-node7<br>host-ip：192.168.2.15<br>pod-cidr：10.41.210.0&#x2F;26</p><p>kubernetes 节点分布在两个子网，中间通过一台vm连接了两个子网，在vm上部署bird软路由进行两个子网通信，同属一个AS自治域。</p><p>注意：如果底层是OpenStack环境需要关闭网卡安全组。</p><h3 id="Bird部署配置"><a href="#Bird部署配置" class="headerlink" title="Bird部署配置"></a>Bird部署配置</h3><h4 id="节点配置"><a href="#节点配置" class="headerlink" title="节点配置"></a>节点配置</h4><p>Bird节点采用一台VM部署，操作系统采用Centos7.6，将此节点作为软路由需要确保以下功能开启。</p><p>内核forward转发</p><pre><code>sysctl -a|grep &quot;net.ipv4.ip_forward = 1&quot;net.ipv4.ip_forward = 1</code></pre><p>iptables数据包转发</p><pre><code>iptables -P FORWARD ACCEPT</code></pre><p>需要互相联通的节点上需要配置互访的静态路由</p><p>如在192.168.0.0&#x2F;24的节点上配置</p><pre><code>ip route add 192.168.2.0/24 via 192.168.0.40 dev ens3</code></pre><p>如在192.168.2.0&#x2F;24的节点上配置</p><pre><code>ip route add 192.168.0.0/24 via 192.168.2.16 dev ens3</code></pre><p>验证互访，在192.168.0.0&#x2F;24主机ping 192.168.2.0&#x2F;24主机</p><h4 id="Bird配置"><a href="#Bird配置" class="headerlink" title="Bird配置"></a>Bird配置</h4><p>Bird配置文件</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">mkdir /bird/</span><br><span class="line">vim /bird/bird.conf</span><br><span class="line"></span><br></pre></td></tr></table></figure><pre><code>router id 192.168.0.40;filter calico_export_to_bgp_peers &#123;  if ( net ~ 10.41.0.0/16 ) then &#123;    accept;  &#125;  if ( net ~ 10.42.0.0/16 ) then &#123;    accept;  &#125;  reject;&#125;filter calico_kernel_programming &#123;  if ( net ~ 10.41.0.0/16 ) then &#123;    accept;  &#125;  if ( net ~ 10.42.0.0/16 ) then &#123;    accept;  &#125;  accept;&#125;# Configure synchronization between routing tables and kernel.protocol kernel &#123;  learn;             # Learn all alien routes from the kernel  persist;           # Don&#39;t remove routes on bird shutdown  scan time 2;       # Scan kernel routing table every 2 seconds  import all;  export filter calico_kernel_programming; # Default is export none  graceful restart;  # Turn on graceful restart to reduce potential flaps in                     # routes when reloading BIRD configuration.  With a full                     # automatic mesh, there is no way to prevent BGP from                     # flapping since multiple nodes update their BGP                     # configuration at the same time, GR is not guaranteed to                     # work correctly in this scenario.  merge paths on;    # Allow export multipath routes (ECMP)&#125;protocol device &#123;  debug &#123; states &#125;;  scan time 2;    # Scan interfaces every 2 seconds&#125;protocol direct &#123;  debug &#123; states &#125;;  interface -&quot;cali*&quot;, -&quot;kube-ipvs*&quot;, &quot;*&quot;; # Exclude cali* and kube-ipvs* but                                          # include everything else.  In                                          # IPVS-mode, kube-proxy creates a                                          # kube-ipvs0 interface. We exclude                                          # kube-ipvs0 because this interface                                          # gets an address for every in use                                          # cluster IP. We use static routes                                          # for when we legitimately want to                                          # export cluster IPs.&#125;# Template for all BGP clientstemplate bgp bgp_template &#123;  debug &#123; states &#125;;  description &quot;Connection to BGP peer&quot;;  local as 63400;  multihop;  gateway recursive; # This should be the default, but just in case.  import all;        # Import all routes, since we don&#39;t know what the upstream                     # topology is and therefore have to trust the ToR/RR.  export filter calico_export_to_bgp_peers;  # Only want to export routes for workloads.  source address 192.168.0.40;  # The local address we use for the TCP connection  add paths on;  graceful restart;  # See comment in kernel section about graceful restart.  connect delay time 2;  connect retry time 5;  error wait time 5,30;&#125;protocol bgp Node_192_168_0_25 from bgp_template &#123;  rr client;  neighbor 192.168.0.25 as 63400;&#125;protocol bgp Node_192_168_0_7 from bgp_template &#123;  rr client;  neighbor 192.168.0.7 as 63400;&#125;protocol bgp Node_192_168_2_14 from bgp_template &#123;  rr client;  neighbor 192.168.2.14 as 63400;&#125;protocol bgp Node_192_168_2_15 from bgp_template &#123;  rr client;  neighbor 192.168.2.15 as 63400;&#125;</code></pre><p>将配置文件中的route-id、pod-cidr、neighbor-ip、as_number修改为实际需要建立bgp邻居的节点ip。</p><p>为了方便部署，本次bird使用Docker启动，启动命令如下：</p><pre><code>docker run  -itd  --net=host --uts=host --cap-add=NET_ADMIN --cap-add=NET_BROADCAST --cap-add=NET_RAW -v /bird/:/etc/bird:ro ibhde/bird4</code></pre><p>检查启动状态是否为up</p><pre><code>docker ps -a</code></pre><h3 id="Calico-BGP对接"><a href="#Calico-BGP对接" class="headerlink" title="Calico BGP对接"></a>Calico BGP对接</h3><p>全部节点上安装calicoctl</p><pre><code>wget https://github.com/projectcalico/calicoctl/releases/download/v3.17.4/calicoctl-linux-amd64mv calicoctl-linux-amd64 /usr/bin/calicoctlchmod a+x /usr/bin/calicoctl</code></pre><p>关闭全局full-mesh</p><pre><code>cat &lt;&lt;EOF | calicoctl apply -f -apiVersion: projectcalico.org/v3kind: BGPConfigurationmetadata: name: defaultspec:  logSeverityScreen: Info  nodeToNodeMeshEnabled: false  asNumber: 63400EOF</code></pre><p>配置节点label</p><p>这里将两组节点打上不同标签，将192.168.2.0&#x2F;24节点打上rack&#x3D;rack-1标签，连接192.168.2.16 bpg-peers，将192.168.0.0&#x2F;24打上rack-rack-2标签，连接192.168.0.40 bgp-peers</p><pre><code>kubectl label nodes rke-node3 rack=rack-2kubectl label nodes rke-node4 rack=rack-2kubectl label nodes rke-node5 rack=rack-1kubectl label nodes rke-node5 rack=rack-1</code></pre><p>使用caliclctl配置BGP Peers</p><pre><code>cat &lt;&lt;EOF | calicoctl apply -f -apiVersion: projectcalico.org/v3kind: BGPPeermetadata:  name: rack1-torspec:  peerIP: 192.168.2.16  asNumber: 63400  nodeSelector: rack == &#39;rack-1&#39;EOF</code></pre><!----><pre><code>apiVersion: projectcalico.org/v3kind: BGPPeermetadata:  name: rack2-torspec:  peerIP: 192.168.0.40  asNumber: 63400  nodeSelector: rack == &#39;rack-2&#39;</code></pre><p>检查与BGP Peers连接情况</p><p>在rack&#x3D;rack-2标签节点执行，应显示已经与192.168.0.40 bgp-peers建立连接</p><pre><code> calicoctl node statusCalico process is running.IPv4 BGP status+--------------+---------------+-------+------------+-------------+| PEER ADDRESS |   PEER TYPE   | STATE |   SINCE    |    INFO     |+--------------+---------------+-------+------------+-------------+| 192.168.0.40 | node specific | up    | 2022-03-18 | Established |+--------------+---------------+-------+------------+-------------+IPv6 BGP statusNo IPv6 peers found.</code></pre><p>在rack&#x3D;rack-1标签节点执行，应显示已经与192.168.2.16 bgp-peers建立连接</p><pre><code>calicoctl node statusCalico process is running.IPv4 BGP status+--------------+---------------+-------+------------+-------------+| PEER ADDRESS |   PEER TYPE   | STATE |   SINCE    |    INFO     |+--------------+---------------+-------+------------+-------------+| 192.168.2.16 | node specific | up    | 2022-03-18 | Established |+--------------+---------------+-------+------------+-------------+IPv6 BGP statusNo IPv6 peers found.</code></pre><p>创建pod，验证路由同步</p><pre><code>kubectl create deployment test --image=nginx --replicas=5</code></pre><p>在5副本中，互相进行ping操作。验证跨节点网络是否正常。</p><p>在bird节点查看路由学习</p><pre><code>ip routedefault via 192.168.2.1 dev eth010.41.210.0/26 via 192.168.2.15 dev eth0 proto bird10.42.57.192/26 via 192.168.0.25 dev eth1 proto bird10.42.113.192/26 via 192.168.0.7 dev eth1 proto bird10.42.210.128/26 via 192.168.2.14 dev eth0 proto bird192.168.0.0/24 dev eth1 proto kernel scope link src 192.168.0.40192.168.2.0/24 dev eth0 proto kernel scope link src 192.168.2.16</code></pre><p>可以看见bird将集群内每个节点的pod-cidr都学习过来了。</p><p>在任意一个node节点上查看路由，以192.168.0.3节点为例，可以看见节点上也拥有集群全部pod-cidr路由信息。</p><pre><code>ip routedefault via 192.168.0.1 dev ens3 proto dhcp src 192.168.0.7 metric 10010.41.57.192/26 via 192.168.0.25 dev ens3 proto birdblackhole 10.41.113.192/26 proto bird10.41.210.0/26 via 192.168.0.40 dev ens3 proto bird10.41.210.128/26 via 192.168.0.40 dev ens3 proto bird10.42.57.192/26 via 192.168.0.25 dev ens3 proto bird192.168.0.0/24 dev ens3 proto kernel scope link src 192.168.0.7192.168.2.0/24 via 192.168.0.40 dev ens3</code></pre><h4 id="节点POD-CIDR路由统一走默认路由"><a href="#节点POD-CIDR路由统一走默认路由" class="headerlink" title="节点POD-CIDR路由统一走默认路由"></a>节点POD-CIDR路由统一走默认路由</h4><p>当前路由同步会将每个节点pod-cidr同步到集群中的节点上，对于Kubernetes集群规模大情况下会造成路由条目增多。可以通过下发默认路由方式，将节点全部流量请求都都指向bird 软路由节点。这样还有一个好处就是，在一些硬件SDN设备中可以实现流量监控。但需要注意的是路由器本身能承载的流量。</p><p>以bird配置为例</p><pre><code>router id 192.168.0.40;protocol static &#123; route 10.41.0.0/16 via 192.168.0.40; route 10.42.0.0/16 via 192.168.0.40;&#125;filter calico_export_to_bgp_peers &#123;  if ( net ~ 10.41.0.0/16 ) then &#123;    accept;  &#125;  if ( net ~ 10.42.0.0/16 ) then &#123;    accept;  &#125;  reject;&#125;filter calico_kernel_programming &#123;  if ( net ~ 10.41.0.0/16 ) then &#123;    accept;  &#125;  if ( net ~ 10.42.0.0/16 ) then &#123;    accept;  &#125;  accept;&#125;# Configure synchronization between routing tables and kernel.protocol kernel &#123;  learn;             # Learn all alien routes from the kernel  persist;           # Don&#39;t remove routes on bird shutdown  scan time 2;       # Scan kernel routing table every 2 seconds  import all;  export filter calico_kernel_programming; # Default is export none  graceful restart;  # Turn on graceful restart to reduce potential flaps in                     # routes when reloading BIRD configuration.  With a full                     # automatic mesh, there is no way to prevent BGP from                     # flapping since multiple nodes update their BGP                     # configuration at the same time, GR is not guaranteed to                     # work correctly in this scenario.  merge paths on;    # Allow export multipath routes (ECMP)&#125;protocol device &#123;  debug &#123; states &#125;;  scan time 2;    # Scan interfaces every 2 seconds&#125;protocol direct &#123;  debug &#123; states &#125;;  interface -&quot;cali*&quot;, -&quot;kube-ipvs*&quot;, &quot;*&quot;; # Exclude cali* and kube-ipvs* but                                          # include everything else.  In                                          # IPVS-mode, kube-proxy creates a                                          # kube-ipvs0 interface. We exclude                                          # kube-ipvs0 because this interface                                          # gets an address for every in use                                          # cluster IP. We use static routes                                          # for when we legitimately want to                                          # export cluster IPs.&#125;# Template for all BGP clientstemplate bgp bgp_template &#123;  debug &#123; states &#125;;  description &quot;Connection to BGP peer&quot;;  local as 63400;  multihop;  gateway recursive; # This should be the default, but just in case.  import all;        # Import all routes, since we don&#39;t know what the upstream                     # topology is and therefore have to trust the ToR/RR.  export filter calico_export_to_bgp_peers;  # Only want to export routes for workloads.  source address 192.168.0.40;  # The local address we use for the TCP connection  add paths on;  graceful restart;  # See comment in kernel section about graceful restart.  connect delay time 2;  connect retry time 5;  error wait time 5,30;&#125;protocol bgp Node_192_168_0_25 from bgp_template &#123;  neighbor 192.168.0.25 as 63400;&#125;protocol bgp Node_192_168_0_7 from bgp_template &#123;  neighbor 192.168.0.7 as 63400;&#125;protocol bgp Node_192_168_2_14 from bgp_template &#123;  neighbor 192.168.2.14 as 63400;&#125;protocol bgp Node_192_168_2_15 from bgp_template &#123;  neighbor 192.168.2.15 as 63400;&#125;</code></pre><p>将neighbor配置中的  rr client删除，同时添加静态路由下发配置</p><pre><code>protocol static &#123; route 10.41.0.0/16 via 192.168.0.40; route 10.42.0.0/16 via 192.168.0.40;&#125;</code></pre><p>在192.168.0.0&#x2F;24的主机上看见路由情况如下：</p><pre><code>ip routedefault via 192.168.0.1 dev ens3 proto dhcp src 192.168.0.7 metric 10010.41.0.0/16 via 192.168.0.40 dev ens3 proto birdblackhole 10.41.113.192/26 proto bird10.42.0.0/16 via 192.168.0.40 dev ens3 proto bird</code></pre><p>可以看见pod-cidr的流量都被发送到Bird虚拟路由器192.168.0.40接口<br>在192.168.2.0&#x2F;24的主机上看见路由情况如下：</p><pre><code>ip routedefault via 192.168.2.1 dev ens710.41.0.0/16 via 192.168.2.16 dev ens7 proto birdblackhole 10.41.210.128/26 proto bird10.42.0.0/16 via 192.168.2.16 dev ens7 proto bird</code></pre><p>可以看见pod-cidr的流量都被发送到Bird虚拟路由器192.168.2.16接口</p><h4 id="节点POD-IP明细路由发布"><a href="#节点POD-IP明细路由发布" class="headerlink" title="节点POD-IP明细路由发布"></a>节点POD-IP明细路由发布</h4><p>在实际使用中若期望将calico-pod明细路由发布到BGP路由器中，则需要修改每个节点的calico配置文件<br>修改方法如下</p><p>创建configmap，替换calico原有的bird_aggr.cfg.template文件</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/bird-2.png"></p><p>主要修改以下参数：<br>注释掉本地黑洞路由，就不会生产本地聚合路由同步到BGP路由器了。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line"># route &#123;&#123;$cidr&#125;&#125; blackhole;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>允许明细路由同步<br>将<code>if ( net ~ &#123;&#123;$cidr&#125;&#125; ) then &#123; reject; &#125; </code>修改为accept</p><p>完整配置如下：</p><pre><code># Generated by confd&#123;&#123;- $block_key := printf "/calico/ipam/v2/host/%s/ipv4/block" (getenv "NODENAME")&#125;&#125;&#123;&#123;- $static_key := "/calico/staticroutes"&#125;&#125;&#123;&#123;if or (ls $block_key) (ls $static_key)&#125;&#125;protocol static &#123;&#123;&#123;- if ls $block_key&#125;&#125;   # IP blocks for this host.&#123;&#123;- range ls $block_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125; #  route &#123;&#123;$cidr&#125;&#125; blackhole;&#123;&#123;- end&#125;&#125;&#123;&#123;- end&#125;&#125;&#123;&#123;- if ls $static_key&#125;&#125;   # Static routes.&#123;&#123;- range ls $static_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125; #  route &#123;&#123;$cidr&#125;&#125; blackhole;&#123;&#123;- end&#125;&#125;&#123;&#123;- end&#125;&#125;&#125;&#123;&#123;else&#125;&#125;# No IP blocks or static routes for this host.&#123;&#123;end&#125;&#125;# Aggregation of routes on this host; export the block, nothing beneath it.function calico_aggr ()&#123;&#123;&#123;- range ls $block_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125;&#123;&#123;- $affinity := json (getv (printf "%s/%s" $block_key .))&#125;&#125;  &#123;&#123;- if $affinity.state&#125;&#125;      # Block &#123;&#123;$cidr&#125;&#125; is &#123;&#123;$affinity.state&#125;&#125;    &#123;&#123;- if eq $affinity.state "confirmed"&#125;&#125;      if ( net = &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;      if ( net ~ &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;    &#123;&#123;- end&#125;&#125;  &#123;&#123;- else &#125;&#125;      # Block &#123;&#123;$cidr&#125;&#125; is implicitly confirmed.      if ( net = &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;      if ( net ~ &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;  &#123;&#123;- end &#125;&#125;&#123;&#123;- end&#125;&#125;&#125;</code></pre><p>升级calico-node映射此configmap配置文件<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/bird-3.png"></p><p>重建calico-node</p><p>查看Bird节点</p><pre><code>ip routedefault via 192.168.2.1 dev eth0 10.41.0.0/16 via 192.168.0.40 dev eth1 proto bird 10.41.57.199 via 192.168.0.25 dev eth1 proto bird 10.41.57.203 via 192.168.0.25 dev eth1 proto bird 10.41.57.204 via 192.168.0.25 dev eth1 proto bird 10.41.113.193 via 192.168.0.7 dev eth1 proto bird 10.41.113.194 via 192.168.0.7 dev eth1 proto bird 10.41.113.195 via 192.168.0.7 dev eth1 proto bird 10.41.113.196 via 192.168.0.7 dev eth1 proto bird 10.41.113.198 via 192.168.0.7 dev eth1 proto bird 10.41.113.201 via 192.168.0.7 dev eth1 proto bird 10.41.113.202 via 192.168.0.7 dev eth1 proto bird 10.41.210.6 via 192.168.2.15 dev eth0 proto bird 10.41.210.7 via 192.168.2.15 dev eth0 proto bird 10.41.210.8 via 192.168.2.15 dev eth0 proto bird 10.41.210.9 via 192.168.2.15 dev eth0 proto bird 10.41.210.137 via 192.168.2.14 dev eth0 proto bird 10.41.210.138 via 192.168.2.14 dev eth0 proto bird 10.42.0.0/16 via 192.168.0.40 dev eth1 proto bird </code></pre><p>已经学习到了每个pod的明细路由，这种方式会导致路由设备压力巨大，因为需要维护大量的路由条目，并且pod的每次删除和创建都会引发的路由条目更新。在实际生产中请谨慎评估后使用。</p><p>而实际业务在使用的过程中，会针对一个服务或者一个deployment分配一个IP Pool，这种使用模式会导致Calico的IP Pool没有办法按照Node聚合，出现一些零散的无法聚合的IP地址，最差的情况，会导致每个Pod产生一条路由，会导致路由的条目变为Pod级别。<br>在默认情况下，交换机设备为了防止路由震荡，会对BGP路由进行收敛保护。但是Kubernetes集群中，Pod生命周期短，变化频繁，需要关闭网络设备的路由变更保护机制才能满足Kubernetes的要求；对于不同的网络设备，路由收敛速度也是不同的，在大规模Pod扩容和迁移的场景，或者进行双数据中心切换，除了考虑Pod的调度时间、启动时间，还需要对网络设备的路由收敛速度进行性能评估和压测。</p><p><a href="https://blog.51cto.com/u_14992974/2549877">https://blog.51cto.com/u_14992974/2549877</a></p><h4 id="Service-CIDR路由发布"><a href="#Service-CIDR路由发布" class="headerlink" title="Service-CIDR路由发布"></a>Service-CIDR路由发布</h4><p>为了使集群外部也可以通过Service的Cluster-ip访问到集群内部服务，可以将Service-cidr通过Calico-bgp进行发布。</p><pre><code>calicoctl patch BGPConfig default --patch &#39;&#123;&quot;spec&quot;: &#123;&quot;serviceClusterIPs&quot;: [&#123;&quot;cidr&quot;: &quot;10.43.0.0/16&quot;&#125;]&#125;&#125;&#39;</code></pre><p>发布后在bird节点上可以看见多条10.43.0.0&#x2F;16地址，因为采用ECMP(等价多路径)方式实现路由负载均衡。</p><pre><code>ip route10.43.0.0/16 proto bird         nexthop via 192.168.0.7 dev eth1 weight 1         nexthop via 192.168.0.25 dev eth1 weight 1         nexthop via 192.168.2.14 dev eth0 weight 1         nexthop via 192.168.2.15 dev eth0 weight 1 </code></pre><p>配置明细路由后发布后，Service-CIDR在BGP路由器中无法看见，可以通过修改bird_aggr.cfg.template文件</p><p>添加以下配置，$servicesubnet_split网段根据集群实际Service-CIDR进行修改</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">&#123;&#123;- $servicesubnet_split := split &quot;10.43.0.0/16&quot; &quot; &quot; &#125;&#125;</span><br><span class="line"></span><br><span class="line">---</span><br><span class="line">  # Service IP block</span><br><span class="line">&#123;&#123;- if $servicesubnet_split&#125;&#125;</span><br><span class="line">&#123;&#123;- range $servicesubnet_split&#125;&#125;</span><br><span class="line">   route &#123;&#123;.&#125;&#125; blackhole;</span><br><span class="line">&#123;&#123;- end&#125;&#125;</span><br><span class="line">&#123;&#123;- end&#125;&#125;</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">---</span><br><span class="line">function accept_servicesubnet () </span><br><span class="line">&#123;</span><br><span class="line">&#123;&#123;- range $servicesubnet_split&#125;&#125;</span><br><span class="line">  if ( net = &#123;&#123;.&#125;&#125; ) then &#123; accept; &#125;</span><br><span class="line">  if ( net ~ &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;</span><br><span class="line">&#123;&#123;- end&#125;&#125;</span><br><span class="line">&#125;</span><br><span class="line">function deny_servicesubnet ()</span><br><span class="line">&#123;</span><br><span class="line">&#123;&#123;- range $servicesubnet_split&#125;&#125;</span><br><span class="line">  if ( net = &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;</span><br><span class="line">  if ( net ~ &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;</span><br><span class="line">&#123;&#123;- end&#125;&#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>完整bird_aggr.cfg.template配置文件如下：</p><pre><code># Generated by confd&#123;&#123;- $block_key := printf "/calico/ipam/v2/host/%s/ipv4/block" (getenv "NODENAME")&#125;&#125;&#123;&#123;- $static_key := "/calico/staticroutes"&#125;&#125;&#123;&#123;- $servicesubnet_split := split "10.43.0.0/16" " " &#125;&#125;&#123;&#123;if or (ls $block_key) (ls $static_key)&#125;&#125;protocol static &#123;&#123;&#123;- if ls $block_key&#125;&#125;   # IP blocks for this host.&#123;&#123;- range ls $block_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125; #  route &#123;&#123;$cidr&#125;&#125; blackhole;&#123;&#123;- end&#125;&#125;&#123;&#123;- end&#125;&#125;&#123;&#123;- if ls $static_key&#125;&#125;   # Static routes.&#123;&#123;- range ls $static_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125; #   route &#123;&#123;$cidr&#125;&#125; blackhole;&#123;&#123;- end&#125;&#125;&#123;&#123;- end&#125;&#125;  # Service IP block&#123;&#123;- if $servicesubnet_split&#125;&#125;&#123;&#123;- range $servicesubnet_split&#125;&#125;   route &#123;&#123;.&#125;&#125; blackhole;&#123;&#123;- end&#125;&#125;&#123;&#123;- end&#125;&#125;&#125;&#123;&#123;else&#125;&#125;# No IP blocks or static routes for this host.&#123;&#123;end&#125;&#125;# Aggregation of routes on this host; export the block, nothing beneath it.# Export the service block.function accept_servicesubnet () &#123;&#123;&#123;- range $servicesubnet_split&#125;&#125;  if ( net = &#123;&#123;.&#125;&#125; ) then &#123; accept; &#125;  if ( net ~ &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;&#123;&#123;- end&#125;&#125;&#125;function deny_servicesubnet ()&#123;&#123;&#123;- range $servicesubnet_split&#125;&#125;  if ( net = &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;  if ( net ~ &#123;&#123;.&#125;&#125; ) then &#123; reject; &#125;&#123;&#123;- end&#125;&#125;&#125;function calico_aggr ()&#123;&#123;&#123;- range ls $block_key&#125;&#125;&#123;&#123;- $parts := split . "-"&#125;&#125;&#123;&#123;- $cidr := join $parts "/"&#125;&#125;&#123;&#123;- $affinity := json (getv (printf "%s/%s" $block_key .))&#125;&#125;  &#123;&#123;- if $affinity.state&#125;&#125;      # Block &#123;&#123;$cidr&#125;&#125; is &#123;&#123;$affinity.state&#125;&#125;    &#123;&#123;- if eq $affinity.state "confirmed"&#125;&#125;      if ( net = &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;      if ( net ~ &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;    &#123;&#123;- end&#125;&#125;  &#123;&#123;- else &#125;&#125;      # Block &#123;&#123;$cidr&#125;&#125; is implicitly confirmed.      if ( net = &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;      if ( net ~ &#123;&#123;$cidr&#125;&#125; ) then &#123; accept; &#125;  &#123;&#123;- end &#125;&#125;&#123;&#123;- end&#125;&#125;&#125;</code></pre>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h3&gt;&lt;p&gt;calico网络插件最为知名的就是calico-bgp模式，在测试中需要验证calico-bgp跨子网路由同步，需要连接两个子网的路由器支持</summary>
      
    
    
    
    <category term="Network" scheme="http://yoursite.com/categories/Network/"/>
    
    
    <category term="Network" scheme="http://yoursite.com/tags/Network/"/>
    
  </entry>
  
  <entry>
    <title>Neuvector介绍和部署</title>
    <link href="http://yoursite.com/2022/03/17/neuvector_1/"/>
    <id>http://yoursite.com/2022/03/17/neuvector_1/</id>
    <published>2022-03-17T13:45:59.000Z</published>
    <updated>2022-03-17T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="Neuvector介绍"><a href="#Neuvector介绍" class="headerlink" title="Neuvector介绍"></a>Neuvector介绍</h3><p>NeuVector 是最早开发 Docker&#x2F;Kubernetes 安全产品的公司，是 Kubernetes 网络安全的领导<br>者，NeuVector 致力于保障企业级容器平台安全，产品适用于各种云环境、跨云或者本地部署等容器生产环境。NeuVector 提供实时深入的容器网络可视化、东西向容器网络监控、主动隔离和保护、容器主机安全以及容器内部安全。和容器管理平台无缝集成并且实现应用级容器安全的自动化。<br>2021年SUSE收购Neuvector，并将其开源。</p><p>项目地址：<br><a href="https://github.com/neuvector/neuvector">https://github.com/neuvector/neuvector</a></p><h4 id="架构解析"><a href="#架构解析" class="headerlink" title="架构解析"></a>架构解析</h4><p><img src="https://open-docs.neuvector.com/user/pages/01.basics/01.overview/architecture.png"></p><p>NeuVector本身包含Controller、Enforcer、Manager、Scanner、Updater模块。</p><p>Controller：整个Neuvector的控制模块，API入口，包括配置下发，高可用主要考虑Controller的HA，通常建议部署3个Controller模块组成集群。</p><p>Enforcer：主要用于安全策略部署下发和执行，DaemonSet类型会在每个节点部署。<br>Manager：提供web-UI(仅HTTPS)和CLI控制台，供用户管理NeuVector。<br>Scanner:对节点、容器、Kubernetes、镜像进行CVE漏洞扫描<br>Updater:cronjob，用于定期更新CVE漏洞库</p><h4 id="功能介绍"><a href="#功能介绍" class="headerlink" title="功能介绍"></a>功能介绍</h4><ul><li>安全漏洞扫描</li><li>容器网络流量可视化</li><li>网络安全策略定义</li><li>L7防火墙</li><li>CICD安全扫描</li><li>合规分析<br>本篇文档更多侧重安装部署，实际功能介绍在后续文章进行深入介绍</li></ul><h3 id="NeuVector安装"><a href="#NeuVector安装" class="headerlink" title="NeuVector安装"></a>NeuVector安装</h3><p>安装环境<br>软件版本：<br>OS：Ubuntu18.04<br>Kubernetes：1.20.14<br>Rancher：2.5.12<br>Docker：19.03.15<br>NeuVector：5.0.0-b1</p><h4 id="快速部署"><a href="#快速部署" class="headerlink" title="快速部署"></a>快速部署</h4><p>创建namespace</p><pre><code>kubectl create namespace neuvector</code></pre><p>部署CRD(Kubernetes 1.19+版本)</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/crd-k8s-1.19.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/waf-crd-k8s-1.19.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/dlp-crd-k8s-1.19.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/admission-crd-k8s-1.19.yaml</span><br></pre></td></tr></table></figure><p>部署CRD(Kubernetes 1.18或更低版本)</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line"></span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/crd-k8s-1.16.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/waf-crd-k8s-1.16.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/dlp-crd-k8s-1.16.yaml</span><br><span class="line">kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/admission-crd-k8s-1.16.yaml</span><br></pre></td></tr></table></figure><p>配置RBAC</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br></pre></td><td class="code"><pre><span class="line">kubectl create clusterrole neuvector-binding-app --verb=get,list,watch,update --resource=nodes,pods,services,namespaces</span><br><span class="line">kubectl create clusterrole neuvector-binding-rbac --verb=get,list,watch --resource=rolebindings.rbac.authorization.k8s.io,roles.rbac.authorization.k8s.io,clusterrolebindings.rbac.authorization.k8s.io,clusterroles.rbac.authorization.k8s.io</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-app --clusterrole=neuvector-binding-app --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-rbac --clusterrole=neuvector-binding-rbac --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrole neuvector-binding-admission --verb=get,list,watch,create,update,delete --resource=validatingwebhookconfigurations,mutatingwebhookconfigurations</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-admission --clusterrole=neuvector-binding-admission --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrole neuvector-binding-customresourcedefinition --verb=watch,create,get,update --resource=customresourcedefinitions</span><br><span class="line">kubectl create clusterrolebinding  neuvector-binding-customresourcedefinition --clusterrole=neuvector-binding-customresourcedefinition --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrole neuvector-binding-nvsecurityrules --verb=list,delete --resource=nvsecurityrules,nvclustersecurityrules</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-nvsecurityrules --clusterrole=neuvector-binding-nvsecurityrules --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-view --clusterrole=view --serviceaccount=neuvector:default</span><br><span class="line">kubectl create rolebinding neuvector-admin --clusterrole=admin --serviceaccount=neuvector:default -n neuvector</span><br><span class="line">kubectl create clusterrole neuvector-binding-nvwafsecurityrules --verb=list,delete --resource=nvwafsecurityrules</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-nvwafsecurityrules --clusterrole=neuvector-binding-nvwafsecurityrules --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrole neuvector-binding-nvadmissioncontrolsecurityrules --verb=list,delete --resource=nvadmissioncontrolsecurityrules</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-nvadmissioncontrolsecurityrules --clusterrole=neuvector-binding-nvadmissioncontrolsecurityrules --serviceaccount=neuvector:default</span><br><span class="line">kubectl create clusterrole neuvector-binding-nvdlpsecurityrules --verb=list,delete --resource=nvdlpsecurityrules</span><br><span class="line">kubectl create clusterrolebinding neuvector-binding-nvdlpsecurityrules --clusterrole=neuvector-binding-nvdlpsecurityrules --serviceaccount=neuvector:default</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>检查是否有以下RBAC对象</p><pre><code>kubectl get clusterrolebinding  | grep neuvectorkubectl get rolebinding -n neuvector | grep neuvectorkubectl get clusterrolebinding  | grep neuvectorneuvector-binding-admission                            ClusterRole/neuvector-binding-admission                            44hneuvector-binding-app                                  ClusterRole/neuvector-binding-app                                  44hneuvector-binding-customresourcedefinition             ClusterRole/neuvector-binding-customresourcedefinition             44hneuvector-binding-nvadmissioncontrolsecurityrules      ClusterRole/neuvector-binding-nvadmissioncontrolsecurityrules      44hneuvector-binding-nvsecurityrules                      ClusterRole/neuvector-binding-nvsecurityrules                      44hneuvector-binding-nvwafsecurityrules                   ClusterRole/neuvector-binding-nvwafsecurityrules                   44hneuvector-binding-rbac                                 ClusterRole/neuvector-binding-rbac                                 44hneuvector-binding-view                                 ClusterRole/view                                                   44hkubectl get rolebinding -n neuvector | grep neuvectorneuvector-admin         ClusterRole/admin            44h</code></pre><p>部署NeuVector<br>底层runtime为Docker</p><pre><code>kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/neuvector-docker-k8s.yaml</code></pre><p>底层runtime为containerd（对于k3s和rke2可以使用此yaml文件）</p><pre><code>kubectl apply -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/neuvector-containerd-k8s.yaml</code></pre><p>1.21以下的Kubernetes版本会提示以下错误，将yaml文件下载将batch&#x2F;v1修改为batch&#x2F;v1beta1</p><pre><code>error: unable to recognize &quot;https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/neuvector-docker-k8s.yaml&quot;: no matches for kind &quot;CronJob&quot; in version &quot;batch/v1&quot;</code></pre><p>1.20.x cronjob还处于beta阶段没有正式GA,1.21版本开始cronjob才进入正式版。</p><p>默认部署web-ui使用的是loadblance类型的Service，为了方便访问修改为NodePort，也可以通过Ingress对外提供服务</p><pre><code>kubectl patch  svc neuvector-service-webui  -n neuvector --type=&#39;json&#39; -p &#39;[&#123;&quot;op&quot;:&quot;replace&quot;,&quot;path&quot;:&quot;/spec/type&quot;,&quot;value&quot;:&quot;NodePort&quot;&#125;,&#123;&quot;op&quot;:&quot;add&quot;,&quot;path&quot;:&quot;/spec/ports/0/nodePort&quot;,&quot;value&quot;:30888&#125;]&#39;</code></pre><p>访问https:&#x2F;&#x2F;node_ip:30888</p><p>默认密码为admin&#x2F;admin</p><p>点击头像旁的My profile页面进入设置页面，设置密码和语言<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-1.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-2.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-3.png"></p><h4 id="Helm部署"><a href="#Helm部署" class="headerlink" title="Helm部署"></a>Helm部署</h4><p>添加repo</p><pre><code>helm repo add neuvector https://neuvector.github.io/neuvector-helm/helm search repo neuvector/core</code></pre><p>创建namespace</p><pre><code>kubectl create namespace neuvector</code></pre><p>创建ServiceAccount</p><pre><code>kubectl create serviceaccount neuvector -n neuvector</code></pre><p>helm安装</p><pre><code>helm install neuvector --namespace neuvector neuvector/core  --set registry=docker.io  --set tag=5.0.0-preview.1 --set=controller.image.repository=neuvector/controller.preview --set=enforcer.image.repository=neuvector/enforcer.preview --set manager.image.repository=neuvector/manager.preview --set cve.scanner.image.repository=neuvector/scanner.preview --set cve.updater.image.repository=neuvector/updater.preview </code></pre><p>Helm-chart参数查看<br><a href="https://github.com/neuvector/neuvector-helm/tree/master/charts/core">https://github.com/neuvector/neuvector-helm/tree/master/charts/core</a></p><h3 id="高可用架构设计"><a href="#高可用架构设计" class="headerlink" title="高可用架构设计"></a>高可用架构设计</h3><p><img src="https://open-docs.neuvector.com/user/pages/01.basics/01.overview/architecture.png"></p><p>NeuVector-HA主要需要考虑Controller模块的HA，只要有一个Controller处于打开状态，所有数据都将在3个副本之间之间同步。<br>Controller数据主要存储在 &#x2F;var&#x2F;neuvector&#x2F; 目录中，但出现POD重建或集群重新部署时，会自动从此目录加载备份文件，进行集群恢复。</p><h4 id="部署策略"><a href="#部署策略" class="headerlink" title="部署策略"></a>部署策略</h4><p>NeuVector官方提供四种HA部署模式<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-14.png"></p><p>方式一：不进行任何调度限制，由Kubernetes进行自由调度管理管理。</p><p>方式二：NeuVector control组件(manager,controller）+enforce、scanner组件配置调度label限制和污点容忍，与Kubernetes master节点部署一起。</p><p>方式三：给Kubernetes集群中通过Taint方式建立专属的NeuVector节点，只允许NeuVector control组件部署。</p><p>方式四：NeuVector control组件(manager,controller）配置调度label限制和污点容忍，与Kubernetes master节点部署一起。k8s-master不部署enforce和scanner组件，意味着master节点不在接受扫描和策略下发。</p><p>以方式二为例，进行部署<br>给master节点打上特定标签</p><pre><code>kubectl label nodes nodename nvcontroller=true</code></pre><p>获取节点Taint</p><pre><code>kubectl get node nodename -o yaml|grep -A 5 taint</code></pre><p>以rancher部署的节点master节点为例</p><pre><code> taints:  - effect: NoSchedule    key: node-role.kubernetes.io/controlplane    value: &quot;true&quot;  - effect: NoExecute    key: node-role.kubernetes.io/etcd</code></pre><p>编辑部署的yaml给NeuVector-control组件（manager,controller）添加nodeSelector和tolerations给enforce、scanner组件只添加tolerations。</p><p>例如以manager组件为例：</p><pre><code>kind: Deploymentmetadata:  name: neuvector-manager-pod  namespace: neuvectorspec:  selector:    matchLabels:      app: neuvector-manager-pod  replicas: 1  template:    metadata:      labels:        app: neuvector-manager-pod    spec:      nodeSelector:        nvcontroller: &quot;true&quot;      containers:        - name: neuvector-manager-pod          image: neuvector/manager.preview:5.0.0-preview.1          env:            - name: CTRL_SERVER_IP              value: neuvector-svc-controller.neuvector      restartPolicy: Always      tolerations:      - effect: NoSchedule        key: &quot;node-role.kubernetes.io/controlplane&quot;        operator: Equal        value: &quot;true&quot;      - effect: NoExecute        operator: &quot;Equal&quot;        key: &quot;node-role.kubernetes.io/etcd&quot;        value: &quot;true&quot;</code></pre><h4 id="数据持久化"><a href="#数据持久化" class="headerlink" title="数据持久化"></a>数据持久化</h4><p>配置环境变量启用配置数据持久化</p><pre><code>- env:  - name: CTRL_PERSIST_CONFIG</code></pre><p>配置此环境变量后，默认情况下Neuvector-Controller会将数据存储在&#x2F;var&#x2F;neuvector目录内，默认此目录是hostpath映射在POD所在宿主机的&#x2F;var&#x2F;neuvector目录内。</p><p>若需要更高级别数据可靠性也可以通过PV对接nfs或其他支出多读写的存储中。<br>这样当出现Neuvector-Controller三个POD副本同时都销毁，宿主机都完全不可恢复时，也不会有数据配置数据丢失。<br>以下以NFS为例。<br>部署nfs</p><p>创建pv和pvc</p><pre><code>cat &lt;&lt;EOF | kubectl apply -f -apiVersion: v1kind: PersistentVolumemetadata:  name: neuvector-dataspec:  capacity:    storage: 10Gi  accessModes:    - ReadWriteMany   nfs:    path: /nfsdata    server: 172.16.0.195 EOFcat &lt;&lt;EOF | kubectl apply -f -kind: PersistentVolumeClaimapiVersion: v1metadata:  name: neuvector-data  namespace: neuvectorspec:  accessModes:    - ReadWriteMany  resources:    requests:      storage: 10GiEOF</code></pre><p>修改NeuVector-Controller部署yaml，添加pvc信息，将&#x2F;var&#x2F;neuvector目录映射到nfs中（默认是hostpath映射到本地)</p><pre><code>spec:  template:    spec:      volumes:        - name: nv-share#         hostPath:                        // replaced by persistentVolumeClaim#           path: /var/neuvector        // replaced by persistentVolumeClaim          persistentVolumeClaim:            claimName: neuvector-data</code></pre><p>或直接在NeuVector部署yaml中挂载nfs目录</p><pre><code>      volumes:      - name: nv-share        nfs:          path: /opt/nfs-deployment          server: 172.26.204.144</code></pre><h3 id="多云安全管理"><a href="#多云安全管理" class="headerlink" title="多云安全管理"></a>多云安全管理</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-5.png"><br>在实际生产应用中，会存在对多个集群进行安全进行管理，NeuVector支持集群联邦功能。<br>需要在一个集群上暴露Federation Master服务，在每个远端集群上部署Federation Worker服务。为了更好的灵活性，可以在每个集群同时启用Federation Master和Federation Worker服务。<br>在每个集群部署此yaml</p><pre><code>apiVersion: v1kind: Servicemetadata:  name: neuvector-service-controller-fed-master  namespace: neuvectorspec:  ports:  - port: 11443    name: fed    nodePort: 30627    protocol: TCP  type: NodePort  selector:    app: neuvector-controller-pod---apiVersion: v1kind: Servicemetadata:  name: neuvector-service-controller-fed-worker  namespace: neuvectorspec:  ports:  - port: 10443    name: fed    nodePort: 31783    protocol: TCP  type: NodePort  selector:    app: neuvector-controller-pod</code></pre><p>将其中一个集群升级为主集群<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-6.png"><br>将其中一个集群升级为主集群，配置连接远程暴露ip和对remot cluster可达的端口。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-7.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-8.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-9.png"><br>在主集群中，生成token，用于其他remote cluster连接。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-10.png"><br>在remote cluster中配置加入主集群，配置token和连接端子<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-11.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-12.png"></p><p>在界面可以对多个Neuvector集群进行管理<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/neuvector-13.png"></p><h3 id="其他配置"><a href="#其他配置" class="headerlink" title="其他配置"></a>其他配置</h3><h4 id="升级"><a href="#升级" class="headerlink" title="升级"></a>升级</h4><p>若是采用yaml文件方式部署的NeuVector直接更新对应的组件镜像tag即可完成升级。如</p><pre><code>kubectl set image deployment/neuvector-controller-pod neuvector-controller-pod=neuvector/controller:2.4.1 -n neuvectorkubectl set image -n neuvector ds/neuvector-enforcer-pod neuvector-enforcer-pod=neuvector/enforcer:2.4.1</code></pre><p>若是采用Helm部署的NeuVector，则直接执行helm update配置对应参数即可即可。</p><h4 id="卸载"><a href="#卸载" class="headerlink" title="卸载"></a>卸载</h4><p>删除部署的组件</p><pre><code>kubectl delete -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/neuvector-docker-k8s.yaml</code></pre><p>删除配置的RBAC</p><pre><code>kubectl get clusterrolebinding  | grep neuvector|awk &#39;&#123;print $1&#125;&#39;|xargs kubectl delete clusterrolebindingkubectl get rolebinding -n neuvector | grep neuvector|awk &#39;&#123;print $1&#125;&#39;|xargs kubectl delete rolebinding -n neuvector</code></pre><p>删除对应的CRD</p><pre><code>kubectl delete -f  https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/crd-k8s-1.19.yamlkubectl delete -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/waf-crd-k8s-1.19.yamlkubectl delete -f https://raw.githubusercontent.com/neuvector/manifests/main/kubernetes/5.0.0/admission-crd-k8s-1.19.yaml</code></pre><p>总结：<br>SUSE此次开源的NeuVector是一个成熟稳定的容器安全管理平台，未来NeuVector会和Rancher产品更好的融合。</p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;Neuvector介绍&quot;&gt;&lt;a href=&quot;#Neuvector介绍&quot; class=&quot;headerlink&quot; title=&quot;Neuvector介绍&quot;&gt;&lt;/a&gt;Neuvector介绍&lt;/h3&gt;&lt;p&gt;NeuVector 是最早开发 Docker&amp;#x2F;Kubern</summary>
      
    
    
    
    <category term="安全" scheme="http://yoursite.com/categories/%E5%AE%89%E5%85%A8/"/>
    
    
    <category term="安全" scheme="http://yoursite.com/tags/%E5%AE%89%E5%85%A8/"/>
    
  </entry>
  
  <entry>
    <title>RKE2的简单使用</title>
    <link href="http://yoursite.com/2022/01/01/rke2/"/>
    <id>http://yoursite.com/2022/01/01/rke2/</id>
    <published>2022-01-01T13:45:59.000Z</published>
    <updated>2021-01-01T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>软件版本：rke2 version v1.22.5+rke2r1<br>os：ubuntu18.04</p><p>RKE2是Rancher  Kubernetes新的发行版，结合和k3s和RKE1的一些特性。与RKE1相比主要特性在于安全性，符合美国联邦政府部门的安全性和合规性，完整通过CIS安全基线标准，符合FIPS-140-2 标准和定期的镜像安全扫描。<br>比如结合k3s的一个单体二进制文件启动，底层runtime集成containerd。</p><h3 id="与其他Kubernetes部署工具对比"><a href="#与其他Kubernetes部署工具对比" class="headerlink" title="与其他Kubernetes部署工具对比"></a>与其他Kubernetes部署工具对比</h3><table><thead><tr><th></th><th>组件集成度</th><th>安全性</th><th>组件容器化</th><th>部署简易性</th></tr></thead><tbody><tr><td>kubeadm</td><td>低，需要单独部署kubelet、runtime等组件，然后在通过static-pod启动其他组件。</td><td>中，默认安全配置</td><td>除kubelet外全部容器化</td><td>低，组件HA需要用户自己完成。</td></tr><tr><td>RKE-1</td><td>低，单独部署runtime然后在通过rke部署集群。</td><td>中，默认安全配置</td><td>全部容器化</td><td>高，一键部署，组件HA自动完成</td></tr><tr><td>RKE-2</td><td>高，单体二进制文件集成runtime和kubelet，一键启动。</td><td>高,专为安全而生，符合各项安全测试规范</td><td>除kubelet外全部容器化</td><td>中，每台节点需要单独操作安装，组件HA自动完成</td></tr></tbody></table><h2 id="RKE2部署"><a href="#RKE2部署" class="headerlink" title="RKE2部署"></a>RKE2部署</h2><h3 id="部署前提："><a href="#部署前提：" class="headerlink" title="部署前提："></a>部署前提：</h3><p>Linux部署前提条件：</p><ul><li>关闭swap。</li><li>关闭NetworkManager（若有），或配置NetworkManager忽略 calico&#x2F;flannel 相关网络接口。</li><li>关闭Selinux，或参考下述链接配置Selinux规则。</li><li>节点主机名采用标准FQDN格式。</li></ul><p>若需要开启NetworkManager和Selinux，策略配置NetworkManager和Selinux策略链接：<br><a href="https://rancher2.docs.rancher.cn/docs/rke2/known_issues/_index#networkmanager">https://rancher2.docs.rancher.cn/docs/rke2/known_issues/_index#networkmanager</a></p><h3 id="通过完整兼容性测试的操作系统："><a href="#通过完整兼容性测试的操作系统：" class="headerlink" title="通过完整兼容性测试的操作系统："></a>通过完整兼容性测试的操作系统：</h3><p>Ubuntu 18.04 (amd64)<br>Ubuntu 20.04 (amd64)<br>CentOS&#x2F;RHEL 7.8 (amd64)<br>CentOS&#x2F;RHEL 8.2 (amd64)<br>SLES 15 SP2 (amd64) (v1.18.16+rke2r1 和更新版本)</p><p>注：使用Cilium网络插件时，因为ebpf依赖内核技术，所以需要保证以下内核版本<br>1 、kernel版本 &gt;&#x3D; 4.9.17</p><h3 id="通过RKE2单机方式快速部署Kubernetes"><a href="#通过RKE2单机方式快速部署Kubernetes" class="headerlink" title="通过RKE2单机方式快速部署Kubernetes"></a>通过RKE2单机方式快速部署Kubernetes</h3><p>部署Server</p><p>下载rke2二进制可执行文件，和自动配置rke2-server</p><pre><code>curl -sfL http://rancher-mirror.rancher.cn/rke2/install.sh | INSTALL_RKE2_MIRROR=cn sh -</code></pre><p>设置rke2-server开机自启</p><pre><code>systemctl enable rke2-server.service</code></pre><p>启动rke2-server</p><pre><code>systemctl start rke2-server.service</code></pre><p>此时，将会通过rke2自动拉起kubelet，然后以static-pod方式启动api-server、Controller-manager、etcd、scheduler</p><p>日志查看：</p><pre><code>journalctl -u rke2-server -f</code></pre><p>默认情况下rke2将创建以下目录:</p><p>&#x2F;var&#x2F;lib&#x2F;rancher&#x2F;rke2&#x2F;:存放额外部署的集群插件（core-dns、网络插件、Ingress-Controller）、etcd数据库存放路径、其他worker连接的token。<br>&#x2F;etc&#x2F;rancher&#x2F;rke2&#x2F;：连接集群的kubeconfig文件，以及集群组件参数配置信息。</p><p>将常用CLI配置软链接</p><pre><code>ln -s /var/lib/rancher/rke2/bin/kubectl  /usr/bin/kubectlln -s /var/lib/rancher/rke2/bin/ctr /usr/bin/ctrln -s /var/lib/rancher/rke2/bin/crictl /usr/bin/crictl</code></pre><p>配置kubeconfig</p><pre><code>mkdir -p ~/.kube/cp /etc/rancher/rke2/rke2.yaml ~/.kube/config</code></pre><p>验证查看:</p><pre><code>kubectl get nodeNAME        STATUS   ROLES                       AGE   VERSIONrke-node6   Ready    control-plane,etcd,master   72m   v1.22.5+rke2r1</code></pre><p>获取worker注册到server的token文件</p><pre><code>cat /var/lib/rancher/rke2/server/token </code></pre><p>部署worker<br>下载rke2二进制可执行文件，和自动配置rke2-server</p><pre><code>curl -sfL http://rancher-mirror.rancher.cn/rke2/install.sh | INSTALL_RKE2_MIRROR=cn INSTALL_RKE2_TYPE=&quot;agent&quot;  sh -</code></pre><p>启动rke2-agent服务</p><pre><code>systemctl enable rke2-agent.service</code></pre><p>配置rke2-agent服务</p><pre><code>mkdir -p /etc/rancher/rke2/vim /etc/rancher/rke2/config.yaml</code></pre><p>配置文件内容如下：</p><pre><code>server: https://&lt;server&gt;:9345token: &lt;token from server node&gt;</code></pre><p>注：<br>rke2 server 进程通过端口 9345 监听新节点的注册。Kubernetes API 仍然监听端口 6443。</p><p>启动服务,等待服务启动注册成功。</p><pre><code>systemctl start rke2-agent.service</code></pre><p>日志查看</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">journalctl -u rke2-agent -f</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>查看最终部署</p><pre><code>kubectl get nodeNAME        STATUS   ROLES                       AGE   VERSIONrke-node6   Ready    control-plane,etcd,master   81m   v1.22.5+rke2r1rke-node7   Ready    &lt;none&gt;                      70m   v1.22.5+rke2r1</code></pre><p>测试验证</p><pre><code>kubectl create deployment test --image=busybox:1.28  --replicas=2   -- sleep 30000 </code></pre><h3 id="通过RKE2高可用方式部署Kubernetes"><a href="#通过RKE2高可用方式部署Kubernetes" class="headerlink" title="通过RKE2高可用方式部署Kubernetes"></a>通过RKE2高可用方式部署Kubernetes</h3><p>前提条件：</p><ul><li><p>Apiserver统一入口（可选），为了方便外部访问集群，需要在集群实现统一入口，可以通过L4负载均衡器或vip地址或智能轮询DNS。集群内部已经通过rke2-agent实现了worker访问api-server的多入口反向代理。</p></li><li><p>奇数个（推荐三个）的 server节点，运行 etcd、Kubernetes API 和其他控制节点服务。</p></li></ul><p>部署顺序</p><ul><li>启动第一个 server 节点</li><li>加入其他 server 节点</li><li>加入 agent 节点</li></ul><p>部署负载均衡器（可选）<br>以nginx为例，配置转发到9345和后端6443端口<br>创建nginx.conf文件</p><pre><code>events &#123;  worker_connections  1024;  ## Default: 1024&#125; stream &#123;    upstream kube-apiserver &#123;        server host1:6443     max_fails=3 fail_timeout=30s;        server host2:6443     max_fails=3 fail_timeout=30s;        server host3:6443     max_fails=3 fail_timeout=30s;    &#125;    upstream rke2 &#123;        server host1:9345     max_fails=3 fail_timeout=30s;        server host2:9345     max_fails=3 fail_timeout=30s;        server host3:9345     max_fails=3 fail_timeout=30s;    &#125;    server &#123;        listen 6443;        proxy_connect_timeout 2s;        proxy_timeout 900s;        proxy_pass kube-apiserver;    &#125;    server &#123;        listen 9345;        proxy_connect_timeout 2s;        proxy_timeout 900s;        proxy_pass rke2;    &#125;&#125;</code></pre><p>将对应的3个ip地址修改为实际server节点ip地址</p><p>启动nginx</p><pre><code>docker run -itd -p 9345:9345  -p 6443:6443 -v ~/nginx.conf:/etc/nginx/nginx.conf nginx</code></pre><p>实际生产环境部署建议部署两个nginx，中间通过keepalived维持vip实现统一入口。</p><p>部署第一个Server<br>下载rke2二进制可执行文件，和自动配置rke2-server</p><pre><code>curl -sfL http://rancher-mirror.rancher.cn/rke2/install.sh | INSTALL_RKE2_MIRROR=cn sh -</code></pre><p>设置rke2-server开机自启</p><pre><code>systemctl enable rke2-server.service</code></pre><p>配置config.yaml文件</p><pre><code>mkdir /etc/rancher/rke2/ -p </code></pre><!----><pre><code>touch config.yaml</code></pre><p>输入以内容</p><pre><code>tls-san:  - xxx.xxx.xxx.xxx  - www.xxx.com</code></pre><p>此处填写LB的统一入口ip地址或域名，如果有多个换行分组方式隔开。</p><p>启动rke2-server</p><pre><code>systemctl start rke2-server.service</code></pre><p>将常用CLI配置软链接</p><pre><code>ln -s /var/lib/rancher/rke2/bin/kubectl  /usr/bin/kubectlln -s /var/lib/rancher/rke2/bin/ctr /usr/bin/ctrln -s /var/lib/rancher/rke2/bin/crictl /usr/bin/crictl</code></pre><p>配置kubeconfig</p><pre><code>mkdir -p ~/.kube/cp /etc/rancher/rke2/rke2.yaml ~/.kube/config</code></pre><p>可以将kubeconfig文件中的中的ip地址由127.0.0.1替换为实际LB的IP地址。</p><p>获取注册到server的token文件</p><pre><code>cat /var/lib/rancher/rke2/server/token </code></pre><p>配置其他Server<br>配置rke2-agent服务</p><pre><code>mkdir -p /etc/rancher/rke2/vim /etc/rancher/rke2/config.yaml</code></pre><p>配置文件内容如下：</p><pre><code>server: https://&lt;server&gt;:9345token: &lt;token from server node&gt;tls-san:  - xxx.xxx.xxx.xxx  - www.xxx.com</code></pre><p>注：</p><ul><li>server地址可以填写第一台Server的地址，也可以填写外部统一入口的地址，最佳实践是填写统一入口地址，这样当第一个Server出现问题后，agent还可以通过统一入口地址通过其他Server获取集群信息。</li><li>token填写第一台server的token</li><li>tls-san跟第一台server一样，一般填写统一入口的ip地址或域名，用于TLS证书注册。</li></ul><p>下载rke2二进制可执行文件，和自动配置rke2-server</p><pre><code>curl -sfL http://rancher-mirror.rancher.cn/rke2/install.sh | INSTALL_RKE2_MIRROR=cn sh -</code></pre><p>设置rke2-server开机自启</p><pre><code>systemctl enable rke2-server.service</code></pre><p>启动rke2-server</p><pre><code>systemctl start rke2-server.service</code></pre><p>等待注册和集群启动</p><p>验证：</p><pre><code>kubectl get nodeNAME        STATUS   ROLES                       AGE    VERSIONrke-node4   Ready    control-plane,etcd,master   140m   v1.22.5+rke2r1rke-node5   Ready    control-plane,etcd,master   138m   v1.22.5+rke2r1rke-node6   Ready    control-plane,etcd,master   19h    v1.22.5+rke2r1rke-node7   Ready    &lt;none&gt;                      19h    v1.22.5+rke2r1</code></pre><p>进入etcd-pod，查看etcd集群状态。</p><pre><code>etcdctl --cert /var/lib/rancher/rke2/server/tls/etcd/server-client.crt --key /var/lib/rancher/rke2/server/tls/etcd/server-client.key --endpoints https://127.0.0.1:2379 --cacert /var/lib/rancher/rke2/server/tls/etcd/server-ca.crt member liste19d2834bb177be1, started, rke-node4-896165c9, https://192.168.0.25:2380, https://192.168.0.25:2379, falseec67af24a94fb07c, started, rke-node6-fed10843, https://192.168.0.32:2380, https://192.168.0.32:2379, falsef7e9f28da0a6e5e6, started, rke-node5-4a4b6af5, https://192.168.0.29:2380, https://192.168.0.29:2379, false</code></pre><p>按单机操作加入agent节点。</p><h3 id="通过RKE2离线部署kubernetes集群"><a href="#通过RKE2离线部署kubernetes集群" class="headerlink" title="通过RKE2离线部署kubernetes集群"></a>通过RKE2离线部署kubernetes集群</h3><h4 id="Tarball模式"><a href="#Tarball模式" class="headerlink" title="Tarball模式"></a>Tarball模式</h4><p>RKE2的离线部署方式与k3s比较相似，都是提前将对应的离线介质下载放置到对应的目录，启动二进制进程执行。</p><p>在RKE2对应的Release页下载对应的离线安装介质<br><a href="https://github.com/rancher/rke2/releases">https://github.com/rancher/rke2/releases</a><br>主要为以下离线安装介质</p><ul><li>rke2-images.linux-amd64.tar</li><li>rke2.linux-amd64.tar.gz</li><li>sha256sum-amd64.txt<br>根据所需要的不同网络插件，下载对应的镜像包</li><li>rke2-images-canal.linux-amd64.tar.gz</li><li>离线安装脚本</li></ul><p>将这些下载后的安装介质放置在节点的一个统一目录如&#x2F;root&#x2F;images</p><p>下载离线安装脚本</p><pre><code>curl -sfL https://get.rke2.io --output install.sh</code></pre><p>部署安装</p><pre><code>INSTALL_RKE2_ARTIFACT_PATH=/root/images sh install.sh</code></pre><p>执行此脚本，将自动对离线介质进行解压到对应目录。<br>接下来就跟在线安装一样，启动RKE2的进程，进行部署server和agent<br>启动rke2</p><p>设置rke2-server开机自启</p><pre><code>systemctl enable rke2-server.service</code></pre><p>启动rke2-server</p><pre><code>systemctl start rke2-server.service</code></pre><p>等待注册和集群启动</p><h4 id="Private-Registry"><a href="#Private-Registry" class="headerlink" title="Private Registry"></a>Private Registry</h4><p>将镜像上传到镜像仓库<br>可以使用rancher的rancher-load-images.sh脚本结合rke2-images-all.linux-amd64.txt文件进行镜像上传。</p><p>下载rke2可执行文件rke2.linux-amd64.tar.gz<br>解压，将systemctl文件和rke2可执行文件复制到对应目录</p><pre><code>cp lib/systemd/system/* /usr/local/lib/systemd/system/</code></pre><!----><pre><code>cp bin/* /usr/local/bin/</code></pre><!----><pre><code>cp share/* /usr/local/share/ -rf</code></pre><p>配置config.yaml，指定默认拉取镜像</p><pre><code>system-default-registry: xxx.xxx.xxx.xxx</code></pre><p>若私有镜像仓库为http或自签名https需要在<code>/etc/rancher/rke2 /registries.yaml</code>进行配置<br>但这里我配置的insecure-registry没有生效，具体issue查看：<a href="https://github.com/rancher/rke2/issues/2317">https://github.com/rancher/rke2/issues/2317</a></p><h3 id="通过RKE2部署Kubernetes高可用实现原理"><a href="#通过RKE2部署Kubernetes高可用实现原理" class="headerlink" title="通过RKE2部署Kubernetes高可用实现原理"></a>通过RKE2部署Kubernetes高可用实现原理</h3><p>RKE2部署的Kubernetes和其他Kubernetes的组件需要HA的方式是一致的.<br>Kubernetes 集群的高可用是针对：</p><ul><li>etcd</li><li>controller-manager</li><li>scheduler</li><li>apiserver</li></ul><p>etcd：通过本身的 Raft 算法 Leader 选主机制，组成ETCD集群，实现 etcd 高可用。</p><p>controller manager：leader election 选举竞争锁的机制来保证高可用。</p><p>scheduler：leader election 选举竞争锁的机制来保证高可用。</p><p>apiserver：无状态，通过前端负载均衡实现高可用。</p><p>另外一个在于在rke2集群中，containerd、kubelet组件集成到了rke2服务中，这点和k3s非常相式，同时在rke2服务中还集成了nginx服务，主要用于做为kubelet连接api-server的方向代理。</p><p>HA的主要区别在于API-server统一入口，因为RKE2会帮助其他组件自动做HA，</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/RKE-1.png"></p><p>当有统一入口时，跟kubeadm和其他原生Kubernetes一样，所有请求都会通过统一负载均衡器连接到后端的rke2-server。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/RKE-2.png"></p><p>如果api-server没有统一入口，kubelet和rke2-agent去连接rke2-server时，会用一个server地址去注册即可，然后agent会获取 所有rke2 server 的地址，然后存储到 &#x2F;var&#x2F;lib&#x2F;rancher&#x2F;rke2&#x2F;agent&#x2F;etc&#x2F;rke2-api-server-agent-load-balancer.json中，生成nginx反向代理配置<br>比如：</p><pre><code>cat rke2-agent-load-balancer.json&#123;  &quot;ServerURL&quot;: &quot;https://192.168.3.10:9345&quot;,  &quot;ServerAddresses&quot;: [    &quot;192.168.3.11:9345&quot;,    &quot;192.168.3.12:9345&quot;  ],  &quot;Listener&quot;: null</code></pre><p>当192.168.3.10 挂掉之后，会自动切换到另一个rke2 server 去连接。当192.168.3.10恢复后，回重新连接192.168.3.10。</p><p>另外在前面也提到，rke2里面也集成了containerd，那么问题来了，如果rke2-agent进程出现问题down了，是否会影响平台上业务正常运行呢？<br>答案是，不会影响业务正常运行，因为containerd创建容器是通过containerd-shim-runc-v2调用runc创建，当containerd出现问题时containerd-shim-runc-v2会被init进程托管，不会导致退出影响现有业务POD。但需要注意的是rke2-agent退出后kubelet也退出了，对应的业务状态探测就没有了，在默认超时5分钟后，Controller-manager会将业务pod重建。</p><h2 id="其他使用技巧"><a href="#其他使用技巧" class="headerlink" title="其他使用技巧"></a>其他使用技巧</h2><h3 id="使用RKE2部署Kubernetes使用其他网络插件"><a href="#使用RKE2部署Kubernetes使用其他网络插件" class="headerlink" title="使用RKE2部署Kubernetes使用其他网络插件"></a>使用RKE2部署Kubernetes使用其他网络插件</h3><p>默认情况下rke2部署使用的是canal做为网络插件，还支持calico和cilium网络插件，若想使用其他网络插件只需要进行配置即可。</p><p>如cilium<br>cilium依赖内核bfp特性，在启用前需要先进行挂载。<br>检查是否有进行挂载</p><pre><code>mount | grep /sys/fs/bpf</code></pre><p>进行挂载</p><pre><code>sudo mount bpffs -t bpf /sys/fs/bpfsudo bash -c &#39;cat &lt;&lt;EOF &gt;&gt; /etc/fstabnone /sys/fs/bpf bpf rw,relatime 0 0EOF&#39;</code></pre><p>在次检查</p><pre><code>mount | grep /sys/fs/bpfbpffs on /sys/fs/bpf type bpf (rw,relatime)bpffs on /sys/fs/bpf type bpf (rw,relatime)</code></pre><p>在start rke2-server和agent服务前先配置config.yaml</p><pre><code>mkdir -p /etc/rancher/rke2/vim /etc/rancher/rke2/config.yaml</code></pre><p>添加以下参数</p><pre><code>cni: cilium</code></pre><p>启动rke2-server</p><pre><code>systemctl start rke2-server.service</code></pre><p>查看是否部署成功</p><pre><code>kubectl get pod -ANAMESPACE     NAME                                                    READY   STATUS              RESTARTS   AGEkube-system   cilium-6rfzw                                            1/1     Running             0          52skube-system   cilium-node-init-998vd                                  1/1     Running             0          52skube-system   cilium-operator-85f67b5cb7-nw7n8                        1/1     Running             0          52skube-system   cilium-operator-85f67b5cb7-qc2vh                        0/1     Pending             0          52skube-system   cloud-controller-manager-rke-node4                      1/1     Running             0          65skube-system   etcd-rke-node4                                          1/1     Running             0          73s</code></pre><h3 id="组件参数配置"><a href="#组件参数配置" class="headerlink" title="组件参数配置"></a>组件参数配置</h3><p>在&#x2F;etc&#x2F;rancher&#x2F;rke2&#x2F;config.yaml 文件中，按照对应组件，添加对应的参数，如apiserver对应为kube-apiserver-arg，组件对应参数为etcd-arg。kube-controller-manager-arg、kube-scheduler-arg、kubelet-arg、kube-proxy-arg。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">etcd-arg:</span><br><span class="line">  - &quot;quota-backend-bytes=858993459&quot;</span><br><span class="line">  - &quot;max-request-bytes=33554432&quot;</span><br><span class="line">kube-apiserver-arg:</span><br><span class="line">  - &quot;watch-cache=true&quot;</span><br><span class="line">kubelet-arg:</span><br><span class="line">  - &quot;system-reserved=cpu=1,memory=2048Mi&quot;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>配置完成后启动rke2-server。agent节点要同步时配置，否则kubelet和kube-proxy参数将不生效</p><p>检查参数是否生效<br>如：</p><pre><code>ps aux|grep system-reserved</code></pre><h3 id="集群备份和还原"><a href="#集群备份和还原" class="headerlink" title="集群备份和还原"></a>集群备份和还原</h3><p>rke2备份文件保存在每个拥有etcd角色的节点的&#x2F;var&#x2F;lib&#x2F;rancher&#x2F;rke2&#x2F;server&#x2F;db&#x2F;snapshots目录内，拥有多副本保存。<br>默认每隔12小时备份一次，保留5份。</p><p>注：目前版本只能通过定时备份，没有立刻备份的选型。<br>将</p><p>指定备份文件恢复</p><p>关闭rke2-server进程</p><pre><code>systemctl stop rke2-server</code></pre><p>指定文件恢复</p><pre><code>rke2 server \  --cluster-reset \  --cluster-reset-restore-path=&lt;PATH-TO-SNAPSHOT&gt;</code></pre><p>若是HA集群，还原成功后在其他server节点将执行<code>rm -rf /var/lib/rancher/rke2/server/db</code>然后重新启动server，加入集群。</p><p>rke2跟rke1一样也支持将备份文件在一个新集群进行还原。</p><h3 id="常见操作"><a href="#常见操作" class="headerlink" title="常见操作"></a>常见操作</h3><p>参考链接：<br><a href="https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf">https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf</a></p><h4 id="查看本机运行的容器"><a href="#查看本机运行的容器" class="headerlink" title="查看本机运行的容器"></a>查看本机运行的容器</h4><p>ctr命令</p><pre><code>/var/lib/rancher/rke2/bin/ctr --address /run/k3s/containerd/containerd.sock --namespace k8s.io container ls</code></pre><p>crictl命令</p><pre><code>export CRI_CONFIG_FILE=/var/lib/rancher/rke2/agent/etc/crictl.yaml/var/lib/rancher/rke2/bin/crictl ps</code></pre><!----><pre><code>/var/lib/rancher/rke2/bin/crictl --config /var/lib/rancher/rke2/agent/etc/crictl.yaml ps</code></pre><!----><pre><code>/var/lib/rancher/rke2/bin/crictl --runtime-endpoint unix:///run/k3s/containerd/containerd.sock ps -a</code></pre><p>最终都是连接到containerd的socket文件</p><h4 id="查看日志"><a href="#查看日志" class="headerlink" title="查看日志"></a>查看日志</h4><pre><code>journalctl -f -u rke2-server/var/lib/rancher/rke2/agent/containerd/containerd.log/var/lib/rancher/rke2/agent/logs/kubelet.log</code></pre><h4 id="etcd操作"><a href="#etcd操作" class="headerlink" title="etcd操作"></a>etcd操作</h4><p>etcdctl check perf</p><pre><code>for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl check perf&quot;; done</code></pre><p>etcdctl endpoint status</p><pre><code>for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl endpoint status&quot;; done</code></pre><p>etcdctl endpoint health</p><pre><code>for etcdpod in $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name); do kubectl -n kube-system exec $etcdpod -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl endpoint health&quot;; done</code></pre><p>etcdctl compact</p><pre><code>rev=$(kubectl -n kube-system exec $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1) -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl endpoint status --write-out fields | grep Revision | cut -d: -f2&quot;)kubectl -n kube-system exec $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1) -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl compact \&quot;$(echo $rev)\&quot;&quot;</code></pre><p>etcdctl defrag</p><pre><code>kubectl -n kube-system exec $(kubectl -n kube-system get pod -l component=etcd --no-headers -o custom-columns=NAME:.metadata.name | head -1) -- sh -c &quot;ETCDCTL_ENDPOINTS=&#39;https://127.0.0.1:2379&#39; ETCDCTL_CACERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-ca.crt&#39; ETCDCTL_CERT=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.crt&#39; ETCDCTL_KEY=&#39;/var/lib/rancher/rke2/server/tls/etcd/server-client.key&#39; ETCDCTL_API=3 etcdctl defrag --cluster&quot;</code></pre><p>对应的，直接操作etcdctl<br>参考：<a href="https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf">https://gist.github.com/superseb/3b78f47989e0dbc1295486c186e944bf</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h2&gt;&lt;p&gt;软件版本：rke2 version v1.22.5+rke2r1&lt;br&gt;os：ubuntu18.04&lt;/p&gt;
&lt;p&gt;RKE2是Rancher</summary>
      
    
    
    
    <category term="kubernetes" scheme="http://yoursite.com/categories/kubernetes/"/>
    
    
    <category term="kubernetes" scheme="http://yoursite.com/tags/kubernetes/"/>
    
  </entry>
  
  <entry>
    <title>Jenkins与外部系统集成</title>
    <link href="http://yoursite.com/2021/10/14/jenkins_Docking_external_system/"/>
    <id>http://yoursite.com/2021/10/14/jenkins_Docking_external_system/</id>
    <published>2021-10-14T13:45:59.000Z</published>
    <updated>2021-10-14T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="环境准备"><a href="#环境准备" class="headerlink" title="环境准备"></a>环境准备</h3><p>软件版本</p><table><thead><tr><th>软件</th><th>版本</th></tr></thead><tbody><tr><td>gitlab</td><td>14.3.0</td></tr><tr><td>Jenkins</td><td>2.303.1</td></tr><tr><td>Harbor</td><td>1.10.2</td></tr><tr><td>Sonar</td><td>9.1</td></tr><tr><td>Nexus</td><td>3.35.0-02</td></tr><tr><td>ArgoCD</td><td>2.1.3</td></tr></tbody></table><h4 id="部署gitlab"><a href="#部署gitlab" class="headerlink" title="部署gitlab"></a>部署gitlab</h4><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run --detach --hostname 10.8.242.28 --publish 443:443 --publish 80:80 --publish 1022:22 --name gitlab --restart always --volume /srv/gitlab/config:/etc/gitlab --volume /srv/gitlab/logs:/var/log/gitlab --volume /srv/gitlab/data:/var/opt/gitlab gitlab/gitlab-ce:12.10.3-ce.0</span><br></pre></td></tr></table></figure><p>替换hostname为实际节点外网IP</p><h4 id="部署Harbor"><a href="#部署Harbor" class="headerlink" title="部署Harbor"></a>部署Harbor</h4><p>Harbor部署与管理<br>部署前先修改docker<br>编辑docker</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line">vim /etc/docker/daemon.json</span><br><span class="line">&#123;</span><br><span class="line"> &quot;insecure-registries&quot; : [&quot;0.0.0.0/0&quot;]</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>重启docker</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">systemctl restart docker </span><br></pre></td></tr></table></figure><p>安装docker-compose</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">curl -L https://github.com/docker/compose/releases/download/1.24.1/docker-compose-`uname -s`-`uname -m` -o /usr/local/bin/docker-compose</span><br><span class="line">chmod +x /usr/local/bin/docker-compose</span><br></pre></td></tr></table></figure><p>下载harbor  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">https://github.com/goharbor/harbor/releases/download/v1.10.2/harbor-online-installer-v1.10.2.tgz</span><br></pre></td></tr></table></figure><p>配置harbo.yaml  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">hostname: 172.31.48.86 //修改为实际节点IP</span><br></pre></td></tr></table></figure><p>屏蔽https配置  </p><p>安装harbor </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">./install.sh --with-clair</span><br></pre></td></tr></table></figure><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br></pre></td><td class="code"><pre><span class="line">docker-compose  ps</span><br><span class="line">      Name                     Command                  State                 Ports          </span><br><span class="line">---------------------------------------------------------------------------------------------</span><br><span class="line">clair               /docker-entrypoint.sh            Up (healthy)   6060/tcp, 6061/tcp       </span><br><span class="line">harbor-core         /harbor/start.sh                 Up (healthy)                            </span><br><span class="line">harbor-db           /entrypoint.sh postgres          Up (healthy)   5432/tcp                 </span><br><span class="line">harbor-jobservice   /harbor/start.sh                 Up                                      </span><br><span class="line">harbor-log          /bin/sh -c /usr/local/bin/ ...   Up (healthy)   127.0.0.1:1514-&gt;10514/tcp</span><br><span class="line">harbor-portal       nginx -g daemon off;             Up (healthy)   80/tcp                   </span><br><span class="line">nginx               nginx -g daemon off;             Up (healthy)   0.0.0.0:80-&gt;80/tcp       </span><br><span class="line">redis               docker-entrypoint.sh redis ...   Up             6379/tcp                 </span><br><span class="line">registry            /entrypoint.sh /etc/regist ...   Up (healthy)   5000/tcp                 </span><br><span class="line">registryctl         /harbor/start.sh                 Up (healthy)        </span><br><span class="line"></span><br></pre></td></tr></table></figure><p>访问<a href="http://node_ip/">http://node_ip</a></p><p>admin&#x2F;Harbor12345<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/harbor_1.png"></p><h4 id="创建测试项目"><a href="#创建测试项目" class="headerlink" title="创建测试项目"></a>创建测试项目</h4><p>spring-petclinic官方示例项目地址：<a href="https://projects.spring.io/spring-petclinic/">https://projects.spring.io/spring-petclinic/</a></p><p>本次实践针对Spring官方提供的spring-petclinic示例项目进行容器化部署，该项目采用Spring Boot + Thymeleaf开发，数据库可使用MySQL、H2等，本实践为操作方便直接使用内置的H2数据库。</p><p>注意：由于本实践采用的是H2内置数据库，所以每个应用实例的数据独立，也使得应用变成了有状态应用，而生产的最佳实践应该是数据采用外部存储，且应用采用无状态方式部署。</p><p>国内clone地址：<a href="https://gitee.com/wanshaoyuan/spring-petclinic.git">https://gitee.com/wanshaoyuan/spring-petclinic.git</a></p><p>将此项目clone后上传到私有的gitlab中.</p><h3 id="与Gitlab集成"><a href="#与Gitlab集成" class="headerlink" title="与Gitlab集成"></a>与Gitlab集成</h3><h4 id="安装gitlab插件"><a href="#安装gitlab插件" class="headerlink" title="安装gitlab插件"></a>安装gitlab插件</h4><p><img src="https://pic.downk.cc/item/5ebd382ac2a9a83be5766ac4.jpg"></p><h4 id="Gitlab中申请AccessToken"><a href="#Gitlab中申请AccessToken" class="headerlink" title="Gitlab中申请AccessToken"></a>Gitlab中申请AccessToken</h4><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-2.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ci_11.png"></p><p><img src="https://pic.downk.cc/item/5ebd3952c2a9a83be57762d6.jpg"></p><p>将申请成功的token保存好  </p><h4 id="配置Jenkins对接gitlab"><a href="#配置Jenkins对接gitlab" class="headerlink" title="配置Jenkins对接gitlab"></a>配置Jenkins对接gitlab</h4><p><img src="https://pic.downk.cc/item/5ebd39b2c2a9a83be577b003.jpg"><br>添加凭证<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/ci_14.png"><br>测试连接</p><p><img src="https://pic.downk.cc/item/5ebd3b2dc2a9a83be5792b5e.jpg"></p><h4 id="测试"><a href="#测试" class="headerlink" title="测试"></a>测试</h4><p>读取gitlab中项目spring-petclinic项目中pom.xml文件</p><p><img src="https://pic.downk.cc/item/5ebd3ddac2a9a83be57c2e59.jpg"></p><p><img src="https://pic.downk.cc/item/5ebd3f39c2a9a83be57dc753.jpg"></p><p>配置连接gitlab私有项目的密钥可以用ssh密钥也可以使用账号密码<br><img src="https://pic.downk.cc/item/5ebd3efcc2a9a83be57d86c1.jpg"></p><p><img src="https://pic.downk.cc/item/5ebd3f92c2a9a83be57e211a.jpg"><br>分支处修改为main分支<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-4.png"></p><p>构建<br><img src="https://pic.downk.cc/item/5ebd3fc1c2a9a83be57e4fc9.jpg"></p><p>去cat这个文件输出内容<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-3.png"></p><p>执行立即构建<br><img src="https://pic.downk.cc/item/5ebd407fc2a9a83be57f28ba.jpg"></p><p>输出结果为实际我们的pom.xml的文件内容  </p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-5.png"></p><h3 id="与Kubernetes集成构建分布式动态编译环境"><a href="#与Kubernetes集成构建分布式动态编译环境" class="headerlink" title="与Kubernetes集成构建分布式动态编译环境"></a>与Kubernetes集成构建分布式动态编译环境</h3><h4 id="安装Kubernetes插件"><a href="#安装Kubernetes插件" class="headerlink" title="安装Kubernetes插件"></a>安装Kubernetes插件</h4><p>Jenkins与Kubernetes集成实现动态Slave Pod，需要安装Kubernetes插件：</p><ul><li>kubernetes</li></ul><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-6.png"></p><h4 id="安装Kubernetes-Continuous-Deploy插件"><a href="#安装Kubernetes-Continuous-Deploy插件" class="headerlink" title="安装Kubernetes Continuous Deploy插件"></a>安装Kubernetes Continuous Deploy插件</h4><p>Jenkins访问kubernetes需要依赖于kubeconfig，为支持kubeconfig类型的凭据配置，需要安装Kubernetes Continuous Deploy插件：</p><ul><li>Kubernetes Continuous Deploy</li></ul><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-7.png"></p><h4 id="配置Kubernetes集群"><a href="#配置Kubernetes集群" class="headerlink" title="配置Kubernetes集群"></a>配置Kubernetes集群</h4><p>配置 系统管理—&gt;系统设置—&gt;新增一个云<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-8.png"><br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-9.png"></p><p>配置Jenkins URL，这里可以不配置api-server地址和证书key，连接kubernetes，所以默认会去读取放在JENKINS_HOME的.kube&#x2F;目录的kubeconfig文件，用于连接集群。我这里是通过安装包的方式安装的Jenkins HOME在&#x2F;var&#x2F;lib&#x2F;jenkins&#x2F;目录，如果是通过容器方式启动，将kubeconfig文件直接放~&#x2F;.kube&#x2F;目录。<br>保存到Jenkins主机的config文件中</p><p>复制粘贴到Jenkins容器内的~&#x2F;.kube&#x2F;config文件中</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">docker exec -it jenkins mkdir /root/.kube/</span><br><span class="line">docker cp config  jenkins:/root/.kube/config</span><br></pre></td></tr></table></figure><p>注意：<br>此方式Jenkins容器重启后，会将目录重新初始化覆盖掉，kubeconfig文件，生产环境可以直接挂载。</p><h4 id="验证Pipeline流水线"><a href="#验证Pipeline流水线" class="headerlink" title="验证Pipeline流水线"></a>验证Pipeline流水线</h4><p>以上Jenkins与Kubernetes的集成配置就基本完成了，下面在正式为Spring Petclinic应用创建Pipeline之前，先简单测试下Jenkins与Kubernetes集成Pipeline流水线是否正常。</p><ul><li>新建一个流水线类型的任务test-hello-pipeline</li></ul><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-10.png"></p><ul><li>准备流水线测试脚本</li></ul><figure class="highlight groovy"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">  agent &#123;</span><br><span class="line">    kubernetes &#123;</span><br><span class="line">      cloud <span class="string">&#x27;Kubernetes&#x27;</span></span><br><span class="line">      namespace <span class="string">&#x27;default&#x27;</span></span><br><span class="line">      yaml <span class="string">&quot;&quot;&quot;</span></span><br><span class="line"><span class="string">apiVersion: v1</span></span><br><span class="line"><span class="string">kind: Pod</span></span><br><span class="line"><span class="string">spec:</span></span><br><span class="line"><span class="string">  containers:</span></span><br><span class="line"><span class="string">    - name: busybox</span></span><br><span class="line"><span class="string">      image: busybox</span></span><br><span class="line"><span class="string">      command:</span></span><br><span class="line"><span class="string">      - sleep</span></span><br><span class="line"><span class="string">      args:</span></span><br><span class="line"><span class="string">      - infinity</span></span><br><span class="line"><span class="string">&quot;&quot;&quot;</span></span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">  stages &#123;</span><br><span class="line">    stage(<span class="string">&#x27;Test&#x27;</span>) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        container(<span class="string">&#x27;busybox&#x27;</span>) &#123;</span><br><span class="line">            sh <span class="string">&quot;echo &#x27;hello world&#x27;&quot;</span></span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>以上是一个简单的声明式pipeline，利用busybox镜像输出<code>hello world</code>字符串。</p><ul><li>添加流水线脚本</li></ul><p>把测试脚本添加到任务的流水线脚本框中：<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-13.png"></p><ul><li>保存流水线，并执行构建</li></ul><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-14.png"></p><ul><li>查看JOB运行结果</li></ul><p>在Kubernetes中可以看到Jenkins自动创建了Pod来执行任务，任务执行完成以后，Pod自动删除。</p><p>Jenkins中查看下构建的控制台输出，正常输出了<code>hello world</code>：</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-12.png"></p><p>验证结果表明，Jenkins与Kubernetes配置成功，Pipeline运行正常。</p><h3 id="Sonar-Qube对接实现代码质量扫描"><a href="#Sonar-Qube对接实现代码质量扫描" class="headerlink" title="Sonar-Qube对接实现代码质量扫描"></a>Sonar-Qube对接实现代码质量扫描</h3><p><img src="https://docs.sonarqube.org/latest/images/dev-cycle.png"><br><img src="https://docs.sonarqube.org/9.1/images/SQ-instance-components.png"></p><h4 id="安装sonarqube"><a href="#安装sonarqube" class="headerlink" title="安装sonarqube"></a>安装sonarqube</h4><p>初始化</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">helm repo add sonarqube https://SonarSource.github.io/helm-chart-sonarqube</span><br><span class="line">helm repo update</span><br><span class="line">kubectl create namespace sonarqube</span><br></pre></td></tr></table></figure><p>helm安装sonarqube</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">helm install  sonarqube  --namespace sonarqube  sonarqube/sonarqube --set postgresql.persistence.enabled=false</span><br></pre></td></tr></table></figure><p>注意：这里为了快速部署没有设置postgresql的持久化存储，有数据丢失风险，生产环境postgresql建议设计HA或持久化存储。<br>设置为NodePort对外暴露</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl patch svc sonarqube-sonarqube -p &#x27;&#123;&quot;spec&quot;: &#123;&quot;type&quot;: &quot;NodePort&quot;&#125;&#125;&#x27; -n sonarqube </span><br></pre></td></tr></table></figure><p>查看NodePort端口</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">kubectl  get svc -n sonarqube</span><br><span class="line">NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE</span><br><span class="line">sonarqube-postgresql            ClusterIP   10.110.40.18    &lt;none&gt;        5432/TCP         4m41s</span><br><span class="line">sonarqube-postgresql-headless   ClusterIP   None            &lt;none&gt;        5432/TCP         4m41s</span><br><span class="line">sonarqube-sonarqube             NodePort    10.106.78.100   &lt;none&gt;        9000:30005/TCP   4m41s</span><br></pre></td></tr></table></figure><p>查看启动成功</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br></pre></td><td class="code"><pre><span class="line"> kubectl get pod -n sonarqube</span><br><span class="line">NAME                     READY   STATUS    RESTARTS   AGE</span><br><span class="line">sonarqube-postgresql-0   1/1     Running   0          4m7s</span><br><span class="line">sonarqube-sonarqube-0    1/1     Running   0          4m7s</span><br></pre></td></tr></table></figure><p>访问节点的30005端口</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-15.png"></p><p>默认密码admin&#x2F;admin</p><p>如果需要中文直接安装插件就好<br>administrator—-&gt;Marketplace<br>搜索Chinese—-安装<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/jenkins_sonar_6.png"></p><p>生成token<br>申请token<br>administrator—&gt;security—&gt;user—&gt;token<br>保存生成token<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/jenkins_sonar_8.png"></p><h4 id="Jenkins配置"><a href="#Jenkins配置" class="headerlink" title="Jenkins配置"></a>Jenkins配置</h4><p>安装插件<br>系统设置—&gt;插件管理<br>安装SonarQube Scanner for Jenkins</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-16.png"></p><p>配置插件<br>配置sonarQube-server<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-17.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-18.png"></p><p>Server URL填写sonarqube的地址</p><p>Server authentication token填写刚刚创建的token，这里创建一个密钥<br>类型为Secret Text。Secret填写token详细信息，ID为此secret的名称<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-19.png"></p><p>配置sonarQUbe-agent<br>系统管理-&gt;全局工具配置——&gt;SonarQube Scanner<br>此处配置为自动安装<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-20.png"></p><h4 id="FreeStyle风格任务下配置SonarQube"><a href="#FreeStyle风格任务下配置SonarQube" class="headerlink" title="FreeStyle风格任务下配置SonarQube"></a>FreeStyle风格任务下配置SonarQube</h4><p>以上面的test-gitlab项目的spring-petclinic为例<br>先执行maven构建出class文件，在进行扫描,因为sonarQube扫描的对象是.class而不是.java文件。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-25.png"></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -i -v /var/jenkins_home/workspace/:/tmp  maven:3.6-jdk-8 mvn -f /tmp/spring-petclinic/pom.xml clean package -DskipTests</span><br></pre></td></tr></table></figure><p>在构建阶段添加”Execute SonarQube Scanner”</p><p>输入以下内容</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">sonar.projectKey=test</span><br><span class="line">sonar.projectName=test</span><br><span class="line">sonar.projectVersion=1.0</span><br><span class="line">sonar.sources=src</span><br><span class="line">sonar.java.binaries=target/classes</span><br><span class="line">sonar.language=java</span><br><span class="line">sonar.sourceEncoding=UTF-8</span><br></pre></td></tr></table></figure><p>注：<br>sonar.projectKey&#x3D;Test #sonar那显示project-key<br>sonar.projectName&#x3D;Test #sonar那显示project名字<br>sonar.projectVersion&#x3D;1.0 ##sonar那显示project版本<br>sonar.sources&#x3D;src #指定要扫描的源码目录。<br>sonar.java.binaries&#x3D;target&#x2F;classes  #指定java文件编译后class文件目录。<br>sonar.language&#x3D;java #只扫描的语言。<br>sonar.sourceEncoding&#x3D;UTF-8 #指定源码的编码格式，一般都会去指定为UTF-8。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-21.png"></p><p>执行构建<br>Jenkins处查看<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-22.png"></p><p>sonar处查看<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-23.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-24.png"></p><h4 id="Jenkins-Pipeline风格任务下配置SonarQube"><a href="#Jenkins-Pipeline风格任务下配置SonarQube" class="headerlink" title="Jenkins-Pipeline风格任务下配置SonarQube"></a>Jenkins-Pipeline风格任务下配置SonarQube</h4><p>使用Pipeline流水线，需要在添加以下步骤</p><p>1、在对应的代码库的根目录创建sonar-project.properties  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">sonar.projectKey=test2</span><br><span class="line">sonar.projectName=test2</span><br><span class="line">sonar.projectVersion=1.0</span><br><span class="line">sonar.sources=src</span><br><span class="line">sonar.java.binaries=target/classes</span><br><span class="line">sonar.java.source=1.8</span><br><span class="line">sonar.java.target=1.8</span><br><span class="line">sonar.language=java</span><br><span class="line">sonar.sourceEncoding=UTF-8</span><br></pre></td></tr></table></figure><p>Pipeline中添加以下步骤</p><p>Pipeline中添加以下步骤</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">stage(&#x27;SonarQube analysis&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        script &#123;</span><br><span class="line">        def sonarqubeScannerHome = tool name: &#x27;SonarQubeScanner&#x27;</span><br><span class="line">            withSonarQubeEnv(&#x27;sonar&#x27;) &#123;</span><br><span class="line">            sh &quot;$&#123;sonarqubeScannerHome&#125;/bin/sonar-scanner&quot;</span><br><span class="line">         &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br></pre></td></tr></table></figure><p>注：<br>1、SonarQubeScanner为全局工具配置中的SonarQube Scanner的配置名称。<br>2、withSonarQubeEnv配置的sonar变量为全局——&gt;系统配置sonar-server的配置名称</p><p>清空workspace</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">rm -rf /var/jenkins_home/workspace/spring-petclinic</span><br></pre></td></tr></table></figure><h3 id="Sonattype-Nexus"><a href="#Sonattype-Nexus" class="headerlink" title="Sonattype-Nexus"></a>Sonattype-Nexus</h3><p>Nexus是开源的制品库，可以用来存储一些代码构建后的制品如jar包，npm包和docker镜像等。也可以将存放制品后的仓库做为私服，供给给后面需要内网编译的软件使用。</p><h4 id="部署安装"><a href="#部署安装" class="headerlink" title="部署安装"></a>部署安装</h4><p>软件版本：3.35.0-02<br>本次部署为了更加方便和快捷，采用Docker方式部署<br>创建目录</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">mkdir /var/nexus-data &amp;&amp; chown -R 200 /var/nexus-data</span><br></pre></td></tr></table></figure><p>Docker运行</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -d -p 8081:8081 --name nexus -v /var/nexus-data:/nexus-data sonatype/nexus3</span><br></pre></td></tr></table></figure><p>初始账号和密码访问<br>账号：admin<br>密码：  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">cat /var/nexus-data/admin.password</span><br></pre></td></tr></table></figure><p>创建仓库<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-45.png"></p><p>仓库分为三种类型，proxy、group、hosted。<br>Proxy：Repository是代理仓库，可以配置上游仓库地址，如阿里云仓库地址。当本地仓库不到时，去向配置的上游仓库查找。<br>hosted：供本地使用的本地仓库。<br>group：仓库组，将多个仓库合成一个组，查找jar包时，会按照仓库组中的仓库顺序下载jar包。</p><p>这里创建两个名为spring-petclinic-releases、spring-petclinic-snapshots，类型为hotsted的Maven2仓库。</p><p>releases库主要用于存储正式版的制品，snapshots存储持续集成过程中产生的制品。  </p><p>这里可以根据情况进行修改为release或snapshots<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-46.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-47.png"></p><p>maven处配置<br>在spring-petclinic目录下创建conf&#x2F;settings.xml文件用于存放连接Nexus3的凭证信息，正常可以在maven_home或~&#x2F;.m2&#x2F;目录有这文件。因为这里是使用Docker进行构建编译，所以这里直接与业务代码放置在一起。<br>settings.xml文件</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br></pre></td><td class="code"><pre><span class="line">&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span><br><span class="line">&lt;settings xmlns=&quot;http://maven.apache.org/SETTINGS/1.0.0&quot;</span><br><span class="line">          xmlns:xsi=&quot;http://www.w3.org/2001/XMLSchema-instance&quot;</span><br><span class="line">          xsi:schemaLocation=&quot;http://maven.apache.org/SETTINGS/1.0.0 http://maven.apache.org/xsd/settings-1.0.0.xsd&quot;&gt;</span><br><span class="line">    &lt;pluginGroups&gt;</span><br><span class="line">  &lt;/pluginGroups&gt;</span><br><span class="line">  &lt;proxies&gt;</span><br><span class="line">  &lt;/proxies&gt;</span><br><span class="line"> &lt;servers&gt;</span><br><span class="line">     &lt;server&gt;</span><br><span class="line">      &lt;id&gt;releases&lt;/id&gt;</span><br><span class="line">      &lt;username&gt;账号&lt;/username&gt;</span><br><span class="line">      &lt;password&gt;密码&lt;/password&gt;</span><br><span class="line">     &lt;/server&gt;</span><br><span class="line">     &lt;server&gt;</span><br><span class="line">      &lt;id&gt;snapshots&lt;/id&gt;</span><br><span class="line">      &lt;username&gt;账号&lt;/username&gt;</span><br><span class="line">      &lt;password&gt;密码&lt;/password&gt;</span><br><span class="line">     &lt;/server&gt;</span><br><span class="line">  &lt;/servers&gt;</span><br><span class="line">&lt;/settings&gt;</span><br></pre></td></tr></table></figure><p>关闭https检测，因为Nexus3使用的是http方式对外暴露所以需要关闭maven构建时强行要求https链接</p><p><code>src/checkstyle/nohttp-checkstyle.xml</code><br>注释<br><code>&lt;module name=&quot;io.spring.nohttp.checkstyle.check.NoHttpCheck&quot;/&gt;</code><br>注释后：<br><code>&lt;!-- &lt;module name=&quot;io.spring.nohttp.checkstyle.check.NoHttpCheck&quot;/&gt; --&gt;</code></p><p>pom.xml文件添加以下内容</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">&lt;distributionManagement&gt;</span><br><span class="line">        &lt;repository&gt;</span><br><span class="line">            &lt;!--id的名字可以任意取，但是在setting文件中的属性&lt;server&gt;的ID与这里一致--&gt;</span><br><span class="line">            &lt;id&gt;releases&lt;/id&gt;</span><br><span class="line">            &lt;!--指向仓库类型为host(宿主仓库）的储存类型为Release的仓库--&gt;</span><br><span class="line">            &lt;url&gt;http://172.16.0.195:8081/repository/mspring-petclinic-releases/&lt;/url&gt;</span><br><span class="line">        &lt;/repository&gt;</span><br><span class="line">        &lt;snapshotRepository&gt;</span><br><span class="line">            &lt;id&gt;snapshots&lt;/id&gt;</span><br><span class="line">            &lt;!--指向仓库类型为host(宿主仓库）的储存类型为Snapshot的仓库--&gt;</span><br><span class="line">            &lt;url&gt;http://172.16.0.195:8081/repository/spring-petclinic-snapshots/&lt;/url&gt;</span><br><span class="line">        &lt;/snapshotRepository&gt;</span><br><span class="line">  &lt;/distributionManagement&gt;</span><br></pre></td></tr></table></figure><p>执行编译</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -i -v /root/spring-petclinic/:/tmp  maven:3.6-jdk-8 mvn -f /tmp/pom.xml --settings /tmp/conf/settings.xml clean deploy </span><br></pre></td></tr></table></figure><p>编译完成上传成功后<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-48.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-49.png"></p><p>在spring-petclinic-snapshots仓库内可见上传来的jar包，对应的jar包后面也接上了对应的时间戳，方便进行分类。  </p><p>如果要上传到release仓库，将pom.xml中的<code>&lt;version&gt;2.5.0-SNAPSHOT&lt;/version&gt;</code>中的<code>-SNAPSHOT</code>字段删除就表示为正式版。  </p><h3 id="ArgoCD集成实现CD端对接"><a href="#ArgoCD集成实现CD端对接" class="headerlink" title="ArgoCD集成实现CD端对接"></a>ArgoCD集成实现CD端对接</h3><h4 id="编写并上传部署spring-petclinic的yaml文件和Dockerfile文件"><a href="#编写并上传部署spring-petclinic的yaml文件和Dockerfile文件" class="headerlink" title="编写并上传部署spring-petclinic的yaml文件和Dockerfile文件"></a>编写并上传部署spring-petclinic的yaml文件和Dockerfile文件</h4><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br></pre></td><td class="code"><pre><span class="line">apiVersion: apps/v1</span><br><span class="line">kind: Deployment</span><br><span class="line">metadata:</span><br><span class="line">  name: spring-petclinic-0-0-1</span><br><span class="line">spec:</span><br><span class="line">  selector:</span><br><span class="line">    matchLabels:</span><br><span class="line">      app: spring-petclinic</span><br><span class="line">      version: 0.0.1</span><br><span class="line">  replicas: 1</span><br><span class="line">  template:</span><br><span class="line">    metadata:</span><br><span class="line">      labels:</span><br><span class="line">        app: spring-petclinic</span><br><span class="line">        version: 0.0.1</span><br><span class="line">    spec:</span><br><span class="line">      containers:</span><br><span class="line">      - name: spring-petclinic</span><br><span class="line">        image: registry.cn-shenzhen.aliyuncs.com/yedward/spring-petclinic:0.0.1</span><br><span class="line">        resources:</span><br><span class="line">          limits:</span><br><span class="line">            memory: 2Gi</span><br><span class="line">            cpu: 1</span><br><span class="line">        ports:</span><br><span class="line">        - containerPort: 8080</span><br><span class="line">        livenessProbe:</span><br><span class="line">          failureThreshold: 3</span><br><span class="line">          httpGet:</span><br><span class="line">            path: /actuator/health/liveness</span><br><span class="line">            port: 8080</span><br><span class="line">            scheme: HTTP</span><br><span class="line">          initialDelaySeconds: 30</span><br><span class="line">          periodSeconds: 5</span><br><span class="line">          successThreshold: 1</span><br><span class="line">          timeoutSeconds: 2</span><br><span class="line">        readinessProbe:</span><br><span class="line">          failureThreshold: 3</span><br><span class="line">          httpGet:</span><br><span class="line">            path: /actuator/health/readiness</span><br><span class="line">            port: 8080</span><br><span class="line">            scheme: HTTP</span><br><span class="line">          initialDelaySeconds: 30</span><br><span class="line">          periodSeconds: 5</span><br><span class="line">          successThreshold: 2</span><br><span class="line">          timeoutSeconds: 2</span><br><span class="line">---</span><br><span class="line">apiVersion: v1</span><br><span class="line">kind: Service</span><br><span class="line">metadata:</span><br><span class="line">  name: spring-petclinic-svc-0-0-1</span><br><span class="line">spec:</span><br><span class="line">  selector:</span><br><span class="line">    app: spring-petclinic</span><br><span class="line">    version: 0.0.1</span><br><span class="line">  ports:</span><br><span class="line">  - port: 8080</span><br><span class="line">    targetPort: 8080</span><br><span class="line">  type: NodePort</span><br></pre></td></tr></table></figure><p>将yaml中的镜像地址改为实际的镜像仓库地址和项目名称。<br>Dockerfile</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">FROM registry.cn-shenzhen.aliyuncs.com/yedward/openjdk:8-jre-slim</span><br><span class="line"># 企业实际场景中应该通过USER指定以非root用户运行</span><br><span class="line">USER appuser</span><br><span class="line">EXPOSE 8080</span><br><span class="line">COPY target/*.jar /app/</span><br><span class="line">WORKDIR /app</span><br><span class="line">CMD java -jar -Xms1024m -Xmx1024m /app/spring-petclinic.jar</span><br></pre></td></tr></table></figure><p>上传到gitlab中的spring-petclinic项目中</p><h4 id="部署ArgoCD"><a href="#部署ArgoCD" class="headerlink" title="部署ArgoCD"></a>部署ArgoCD</h4><p>单节点部署<br>使用官网快速部署</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">kubectl create namespace argocd</span><br><span class="line">kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml</span><br></pre></td></tr></table></figure><p>部署完后产生以下服务</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br></pre></td><td class="code"><pre><span class="line">NAME                                      READY   STATUS    RESTARTS   AGE</span><br><span class="line">pod/argocd-application-controller-0       1/1     Running   0          5d6h</span><br><span class="line">pod/argocd-dex-server-74588646d-sz9g8     1/1     Running   0          2d2h</span><br><span class="line">pod/argocd-redis-5ccdd9d4fd-csthm         1/1     Running   1          5d6h</span><br><span class="line">pod/argocd-repo-server-5bbb8bdf78-mxkv7   1/1     Running   0          18h</span><br><span class="line">pod/argocd-server-789fb45964-82mzx        1/1     Running   0          18h</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">NAME                            TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE</span><br><span class="line">service/argocd-dex-server       ClusterIP   10.43.180.172   &lt;none&gt;        5556/TCP,5557/TCP,5558/TCP   5d6h</span><br><span class="line">service/argocd-metrics          ClusterIP   10.43.184.97    &lt;none&gt;        8082/TCP                     5d6h</span><br><span class="line">service/argocd-redis            ClusterIP   10.43.4.233     &lt;none&gt;        6379/TCP                     5d6h</span><br><span class="line">service/argocd-repo-server      ClusterIP   10.43.9.45      &lt;none&gt;        8081/TCP,8084/TCP            5d6h</span><br><span class="line">service/argocd-server           NodePort    10.43.48.239    &lt;none&gt;        80:31320/TCP,443:31203/TCP   5d6h</span><br><span class="line">service/argocd-server-metrics   ClusterIP   10.43.149.186   &lt;none&gt;        8083/TCP                     5d6h</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">NAME                                 READY   UP-TO-DATE   AVAILABLE   AGE</span><br><span class="line">deployment.apps/argocd-dex-server    1/1     1            1           5d6h</span><br><span class="line">deployment.apps/argocd-redis         1/1     1            1           5d6h</span><br><span class="line">deployment.apps/argocd-repo-server   1/1     1            1           5d6h</span><br><span class="line">deployment.apps/argocd-server        1/1     1            1           5d6h</span><br><span class="line"></span><br><span class="line">NAME                                            DESIRED   CURRENT   READY   AGE</span><br><span class="line">replicaset.apps/argocd-dex-server-74588646d     1         1         1       5d6h</span><br><span class="line">replicaset.apps/argocd-redis-5ccdd9d4fd         1         1         1       5d6h</span><br><span class="line">replicaset.apps/argocd-repo-server-5bbb8bdf78   1         1         1       5d6h</span><br><span class="line">replicaset.apps/argocd-server-789fb45964        1         1         1       5d6h</span><br><span class="line"></span><br><span class="line">NAME                                             READY   AGE</span><br><span class="line">statefulset.apps/argocd-application-controller   1/1     5d6h</span><br></pre></td></tr></table></figure><p>使用NodePort方式为对外暴露</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl patch svc argocd-server -n argocd -p &#x27;&#123;&quot;spec&quot;: &#123;&quot;type&quot;: &quot;NodePort&quot;&#125;&#125;&#x27;</span><br></pre></td></tr></table></figure><p>访问dashboard<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/argocd_1.png"></p><p>默认帐号为admin，密码通过secret获取</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath=&quot;&#123;.data.password&#125;&quot; | base64 -d</span><br></pre></td></tr></table></figure><h4 id="配置ArgoCD"><a href="#配置ArgoCD" class="headerlink" title="配置ArgoCD"></a>配置ArgoCD</h4><p>配置对接gitlab<br>setting——&gt;Repositories-&gt;Connect repo using HTTPS<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-26.png"></p><p>如果对应的git是私有库，pull需要帐号密码则需要在argo设置中配置repo connect</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/argocd_10.png"></p><p>填写对应的帐号密码，如果是自签名证书需要将CA附上<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/argocd_11.png"></p><p>创建Project<br>setting——&gt;Projects<br>项目是argocd中的管理对象，也与之对应的发布权限相关联。<br>创建项目，并配置DESTINATIONS，能够发布到哪些集群和命名空间<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-28.png"></p><p>创建Application<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-29.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-30.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-31.png"></p><p>创建完后<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-32.png"></p><p>点击sync会自动将yaml文件部署到k8s集群中。<br>可以在Kubernetes集群中查看到</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">kubectl get pod </span><br><span class="line">NAME                                      READY   STATUS    RESTARTS   AGE</span><br><span class="line">spring-petclinic-0-0-1-6695b96956-xx9nw   1/1     Running   0          28m</span><br></pre></td></tr></table></figure><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-36.png"></p><h4 id="Harbor中创建对应项目"><a href="#Harbor中创建对应项目" class="headerlink" title="Harbor中创建对应项目"></a>Harbor中创建对应项目</h4><p>在harbor中创建spring-petclinic项目<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-33.png"></p><h4 id="gitlab-Webhook配置"><a href="#gitlab-Webhook配置" class="headerlink" title="gitlab Webhook配置"></a>gitlab Webhook配置</h4><p>当前Jenkins进行CI构建还是基于手动点击运行，可以配置基于gitlab的触发事件进行调用，如push、merge、tag push等事件触发回调Jenkins自动执行CI<br>jenkins处打开项目触发器<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-35.png"><br>生成连接Secret token保存下来<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-41.png"><br>Gitlab配置：<br>root登录后，需要开放安全配置，允许本地local网络连接<br>在menu选择admin——&gt;settings-——&gt;Network——&gt;Outbound requests</p><p>勾选</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">Allow requests to the local network from web hooks and services  </span><br><span class="line">Allow requests to the local network from system hooks</span><br></pre></td></tr></table></figure><p>将Jenkins的ip添加到白名单中，保存。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-42.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-40.png"></p><p>项目——&gt;setting——&gt;webhooks<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-43.png"></p><p>填写Jenkins对应的回调地址和token</p><p>点击Test settings即可在Jenkins处看见已经开始的构建任务。</p><p>保存配置<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-44.png"></p><p>test这里选择基于事件回调。查看Jenkins处是否开始自动执行任务。</p><h4 id="Jenkins配置-1"><a href="#Jenkins配置-1" class="headerlink" title="Jenkins配置"></a>Jenkins配置</h4><p>Argo是检查到yaml文件变化会进行自动发布到k8s中，那么我们只需要在Jenkins中增加修改和上传yaml阶段即可。</p><p>完整的构建阶段shell<br>编译阶段shell</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">docker run -i -v /var/jenkins_home/workspace/:/tmp  maven:3.6-jdk-8 mvn -f /tmp/spring-petclinic/pom.xml clean package -DskipTests</span><br></pre></td></tr></table></figure><p>代码扫描阶段</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br></pre></td><td class="code"><pre><span class="line">sonar.projectKey=test</span><br><span class="line">sonar.projectName=test</span><br><span class="line">sonar.projectVersion=1.0</span><br><span class="line">sonar.sources=src</span><br><span class="line">sonar.java.binaries=target/classes</span><br><span class="line">sonar.language=java</span><br><span class="line">sonar.sourceEncoding=UTF-8</span><br></pre></td></tr></table></figure><p>镜像构建阶段</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">docker login -u useradmin -p password harbor_ip</span><br><span class="line">docker build -t harbor_ip/spring-petclinic/spring-petclinic:$BUILD_NUMBER .</span><br><span class="line">docker push harbor_ip/spring-petclinic/spring-petclinic:$BUILD_NUMBER </span><br></pre></td></tr></table></figure><p>注：<br>1、这里使用Jenkins内置BUILD_NUMBER号为镜像tag，跟Jenkins的CI号是匹配的。<br>2、将上传镜像的账号密码修改为实际的账号密码。 </p><p>发布更新部署yaml阶段</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">git clone http://username:password@1.13.173.7/root/spring-petclinic.git</span><br><span class="line">git config --global user.email &quot;root@example.com&quot;</span><br><span class="line">git config --global user.name &quot;root&quot;</span><br><span class="line">git remote set-url origin http://username:password@1.13.173.7/root/spring-petclinic.git</span><br><span class="line">sed -i &quot;s/spring-petclinic:.*/spring-petclinic:$BUILD_NUMBER/g&quot; spring-petclinic/deployment.yaml</span><br><span class="line">cd spring-petclinic/</span><br><span class="line">git add deployment.yaml</span><br><span class="line">git commit -m &quot;update yaml&quot;</span><br><span class="line">git push origin main</span><br></pre></td></tr></table></figure><p>在配置一个构建后删除操作，避免构建后缓存影响下次构建<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-37.png"></p><p>执行构建，构建成功后查看对应的k8s中的部署的业务镜像版本号是否与实际应用部署的环境变量相同。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-38.png"></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">kubectl describe pod/spring-petclinic-0-0-1-699954b589-h7n58</span><br><span class="line"></span><br><span class="line">Events:</span><br><span class="line">  Type    Reason     Age   From               Message</span><br><span class="line">  ----    ------     ----  ----               -------</span><br><span class="line">  Normal  Scheduled  57m   default-scheduler  Successfully assigned default/spring-petclinic-0-0-1-7969df6996-dn2cc to rke-node2</span><br><span class="line">  Normal  Pulled     57m   kubelet            Container image &quot;1.13.173.7:8080/spring-petclinic/spring-petclinic:15&quot; already present on machine</span><br><span class="line">  Normal  Created    57m   kubelet            Created container spring-petclinic</span><br><span class="line">  Normal  Started    57m   kubelet            Started container spring-petclinic</span><br></pre></td></tr></table></figure><p>访问节点ip+spring-petclinic服务暴露出来的NodePort端口<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins2-39.png"></p><p>这是一个宠物医院的管理系统，可以通过此页面进行宠物管理。</p><h3 id="备注"><a href="#备注" class="headerlink" title="备注"></a>备注</h3><p>Jenkins内置环境变量<br>直接访问${YOUR_JENKINS_HOST}&#x2F;env-vars.html即可</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">BUILD_NUMBER， 唯一标识一次build，例如23；</span><br><span class="line">BUILD_ID，基本上等同于BUILD_NUMBER，但是是字符串，例如2011-11-15_16-06-21；</span><br><span class="line">JOB_NAME， job的名字，例如JavaHelloWorld；</span><br><span class="line">BUILD_TAG，作用同BUILD_ID,BUILD_NUMBER,用来全局地唯一标识一此build，例如jenkins-JavaHelloWorld-23；</span><br><span class="line">EXECUTOR_NUMBER， 例如0；</span><br><span class="line">NODE_NAME，slave的名字，例如MyServer01；</span><br><span class="line">NODE_LABELS，slave的label，标识slave的用处，例如JavaHelloWorld MyServer01；</span><br><span class="line">JAVA_HOME， java的home目录，例如C:\Program Files (x86)\Java\jdk1.7.0_01；</span><br><span class="line">WORKSPACE，job的当前工作目录，例如c:\jenkins\workspace\JavaHelloWorld；</span><br><span class="line">HUDSON_URL = JENKINS_URL， jenkins的url，例如http://localhost:8000/ ；</span><br><span class="line">BUILD_URL，build的url 例如http://localhost:8000/job/JavaHelloWorld/23/；</span><br><span class="line">JOB_URL， job的url，例如http://localhost:8000/job/JavaHelloWorld/；</span><br><span class="line">SVN_REVISION，svn 的revison， 例如4；</span><br></pre></td></tr></table></figure>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;环境准备&quot;&gt;&lt;a href=&quot;#环境准备&quot; class=&quot;headerlink&quot; title=&quot;环境准备&quot;&gt;&lt;/a&gt;环境准备&lt;/h3&gt;&lt;p&gt;软件版本&lt;/p&gt;
&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;软件&lt;/th&gt;
&lt;th&gt;版本&lt;/th&gt;
&lt;/tr&gt;
&lt;/</summary>
      
    
    
    
    <category term="CI/CD" scheme="http://yoursite.com/categories/CI-CD/"/>
    
    
    <category term="CI/CD" scheme="http://yoursite.com/tags/CI-CD/"/>
    
  </entry>
  
  <entry>
    <title>Jenkins-pipeline讲解和使用</title>
    <link href="http://yoursite.com/2021/10/14/jenkins_pipeline/"/>
    <id>http://yoursite.com/2021/10/14/jenkins_pipeline/</id>
    <published>2021-10-14T13:45:59.000Z</published>
    <updated>2021-10-14T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h3 id="什么是Jenkins-Pipeline"><a href="#什么是Jenkins-Pipeline" class="headerlink" title="什么是Jenkins-Pipeline"></a>什么是Jenkins-Pipeline</h3><p>Pipeline是一套运行在Jenkins上的工作流框架，2.X版本Jenkins的核心功能，主要是将一个大的工作流分拆成多个独立的功能模块，实现单个任务难以完成的复杂流程编排和可视化。<br>Jenkins Pipeline也是实现CICD As file的一个重要工具，将Pipeline编写成Jenkinsfile与业务代码一起存放。</p><p>Pipeline支持两种语法：<br>1、声明式语法</p><p>Jenkins新加入的语法规则在Jenkinsfile固定的关键字之内，所采用的语法风格大多与shell类似，这种风格更加符合日常的阅读习惯，也更简单，以后我都将采用这种方式进行介绍以及深入。</p><p>2、脚本式语法<br>不是shell脚本形式，而是基于Groovy语言的语法风格，学习成本相对较高</p><p>建议直接使用声明式语法清晰简单明了，合适大部分人入门</p><h3 id="Pipeline和FreeStyle对比"><a href="#Pipeline和FreeStyle对比" class="headerlink" title="Pipeline和FreeStyle对比"></a>Pipeline和FreeStyle对比</h3><p>| |  灵活方式 | 显示形式  |<br>|—|—|—|—|<br>| FreeStyle  | 图形化操作，合适入门操作，后期流程多后，不易于快速快速构建  | 只有统一日志展示，没有完整阶段流程信息展示  |   |<br>| Pipeline | 结构化代码语法，易于阅读和管理，可以实现CICD as Code  |  阶段流程信息展示清晰，每个阶段构建时间和对应的构建日志清晰可读 |   |</p><h3 id="Jenkins-Pipline语法介绍"><a href="#Jenkins-Pipline语法介绍" class="headerlink" title="Jenkins-Pipline语法介绍"></a>Jenkins-Pipline语法介绍</h3><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-11.png"></p><p><img src="http://t.eryajf.net/imgs/2021/09/f126d7d667e038cd.jpg"><br>图片来源:<a href="https://wiki.eryajf.net/pages/3298.html#_1-%E6%A1%86%E6%9E%B6%E4%BB%8B%E7%BB%8D%E3%80%82">https://wiki.eryajf.net/pages/3298.html#_1-%E6%A1%86%E6%9E%B6%E4%BB%8B%E7%BB%8D%E3%80%82</a></p><ul><li>Aagent: 一个 Aagent 就是一个 Jenkins执行 Step 的具体运行环境。</li><li>Stage: 表示Pipeline的一个阶段，如clone code阶段，Build阶段。一个Pipeline中至少需要一个Stage。</li><li>Step: 表示实际的执行步骤，小到执行一个 Shell 脚本，大到构建一个 Docker 镜像，由各类 Jenkins Plugin 提供，当插件扩展Pipeline DSL 时，通常意味着插件已经实现了一个新的步骤，在Stage下有且只能有一个step。</li></ul><h4 id="environment"><a href="#environment" class="headerlink" title="environment"></a>environment</h4><p>环境变量，可以定义在全局变量或者步骤中的局部变量，取决于所定义位置。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">   agent any</span><br><span class="line">   environment &#123;</span><br><span class="line">       DISABLE_AUTH = &#x27;true&#x27;               </span><br><span class="line">   &#125;</span><br><span class="line">   stages &#123;</span><br><span class="line">       stage(“Build”) &#123;</span><br><span class="line">           steps &#123;</span><br><span class="line">               echo env.DISABLE_AUTH</span><br><span class="line">           &#125;</span><br><span class="line">       &#125;</span><br><span class="line">   &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>运行结果，会输出对应的环境变量。</p><h4 id="options"><a href="#options" class="headerlink" title="options"></a>options</h4><p>选项，定义流水线运行时的配置选项。如历史构建记录数量保留，超时时间等操作。以下例子定义重试次数.</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    options &#123;</span><br><span class="line">        retry(3)</span><br><span class="line">    &#125;</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">               sh &quot;dwdwe&quot;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>编写一个无法执行的命令，可以通过Console Output看见会Retrying 3次才停止。  </p><h4 id="parameters"><a href="#parameters" class="headerlink" title="parameters"></a>parameters</h4><p>参数，为流水线运行时设置相关的参数，不需要在UI界面上额外定义。<br>出现在Pipeline块内，并且只有一次。<br>常用参数：<br>string：字符串类型的参数，例如： <code>parameters &#123; string(name: &#39;DEPLOY_ENV&#39;, defaultValue: &#39;staging&#39;, description: &#39;&#39;) &#125;</code><br>文本: 一个text参数，可以包含多行，例如： <code>parameters &#123; text(name: &#39;DEPLOY_TEXT&#39;, defaultValue: &#39;One\nTwo\nThree\n&#39;, description: &#39;&#39;) &#125;</code><br>booleanParam: 一个布尔参数，例如： <code>parameters &#123; booleanParam(name: &#39;DEBUG_BUILD&#39;, defaultValue: true, description: &#39;&#39;) &#125;</code><br>choice： 选择参数，<code>例如： parameters &#123; choice(name: &#39;CHOICES&#39;, choices: [&#39;one&#39;, &#39;two&#39;, &#39;three&#39;], description: &#39;&#39;) &#125;</code></p><p>password: 在 Jenkins 参数化构建 UI 提供一个暗文密码输入框，所有需要脱敏的信息，都可以通过这个参数来配置。<br><code> parameters &#123; password(name: &#39;PASSWORD&#39;, defaultValue: &#39;SECRET&#39;, description: &#39;A secret password&#39;) &#125;</code></p><p>注意：<br>这种声明定义之后，需要手动构建一次，然后才会自动落位到配置好的参数化构建中了。<br>例子：</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    parameters &#123;</span><br><span class="line">        string(name: &#x27;PERSON&#x27;, defaultValue: &#x27;Mr Jenkins&#x27;, description: &#x27;Who should I say hello to?&#x27;)</span><br><span class="line"></span><br><span class="line">        text(name: &#x27;BIOGRAPHY&#x27;, defaultValue: &#x27;&#x27;, description: &#x27;Enter some information about the person&#x27;)</span><br><span class="line"></span><br><span class="line">        booleanParam(name: &#x27;TOGGLE&#x27;, defaultValue: true, description: &#x27;Toggle this value&#x27;)</span><br><span class="line"></span><br><span class="line">        choice(name: &#x27;CHOICE&#x27;, choices: [&#x27;One&#x27;, &#x27;Two&#x27;, &#x27;Three&#x27;], description: &#x27;Pick something&#x27;)</span><br><span class="line"></span><br><span class="line">        password(name: &#x27;PASSWORD&#x27;, defaultValue: &#x27;SECRET&#x27;, description: &#x27;Enter a password&#x27;)</span><br><span class="line">    &#125;</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">                echo &quot;Hello $&#123;params.PERSON&#125;&quot;</span><br><span class="line"></span><br><span class="line">                echo &quot;Biography: $&#123;params.BIOGRAPHY&#125;&quot;</span><br><span class="line"></span><br><span class="line">                echo &quot;Toggle: $&#123;params.TOGGLE&#125;&quot;</span><br><span class="line"></span><br><span class="line">                echo &quot;Choice: $&#123;params.CHOICE&#125;&quot;</span><br><span class="line"></span><br><span class="line">                echo &quot;Password: $&#123;params.PASSWORD&#125;&quot;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>在执行Build时会多出这些选项<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-12.png"></p><h4 id="post"><a href="#post" class="headerlink" title="post"></a>post</h4><p>运行后处理，当流水线完成后根据配置的条件做一些动作，如：构建失败后邮件通知。<br>条件：<br>always：无论怎么样总是执行。<br>changed: 当前Pipeline状态与先前不一致情况下执行。<br>failuer: 失败情况下。<br>success: 成功情况下。<br>unstable: 不稳定情况下，Pipeline状态标识为黄色。<br>aborted: Pipeline中止情况下。<br>cleanup: 无论怎么样，执行目录清理</p><p>例如：<br>无论怎么样，将执行情况邮件发送。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br></pre></td><td class="code"><pre><span class="line">post&#123;</span><br><span class="line">         always&#123;</span><br><span class="line">         mail to : ‘team@example.com’,</span><br><span class="line">                   subject:”Pipeline statue:$&#123;currentBuild.fullDisplayName&#125;”,</span><br><span class="line">                   body:”The execution result$&#123;env.Build_url&#125;”</span><br><span class="line"></span><br><span class="line">   &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>定义三种执行状态，此时shell命令无法执行，输出always和success输出信息。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">               sh  &#x27;dwd&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        </span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    post &#123; </span><br><span class="line">        always &#123; </span><br><span class="line">            echo &#x27;already exec&#x27;</span><br><span class="line">        &#125;</span><br><span class="line">        success &#123;</span><br><span class="line">            echo &#x27;exec success &#x27;</span><br><span class="line">        &#125;</span><br><span class="line">        failure &#123;</span><br><span class="line">            echo &#x27;exec failure&#x27;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="tool"><a href="#tool" class="headerlink" title="tool"></a>tool</h4><p>构建工具，获取通过自动安装或手动安装工具的环境变量，支持maven、jdk、gradle，工具的名称必须预先在Jenkins的系统设置-&gt;全局工具配置中定义。<br>例如在Jenkins—&gt;Global Tool Configuration中添加工具对应的环境变量，然后在项目中引用。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    tools &#123;</span><br><span class="line">        maven &#x27;maven&#x27; </span><br><span class="line">    &#125;</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">                sh &#x27;mvn --version&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><h4 id="input"><a href="#input" class="headerlink" title="input"></a>input</h4><p>交互输入，在流水线执行各个阶段的时候，由人工确认是否继续执行。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br></pre></td><td class="code"><pre><span class="line">pipeline&#123;</span><br><span class="line">    agent any</span><br><span class="line">    environment&#123;</span><br><span class="line">    approvalMap = &#x27;&#x27;</span><br><span class="line">    &#125;</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;pre deploy&#x27;)&#123;</span><br><span class="line">            steps&#123;</span><br><span class="line">                script&#123;</span><br><span class="line">                    approvalMap = input (</span><br><span class="line">                        message: &#x27;发布到哪个环境？&#x27;,</span><br><span class="line">                        ok: &#x27;确定&#x27;,</span><br><span class="line">                        parameters:[</span><br><span class="line">                            choice(name: &#x27;ENV&#x27;,choices: &#x27;test\npre\nprod&#x27;,description: &#x27;发布到什么环境？&#x27;),</span><br><span class="line">                            string(name: &#x27;username&#x27;,defaultValue: &#x27;&#x27;,description: &#x27;输入用户名&#x27;)</span><br><span class="line">                        ],</span><br><span class="line">                        submitter: &#x27;admin&#x27;,</span><br><span class="line">                    )</span><br><span class="line">                &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        stage(&#x27;deploy&#x27;)&#123;</span><br><span class="line">            steps&#123;</span><br><span class="line">                echo &quot;操作者是 $&#123;approvalMap[&#x27;username&#x27;]&#125;&quot;</span><br><span class="line">                echo &quot;发布到什么环境 $&#123;approvalMap[&#x27;ENV&#x27;]&#125;&quot;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br><span class="line"></span><br></pre></td></tr></table></figure><p>点击构建，程序将会在input的步骤停住，等待用户进行相应输入和选择。<br>message ：必填。页面展示信息</p><p>ok：input表单上“ok”按钮的可选文本。</p><p>submitter：允许提交此input选项的用户或外部组名列表，用逗号分隔。默认允许任何用户。</p><p>例子结果如图所示：<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-13.png"></p><h4 id="when"><a href="#when" class="headerlink" title="when"></a>when</h4><p>条件判断，允许流水线根据给定的条件决定是否应该执行阶段，when 指令必须包含至少一个条件.<br>内置条件<br>branch：分支匹配，如<code>when &#123; branch &#39;release-v2.5&#39; &#125;</code><br>environment：环境变量匹配如<code>when &#123; environment name: &#39;DEPLOY_TO&#39;, value: &#39;production&#39; &#125;</code>    </p><p>not：当嵌套条件为false时执行stage。必须包含至少一个条件例如：<code>when &#123; not &#123; branch &#39;master&#39; &#125; &#125;</code><br>allOf：当所有嵌套条件都为真时，执行舞台。必须包含至少一个条件。例如：<code>when &#123; allOf &#123; branch &#39;master&#39;; environment name: &#39;DEPLOY_TO&#39;, value: &#39;production&#39; &#125; &#125;</code><br>anyOf: 当至少一个嵌套条件为真时执行。必须至少包含一个条件。例如：<code>when &#123; anyOf &#123; branch &#39;master&#39;; branch &#39;staging&#39; &#125; &#125;</code></p><p>以上例子支持通配符配置，如<code>when &#123; branch &#39;release-v2.*&#39; &#125;</code><br>例子:</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">               git branch: &#x27;main&#x27;, url: &#x27;https://gitee.com/wanshaoyuan/spring-petclinic.git&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        stage(&#x27;deploy to proc&#x27;)&#123;</span><br><span class="line">           when &#123;</span><br><span class="line">             branch &#x27;main&#x27;</span><br><span class="line">           &#125;</span><br><span class="line">            steps&#123;</span><br><span class="line">              echo &quot;deploy to proc env&quot;</span><br><span class="line">            &#125;</span><br><span class="line">            </span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    </span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>执行后因为是clone main分支，将输出deploy to proc env。</p><h4 id="parallel"><a href="#parallel" class="headerlink" title="parallel"></a>parallel</h4><p>默认Pipeline是串行，可以通过parallel配置并行构建，阶段可以在他们内部声明多嵌套阶段, 它们将并行执行，一个阶段只能有一个 steps 或 parallel的阶段。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line"></span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;one&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">               echo &quot;stage1&quot;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">        stage(&#x27;two&#x27;) &#123;</span><br><span class="line">            failFast true</span><br><span class="line">            parallel &#123;</span><br><span class="line">                stage(&#x27;并行1&#x27;) &#123;</span><br><span class="line">                  steps &#123;</span><br><span class="line">                    echo &quot;并行一&quot;</span><br><span class="line">                  &#125;</span><br><span class="line">            &#125;</span><br><span class="line">                stage(&#x27;并行2&#x27;) &#123;</span><br><span class="line">                  steps &#123;</span><br><span class="line">                    echo &quot;并行二&quot;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line"> &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>注:添加failFast true到包含parallel的stage中，其中一个失败时中止所有parallel内的stage。<br>本阶段会执行多个步骤。<br> 通过BlueOcean查看Pipeline效果图<br> <img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-15.png"></p><h4 id="script"><a href="#script" class="headerlink" title="script"></a>script</h4><p>脚本标签，需要执行一些系统命令语法。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;Example&#x27;) &#123;</span><br><span class="line">            steps &#123;</span><br><span class="line">               sh &#x27;date&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">    &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>通过sh执行对应的shell命令。  </p><h4 id="trigger"><a href="#trigger" class="headerlink" title="trigger"></a>trigger</h4><p>触发器，设置构建触发器，比如根据周期计划定时构建。<br>cron定时构建<br>根Linux内crontab对应的，分、时、日、月、周。这里例子定义每分钟执行一次。</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    triggers&#123;</span><br><span class="line">      cron(&#x27;* * * * 1&#x27;)</span><br><span class="line">    &#125;</span><br><span class="line">   </span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;cat文件&#x27;)&#123;</span><br><span class="line">            steps&#123;</span><br><span class="line">                sh &#x27;&#x27;&#x27;cat README.md&#x27;&#x27;&#x27;</span><br><span class="line"></span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">     &#125;</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure><p>注意点：<br>H关键字为Hash，表示当前设置的时间跨度范围内随机一值例如<br><code>triggers&#123; cron(&#39;H/15 * * * *&#39;) &#125;</code><br>每15分钟执行一次，可能在 :07,:22,:37,:52执行</p><h4 id="withCredentials"><a href="#withCredentials" class="headerlink" title="withCredentials"></a>withCredentials</h4><p>将secret与变量对应起来。在jenkins中创建的密钥，在Pipeline中希望通过变量方式引用，可以通过withCredentials进行。</p><p>例子<br>在全局凭据中创建个Username with password的凭证，输入用户名和密码</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">    agent any</span><br><span class="line">    stages &#123;</span><br><span class="line">        stage(&#x27;部署到测试环境&#x27;)&#123;</span><br><span class="line">            steps&#123;</span><br><span class="line">                 withCredentials([usernamePassword(credentialsId: &#x27;harbor_account&#x27;, passwordVariable: &#x27;password&#x27;, usernameVariable: &#x27;username&#x27;)]) &#123;</span><br><span class="line">                 sh &#x27;echo $username&#x27;</span><br><span class="line">               &#125;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">     &#125;</span><br><span class="line">  &#125;</span><br></pre></td></tr></table></figure><p>执行构建后会输出密钥中的用户名<br>注：<br>usernamePassword：withCredentials的类型，相应的还支持sshUserPrivateKey(ssh key)和certificate（证书）。<br>credentialsId：Jenkins中对应的配置名称。<br>passwordVariable： 密码项目转成对应的变量。<br>usernameVariable：  用户名项转换成对应的变量。  </p><p>SSH User Private Key 示例</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">withCredentials(bindings: [sshUserPrivateKey(credentialsId: &#x27;jenkins-ssh-key-for-abc&#x27;, \</span><br><span class="line">                                             keyFileVariable: &#x27;SSH_KEY_FOR_ABC&#x27;, \</span><br><span class="line">                                             passphraseVariable: &#x27;&#x27;, \</span><br><span class="line">                                             usernameVariable: &#x27;&#x27;)]) &#123;</span><br><span class="line">  // some block</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>Certificate 示例</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">withCredentials(bindings: [certificate(aliasVariable: &#x27;&#x27;, \</span><br><span class="line">                                       credentialsId: &#x27;jenkins-certificate-for-xyz&#x27;, \</span><br><span class="line">                                       keystoreVariable: &#x27;CERTIFICATE_FOR_XYZ&#x27;, \</span><br><span class="line">                                       passwordVariable: &#x27;XYZ-CERTIFICATE-PASSWORD&#x27;)]) &#123;</span><br><span class="line">  // some block</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>参考： <a href="https://www.jenkins.io/zh/doc/book/pipeline/jenkinsfile/#usernames-and-passwords">https://www.jenkins.io/zh/doc/book/pipeline/jenkinsfile/#usernames-and-passwords</a></p><h3 id="BlueOcean使用"><a href="#BlueOcean使用" class="headerlink" title="BlueOcean使用"></a>BlueOcean使用</h3><p>Jenkins针对pipeline提供了全新的Blue Ocean界面，可以清晰的查看流水线的执行情况：</p><p>安装blueocean插件<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-14.png"></p><p>安装完以后在对应的项目处会多出一个打开Blue Ocean按钮，<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-16.png"></p><p>界面美化后的扁平化风格，可以通过此界面，进行一些常规的配置和操作<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-17.png"></p><p>阶段性Pipeline重跑<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-18.png"></p><p>日志统一下载<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-19.png"></p><h3 id="Jenkinsfile"><a href="#Jenkinsfile" class="headerlink" title="Jenkinsfile"></a>Jenkinsfile</h3><p>目前主流大部分CI工具都支持Pipeline as Code，就是将整CI流程通过代码方式实现，然后将对应的代码和业务代码放置在一起，对应的CI工具在拉取业务代码后可以直接解析CI的流程代码进行执行。Jenkins也是支持这种方式的，通过将写好的Pipeline写在Jenkinsfile中存放在代码仓库中，Jenkins配置读取指定目录的Jenkinsfile文件即可。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-20.png"></p><h3 id="实例Demo最佳实践"><a href="#实例Demo最佳实践" class="headerlink" title="实例Demo最佳实践"></a>实例Demo最佳实践</h3><p>完整CICD步骤<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-2.png"></p><p>1、clone 源码。<br>2、编译源码。<br>3、编译后进行代码扫描和编译后可执行文件扫描。<br>4、将编译后生成的制品上传到制品库。<br>5、镜像构建，将构建后的镜像上传到Harbor。<br>6、更新Gitlab中的部署文件。<br>7、触发ArgoCD同步。手动或自动部署到Kubernetes环境中。</p><h4 id="Jenkins创建连接账号"><a href="#Jenkins创建连接账号" class="headerlink" title="Jenkins创建连接账号"></a>Jenkins创建连接账号</h4><p>因为要进行镜像上传和修改gitlab中的部署yaml，需要进行修改。<br>系统管理——&gt;Manage Credentials——&gt;创建全局域的认证信息<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-3.png"></p><p>添加凭据<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-4.png"></p><p>类型为Username with password<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-5.png"></p><p>创建id为harbor_account和gitlab_account的凭证用于Pipeline连接。</p><h4 id="代码库配置连接SonarQube信息"><a href="#代码库配置连接SonarQube信息" class="headerlink" title="代码库配置连接SonarQube信息"></a>代码库配置连接SonarQube信息</h4><p>使用Pipeline流水线，需要在添加以下步骤</p><p>1、在对应的代码库的根目录创建sonar-project.properties文件</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">sonar.projectKey=test2</span><br><span class="line">sonar.projectName=test2</span><br><span class="line">sonar.projectVersion=1.0</span><br><span class="line">sonar.sources=src</span><br><span class="line">sonar.java.binaries=target/classes</span><br><span class="line">sonar.java.source=1.8</span><br><span class="line">sonar.java.target=1.8</span><br><span class="line">sonar.language=java</span><br><span class="line">sonar.sourceEncoding=UTF-8</span><br></pre></td></tr></table></figure><p>Pipeline中添加以下步骤</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line">stage(&#x27;SonarQube analysis&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        script &#123;</span><br><span class="line">        def sonarqubeScannerHome = tool name: &#x27;SonarQubeScanner&#x27;</span><br><span class="line">            withSonarQubeEnv(&#x27;sonar&#x27;) &#123;</span><br><span class="line">            sh &quot;$&#123;sonarqubeScannerHome&#125;/bin/sonar-scanner&quot;</span><br><span class="line">         &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br></pre></td></tr></table></figure><p>注：<br>1、SonarQubeScanner为全局工具配置中的SonarQube Scanner的配置名称。<br>2、withSonarQubeEnv配置的sonar变量为全局——&gt;系统配置sonar-server的配置名称</p><h4 id="Pipeline创建"><a href="#Pipeline创建" class="headerlink" title="Pipeline创建"></a>Pipeline创建</h4><p>创建个名称为spring-petclini的Pipeline<br>配置构建触发器<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-6.png"></p><p>贴入以下Pipeline 代码</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br><span class="line">53</span><br><span class="line">54</span><br><span class="line">55</span><br><span class="line">56</span><br><span class="line">57</span><br><span class="line">58</span><br><span class="line">59</span><br><span class="line">60</span><br><span class="line">61</span><br><span class="line">62</span><br><span class="line">63</span><br><span class="line">64</span><br><span class="line">65</span><br><span class="line">66</span><br><span class="line">67</span><br><span class="line">68</span><br><span class="line">69</span><br><span class="line">70</span><br><span class="line">71</span><br><span class="line">72</span><br><span class="line">73</span><br><span class="line">74</span><br><span class="line">75</span><br><span class="line">76</span><br><span class="line">77</span><br><span class="line">78</span><br><span class="line">79</span><br><span class="line">80</span><br><span class="line">81</span><br><span class="line">82</span><br><span class="line">83</span><br><span class="line">84</span><br><span class="line">85</span><br><span class="line">86</span><br><span class="line">87</span><br><span class="line">88</span><br><span class="line">89</span><br><span class="line">90</span><br><span class="line">91</span><br><span class="line">92</span><br><span class="line">93</span><br><span class="line">94</span><br><span class="line">95</span><br><span class="line">96</span><br><span class="line">97</span><br><span class="line">98</span><br><span class="line">99</span><br><span class="line">100</span><br></pre></td><td class="code"><pre><span class="line">pipeline &#123;</span><br><span class="line">  agent &#123;</span><br><span class="line">    kubernetes &#123;</span><br><span class="line">      cloud &#x27;kubernetes&#x27;</span><br><span class="line">      namespace &#x27;default&#x27;</span><br><span class="line">      yaml &quot;&quot;&quot;</span><br><span class="line">apiVersion: v1</span><br><span class="line">kind: Pod</span><br><span class="line">spec:</span><br><span class="line">  containers:</span><br><span class="line">    - name: git</span><br><span class="line">      image: alpine/git:v2.26.2</span><br><span class="line">      command:</span><br><span class="line">        - cat</span><br><span class="line">      tty: true</span><br><span class="line">    - name: maven</span><br><span class="line">      image: maven:3.6.3-openjdk-8</span><br><span class="line">      command:</span><br><span class="line">        - cat</span><br><span class="line">      tty: true</span><br><span class="line">      volumeMounts:</span><br><span class="line">        - mountPath: /root/.m2/repository</span><br><span class="line">          name: jenkins-maven-m2-pvc</span><br><span class="line">    - name: docker</span><br><span class="line">      image: docker:19.03-dind</span><br><span class="line">      command:</span><br><span class="line">        - cat</span><br><span class="line">      tty: true</span><br><span class="line">      volumeMounts:</span><br><span class="line">        - mountPath: /var/run/docker.sock</span><br><span class="line">          name: docker-sock</span><br><span class="line">    - name: helm-kubectl</span><br><span class="line">      image: registry.cn-shenzhen.aliyuncs.com/yedward/helm-kubectl:3.3.1-1.18.8</span><br><span class="line">      command:</span><br><span class="line">        - cat</span><br><span class="line">      tty: true</span><br><span class="line">  volumes:</span><br><span class="line">    - name: jenkins-maven-m2-pvc</span><br><span class="line">      persistentVolumeClaim:</span><br><span class="line">        claimName: jenkins-maven-m2-pvc</span><br><span class="line">    - name: docker-sock</span><br><span class="line">      hostPath:</span><br><span class="line">        path: /var/run/docker.sock</span><br><span class="line">        type: &quot;&quot;</span><br><span class="line">&quot;&quot;&quot;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">  </span><br><span class="line">  stages &#123;</span><br><span class="line">    stage(&#x27;Clone&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        container(&#x27;git&#x27;) &#123;</span><br><span class="line">          git branch: &#x27;main&#x27;, credentialsId: &#x27;gitlab&#x27;, url: &#x27;http://172.16.1.184/root/spring-petclinic.git&#x27;        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    stage(&#x27;Build&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        container(&#x27;maven&#x27;) &#123;</span><br><span class="line">          sh &#x27;mvn clean package -DskipTests&#x27;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    stage(&#x27;SonarQube analysis&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        script &#123;</span><br><span class="line">        def sonarqubeScannerHome = tool name: &#x27;SonarQubeScanner&#x27;</span><br><span class="line">            withSonarQubeEnv(&#x27;sonar&#x27;) &#123;</span><br><span class="line">            sh &quot;$&#123;sonarqubeScannerHome&#125;/bin/sonar-scanner&quot;</span><br><span class="line">         &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    stage(&#x27;Publish&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        container(&#x27;docker&#x27;) &#123;</span><br><span class="line">            withCredentials([usernamePassword(credentialsId: &#x27;harbor_account&#x27;, passwordVariable: &#x27;USERPWD&#x27;, usernameVariable: &#x27;USERNAME&#x27;)]) &#123;</span><br><span class="line">                sh &#x27;echo &quot;$USERPWD&quot; | docker login --username=&quot;$USERNAME&quot; 172.16.1.31 --password-stdin&#x27;</span><br><span class="line">                sh &#x27;docker build -t 172.16.1.31/spring-petclinic/spring-petclinic:$BUILD_NUMBER .&#x27;</span><br><span class="line">                sh &#x27;docker push 172.16.1.31/spring-petclinic/spring-petclinic:$BUILD_NUMBER&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">    stage(&#x27;Deploy&#x27;) &#123;</span><br><span class="line">      steps &#123;</span><br><span class="line">        container(&#x27;git&#x27;) &#123;</span><br><span class="line">            withCredentials([usernamePassword(credentialsId: &#x27;gitlab_account&#x27;, passwordVariable: &#x27;USERPWD&#x27;, usernameVariable: &#x27;USERNAME&#x27;)]) &#123;</span><br><span class="line">                sh &#x27;git config --global user.email &quot;root@example.com&quot;&#x27;</span><br><span class="line">                sh &#x27;git config --global user.name &quot;root&quot;&#x27;</span><br><span class="line">                sh &#x27;git remote set-url origin http://$USERNAME:$USERPWD@172.16.1.184/root/spring-petclinic.git&#x27;</span><br><span class="line">                sh &#x27;sed -i &quot;s/spring-petclinic:.*/spring-petclinic:$BUILD_NUMBER/g&quot; deployment.yaml&#x27;</span><br><span class="line">                sh &#x27;git add deployment.yaml&#x27;</span><br><span class="line">                sh &#x27;git commit -m &quot;update yaml&quot;&#x27;</span><br><span class="line">                sh &#x27;git push origin main&#x27;</span><br><span class="line">            &#125;</span><br><span class="line">        &#125;</span><br><span class="line">      &#125;</span><br><span class="line">    &#125;</span><br><span class="line">  &#125;</span><br><span class="line">&#125;</span><br></pre></td></tr></table></figure><p>注意点：<br>1、clone阶段如果是私有代码仓库，需要配置凭证，可以通过流水线语法生成对应的执行命令。<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-6.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-7.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-8.png"></p><p>输入对应的git地址和分支，选择gitlab密钥，生成对应的执行代码</p><p>执行<br>修改spring-petclinic项目的代码注释掉主页的图片，提交代码，触发自动CICD查看效果。</p><p>注释掉首页小狗图片<br><code>spring-petclinic/src/main/resources/templates/welcome.html</code></p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">&lt;!--&lt;img class=&quot;img-responsive&quot; src=&quot;../static/resources/images/pets.png&quot; th:src=&quot;@&#123;/resources/images/pets.png&#125;&quot;/&gt;--&gt;</span><br></pre></td></tr></table></figure><p>重新提交代码<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-9.png"></p><p>查看已经没有对应logo<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/Jenkins3-10.png"></p><p>参考链接：<br><a href="https://www.jenkins.io/doc/book/pipeline/syntax/">https://www.jenkins.io/doc/book/pipeline/syntax/</a><br><a href="https://wiki.eryajf.net/pages/3298.html">https://wiki.eryajf.net/pages/3298.html</a></p>]]></content>
    
    
      
      
    <summary type="html">&lt;h3 id=&quot;什么是Jenkins-Pipeline&quot;&gt;&lt;a href=&quot;#什么是Jenkins-Pipeline&quot; class=&quot;headerlink&quot; title=&quot;什么是Jenkins-Pipeline&quot;&gt;&lt;/a&gt;什么是Jenkins-Pipeline&lt;/h3&gt;&lt;p&gt;Pi</summary>
      
    
    
    
    <category term="CI/CD" scheme="http://yoursite.com/categories/CI-CD/"/>
    
    
    <category term="CI/CD" scheme="http://yoursite.com/tags/CI-CD/"/>
    
  </entry>
  
  <entry>
    <title>应用性能监控1-Skywalking</title>
    <link href="http://yoursite.com/2021/08/26/apm_1/"/>
    <id>http://yoursite.com/2021/08/26/apm_1/</id>
    <published>2021-08-26T13:45:59.000Z</published>
    <updated>2021-08-26T13:45:59.000Z</updated>
    
    <content type="html"><![CDATA[<h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>随着应用功能越来越多，从单体架构发展到现在微服务架构，拆分的模块越来越细粒化，需要定位应用模块之间的问题困难越来越大，需要通过一些第三方工具去帮助我们快速定位和发现应用模块的问题，实现以下功能：<br>1、监控模块间响应时间展示<br>2、应用模块间调用链路展示<br>3、慢响应识别<br>市面上也有非常多的APM软件提供。主流开源的如SkyWalking、ZipKin、CAT、PinPoint、ElasticAPM。这些都是根开发语言有强绑定关系，并且需要业务加载对应的开发包和引入SDK，对业务具有一定的侵入性。目前还有新兴的解决方案，基于云原生ServiceMesh方式，对应用没有侵入性和开发语言绑定。</p><h2 id="SkyWalking介绍"><a href="#SkyWalking介绍" class="headerlink" title="SkyWalking介绍"></a>SkyWalking介绍</h2><p>基于Google分布式链路追踪论文Dapper开发，由中国工程师吴晟开发并开源贡献给Apache基金会，支持多种开发语言如Java、PHP、Go、C++、Node.js、Python、.NET、Lua……</p><h2 id="SkyWalking组件介绍"><a href="#SkyWalking组件介绍" class="headerlink" title="SkyWalking组件介绍"></a>SkyWalking组件介绍</h2><p>总体架构如下<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-1.jpg"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-2.jpg"></p><p>SkyWalking架构总体分为四部分：</p><p>Agent：探针负责与各类开发语言和平台集成如ServiceMesh，进行Tracing和Metric数据收集。发送给Server端。<br>Server端（OAP)：接受Agent采集发送过来的数据，进行数据分析、处理、聚合、查询以及将数据发送到后端存储。<br>Storage：支持多种后端存储（ElasticSearch、Mysql、Tidb…)，接收Server端发送过来的数据。<br>UI:  进行数据计算后的结果统一展示和调用链路展示。</p><h2 id="SkyWalking安装"><a href="#SkyWalking安装" class="headerlink" title="SkyWalking安装"></a>SkyWalking安装</h2><p>环境信息</p><table><thead><tr><th>软件</th><th>版本</th></tr></thead><tbody><tr><td>Kubernetes</td><td>v1.18.20</td></tr><tr><td>SkyWalking</td><td>v8.1.0</td></tr></tbody></table><p>SkyWalking官方支持多种安装方式，这里为了快速部署，使用的是在Kubernetes上用Helm安装，后端存储使用ElasticSearch。<br>使用官方Helm安装方式最小化安装，后端存储使用ElasticSearch。  参考部署手册<br><a href="https://github.com/apache/skywalking-kubernetes">https://github.com/apache/skywalking-kubernetes</a><br>环境</p><p>配置环境变量</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">export SKYWALKING_RELEASE_NAME=skywalking</span><br><span class="line">export SKYWALKING_RELEASE_NAMESPACE=default</span><br></pre></td></tr></table></figure><p>配置repo</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">export REPO=skywalking</span><br><span class="line">helm repo add $&#123;REPO&#125; https://apache.jfrog.io/artifactory/skywalking-helm  </span><br><span class="line">helm repo update</span><br></pre></td></tr></table></figure><p>安装skywalking，这里安装会自动帮你部署一个ElasticSearch，如果需要对接已经存在的ElasticSearch集群或使用其他的后端存储，可以使用其他参数进行部署安装。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br></pre></td><td class="code"><pre><span class="line">helm install &quot;$&#123;SKYWALKING_RELEASE_NAME&#125;&quot; $&#123;REPO&#125;/skywalking -n &quot;$&#123;SKYWALKING_RELEASE_NAMESPACE&#125;&quot; \</span><br><span class="line">  --set oap.image.tag=8.1.0-es7 \</span><br><span class="line">  --set oap.storageType=elasticsearch7 \</span><br><span class="line">  --set ui.image.tag=8.1.0 \</span><br><span class="line">  --set elasticsearch.imageTag=7.5.1</span><br></pre></td></tr></table></figure><p>部署完后查看</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">kubectl get pod </span><br><span class="line">NAME                              READY   STATUS      RESTARTS   AGE</span><br><span class="line">elasticsearch-master-0            1/1     Running     0          8m54s</span><br><span class="line">elasticsearch-master-1            1/1     Running     0          8m54s</span><br><span class="line">elasticsearch-master-2            1/1     Running     0          8m54s</span><br><span class="line">skywalking-es-init-vl8c7          0/1     Completed   0          8m54s</span><br><span class="line">skywalking-oap-64df9d4b8c-dvksd   1/1     Running     0          3m50s</span><br><span class="line">skywalking-oap-64df9d4b8c-p6thl   1/1     Running     0          8m54s</span><br><span class="line">skywalking-ui-649dc77bd7-t9d7m    1/1     Running     0          8m54s</span><br></pre></td></tr></table></figure><p>部署了一个ElasticSearch集群和skywalking对应的组件</p><p>为了方便访问，我们将Skywalking的UI通过NodePort对外暴露出来。  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">kubectl patch  svc skywalking-ui  --type=&#x27;json&#x27; -p &#x27;[&#123;&quot;op&quot;:&quot;replace&quot;,&quot;path&quot;:&quot;/spec/type&quot;,&quot;value&quot;:&quot;NodePort&quot;&#125;,&#123;&quot;op&quot;:&quot;add&quot;,&quot;path&quot;:&quot;/spec/ports/0/nodePort&quot;,&quot;value&quot;:30930&#125;]&#x27;</span><br></pre></td></tr></table></figure><p>访问http:&#x2F;&#x2F;节点ip:30930，此时默认UI界面如下：<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-9.png"></p><h2 id="与应用集成方式"><a href="#与应用集成方式" class="headerlink" title="与应用集成方式"></a>与应用集成方式</h2><p>方式一：应用启动加载agent依赖包。<br>比如通过<a href="http://skywalking.apache.org/downloads/%E4%B8%8B%E8%BD%BD%E5%AF%B9%E5%BA%94%E5%8F%91%E8%A1%8C%E7%89%88%E6%9C%ACtar%E5%8C%85%E9%87%8C%E9%9D%A2%E5%8C%85%E5%90%AB%E7%9A%84agent%E6%96%87%E4%BB%B6%EF%BC%8C%E7%84%B6%E5%90%8E%E5%BA%94%E7%94%A8%E5%90%AF%E5%8A%A8%E5%91%BD%E4%BB%A4%E5%8A%A0%E8%BD%BD%E6%AD%A4agent%E4%BE%9D%E8%B5%96%E6%96%87%E4%BB%B6%E5%8D%B3%E5%8F%AF%E3%80%82%E5%A6%82%E4%BB%A5%E4%B8%8B%E9%80%9A%E8%BF%87%E5%AE%B9%E5%99%A8%E5%BA%94%E7%94%A8%E6%9E%84%E5%BB%BADockerfile%E6%96%B9%E5%BC%8F%E5%8A%A0%E8%BD%BD%E3%80%82">http://skywalking.apache.org/downloads/下载对应发行版本tar包里面包含的agent文件，然后应用启动命令加载此agent依赖文件即可。如以下通过容器应用构建Dockerfile方式加载。</a>  </p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">FROM registry.cn-shenzhen.aliyuncs.com/yedward/openjdk:8-jre-slim</span><br><span class="line">USER appuser</span><br><span class="line">EXPOSE 8080</span><br><span class="line">COPY --from=build /usr/src/app/target/*.jar /app/</span><br><span class="line">WORKDIR /app</span><br><span class="line">CMD java -jar -Xms1024m -Xmx1024m /app/spring-petclinic.jar -javaagent:/opt/skywalking/agent/skywalking-agent.jar</span><br></pre></td></tr></table></figure><p>方式二：通过外部挂载和参数引用方式。下面Demo主要就是对这种方式的演示。  </p><p>这两种方式最大的区别在于，方式一需要改动应用启动命令，方式二对应用本身不需要进行改动，就需要进行升级即可。</p><h2 id="应用Demo演示"><a href="#应用Demo演示" class="headerlink" title="应用Demo演示"></a>应用Demo演示</h2><p>以spring-petclinic为Demo进行演示，一个简单的应用，前面有一个Gateway做为统一流量入口，通过Web模块将对应的服务请求转发到后端不同的其他服务上，进行服务调用。</p><p><a href="https://github.com/wanshaoyuan/spring-petclinic-msa.git">https://github.com/wanshaoyuan/spring-petclinic-msa.git</a></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-3.png"></p><p>部署Demo应用</p><figure class="highlight plaintext"><table><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">下载</span><br><span class="line">git clone https://github.com/wanshaoyuan/spring-petclinic-msa.git</span><br><span class="line"></span><br><span class="line">部署yaml</span><br><span class="line"></span><br><span class="line">kubectl apply -f k8s/local-skywalking/ </span><br></pre></td></tr></table></figure><p>访问服务</p><p><a href="http://host_ip:31080/">http://host_ip:31080</a></p><p>一个宠物医院系统，可以点击进行一些资料的添加和修改。</p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-10.png"></p><p>查看Skywalking的数据，点击右上角自动的按钮进行自动的刷新</p><p>调用Top耗时显示和响应耗时范围展示<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-5.png"></p><p>服务响应时间和调用成功率<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-6.png"></p><p>全局调用链路展示<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-4.png"></p><p><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-7.png"></p><p>调用关系和路径耗时展示<br><img src="https://wanshaoyuan.oss-cn-hangzhou.aliyuncs.com/image/apm1-8.png"></p><p>总结：<br>Skywalking UI做的非常精美，做为一款开源产品功能覆盖还是非常全面的，APM系统对于目前微服务体系的应用进行故障排查还是有非常大的帮助。但这种非常对开发语言还是有一定依赖性，另外一种不需要开发语言依赖的方式就是ServiceMesh的实现<br>完全不侵入应用，也不需要加载Jar包，ServiceMesh主要是通过做应用透明代理和流量劫持去实现链路追踪，如Istio，但他的缺点是只能追踪HTTP请求，覆盖范围有限，并且相对追踪的数据也比埋点的要少一些。  </p>]]></content>
    
    
      
      
    <summary type="html">&lt;h2 id=&quot;概述&quot;&gt;&lt;a href=&quot;#概述&quot; class=&quot;headerlink&quot; title=&quot;概述&quot;&gt;&lt;/a&gt;概述&lt;/h2&gt;&lt;p&gt;随着应用功能越来越多，从单体架构发展到现在微服务架构，拆分的模块越来越细粒化，需要定位应用模块之间的问题困难越来越大，需要通过一些第三方工具</summary>
      
    
    
    
    <category term="应用上云" scheme="http://yoursite.com/categories/%E5%BA%94%E7%94%A8%E4%B8%8A%E4%BA%91/"/>
    
    
    <category term="应用上云" scheme="http://yoursite.com/tags/%E5%BA%94%E7%94%A8%E4%B8%8A%E4%BA%91/"/>
    
  </entry>
  
</feed>
