Transcription

Atlas 200 DKV100R020C00Black Box InstructionsIssue01Date2021-01-29HUAWEI TECHNOLOGIES CO., LTD.

Copyright Huawei Technologies Co., Ltd. 2021. All rights reserved.No part of this document may be reproduced or transmitted in any form or by any means without priorwritten consent of Huawei Technologies Co., Ltd.Trademarks and Permissionsand other Huawei trademarks are trademarks of Huawei Technologies Co., Ltd.All other trademarks and trade names mentioned in this document are the property of their respectiveholders.NoticeThe purchased products, services and features are stipulated by the contract made between Huawei andthe customer. All or part of the products, services and features described in this document may not bewithin the purchase scope or the usage scope. Unless otherwise specified in the contract, all statements,information, and recommendations in this document are provided "AS IS" without warranties, guaranteesor representations of any kind, either express or implied.The information in this document is subject to change without notice. Every effort has been made in thepreparation of this document to ensure accuracy of the contents, but all statements, information, andrecommendations in this document do not constitute a warranty of any kind, express or implied.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.i

Atlas 200 DKBlack Box InstructionsContentsContents1 Overview.12 Configuration.23 Log.34 Snapshot Log. 64.1 Snapshot Log Content. 64.1.1 Head Info. 64.1.2 Boot Region and Run Region.74.1.2.1 Control Information. 74.1.2.2 Data. 114.2 Snapshot Export. 12Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.ii

Atlas 200 DKBlack Box Instructions1 Overview1OverviewBlack box mechanism enhances the system maintenability of Ascend AI Processorand preserves the necessary software and hardware parameters when anexception occurs in the system, facilitating fault diagnosis and analysis as well asquick fault locating.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.1

Atlas 200 DKBlack Box Instructions2 Configuration2ConfigurationThe black box configuration file is stored in /var/log/npu/conf/bbox/bbox.conf.Table 2-1 describes the configuration items.Table 2-1 Configuration item descriptionItemDescriptionMNTN PATH /var/log/npu/hisi logsPath for storing the black box logs. The path cannotcontain relative path /./ and the path length cannotexceed 64 bytes.MNTN LOGSPACE SIZE 64Available space size for storing log files of each SoC. Theunit is MB. The size range is 20 MB to 3 GB.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.2

Atlas 200 DKBlack Box Instructions3 Log3LogThis section describes the usage of black box log files.Step 1 Go to the directory where black box logs are stored.cd /var/log/npu/hisi logs/device-id/NOTEid indicates the device ID. /var/log/npu/hisi logs/device-id is the default path for storingblack box logs. You can modify the path in the configuration file. For details, seeConfiguration.Step 2 View the history.log file.Find the subdirectory of the exception device-id for viewing the history.log file.Format and description of the history.log file are as follows:[2020-02-27-19:18:46.142527] system exception code [0x68020002]: ModuleName [DRIVER],ExceptionReason[DEVICE HBL EXCEPTION], TimeStamp [20200227191842-300803183].Table 3-1 Field description of the history.log dicates the host handling time.system exception code [0x68020002]The exception code is [0x68020002].ModuleName [DRIVER]The exception module is [Driver].ExceptionReason[DEVICE HBL EXCEPTION]The exception reason is [deviceheartbeat loss].TimeStamp[20200227191842-300803183]The device reporting timestamp is[20200227191842-300803183].NOTEThe history.log file records a maximum of 30,000 records. When the number of recordsexceeds 30,000, the earliest 20,000 records will be cleared.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.3

Atlas 200 DKBlack Box Instructions3 LogStep 3 View the exception log of a specific module.The directory name is the same as the timestamp. For example, if the TimeStampis [20200227191842-300803183], the directory name is20200227191842-300803183. Specific exception information about the module isstored in this directory. File description is as follows.Table 3-2 Log file path and file contentRelative File PathFile ContentDONEBlack box log recording statusbboxDirectory for storing the maintenance andtesting information of the black box, a staticallyreserved spacebbox/bbox info.txtBasic black box informationbbox/[module].txtException information of [module], for example,ts.txtbbox/osOS maintenance and testing informationbbox/os/os info.txtBasic OS informationbbox/os/kbox.txtKernel suspension task stack and some kernellogsbbox/os/hookKernel track informationbbox/os/regOS register informationbbox/os/reg/reset reg.txtReset register informationlogDirectory of logslog/kernel.logOS kernel log informationlog/early kernel.logOS startup information at the early stagemntnDirectory for storing the maintenance andtesting information of each modulemntn/ddr mntn.txtDDR maintenance and testing informationmntn/pmu.regPMU register informationmntn/tsensor.regTSensor register informationsnapshotDirectory for snapshot informationsnapshot/hdr.logSnapshot informationstackcore*Process stackcore file----EndIssue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.4

Atlas 200 DKBlack Box Instructions3 LogNOTICE1. When log-deamon process exception occurs during black box log flushing, thelog content is uncontrollable and may be lost.2. When the storage of disk where the black box logs are stored (/var/log/npu/hisi logs) is insufficient, black box logs cannot be generated.3. The DONE file records the following three statuses: STARTING: The exception reporting is being processed, and exception logsare being exported. FILEDONE: The exception reporting is processed, exception logs areexported properly, and exception information is complete. PROCFAIL: The exception reporting is processed, while the exception logexport fails, and the exception information is incomplete.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.5

Atlas 200 DKBlack Box Instructions4 Snapshot Log4Snapshot Log4.1 Snapshot Log Content4.2 Snapshot Export4.1 Snapshot Log ContentSnapshot log content includes head info, boot region and run region. The formatis as follows.head infoboot region --region config --region control --area 0 --area 1 . -area 7run region --region config --region control --area 0 --area 1 . -area 74.1.1 Head InfoSnapshot head information: head info magic: 0xeaea2020version: 0x100reset count: 0x97Table 4-1 Field descriptionFieldDescriptionmagicMagic number used to identify snapshot functions. Thevalue is fixed to 0xeaea2020.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.6

Atlas 200 DKBlack Box Instructions4 Snapshot LogFieldDescriptionversionVersion number. For example, version number of the sampleis 1.0.reset countNumber of hot resets in the current environment4.1.2 Boot Region and Run RegionStructures of the boot region and run region are the same, consisting of controlinformation and data.4.1.2.1 Control InformationControl information of the boot region: boot region region offset: 0x400region size: 0x4b000--------------------regiontotal areahistory areaerror areaarea config:used module countconfig------------------: 0xa: 0x5: 0x2: 0x4module config:module 0 offsetmodule 0 size: 0x0: 0x3000module 1 offsetmodule 1 size: 0x3000: 0x1000module 2 offsetmodule 2 size: 0x4000: 0x1000module 3 offsetmodule 3 size: 0x5000: 0x1000--------------------region control-----------------area index: 0x1error area count: 0x2Control information of the run region: run region : 0x4b400region offsetregion size: 0x25800--------------------regiontotal areahistory areaerror areaarea config:used module countconfig------------------: 0xa: 0x5: 0x2: 0x4module config:module 0 offsetmodule 0 size: 0x0: 0x800module 1 offset: 0x800Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.7

Atlas 200 DKBlack Box Instructions4 Snapshot Logmodule 1 size: 0x800module 2 offsetmodule 2 size: 0x1000: 0x1000module 3 offsetmodule 3 size: 0x2000: 0x1000--------------------region control-----------------area index: 0x4error area count: 0x0Table 4-2 Fields and descriptionAreaFieldDescriptionboot regionregion offsetOffset relative to the start address of thesnapshotregion sizeRegion sizetotal areaTotal number of areas for storing datahistory areaNumber of areas for storing historicaldataerror areaNumber of areas for storing error dataused module countNumber of modules in each areamodule configAddress offset and size of each module.Address offset is relative to the start ofthe located area.area indexIndex of the area where the hot resetdata is storederror area countNumber of detected errors that are notexported during the boot processarea N control infoControl information about each area.Currently, only the following 7 values areused:regionconfigregioncontrol 0–4: for historical queues 5–6: for exception queuesflagQueue type of the area: 0: unused 1: L2BUF historical queue 2: L2BUF exception queue 3: DDR historical queue 4: DDR exception queueIssue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.8

Atlas 200 DKBlack Box Instructions4 Snapshot LogAreaFieldDescriptiontagArea information status: 0: unused 1: in use and initialized 2: normally in use 3: in use with errorexception typeException type: STARTUP EXCEPTION(0x2c) isdisplayed in the boot region. last reset reason is displayed in therun region.module idModule ID module id is displayed in the bootregion. Run region does not have this field.exception idException ID exception id is displayed in the bootregion. Run region does not have this field.reset numberNumber of hot resets when theinformation is recorded.NOTE1. The seven area N control information blocks are divided into two queues. 0–4 are usedfor historical queues, and 5–6 are for exception queues.Historical queue follows the ring buffer principle to overwrite the current queue.Exception queue follows the read-clear principle. Queues can only be cleared and reusedafter the content is read.2. A snapshot is exported by the black box only when the value of error area count is not0. After the export, error area count of the area is cleared. After the hot reset isperformed again, if no new exception record is generated, the snapshot is not exported.3. For boot region exceptions, attention should be paid to the module id, exception type,and exception id. For run region exceptions, attention should be paid only to theexception type.4. If the exception code of STARTUP EXCEPTION or RUN EXCEPTION is 0xA8**EFFF,which is the default snapshot exception code, snapshot is not supported by the module.The module exception is detected by the BIOS during startup.Issue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.9

Atlas 200 DKBlack Box Instructions4 Snapshot LogTable 4-3 Exception type descriptionException TypeException TypeValueDescriptionDEVICE COLDBOOT0x0Cold boot without exceptionBIOS EXCEPTION0x1Previous BIOS boot exceptionDEVICE HOTBOOT0x2Hot reset with keysABNORMAL EXCEPTION0x10Hardware exception that is notdetected, such as DDR bus suspensionTSENSOR EXCEPTION0x1fSoC temperature protection resetPMU EXCEPTION0x20Hardware reset caused by PMUovercurrent, undervoltage, orovertemperatureDDR FATAL EXCEPTION0x22DDR fatal exception reset (for example,DDR overtemperature reset)OS PANIC0x24Panic (for example, accessing invalidaddresses)OS COREDUMP0x29User-mode process core dumpOS OOM0x2aOOM exceptionOS HDC0x2bHDC disconnectionSTARTUP EXCEPTION0x2cModule startup exceptionHEARTBEAT EXCEPTION0x2dModule heartbeat exceptionRUN EXCEPTION0x2eModule running exceptionLPM EXCEPTION0x32LPM exceptionTS EXCEPTION0x33TS exceptionDVPP EXCEPTION0x35DVPP exceptionDRIVER EXCEPTION0x36Driver exceptionZIP EXCEPTION0x37ZIP exceptionTEE EXCEPTION0x38TEE OS exceptionLPFW EXCEPTION0x39LPFW exceptionNETWORK EXCEPTION0x3aNetwork exceptionIssue 01 (2021-01-29)Copyright Huawei Technologies Co., Ltd.10

Atlas 200 DKBlack Box Instructions4 Snapshot LogException TypeException TypeValueDescriptionATF EXCEPTION0x3cATF exceptionDEVICE LTO EXCEPTION0x8aDevice s