# Dataset Format

## `stru_out`

This file contains structure and k-mesh data.
It should contain the following data in order
- lattice vectors: 3 lines, 3 float number in each line, unit: Bohr radius
- reciprocal lattice vectors: 3 lines, 3 float number in each line, unit: inverse Bohr radius
- number of k-grids along each lattice vectors: 1 line, `nkx`, `nky`, `nkz`. The
  total number of k-points `nkpts` equals to the product of `nkx`, `nky` and
  `nkz`
- Cartesian coordinates of each k-point: `nkpts` lines, 3 float number in each line, unit: inverse Bohr radius
- mapping of k-point to its irreducible counterpart: `nkpts` lines, 1 integer in each line.

The mapping should be considered as below:
suppose the number on the `n`-th line is `m`, it means that
the irreducible k-point corresponding to the `n`-th k-point in the full k-point set is the `m`-th
k-point in the full set.

## `Cs_data_xxx.txt`

These files contain the localized RI triple coefficients.

In plain text format, each file has a header with two integers:
total number of atoms and number of periodic unit cells.
Then till the end of file, the data is formatted as blocks of RI coefficient $C$ on each pair of atoms and unit cell
```
i_atom_1  i_atom_2  n_1  n_2  n_3  n_basis_1  n_basis_2  n_aux_basis_1
C(1, 1, 1)
...
C(n_aux_basis_1, n_basis_2, n_basis_1)
```
Here `C` is the RI coefficients between the atom `i_atom_1` and `i_atom_2` in unit cells separated by
lattice vector $\mathbf{R} = n_1 \mathbf{a}_1 + n_2 \mathbf{a}_2 + n_3 \mathbf{a}_3$.
The auxiliary basis is located on `i_atom_1`. The number of basis functions on `i_atom_1` and `i_atom_2` 
are `n_basis_1` and `n_basis_2`, respectively. The number of auxiliary functions is `n_aux_basis_1`.
The indices of `C` runs in the Fortran order, i. e. the first index runs the fastest.

In binary format, the data is organized similarly in the plain text format, except for an extra integer
is included in the header, which is the number of atom pairs and lattice vectors included in the file.
The coefficients are saved in double precision. To better illustrate the format of binary file, the
following Python snippet could be helpful
```python
import struct
import numpy as np

# ensure that "Cs_data_0" exists and was generated with binary output mode in DFT code
cfile_path = "Cs_data_0.txt"

with open(cfile_path, 'rb') as h:
    n_atoms, n_cells, n_apcell_file = struct.unpack('iii', h.read(12))
    for _ in range(n_apcell_file):
        a1, a2, r1, r2, r3, nb1, nb2, nbb1 = struct.unpack('i' * 8, h.read(4 * 8))
        apcell = (a1, a2, r1, r2, r3)
        array_size = nb1 * nb2 * nbb1
        array = np.array(struct.unpack('d' * array_size, h.read(8 * array_size)))
        array = np.reshape(array, (nb1, nb2, nbb1))
        apcells[apcell] = array
```

## `band_out`

This file contains band energies and occupation numbers from the mean-field starting-point calculation.
It has a 5-line header
```
n_k_points
n_spins
n_states
n_basis
e_fermi
```
The first 4 lines contain an integer in each. The 5th line is a float number, which is the Fermi energy
in Hartree unit.

The remaining lines consists of `n_k_points*n_spins` blocks of `n_states+1` lines, in the format of
```
i_k_point    i_spin
1           f_1        e_1_ha      e_1_ev
2           f_2        e_2_ha      e_2_ev
3           f_3        e_3_ha      e_3_ev
...
n           f_n        e_n_ha      e_n_ev
...
```
This block contains the energies and occupation numbers of states $\left|\psi_{n,k\sigma}\right\rangle$
`i_k_point` marks the index of k-point $k$ in the full k-point set.
`i_spin` specify the spin channel $\sigma$.
In each of the following lines, the first integer species the index of state.
The 3 float numbers stand for the occupation number, the energy in Hartree unit and that in electronvolt
unit, respectively.
For spin-unpolarized calculation, `f_n` is a number from 0 to 2, otherwise it is from 0 to 1.

## `KS_eigenvector_xxx.txt`
These files contain the wave functions (eigenvectors) from the starting-point calculation expanded by orbital basis.
Each file can be divided in blocks of `n_states*n_basis*n_spins+1` lines,
where `n_states`, `n_basis` and `n_spins` will be extracted from `band_out` file.
Each block stores the data for a particular k-point, $c^i_{n,k\sigma}$:
```
i_k_point
c(1,1,1)_real c(1,1,1)_imag
...
c(i,n,s)_real c(i,n,s)_imag
...
```
The first line contains single integer, the index of the k-point of following data.
The remaining lines store the data with running index $i$, $n$, $\sigma$ in C-style row-major order,
i. e., spin index runs fastest, then state index and finally basis index.
Each line has two float numbers, which are the real and imaginary part of $c^i_{n,k\sigma}$.

## `coulomb_mat_xxx.txt`

These files contains the Coulomb matrices in auxiliary basis.
A single header line contains an integer, the number of irreducible k-point at
which the Coulomb matrices are computed.
The remaining part of the file is organized in blocks
```
n_aux_basis    row_start    row_end    col_start    col_end
i_k_point      k_weight
v(row_start, col_start  )_real       v(row_start, col_start  )_imag
v(row_start, col_start+1)_real       v(row_start, col_start+1)_imag
...
v(row_end, col_end)_real             v(row_end, col_end)_imag
```
where
- integer `n_aux_basis` is the total number of auxiliary basis functions.
- integer `row_start`, `row_end`, `col_start` and `col_end` mark the submatrix of the full Coulomb matrix
  that this block contain.
- integer `i_k_point` is the index of k-point of the current Coulomb matrix, in the full k-point list.
- float number `k_weight` is the weight of the irreducible k-points.

After the block header, there should be `(row_end-row_start+1)` times `(col_end-col_start+1)` lines
for the actual matrix element data. Each line contains two float numbers, which are the real and imaginary
parts of the element. The data is ordered in C-style row major.