# DataFrame Column Descriptions

This document serves as a guide for understanding the columns in the provided datasets. Each column is explained below:

## names
- Represents the name of each entry.
- "comp" refers to compounds from the Bordwell dataset.
- "shen" refers to compounds from the Shen dataset.
- "ibond" refers to compounds from the iBonD dataset.

## smiles
- Contains SMILES notation for the neutral molecule.

## pka_exp
- The experimental pKa value.

## ref
- Reference for the experimental pKa from the Bordwell dataset.

## comment
- Additional comments.

## method
- The experimental method from which the pKa was obtained, information extracted from iBonD.

## ref_link
- Link to the reference, information extracted from iBonD.

## ref_title
- Title of the reference, information extracted from iBonD.

## name_shen
- Name of the molecule as it appears in the Shen dataset.

## outlier
- True if the entry is considered an outlier. Used to filter out compounds for LFER.

## outlier_note
- Notes about why an entry is considered an outlier.

## lst_names_deprot
- List of names for each deprotonation site.

## ref_mol_smiles_map
- Mapped SMILES of the reference/neutral molecule.

## lst_smiles_map_deprot
- List of mapped SMILES for each deprotonation site.

## lst_smiles_deprot
- List of SMILES for each deprotonation site.

## lst_atomsite_deprot
- List of atom sites for each deprotonation site.

## lst_atomindex_deprot
- List of atom indices for each atom in the molecule.

## cm5
- List of CM5 charges for each atom in the molecule.

## descriptor_vector
- List of descriptor vectors for each atom in the molecule.

## mapper_vector
- List of mapper vectors for each atom in the molecule.

## train_test
- Indicates if the molecule is part of the training set or test set.

## fold1 - fold5
- Fold assignments for cross-validation, with each fold representing a different subset of the data used for validation.

## gfn_method_xtb
- Method used for the xTB calculation.

## solvent_model_xtb
- Solvent model used for the xTB calculation.

## solvent_name
- Name of the solvent.

## e_xtb_neutral
- Energy for the neutral molecule calculated with xTB.

## e_dft_neutral
- Energy for the neutral molecule calculated with DFT.

## lst_e_xtb_deprot
- List of energies for each deprotonation site calculated with xTB.

## lst_e_rel_xtb
- List of relative energies for each deprotonation site calculated with xTB.

## lst_e_dft_deprot
- List of energies for each deprotonation site calculated with DFT.

## lst_e_rel_dft
- List of relative energies for each deprotonation site calculated with DFT.

## e_rel_min_dft
- Minimum relative energy calculated with DFT.

## lst_pka_lfer
- pKa values calculated with LFER.

## pka_min_lfer
- Minimum pKa value calculated with LFER.

## atomsite_min_lfer
- Atom site for the lowest pKa value calculated with LFER.

## atom_lowest_lfer - atom_lowest_lfer_2
- Lists containing binary indicators for the lowest pKa values and those within an absolute difference of 1 or 2 from the lowest pKa value, calculated with LFER.

## lst_pka_pred_reg - atom_lowest_pred_reg2
- Similar to the LFER columns, but for pKa values predicted with a regression model (for both the test set and when trained with the full dataset).

## lst_pka_pred_reg_full - atom_lowest_pred_reg_full2
- Lists and indicators for pKa predictions using regression models trained with the full dataset, including minimum values and proximity to the lowest predicted values.