How to Format all JSON Files in a Git Repository
by Christoph Schiessl on DevOps, Git, and Python
I'm working on a project that depends on many JSON files for configuration. These files are deployed with the application and, therefore, are kept in the project's Git repository. Internal tools automatically generate some JSON files, while others are manually maintained and updated. Suffice it to say that there is no consistent formatting across these files, but it would be a win for developer productivity if this were the case. At the very least, it would make diffs easier to look at. Long story short, I decided to take matters into my own hands, clean up the formatting, and also update the project's CI to ensure that it stays consistent. There were several steps needed to make this happen ...
Detect all JSON files
Since I'm working in a Git repository, it's a natural choice to use the git ls-files
command to get a listing of all JSON files. The official documentation explains this Git command as follows:
Show information about files in the index and the working tree.
By default, git ls-files
returns only files that Git already tracks. However, I want to handle tracked and untracked files. To accomplish this, I had to use the --cached
option to request tracked files and the --others
option to request untracked files.
git ls-files --cached --tracked
This command has a problem because it also returns files excluded from tracking, which is not what I want. For instance, the directory containing my Python virtual environment is listed in a .gitignore
file and, therefore, excluded from tracking. To tell git ls-files
to apply the .gitignore
rules as usual, we must add the --exclude-standard
option.
git ls-files --cached --tracked --exclude-standard
All right, we now get a list of tacked and untracked files not excluded by a .gitignore
file. The last missing piece is to limit the listing to files with a json
extension. Fortunately, filtering is another built-in feature of git ls-files
, meaning we can append a filter like -- '*.json'
to achieve the desired effect.
git ls-files --cached --others --exclude-standard -- '*.json'
If I run this command in my test repository, it detects four JSON files — two that are tracked, two that are untracked, and none of the ones that are excluded by .gitignore
files (e.g., in a node_modules
directory).
$ git ls-files --cached --others --exclude-standard -- '*.json'
subdirectory/untracked.json
untracked.json
subdirectory/tracked.json
tracked.json
Iterate over detected files
Next, we need to iterate over all detected files to process each one. This could be done with a shell script, but it's easier with Python. The interesting part here is that we need a way to execute our git ls-files
command in a subshell and capture its standard output. Luckily, there's the built-in subprocess
module providing the check_output
function, which is exactly what we need. This function executes the given command, captures its standard output, and then returns the captured output. By default, it's not using a subshell, but this can easily be enabled by setting its shell
parameter to True
.
from pathlib import Path
from subprocess import check_output
COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
print(f"Processing {json_file}")
Note that we have to decode()
the subprocess's output to convert it from bytes
to a UTF-8 str
. Next, I'm using splitlines()
to separate it into individual lines (without trailing line break characters). Finally, I'm mapping all file paths to Path
objects, which will be convenient during the next step. When you run this script, you get the expected output.
$ python json-formatter.py
Processing subdirectory/untracked.json
Processing untracked.json
Processing subdirectory/tracked.json
Processing tracked.json
Format with the json
module
To re-format a file, we need to read the file's content, parse the content as JSON, and then serialize the JSON again with consistent indentation and so on. To read the file, we can use the read_text()
function of the Path
class. This is why I mapped the strings representing the file paths to proper Path
objects in the previous step. In any case, once we have the unformatted content in a str
variable, we can pass it along to json.loads()
for parsing into more meaningful data structures like, for instance, a dict
object. Lastly, we take this data structure and serialize it back to a plain str
using the json.dumps()
function.
import json
from pathlib import Path
from subprocess import check_output
COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
print(f"Processing {json_file}", end="")
unformatted_content: str = json_file.read_text()
formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
if unformatted_content == formatted_content:
print(" => correctly formatted.")
else:
print(" => incorrectly formatted.")
The json.dumps()
function takes a parameter called indent
to control the number of spaces used to indent nested structures in the str
output. I'm using four spaces, but any positive integer will do ...
Check-Only feature for CI integration
If the script is used in a CI environment, it shouldn't attempt to format any files on disk because these changes would be lost again when the CI job is finished. Instead, the script should only fail (i.e., terminate with a non-zero exit status) if it detects incorrectly formatted files. To support this, I'm adding a --check-only
option to trigger this behavior when running in CI. I'm deliberately not using a library to parse the command line argument because I would rather keep it as simple as possible, instead, I'm looking at sys.argv
directly.
import json
import sys
from pathlib import Path
from subprocess import check_output
COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
CHECK_ONLY = len(sys.argv) >= 2 and sys.argv[1] == "--check-only"
exit_with_failure = False
listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
print(f"Processing {json_file}", end="")
unformatted_content: str = json_file.read_text()
formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
if unformatted_content == formatted_content:
print(" => correctly formatted.")
else:
print(" => incorrectly formatted.")
if CHECK_ONLY:
exit_with_failure = True
if exit_with_failure:
sys.exit(1)
If the --check-only
has been given, I'm setting a boolean variable exit_with_failure
to True
if the formatting of the file currently being processed is incorrect. Then, once all files are processed, I check the variable to determine if there were any incorrectly formatted files. If that is the case, then I use the sys.exit()
to terminate the script with a non-zero exit code and thereby make the whole script fail. This works well with most CI systems because their jobs usually fail when one of the steps defining them fails, and our script will be one such step.
In any case, if we rerun the script with --check-only
, you'll first see that it tells us whether the formatting is correct for each JSON file. Secondly, it fails with exit status 1
because there are multiple incorrectly formatted files.
$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ echo $? # print exit status of the previous command ...
1
Write formatted JSON back to file
Lastly, we have to write the formatted JSON back to disk if the script has been started in normal mode — without the --check-only
option. This is pretty easy, though, because the Path
class also has a counterpart for read_text()
, which is unsurprisingly called write_text()
. We need only two extra lines of code to accomplish this ...
import json
import sys
from pathlib import Path
from subprocess import check_output
COMMAND = "git ls-files --cached --others --exclude-standard -- '*.json'"
CHECK_ONLY = len(sys.argv) >= 2 and sys.argv[1] == "--check-only"
exit_with_failure = False
listing: str = check_output(COMMAND, shell=True).decode()
for json_file in map(Path, listing.splitlines()):
print(f"Processing {json_file}", end="")
unformatted_content: str = json_file.read_text()
formatted_content: str = json.dumps(json.loads(unformatted_content), indent=4)
if unformatted_content == formatted_content:
print(" => correctly formatted.")
else:
print(" => incorrectly formatted.")
if CHECK_ONLY:
exit_with_failure = True
else:
json_file.write_text(formatted_content)
if exit_with_failure:
sys.exit(1)
Conclusion
Now, you can try out the finished script ...
$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ echo $? # command failed because files are not yet correctly formatted
1
$ python json-formatter.py
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => incorrectly formatted.
Processing tracked.json => incorrectly formatted.
$ python json-formatter.py --check-only
Processing subdirectory/untracked.json => correctly formatted.
Processing untracked.json => correctly formatted.
Processing subdirectory/tracked.json => correctly formatted.
Processing tracked.json => correctly formatted.
$ echo $? # command succeeded because files are now correctly formatted
0
Anyway, that's everything I had to say for today. Thank you for reading, and see you soon!